跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.168) 您好!臺灣時間:2024/12/06 00:48
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林佳慧
研究生(外文):Chia-Hui Lin
論文名稱:以URL資訊為基礎的相似郵件偵測系統設計
論文名稱(外文):Near-Duplicate Mail Detection Based On URL Information
指導教授:葉春超
指導教授(外文):Chun-Chao Yeh
學位類別:碩士
校院名稱:國立臺灣海洋大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:166
中文關鍵詞:垃圾信
外文關鍵詞:spam
相關次數:
  • 被引用被引用:0
  • 點閱點閱:132
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
網際網路上的商業蓬勃發展,許多商人利用電子郵件做為廣告的媒介,一些沒有經過使用者同意而濫發的電子信件已經達到氾濫的程度了。電子郵件地址的收集容易,且大量發送電子郵件的成本遠比傳統郵件的成本低,又具有即時性。因此透過電子郵件傳送廣告訊息的方式成為一個重要的商業行銷手法。然而也因此造成垃圾郵件的氾濫。如果公司願意花一樣多的錢在發送垃圾的電子郵件上,一個網路的使用者一天大概會收到超過數百封的垃圾郵件。而每次使用者只要傳送訊息,或是在網路上填一些輸入的表格,甚至是使用商品的註冊卡在商品網站註冊,而這些動作就會容易讓電子信箱被垃圾郵件的發送者取得。垃圾郵件氾濫造成伺服器及網際網路資源不必要的浪費,同時也造成電子郵件使用者相當的困擾,此問題嚴重威脅網際網路資源的有效運用及電子郵件的使用。
近幾年,許多的研究者已經意識到對抗垃圾郵件技術的重要性。許多反垃圾郵件的技術都被提出。這些技術大部分都是利用每封信件的內容是否隱含不當的廣告訊息作為判斷。而在我們這篇論文中,我們提出利用郵件間的相似性來做為相似郵件的偵測。我們的特徵選取是採用郵件內容中包含的URL訊息。一般說來,一封廣告信都會被大量複製與傳送給網路上使用者。而傳送給這些不同使用者的信其實都是來自相同的一封廣告信。如果是這樣的情況,那麼其實採用偵測複本的技術就足夠了。但是,垃圾信的發送者已經聰明許多去防範這樣的偵測,他們已經可以產生出像是隨便加入一些文字或是針對收件人的名字,收件人的信箱,或是信件的主題等等依據收件者做個人化處理。這些信件看起來當然不同,但實際上確實是來自同一封垃圾信。
因此,我們在本篇論文中,我們設計一個以URL訊息為基礎的相似郵件偵測系統,希望可以協助垃圾郵件的偵測。同時,研究不同的複本偵測方法,實作他們的系統並與我們的系統效能評比。在實驗中我們利用真實的郵件,讓我們的實驗結果更具有意義。我們主要比較Octet-based histogram方法,以及 其他兩種最近提出的相似文件判別技術(I-Mach 及 Winnowing)。研究結果顯示,以目前我們收集的垃圾郵件為檢測樣本,我們所提出的方法相較其他三種方法的準確率皆較佳。同時我們也分析不同的spam技術對本方法的影響。這些成果相信對於未來設計一個有效的垃圾郵件偵測系統有相當的幫助。
As the commercialization of the Internet continues, unsolicited email reached epidemic proportions when more and more marketers have turned to email as an advertising medium [27]. The usefulness of email is seriously threatened by the commercialization of the Internet because it is easier than to collect address lists and cheaper than to mass-distribute messages. If companies spend as much money in sending junk email as they do in sending junk physical mail, an Internet user would likely get more than hundreds of junk messages per day. Every time an Internet user explores his/her email address to the public, for example listing or registering in a web site, spammer can obtain his/her email address easily.
Recently, many researchers have become aware of importance of developing techniques to overcome spam. Various anti-spam techniques have been proposed, and most of them are based on intra-mail scan methods. In this thesis, we provide an inter-mail scan scheme for spam detection based on URL information of mail content. Usually, a spam will be massively reproduced and delivered to the Internet users. Many copies of a spam delivered to different receivers are identical copies, and they can be easily detected by exactly compare their unique fingerprint. However, more and more intelligent spam delivery systems are able to generate customized copies (according to, for example, email receiver, email address, email subject) for different receivers. Contexts of these customized copies are not exactly the same, but they deliver same message to the receivers indeed.
In this thesis, a near-duplicated mail detection system is developed based on URL information. Rich empirical results are reported with real mails for training and testing according to different spam behaviors. Meanwhile to have better knowledge about the strength of the proposed scheme, different approaches of near duplicate (mail/document) detection schemes are investigated and compared in terms of accuracy. We compare three different approaches available in literature: Octet-based histogram method, I-Mach, and Winnowing. Using over thousands of real mail we collected as testing sample, the experiment results show that the proposed strategy outperforms the other three approaches in terms of accuracy. We hope the results of this thesis study can give more insights on spam/anti-spam techniques for the design of spam filtering systems.
Chapter 1 Introduction 1
1.1 Background 1
1.2 Near-duplicated mails 2
1.3 The proposed near-duplicated mail detection scheme 4
1.4 Contributions and main results 5
1.5 Thesis organization 6
Chapter 2 Background and Related Works 7
2.1 Background 7
2.2 Detection techniques used in preventing spam mails 8
2.2.1 Rule-based 8
2.2.2 Statistical-classification based on content information 9
2.2.3 Checking faked header information 11
2.2.3.1 Check Faked SMTP 11
2.2.3.2 Check Faked From 12
2.2.4 Hybrid 13
2.2.5 Hardware-based 14
2.3 Survey of duplicate document detection 15
2.3.1 I-match 15
2.3.2 Winnowing 16
2.4 Near-duplicate detection system based on histogram 17
Chapter 3 Design of Near-Duplicated Mail Detection System 19
3.1 Why URL information 19
3.2 System architecture 20
3.3 Feature extraction 21
3.3.1 Algorithm of feature extraction 21
3.4 Mails containing URL information 22
3.4.1 Possible tricks to against URL checking 22
3.4.2 Proposed mechanisms for URL information processing 23
3.4.3 Granularity of URL information for comparison 24
3.5 Mails without containing URL information 25
3.6 Decision model 25
3.6.1 Similarity measurement 25
3.6.2 Decision function 26
3.6.3 Threshold values 26
Chapter 4 Data Set and Evaluation Procedures 27
4.1 The data sets 27
4.1.1 Characteristics of the data sets 27
4.1.1.1 Duplicate copies. 28
4.1.1.2 Mail size and contain types 28
4.2 Evaluation of the system performance 35
4.2.1 Performance measure 35
4.2.1 Evaluation procedure 36
Chapter 5 Performance Evaluation 37
5.1 Experiment setup 37
5.2 Experiment Results on mails containing URL information 38
5.2.1 Strategies without subtoken mechanism 39
5.2.1.1 Experiment 5-1: Full-length without subtoken 39
5.2.1.2 Experiment 5-2: Ignore-CGI without subtoken 42
5.2.1.3 Experiment 5-3: IP-only without subtoken 46
5.2.1.4 Experiment 5-4: Truncated domain-name without subtoken 49
5.2.1.5 Summary on strategies without subtoken mechanism 52
5.2.2 Strategies with subtoken mechanism 52
5.2.2.1 Experiment 5-5: Ignore-CGI with subtoken 52
5.2.2.2 Experiment 5-6: Truncated domain-name with subtoken 55
5.2.2.3 Experiment 5-7: Full-length with subtoken 58
5.2.2.4 Summary on the performance of the mails containing URL information 60
5.3 Experiment Results for mails containing no URL information 61
5.3.1 Experiment 5-8: mails not containing URL information 62
5.4 Overall statistics 63
5.4.1 Architecture of overall system 63
5.4.2 Experiment 5-9: overall system performance 64
Chapter 6 Performance Test 73
6.1 Results on mails containing URL information 73
6.1.1 Strategies without subtoken mechanism 73
6.1.1.1 Experiment 6-1: on strategies without subtoken mechanism 73
6.1.2 Strategies with subtoken mechanism 76
6.1.2.1 Experiment 6-2: on strategies with subtoken mechanism 76
6.1.3 Overall performance on mails containing URL information 77
6.2 Overall performance 79
6.4 Appendix: 83
A. Tables for Strategies without subtoken mechanism (for URL mails only) 83
B. Tables for Strategies with subtoken mechanism(for URL mails only) 87
C. Tables for mails containing No URL information (for non-URL mails only) 89
D. Tables for overall performance (for all mails) 90
Chapter 7 Performance Comparison 98
7.1 On the Mails containing URL information 98
7.1.1 I-match-like 98
7.1.1.1 Setting 98
7.1.1.2 Results of I-match-like system 99
7.1.2 Winnowing 102
7.1.2.1 Setting 102
7.1.2.2 Results of winnowing system 103
7.1.3 Octet histogram-based duplicate mail detection 109
7.1.3.1 Setting 109
7.1.3.2 Results of octet histogram-based duplicate mail detection 109
7.2 Mails without containing URL information 110
7.3 Overall Performance 113
7.4 Appendix 117
A. I-Match-like performance 117
B. Winnowing-based performance 120
C. Octet histogram-based performance 132
D. Overall performance 138
Chapter 8 Conclusions 141
References 144
[1] Robert J. Hall, “How to avoid unwanted email”, COMMUNICATIONS OF THE ACM, Vol. 41, No.3, March 1998.
[2] http://members.aol.com/emailfaq/emailfaq.html#3e
[3] Flavio D. Garcia, “Jaap-Henk Hoepman, Jeroen van Nieuwenhuizen”, SPAM FILTER ANALYSIS, Submitted to SEC 2004.
[4] Open WebMail project available at http://openwebmail.org/
[5] Sendmail available at http://www.sendmail.org/
[6] Postfix available at http://www.postfix.org/
[7] Qmail available at http://www.qmail.org/
[8] J. Myers, “Post Office Protocol – Version 3”, RFC1939, May 1996.
[9] M. Crispin, “Internet Message Access Protocol”, RFC2060, December 1996.
[10] SpamAssassin available at http://spamassassin.apache.org/
[11] TDMA (Tagged Message Deliver Agent) available at http://tmda.net/
[12] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos, “A Memory-Based Approach to Anti-Spam Filtering”, National Centre for Scientific Research(NCSR) Demokritos, Technical Report Demo, 2001.
[13] Paul Graham, ”a Plan for Spam”, available at http://www.paulgraham.com/spam.html
[14] Bayesian available at http://www.bayesian.org/
[15] Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, Constantine D.Spyropoulos, “An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages”, in Proc. the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000), Athens, Greece, pp. 160-167.
[16] Ion. Androutsopoulos, G.eorgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D. Spyropoulos and Panagiotis Stamatopoulos, “Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach” In H. Zaragoza, P. Gallinari, and M. Rajman (Eds.), Proc. Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France, pp. 1-13, 2000.
[17] POPFile available at http://popfile.sourceforge.net/
[18] RFC822 available at http://www.faqs.org/rfcs/rfc822.html
[19] Spamkiller for MailServers available at http://www.nai.com/us/products/mcafee/antispam/spk_mailserver.htm
[20] No Spam Today! for Workstations available at http://www.nospamtoday.com/workstation/
[21] Pop3proxy available at http://mcd.perlmonk.org/pop3proxy/
[22] Spamkiller Appliances available at http://www.nai.com/us/products/mcafee/antispam/spk_appliances.htm
[23] Abdur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCabe, “Collection Statistics for Fast Duplicate Document Detection”, ACM Transactions on Information Systems, Vol. 20, No. 2, April 2002, Pages 171-191.
[24] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken, “Winnowing: Local Algorithms for Document Fingerprinting”, SIGMOD 2003, June 9-12, 2003, San Diego, CA.
[25] 葉迺瑋,相似郵件偵測系統設計與實作,國立台灣海洋大學資訊工程學系碩士學位論文,九十三年六月。
[26] E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, and P. Samarati, “An Open Digest-based Technique for Spam Detection”, In Proceedings of the 4th IEEE international conference on peer-to-peer computing, 2004.
[27] Shane Hird, Technical Solutions for Controlling Spam, Distributed Systems Technology Center, available at http://security.dstc.edu.au/papers/technical_spam.pdf
[28] Distributed Checksum Clearinghouse (DCC), available at: http://www.rhyolite.com/anti-spam/dcc/
[29] D. Fetterly, M. Manasse, and M. Najork.,” On the evolution of clusters of near-duplicate web pages.,” in Proceedings of the 1st Latin American Web Congress, pages 37-45, 2003.
[30] Mehran Sahami, Susan Dumaisy, David Heckermany, Eric Horvitzy, “A Bayesian approach to filtering junk E-Mail” in Proc. Of AAAI Workshop on Learning for Text Categorization, July 1998, Madison, Wisconsin.
[31] Simon Tong, and Daphne Koller, “Support Vector Machine Active Learning with Applications to Text Classification”, Journal of Machine Learning Research, vol. 2, pp.45-66, 2001.
[32] Klaus-Robust Mueller, et al. “An introduction to kernel-based learning algorithms”, IEEE Transactions on Neural Networks, March 2001.
[33] J.R. Quinlan, “C4.5: Programs for Machine Learning”, San Mateo, Calif.: Morgan Kaufmann Publishers, 1993.
[34] Fernando J. Corbato, “On computer system challenges”, JACM, 50(1):30-31, January 2003.

[35] S. Olsen, “Spam: It's completely out of control”, CNET News.com, March 21 2002. Available: http://zdnet.com.com/2100-1106-865442.html.
[36] Geoff Hulten, Anthony Penta, Gopalakrishnan Seshadrinathan, Manav Mishra, “Trends in Spam Products and Methods “, First Conference on Email and Anti-Spam (CEAS), Mountain View, CA ,July 30 and 31, 2004.
[37] S. Machlis, “Uh-oh: Spam's getting more sophisticated”,Computerworld, Jan 17 2003.
[38] John Graham-Cumming, “How to beat an adaptive spam filter”, MIT Spam Conference, 2004.
[39] Gregory L. Wittel and S. Felix Wu, “On attacking statistical spam filters”, in Proceedings of CEAS04.
[40] RFC 2045 available at ftp://ftp.rfc-editor.org/in-notes/rfc2045.txt
[41] RFC 2046 available at ftp://ftp.rfc-editor.org/in-notes/rfc2046.txt
[42] RFC 2047 available at ftp://ftp.rfc-editor.org/in-notes/rfc2047.txt
[43] RFC 2048 available at ftp://ftp.rfc-editor.org/in-notes/rfc2048.txt
[44] RFC 2049 available at ftp://ftp.rfc-editor.org/in-notes/rfc2049.txt
[45] K-Nearest Neighbor available at http://blue.lins.fju.edu.tw/~tseng/ResearchResults/categorization.htm
[46] Nearest Neighbor available at http://www.cs.umd.edu/~brabec/quadtree/nearest.htm
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top