( 您好!臺灣時間:2022/05/20 11:35
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Chia-shyang Lin
論文名稱(外文):An Enhanced Naïve Bayesian Classifier on Spam Filtering
指導教授(外文):S.M. TungDon-her Shieh
外文關鍵詞:Bayes'' Theorememailspam
  • 被引用被引用:2
  • 點閱點閱:278
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
電子郵件對人們工作與日常生活的影響,使之可稱為 Internet 的「殺手級應用 (Killer Application)」,已成為企業與個人間便捷的溝通管道之一。但同時,大多數的使用者也都飽受了垃圾郵件轟炸之苦。針對此一問題,現有的解決方式中,以內容為基礎的過濾方法最適合於個人用戶端所使用,而其中又以貝氏定理為基礎的演算法為大宗。本研究檢視了 Naïve Bayes 與 Robinson (2003) 兩種以貝氏定理為基礎的過濾方法後,提出了三種改進演算法,其中透過多屬性維度與回饋式學習之方法,經實驗後證明其結果相較於 Naïve Bayes 與 Robinson (2003) 有較低的錯誤率,而回饋式學習演算法更在各種評估指標中獲得整體的提昇。
Spam problem has been viewed as a serious threat to the Internet, flooding users’ inboxes and costing businesses billions of dollars through the waste of bandwidth, storage, and office work forces. To the worse and worse spam problem, several studies have been made, ranging from technical to regulatory. Naïve Bayes classifier is a widely used classifier in text categorization task. It also enjoys a blaze of popularity in anti-spam researchers. In this study, we analysis the Naïve Bayes classifier and the modification of Robinson (2003), then proposed three ways of enhancement. The experiment result shows that two of the proposed methods have better performance in most cases than traditional Naïve Bayes model while holding good detection rate and eliminating the false positive problem.
中文摘要 I
誌謝 III
目錄 IV
表目錄 VI
圖目錄 VII
一、緒論 1
1.1研究背景 1
1.2研究動機 2
1.3研究目的 5
1.4研究範圍 5
1.5研究流程 6
1.6論文架構 8
二、文獻探討 9
2.1電子郵件的基礎概念 9
2.1.1 電子郵件系統的組成 10
2.1.2電子郵件通訊的主要協定 11
2.1.3開放式代轉站 (Open Relay) 12
2.2 垃圾郵件所帶來之問題 13
2.3 處裡垃圾郵件問題的挑戰 19
2.4 垃圾郵件的偵測與過濾 21
2.4.1 社會面的解決方法 22
2.4.2 技術面的解決方法 24
2.5小結 31
三、系統設計 33
3.1以貝氏定理為基礎的過濾方法 33
3.1.1 Naïve Bayes過濾法 33
3.1.2 Robinson (2003) 提出之貝氏方法 36
3.2 方法的改進 38
3.2.1 分類屬性的事後機率值調整 38
3.2.2 提高屬性維度 40
3.2.3 回饋式學習 42
四、實驗設計 46
4.1 資料集 46
4.2 評估指標 47
4.3 實驗結果 49
4.3.1 實驗 1 ( ) 49
4.3.2 實驗 2 ( ) 51
4.3.3 實驗結果說明 53
五、結論與未來建議 55
5.1 研究結論 55
5.2 研究限制 56
5.3 未來研究方向 56
參考文獻 57
1.Allman, Eric., (2003), "The FTC and SPAM”, QUEUE, September 2003, 62-69,ACM Press,
2.Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Paliouras, G., Spyropoulos, C D. (2000). “An Evaluation of Naive Bayesian Anti-Spam Filtering”, In Workshop on Machine Learning in the New Information Age.
3.Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Spyropoulos, C. D., (2000), “An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages.”, In Proc. of the 23 rd Annual International ACM SIGR Conference on Research and Development in Information Retrieval, pp. 160-167, Athens, Greece,.
4.Belkin, Nicholas J. and Croft, W. Bruce, (1992), “Information filtering and information retrieval: two sides of the same coin?,” Communications of the ACM, Vol 35, Issue 12, pages 29-38.
5.Baayen, H., Van Halteren, H., Neijt, A., and Tweedie, F., (2002), “An experiment in authorship attribution.”, Proceedings of JADT 2002. St. Malo, 2002. 29-37.,.
6.Carreras, X., Màrquez, L., (2001), “Boosting Trees for Anti-Spam Email Filtering”, Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing
7.Cerf, V. G., “Spam, Spim, and Spit,”, (2005, Apr), Communications of the ACM, Volume 48 Issue 4, pp. 39-43.
8.Cheng, J. and Greiner, R., (1999, Aug), “Comparing Bayesian Network Classifiers.”, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99), Sweden.
9.Chwartz, A., (2004, Jul), SpamAssassin, first edition, ISBN: 0-596-00707-8, publisher: O’Reilly.
10.Cournane, A. and Hunt, R., (2004, Mar), “An Analysis of the Tools used for the Generation and Prevention of Spam”, Computers and Security, Elsevier, UK, Vol 23 No 2, pp154-166.
11.Cranor, L. Faith, B. H. LaMacchia, (1998, Aug), “Spam!”, Communication of ACM, vol 41, no. 8, pp. 74-83,
12.Cunningham, P., Nowlan, N., Delany, S. J., Haahr, M., (2003), “A Cased-Based Approach to Spam Filtering that can Track Concept Drift”, Technical Report TCD-CS-2003-16, Trinity College Dublin, Ireland
13.Deepak, P., Parameswaran, S., (2005, Jan), “Spam filtering using spam mail communities”, Proceedings. The 2005 Symposium on Applications and the Internet, pp. 377 – 383
14.Dent, K. D., (2003, Dec), Postfix: The Definitive Guide, first edition, ISBN: 0-596-00212-2, Publisher: O''Reilly.
15.Domingos, P. and Pazzani, M., (1996), “Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier.”, In Proc. o f the 13th International Conference on Machine Learning, pp. 105-112, Bari, Italy.
16.Drucker H., D. Wu, and V. Vapnik., (1999), “Support vector machines for spam categorization.”, IEEE Transactions on Neural Networks, 10(5), pp. 1048-1054.
17.Dunham, M. H., (2003), Data Mining Introductory and Advanced Topics, Pearson Education Inc.
18.Fawcett, T., (2003), “"In vivo" spam filtering: a challenge problem for KDD”, Contributed articles, ACM SIGKDD Explorations Newsletter, Volume 5 , Issue 2, pp. 140 - 148
19.Gburzynski P., Maitan J., (2004, Feb), “Fighting the spam wars: A remailer approach with restrictive aliasing”, ACM Transactions on Internet Technology (TOIT), Volume 4, Issue 1, pp. 1-30.
20.GFI Whitepaper, (2005), Why Bayesian Filtering is the most effective anti-spam technology, GFI Software LTD.
21.Giraud-Carrier, C., (2000, Jun), “Unifying Learning with Evolution Through Baldwin an Evolution and Lamarckism: A Case Study.” In: Proceedings of the Symposium on Computational Intelligence and Learning (CoIL-2000), pp. 36-41. MIT GmbH.
22.Hidalgo, J. M. G., (2002), “Evaluating Cost-sensitive Unsolicited Bulk Email Categorization”, In Proceedings of SAC-02, 17th ACM Symposium on Applied Computing, pp 615-620, Madrid, ES.
23.Graham, P., (2004), “Better Bayesian Filtering”, In Proceedings of Spam Conference, 2004, Massachusetts Institute of Technology
24.Graham-Cummings, J., (2003), “The spammers'' compendium.”, In Proceedings of the Spam Conference 2003, Massachusetts Institute of Technology
25.Grimes, Galen A., (2004, May), “Issues with spam”, Computer Fraud & Security, Volume 2004, Issue 5, Pages 12-16
26.Han, J. and Kamber, M., (2001), Data mining concepts and techniques, Morgan Kaufmann, page 284-287.
27.Hulten G., Penta A., Seshadrinathan G., Mishra M., (2004), “Trends in Spam Products and Methods”, In First Conference on Email and Anti-Spam (CEAS) 2004 Proceedings.
28.Ivey, K.C., (1998, Apr), “Spam: the plague of junk E-mail”, Computer Applications in Power, IEEE, Volume 11, Issue 2, pp. 15-16
29.Jin, R. and Si, L., “A study of methods for normalizing user ratings in collaborative filtering.”, Proceedings of the 27th annual international conference on Research and development in information retrieval, Pages: 568 – 569, July 2004.
30.Jung, J., Sit, E., (2004), “Traffic characterization and SPAM: An empirical study of spam traffic and the use of DNS black lists”, Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pp. 370-375.
31.Kohavi, R., (1995), “A study of cross-validation and bootstrap for accuracy estimation and model selection”, In Proc. of the 14th Int. Joint Conf. on AI, Vol. 2, Canada, 1995.
32.Lai, C. C., Tsai, M. C., (2004, Dec), “An Empirical Performance Comparison of Machine Learning Methods for Spam E-Mail Categorization”, Fourth International Conference on Hybrid Intelligent Systems, Proceedings, pp. 44-48
33.Lee, Y., (2005, Jun), “The CAN-SPAM Act: a silver bullet solution?”, Communications of the ACM, Volume 48 , Issue 6, pp. 131-132
34.McWilliams, B. S., (2004, Oct), Spam Kings, First Edition, ISBN: 0-596-00732-9, Publisher: O’Reilly.
35.Langley, P., Wayn, I. and Thompson, K., (1992), “An Analysis of Bayesian Classifiers.”, In Proc. o f the 10th National Conference on Artificial Intelligence, pp. 223-228, San Jose, California.
36.Lashkari, Y., (1995), “Feature Guided Automated Collaborative Filtering”, Masters Thesis, MIT Media Laboratory.
37.Loder, T., Alstyne, M. V., Wash, Rick., (2004), “An economic Answer to Unsolicited Communication”, Proceedings of the 5th ACM conference on Electronic commerce, Session 2, pp 40 – 50.
38.Ludlow, M., (2002). “Just 150 ''spammers'' blamed for e-mail woe.”, The Sunday Times, 1st December. page 3.
39.Metzger, J., Schillo, M. and Fischer, K., (2003, June), “A multiagent-based peer-to-peer network in java for distributed spam filtering.”, In Proc. of the CEEMAS, Czech Republic.
40.O’Brien, C. and Vogel, C., (2003), “Spam filters: Bayes vs. chi-squared; letters vs. words”, In Proceedings of the International Symposium on Information and Communication Technologies.
41.O’Brien, C. and Vogel, C., (2004), “Comparing SpamAssassin with CBDF Email Filtering”, In Proceedings of the 7th Annual CLUK Research Colloquium.
42.Paulson, D. L. (2003, Jul), “Group Considers Drastic Measures to Stop Spam”, News Briefs, Computer, IEEE Computer Society, Volume: 36 , Issue: 7 ,Pages:21 – 22.
43.Pfleeger, S. L. Bloom, G., (2005, Mar), “Canning Spam: Proposed solutions to Unwanted Email”, IEEE Security & Privacy, IEEE Computer Society.
44.Robinson, G., (2003, Mar), “A Statistical Approach to the Spam Problem”, Linux Journal, Volume 2003, Issue 107
45.Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., (1998), “A Bayesian Approach to Filtering Junk E-mail”, In Learning for Text Categorization – Paper from the AAAI Workshop, pp. 55-62, Madison Wisconsin. AAAI Technical Report WS-98-05.
46.Sakkis, G., Androutsopoulos, I., Paliouras, S., Karkaletsis, Spyropoulos, C. D., Stamatopoulos, P., (2003), “A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists” Inf. Retr. 6(1): 49-73.
47.Weber, R., (2004, Sep), “The Grim Reaper: The Curse of E-Mail”, Editor’s Comments, MIS Quarterly Vol. 28 No. 3, pp. 3-13
48.Whitworth, B., Whitworth, E., (2004, Oct), “Spam and the Social-Technical Gap”, Computer, IEEE Computer Society, Volume 37, Issue 10, pp. 38 - 45
49.Wu, Y. F., (1999), Learning with bayesian networks, Publications of Mississippi State University, Institute for Signal and Information Processing 1999.
50.Zhang, L., Zhu, J., Yao, T., (2004, Dec), “An Evaluation of Statistical Spam Filtering Techniques”, ACM Transactions on Asian Language Information Processing, Vol. 3, No. 4, Pages 243-269
51.Zhou F., Zhuang L., Zhao, B. Y., Huang, L., Joseph A. D., Kubiatowicz, J., (2003), “Approximate Object Location and Spam Filtering on Peer-to-peer Systems”, In Proc. of Middleware (Rio de Janeiro, Brazil, June 2003), ACM, pp. 1--20.
1.Anti Spam Research Group, http://asrg.sp.am/
2.Bogofilter project, http://bogofilter.sourceforge.net/
3.CAUCE, About the problem, Coalition Against Unsolicited Commercial Email, available from: http://www.cauce.org/about/problem.shtml.
4.Ferris Research, http://www.ferris.com/
5.Graham, P., (2002, Aug), A plan for spam, Retrieved December 23, 2004 from, http://www.paulgraham.com/spam.html,
6.Graham, P., (2003, Aug), Filters that Fight back, Retrieved December 23, 2004, from http://www.paulgraham.com/ffb.html
7.Jupiter Research, http://www.jupiterresearch.com/bin/item.pl/home
8.Mason, Justin. The SpamAssassin Homepage. Available from: http://spamassassin.org/index.html, 2004
9.Nielsen//NetRatings, http://www.netratings.com/
10.Prakash, V.V., Vipul’s Razor, http://razor.sourceforge.net/
11.SpamBayes Project, http://spambayes.sourceforge.net/
12.Spammers Grab MSN Hotmail addresses, http://www.spamhaus.org/news.lasso?article=6
13.RFC Editor Homepage, http://www.rfc-editor.org/
14.Sullivan, B. (2003, Aug), Spam Wars: How Unwanted Email is Burying the Internet, MSNBC, at http://www.msnbc.com/news/941040.asp
15.The Internet Engineering Task Force, http://www.ietf.org/
16.UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html
17.李欣茹,民93年,垃圾信嚴重,企業頭痛,民93.12.23 檢索,來源http://taiwan.cnet.com/enterprise/features/0,2000062876,20085772-3,00.htm
18.李欣茹,郭和杰,民93年,辦公室『信』騷擾調查報告,民93.12.23 檢索,來源http://taiwan.cnet.com/enterprise/features/0,2000062876,20085772-2,00.htm
第一頁 上一頁 下一頁 最後一頁 top