跳到主要內容

臺灣博碩士論文加值系統

(44.200.194.255) 您好!臺灣時間:2024/07/23 04:45
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃琬瑜
研究生(外文):Wan-Yu Huang
論文名稱:一個有效率分析HTML格式郵件特徵之垃圾信過濾方法
論文名稱(外文):An Efficient Method of Filtering Spam by Analyzing the HTML Format of E-Mails
指導教授:楊維邦楊維邦引用關係
指導教授(外文):Wei-Pang Yang
學位類別:碩士
校院名稱:國立東華大學
系所名稱:數位知識管理碩士學位學程
學門:商業及管理學門
學類:其他商業及管理學類
論文種類:學術論文
論文出版年:2010
畢業學年度:98
語文別:中文
論文頁數:95
中文關鍵詞:機器學習演算法決策樹電子郵件格式文件分類垃圾郵件過濾
外文關鍵詞:text categorizationspam filteringmachine learningdecision treee-mail format
相關次數:
  • 被引用被引用:0
  • 點閱點閱:433
  • 評分評分:
  • 下載下載:26
  • 收藏至我的研究室書目清單書目收藏:1
隨著網際網路的快速發展與普及化,電子郵件除了成為人們在日常生活中不可或缺的溝通工具外,也成為行銷的主要工具。電子郵件雖具有方便、快速、低成本和傳送範圍不受限制的好處,但也帶來了垃圾郵件(spam)氾濫的問題以及對網路安全和使用者不便的威脅性。特別是近幾年來,垃圾郵件散播者(Spammer)開始以HTML格式編碼的郵件傳送資訊,藉由HTML tag將URL置於HERF、SRC、ACTION等HTML標籤中,吸引使用者點擊郵件,造成使用者資料外洩、惡意程式被植入電腦等情況發生。加上過去的研究學者都未針對HTML code以及URL address做為過濾垃圾郵件的特徵研究,因此本研究提出一個有效率分析HTML格式郵件特徵之垃圾信過濾方法,透過分析郵件格式,將電子郵件依據Content type分為Text和HTML兩大類別,分別分析未知郵件的標頭(Header)資訊與內文(Body)資訊,並以改良的ID3決策樹做為本研究的機器學習演算法。為改善機器學習演算法的高FP與FN值的問題,以及垃圾郵件關鍵字具時效性的問題,本研究除了訓練與執行階段,還加入了再學習階段做調分機制再學習和關鍵字再學習的動作,提高郵件過濾的正確率。
經本研究實驗證實,透過分析郵件格式和持續學習的方式,能有效正確的分類郵件,本研究的垃圾信過濾機制整體至少能達到97.25%以上的正確率,最高則能達到99.63%的正確率。若單以HTML的決策樹分類結果,則可達到99.88%的正確率,可見本研究對於HTML格式郵件的過濾方法具有不錯的成效。此外,FP rate和FN rate經由再學習階段也有明顯的下降,特別是FN rate,皆在1%以下,而以HTML格式郵件的過濾方法則完全無FN的問題。
In recent years, the spammers have begun to send HTML format e-mails. They might place suspicious URL in HTML tags such as HERF, SRC, and ACTION etc. to attract users to click on the message, which would result in user data leakage. However, the literature about the spam filtering of studying HTML format and URL address as characteristics is very scarce. In this paper, we proposed an efficient method of filtering spam by analyzing the HTML format of e-mails. Our method was divided into four processes. In the first process, we analyzed e-mail format and separated e-mails into HTML and Text category according to the content-type. In the second process, we analyzed the information of mail header and mail body of unknown e-mails and applied the improved decision tree algorithm ID3 in these two categories. And in the third process, we scored each unknown e-mail according to the result of second process and classify it to be a spam or a legitimate mail. Moreover, to improve the high FP (False Positive) and FN (False Negative), and to solve the problem of time-sensitive spam keywords, we used the process of re-learning to increase the e-mail filter accuracy in the forth process.
According to the experiment, the overall accuracy of our method proposed could achieve 99.63%, and the FP rate and FN rate could be decreased to 2.35% and 0.32% respectively by applying re-learning process. If only consider the classification for e-mails of HTML format, the accuracy which could achieve 99.88% and FP rate and FN rate which could be below 1% and 0% respectively. The experiment results showed that our method could effectively deal with spam problem with high accuracy, especially for HTML format e-mails.
第一章 緒論 01
1.1研究背景 01
1.2研究動機 04
1.3研究目的 07
1.4論文架構 08
第二章 文獻探討 09
2.1電子郵件格式標準 09
2.2 HTML郵件 15
2.3常見垃圾郵件過濾機制 18
2.4決策樹演算法 23
2.4.1 ID3決策樹演算法 26
第三章 研究設計 31
3.1研究架構與流程 32
3.2研究設計 34
3.2.1郵件分析階段 34
3.2.2郵件訓練階段 36
3.2.3執行階段 47
3.2.4再學習階段 49
第四章 實驗設計與分析 53
4.1資料集選用 53
4.2實驗設計 54
4.3績效衡量標準 56
4.4實驗結果分析 58
4.4.1本研究方法之實驗結果分析 59
4.4.2再學習階段對實驗結果的影響與分析 62
4.4.3本研究方法之可靠性分析 69
4.4.4本研究方法與現有方法之比較 73
第五章 結論與未來研究方向 77
第六章 參考文獻 79


[1][Androutsopoulos, 2000] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, Paliouras, G., Spyropoulos, C.D. An Evaluation of Naive Bayesian Anti-spam Filtering. Proc. of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), pp. 9-17, 2000.
[2][Breiman, 1984]L. Breiman, H. J. Friedman, R. A. Olshen & C. J. Stone, “Classification and regression trees. Belmont,” CA: Wadsworth International Group, 1984.
[3][Delany, 2005] S. J. Delany, P. Cunningham, A. Tsymbal, and L. Coyle, “A case-based technique for tracking concept drift in spam filtering,” Knowledge-Based Systems , Vol. 18, pp. 187–195, 2005.
[4][Dong, 2006] J. Dong, H. Cao, P. Liu and L. Ren, “Bayesian Chinese Spam Filter Based on Crossed N-gram,” Proceedings of the 6th International Conference on Intelligent Systems Design and Applications, ISDA.3, pp. 103-108, 2006.
[5][Georgiou, 2008] E. Georgiou , M.D. Dikaiakos, A. Stassopoulou, “On the properties of spam-advertised URL addresses,” Journal of Network and Computer Applications, Vol. 31, pp. 966–985, 2008.
[6][Gomes, 2004] LH. Gomes, C. Cazita, JM. Almeida, V. Almeida, Jr. W. Meira, “Characterizing a Spam Traffic.,” In: Proceedings of the fourth ACM SIGCOMM conference on internet measurement, ACM, pp. 356–69, 2004.
[7][Guzella, 2009] T.S.Guzella, W.M.Caminhas, “A review of machine learning approaches to Spam filtering, ” Expert Systems with Applications, Vol. 36, pp. 10206–10222, 2009.
[8][Hartigan, 1975]J.A. Hartigan, “Clustering Algorithms,” New York: John Wiley & Sons, 1975
[9][Hidalgo, 2002]. J. M. G. Hidalgo, E. P. Sanz, and MJ. Maña López, “ Evaluating cost-sensitive unsolicited bulk email categorization, ” In Proceedings of the ACM symposium on applied computing, pp. 615–620, 2002.
[10][HKASC, 2004] Hong Kong Anti-SPAM Coalition (HKASC), "Legislation: One of the Key Pillars in the Fight against SPAM," WHITE PAPER, 2004.
[11][Hsiao, 2008] W. F. Hsiao, T. M. Chang, " An incremental cluster-based approach to spam filtering," Expert Systems with Applications, 34, pp. 1599–1608, 2008.
[12][James, 2006]C. James, H. Ray, “Tightening the net: A review of current and next generation spam filtering tools,”Computers&Security, Vol. 25, Issue 8, pp566-578, 2006
[13][Jayaraj, 2008] A. Jayaraj, T. Venkatesh, and C. S. R. Murthy, ”Loss Classification in Optical Burst Switching Networks using Machine Learning Techniques: Improving the Performance of TCP,” IEEE Journal on Selected Areas in Communications, Volume 26, Issue 6, Part Supplement, pp. 45 – 54, 2008.
[14][Katharina, 1999] K. D. C. Stark, D. U. Pfeiffer, “The application of non-parametric techniques to solve classification problems in complex data sets in veterinary epidemiology -an example,” Intelligent Data Analysis, Vol. 3, pp.23-35, 1999.
[15][Koprinska ,2007] I. Koprinska, J. Poon, J. Clark and J. Chan, “Learning to Classify E-mail,” Information Sciences, Vol. 177, pp. 2167-2187, 2007.
[16][Lai, 2007] C. C. Lai, "An empirical study of three machine learning methods for spam filtering," Knowledge-Based Systems, 20, pp. 249–254, 2007.
[17][Lai, 2009] G. H. Lai, C.M. Chen, C.S. Laih, T. Chen, “A collaborative anti-spam system,” Expert Systems with Applications, vol.36, pp. 6645–6653, 2009.
[18][Mertz, 2002] Mertz, D., “Spam Filtering Techniques-Six Approaches to Eliminating Unwanted E-mail,” IBM DeveloperWorks, 2002.
[19][Nicholas, 2003] Nicholas, T., “Using AdaBoost and Decision Stumps to Identify Spam E-mail,” Stanford University Course Project Report, 2003.
[20][Ohmann, 1996]C. Ohmann, V. Moustakis, Q. Yang, and K. Lang, “Evaluation of automatic knowledge acquisition techniques in the diagnosis of acute abdominal pain,” Artificial Intelligence in Medicine, vol. 8, pp. 23-36, 1996.
[21][Quang-Anh Tran, 2006] Tran, Q. A., H. Duan and X. Li, “Real-time statistical rules for spam detection,” International Journal of Computer Science and Network Security (IJCSNS), Vol. 6, No.2, pp.178-184, 2006.
[22][Quinlan, 1986] J. R. Quinlan, "Induction of Decision Trees," Machine Learning, Vol. 1, No. 1, pp. 81-106, 1986.
[23][Sebastiani, 2002] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47, 2002.
[24][Sheu, 2009] J. J. Sheu, and K. T. Chu, "An Efficient SPAM Filtering Method by Analyzing E-Mail’s Header Session Only,“ International Journal of Innovative Computing, Information and Control, vol.5, PP. 334–343, 2009.
[25][Sheu, 2009] J. J. Shue, “An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization,” International Journal of Network Security, Vol. 9, No. 1, pp. 34-43, 2009.
[26][Shih, 2008] D. H. Shih, H. S. Chiang, and B. Lin, “Collaborative spam filtering with heterogeneous agents,” Expert Systems with Applications Vol. 35, Issue 4, pp. 1555-1566, 2008.
[27][Su, 2010] M. C. Su, H. H. Lo, and F. H. Hsu, “A neural tree and its application to spam e-mail detection,” Expert Systems with Applications, 2010
[28][Teng, 2008] W. L. Teng, W. C. Teng, “A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters,” IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 125-131, 2008
[29][Tompkins, 2003] T. Tompkins, and D. Handley, “Giving e-mail back to the users: Using digital signatures to solve the spam problem,” First Monday, Vol. 8, No. 9, 2003.
[30][Wang, 2007] C. C. Wang, and S. Y. Chen, "Using Header Session Messages to Anti-spamming," Computers & Security, vol. 26, pp. 381-390, 2007.
[31][Wu, 2009] C. H. Wu, ”Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks,” Expert Systems with Applications, vol. 36, pp.4321–4330, 2009.
[32][Yih, 2006 ] W. T. Yih, J. Goodman and G. Hulton, “Learning at Low False Positive Rates,” Proceedings of Conference on Email and Anti-Spam (CEAS), 2006.
[33][Zhang, 2001] T. Zhang, F. J. Oles, “Text Categorization Based on Regularized Linear Classification Methods,” Information Retrieval, vol.4, pp. 5-31, 2001.
[34][Zhou, 2007] J. Zhou, W.Y. Chin, R. Roman, and J. Lopez, “An effective multi-layered defense framework against spam,” Information Security Technical Report, Vol. 12, Issue 3, pp. 179-185, 2007.
[35][Zorkadis, 2005] V. Zorkadis, D. A. Karras, and M. Panayotou, "Efficient Information Theoretic Strategies for Classifier Combination, Feature Extraction and Performance Evaluation in Improving False Positives and False Negatives for SPAM E-Mail Filtering," Neural Networks, vol. 18, pp. 799–807, 2005.
Others:
[36][CommtouchLabs, 2008] CommtouchLabs. (2008, 10 15). Q3 2008 Email Threats Trend Report. Retrieved 10 17, 2008, from http://www.commtouch.com
[37][CommtouchLabs, 2009] CommtouchLabs. (2009, 10, 13). Q3 2009 Email Threats Trend Report. Retrieved 10 13, 2009, from http://www.commtouch.com/
[38][CommtouchLabs, 2010] CommtouchLabs. (2010, 04, 20). Q1 2010 Email Threats Trend Report. Retrieved 04 20, 2010, from http://www.commtouch.com/
[39][Cristianini, 2000] N. Cristianini and J. Shawe-Taylor, “Support Vector Machines,” Cam University Press, 2000, from http://www.support-vector.net/description.html
[40][Georgedillon.com] http://www.georgedillon.com/web/html_email_is_evil.shtml
[41][Symantec, 2009] State of spam-A monthly report, Retrieved Dec, 2009, from http://www.Symantec.com
[42][Symantec, 2010] State of spam-A monthly report, Retrieved Dec, 2010, from http://www.Symantec.com
[43]Graham-Cumming J, The Spammers’ compendium. Spam conference, 2003, from http://www.jgc.org/tsc/I
[44] http://www.w3schools.com/tags/tag_html.asp
[45]<行政院國家資通安全會報技術服務中心,電子郵件標頭解析,方家慶, Retrieved Sep, 2009 > http://www.icst.org.tw/index.aspx
[46], Retrieved Oct. 24, 2008, From http://tartarus.org/~martin/PorterStemmer
[47]M. Sirbu, "RFC1049: A Content-Type Header Field for Internet Messages," 1988, Retrieved Oct. 24, 2008, From http://www.ietf.org/rfc/rfc1049.txt
[48]2007 TREC Public SPAM Corpus. Retrieved Oct. 24, 2008, From http://plg.uwaterloo.ca/~gvcormac/treccorpus07/
[49]SpamAssassin, from http://spamassassin.apache.org/publiccorpus/

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊