跳到主要內容

臺灣博碩士論文加值系統

(44.222.218.145) 您好!臺灣時間:2024/02/29 13:36
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:王浩羽
研究生(外文):Hao-Yu Wang
論文名稱:辨識垃圾留言:以YouTube為例
論文名稱(外文):Identifying Machine Spams in YouTube Comments
指導教授:柯士文
指導教授(外文):Shih-Wen Ke
學位類別:碩士
校院名稱:中原大學
系所名稱:資訊工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:中文
論文頁數:73
中文關鍵詞:資訊檢索文字分類文字前處理特徵選取垃圾訊息分類器
外文關鍵詞:text classifierspam filteringfeature selectiontext preprocessingtext classificationnformation retrieval
相關次數:
  • 被引用被引用:0
  • 點閱點閱:560
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
隨著網際網路的蓬勃發展,使得越來越多的業者選擇在網路的平台上來宣傳推廣自己的產品。這種廣告訊息是會影響到他們的使用者經驗,並沒有任何的益處,或是缺乏有用的資訊,因此而稱之為垃圾訊息(spam)。
現今普遍的垃圾訊息分類方式,都是建立在傳統文字的分類方式之上,在一般的垃圾訊息分類中,有著很不錯的效果,甚至在電子郵件的領域中,已成功的利用垃圾訊息分類的技術,來有效的過濾垃圾郵件。然而,這些技術,在現今流行的社群網路平台上,卻沒有辦法準確的分辨垃圾訊息以及非垃圾訊息。
因此,本研究在面對社群網路平台上的垃圾訊息,將分為人為垃圾訊息(human spam)、機器垃圾訊息(machine spam),其中machine spam是代表以機器式發送的廣告式垃圾訊息,而human spam則代表與machine spam不同的垃圾訊息。並且透過兩部分的實驗,利用bag-of-words、bigram、part-of-speech的特徵,term frequency與χ^2的特徵選取方法,以及SVM分類器,觀察各設定與分類之間的關聯性,再透過我們的重標記(relabeling)方法,將原始資料轉換為ham、human spam、machine spam三類個類別,並且進行分類測試,試著透過這兩部分的實驗結果,來驗證在社群網路垃圾分類的工作下,分為ham、human spam、machine spam三個類別的可行性與正確性。
而在本研究的實驗結果顯示,machine spam與ham之間有著一定程度的差異性,而human spam與ham之間,則不具有決定性的特徵來辨別兩者之間的差異。因此,由我們的實驗結果可推論出,在社群網路的垃圾訊息分類下,分為ham、human spam、machine spam三個類別,比起傳統的ham、spam兩個類別的分類,有著較好的表現。


The Internet is now ubiquitous, therefore more and more marketers and campaigners use the Internet to advertise their services and products. However, these advertisements can hurt user experience and usually provides little useful information. We regard this kind of information as spam, which is generally unwanted or unsolicited messages.
Spam filtering is a task that is built upon traditional text classification techniques. Conventional ‘binary’ spam filtering has been successfully applied to emails, in which emails are classified into either spam or non-spam (a.k.a. ham). However, it is harder to identify spams on social networking sites with traditional techniques due to different natures in the spam messages.
This study aims to identify the spam messages on social networking sites, a.k.a. social spam. In the experiments, we use our ‘relabeling’ technique to convert the binary spam filtering task into a three-class problem. That is, we have ham and spam, which is further divided into two categories: machine spam and human spam. A machine spam is generated and posted on social networking sites by a computer programme, whereas a human spam is written and posted by a human user but reported as a spam message by another user. In our investigation, bag-of-words, bigram and part-of-speech are used to represent the messages. Two feature selection techniques, namely term frequency and χ^2, are used to reduce the document dimensionality, and SVM is chosen as the classifier.
The results show the classifier performs much better at the three-class problem (i.e. ham, machine spam and human spam) than the binary (i.e. ham and spam) problem. This suggests that the difference between machine spam and ham is much greater than that between human spam and ham, and that treating spam filtering for social networking sites as a three-class problem is more appropriate.


目錄
摘要 II
Abstract III
致謝 IV
目錄 V
圖目錄 VII
表目錄 VIII
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 研究目的與架構 2
第二章 傳統文字分類之探討 4
2.1 文字前處理(Preprocessing) 5
2.1.1 文字萃取(Text extraction) 5
2.1.2 Stopword removal(停用詞刪除) 7
2.1.3 Stemming(詞幹提取) 8
2.2 Document representation(文件表示法) 8
2.2.1 Bag-of-words 9
2.2.2 n-gram 9
2.2.3 Term weighting 10
2.3 Feature Selection(特徵選取) 12
2.3.1 Term frequency(TF) 12
2.3.2 Document frequency(DF) 12
2.3.3 Information gain(IG) 13
2.3.4 Mutual information(MI) 13
2.3.5 χ2 statistic(χ2) 14
2.4 Classifier(分類器) 15
2.4.1 最近鄰居演算法(k-nearest neighbors algorithm, k-NN) 15
2.4.2 向量支持機(Support Vector Machine, SVM) 18
2.4.3 決策樹(Decision Tree) 19
2.4.4 分類器比較討論 21
2.5 Evaluation metrics(評測方法) 21
第三章 社群垃圾訊息分類 24
3.1 垃圾訊息(spam) 24
3.2 垃圾電子郵件分類 26
3.3 其他社群網路垃圾訊息分類 27
3.4 分析與討論 29
第四章 實驗方法 31
4.1 資料集(Dataset) 31
4.2 分類器 32
4.3 評測方式 32
4.4 實驗(一) 33
4.4.1 資料集 33
4.4.2 實驗設計與流程 34
4.5 實驗(二) 35
4.5.1 資料集 35
4.5.2 重標記(Relabeling) 35
4.5.3 實驗設計與流程 38
第五章 實驗結果與分析 39
5.1 實驗(一)之結果 39
5.1.1 實驗1-1 39
5.1.2 實驗1-2 40
5.1.3 實驗1-3 42
5.1.4 實驗(一)結果之探討 43
5.2 實驗(二)之結果 43
5.2.1 Validation set之實驗結果 43
5.2.2 Testing set之實驗結果 46
5.2.3 實驗(二)結果之探討 47
5.3 整體實驗之結果分析 47
第六章 結論 49
6.1 實驗結果歸納與結論 49
6.2 未來研究方向 51
Reference 53
附錄 58


圖目錄
圖2-1 傳統文字分類之流程圖。 4
圖2-2 前處理流程圖。 5
圖2-3 Part-of-speech轉換,以新聞稿為例。 6
圖2-4 萃取前後比較圖,以YouTube留言為例。 7
圖2-5 一部分的stopword list。 7
圖2-6 Stemming示意圖。 8
圖2-7 Vector Space Model 9
圖2-8 kNN的分類示意圖。 16
圖2-9 Euclidean distance示意圖。 17
圖2-10 Cosine similarity示意圖。 18
圖2-11 超平面示意圖。 19
圖2-12 決策樹示意圖。 20
圖2-13 ROC曲線。 23
圖3-1 於YouTube上回報spam訊息的結果。 24
圖3-2 Self-similarity matrix (Lin et al., 2008)。 28
圖3-3 Clock-like visualization (Lin et al., 2008)。 28
圖3-4 Ham、human spam以及machine spam的關係示意圖。 30
圖4-1 實驗(一)流程圖 34
圖4-2 重標記(relabeling)演算法 36
圖4-3 Relabeling狀況一。 37
圖4-4 Relabeling狀況二。 38
圖4-5 實驗(二)流程圖 38
圖5-1 在一般文字特徵上的分類結果差異。 39
圖5-2 Bag-of-words在各數量的特徵下的分類表現。 40
圖5-3 Bigram在各數量的特徵下的分類表現。 41
圖5-4 以TF與χ2取前一千筆特徵,bag-of-words的分類表現。 42
圖5-5 以TF與χ2取前一千筆特徵,bigram的分類表現。 42
圖6-1 YouTube垃圾留言回報選單。 51

表目錄
表2-1 Confusion matrix。 21
表4-1 原始資料集的數量分布。 31
表4-2 評測範例一 32
表4-3 評測範例二 32
表4-4 實驗(一)之訓練集與測試集。 33
表4-5 實驗(一)之訓練集與測試集,於前處理後之數量。 34
表4-6 訓練集、驗證集與測試集數量 35
表4-7 重標記後的訓練集、驗證集與測試集數量。 38
表5-1 使用bag-of-words在part-of-speech特徵上的分類結果。 43
表5-2 使用bigram在part-of-speech特徵上的分類結果。 43
表5-3 使用基本前處理與RBF SVM,在bag-of-words的分類結果。 44
表5-4 使用前處理方法二與RBF SVM,在bag-of-words的分類結果。 45
表5-5 使用前處理方法二與linear SVM,在bag-of-words的分類結果。 45
表5-6 使用前處理方法二與linear SVM,在bigram的分類結果。 46
表5-7 Testing set使用前處理方法二與linear SVM在bag-of-words的分類結果。 46

AMATI, G. & VAN RIJSBERGEN, C. J. 2002. Probabilistic Models of Information Retrieval Based on Measuring The Divergence from Randomness. ACM Transactions on Information Systems, 20, 357-389.
BAI, J., NIE, J.-Y. & PARADIS, F. 2004. Using language models for text classification. The Asia Information Retrieval Symposium, AIRS '04.
BASNET, R., MUKKAMALA, S. & SUNG, A. 2008. Detection of Phishing Attacks: A Machine Learning Approach. In: PRASAD, B. (ed.) Soft Computing Applications in Industry.
BHATTARAI, A., RUS, V. & DASGUPTA, D. 2009. Characterizing Comment Spam in the Blogosphere Through Content Analysis. The IEEE Symposium on Computational Intelligence in Cyber Security, CICS '09.
BREIMAN, L., FRIEDMAN, J., STONE, C. J. & OLSHEN, R. A. 1984. Classification and Regression Trees.
BRILL, E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-Of-Speech Tagging. Computational Linguistics, 21, 543-565.
CELIK, K. & GUNGOR, T. 2013. A Comprehensive Analysis of Using Semantic Information in Text Categorization. The IEEE International Symposium on Innovations in Intelligent Systems and Applications, INISTA '13.
CHANG, C.-C. & LIN, C.-J. LIBSVM -- A Library for Support Vector Machines [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
CHANG, C.-C. & LIN, C.-J. 2011. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2, 1-27.
CORTES, C. & VAPNIK, V. 1995. Support-Vector Networks. Machine Learning, 20, 273-297.
DA LUZ, A., VALLE, E. & DE A ARAUJO, A. 2012. A Context-aware Description for Content Filtering on Video Sharing Social Networks. The IEEE International Conference on Multimedia and Expo, ICME '12
DHAMIJA, R., TYGAR, J. D. & HEARST, M. 2006. Why phishing works. The SIGCHI Conference on Human Factors in Computing Systems, CHI '06.
DRUCKER, H., WU, S. & VAPNIK, V. N. 1999. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10, 1048-1054.
EL-KHAIR, I. A. 2006. Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study. International Journal of Computing & Information Sciences, 4, 119-133.
FETTE, I., SADEH, N. & TOMASIC, A. 2007. Learning to Detect Phishing Emails. The 16th International Conference on World Wide Web, WWW '07.
GEUTNER, P. 1997. Fuzzy Class Rescoring: A Part-Of-Speech Language Model. In: KOKKINAKIS, G., FAKOTAKIS, N. & DERMATAS, E. (eds.) The 5th European Conference on Speech Communication and Technology, EUROSPEECH '97.
GONÇALVES, C. A., GONÇALVES, C. T., CAMACHO, R. & OLIVEIRA, E. C. 2010. The Impact of Pre-processing on the Classification of MEDLINE Documents. The 10th International Workshop on Pattern Recognition in Information Systems, PRIS '10.
GOOGLE. YouTube Data API [Online]. Available: https://developers.google.com/youtube/.
HAMID, I. R. A. & ABAWAJY, J. 2011. Phishing Email Feature Selection Approach. The 10th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom '11.
JENQ-HAUR, W. & MING-SHENG, L. 2011. Using Inter-comment Similarity for Comment Spam Detection in Chinese Blogs. The International Conference on Advances in Social Networks Analysis and Mining, ASONAM '11
LAN, M., TAN, C.-L., LOW, H.-B. & SUNG, S.-Y. 2005. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines. Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW '05.
LEHMANN, M. 2009. String::Similarity [Online]. Available: http://search.cpan.org/~mlehmann/String-Similarity-1.04/Similarity.pm.
LEWIS, D. D. 1992. An Evaluation of Phrasal and Clustered Representations on A Text Categorization Task. The 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '92.
LIN, Y.-R., SUNDARAM, H., CHI, Y., TATEMURA, J. & TSENG, B. L. 2008. Detecting Splogs via Temporal Dynamics Using Self-Similarity Analysis. ACM Transactions on the Web, 2, 1-35.
LOVINS, J. B. 1968. Development of A Stemming Algorithm. Mechanical Translation and Computational Linguistics, 11.
LUHN, H. P. 1958. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2, 159-165.
MA, L., OFOGHI, B., WATTERS, P. & BROWN, S. 2009. Detecting Phishing Emails Using Hybrid Features. The Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing, UIC-ATC '09.
MANSOUR, Y. 1997. Pessimistic Decision Tree Pruning Based on Tree Size. The 14th International Conference on Machine Learning, ICML '97.
MEDLOCK, B. 2003. A Language Model Approach to Spam Filtering. http://www.benmedlock.co.uk/medlock-03.pdf [Online].
MYERS, E. W. 1986. An O(ND) Difference Algorithm and Its Variations. Algorithmica, 1, 251-266.
O'CALLAGHAN, D., HARRIGAN, M., CARTHY, J. & CUNNINGHAM, P. 2012. Network Analysis of Recurring YouTube Spam Campaigns. The 6th International AAAI Conference on Weblogs and Social Media, ICWSM '12.
PONTE, J. M. & CROFT, W. B. 1998. A language modeling approach to information retrieval. The 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '98.
PORTER, M. F. 1980. An algorithm for suffix stripping. Program.
QAMAR, A. M., GAUSSIER, E., CHEVALLET, J. P. & JOO-HWEE, L. 2008. Similarity Learning for Nearest Neighbor Classification. The 8th IEEE International Conference on Data Mining, ICDM '08
RAJADESINGAN, A. & MAHENDRAN, A. 2012. Comment Spam Classification in Blogs through Comment Analysis and Comment-Blog Post Relationships. The 13th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing '12.
ROBERTSON, S. E. & SPARCK JONES, K. 1976. Relevance Weighting of Search Terms. American Society for Information Science, 27, 129-146.
SALTON, G., WONG, A. & YANG, C. S. 1975. A Vector Space Model for Automatic Indexing. Communications of the ACM.
SHEN, D., SUN, J.-T., YANG, Q. & CHEN, Z. 2006. Text Classification Improved Through Multigram Models. The 15th ACM International Conference on Information and Knowledge Management, CIKM '06.
SYMANTEC 2014. Internet Security Threat Report, Volume 19.
TOUTANOVA, K., KLEIN, D., MANNING, C., MORGAN, W., RAFFERTY, A., GALLEY, M. & BAUE, J. 2004. Stanford Log-Linear Part-Of-Speech Tagger [Online]. Available: http://nlp.stanford.edu/software/tagger.shtml.
TOUTANOVA, K., KLEIN, D., MANNING, C. D. & SINGER, Y. 2003. Feature-Rich Part-Of-Speech Tagging with a Cyclic Dependency Network. The Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL '03.
TOUTANOVA, K. & MANNING, C. D. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-Of-Speech Tagger. The Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP '00.
VAN RIJSBERGEN, C. J. 1979. Information Retrieval.
YANG, Y. & PEDERSEN, J. O. 1997. A Comparative Study on Feature Selection in Text Categorization. The 14th International Conference on Machine Learning, ICML '97.
YU, C. T. & SALTON, G. 1976. Precision Weighting—An Effective Automatic Indexing Method. ACM, 23, 76-88.
YU, Y. & CHEN, Y. 2012. A Novel Content Based and Social Network Aided Online Spam Short Message Filter. The 10th World Congress on Intelligent Control and Automation, WCICA '12.
ZHAI, C. & LAFFERTY, J. 2002. Two-Stage Language Models for Information Retrieval. The 25th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval Location, SIGIR '02.
ZHANG, L., ZHU, J. & YAO, T. 2004. An Evaluation of Statistical Spam Filtering Techniques. ACM Transactions on Asian Language Information Processing, 3, 243-269.
ZHANG, X., ZHU, S. & LIANG, W. 2012. Detecting Spam and Promoting Campaigns in the Twitter Social Network. The 12th IEEE International Conference on Data Mining, ICDM '12.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊