(34.237.124.210) 您好!臺灣時間:2021/02/25 19:38
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:李菽豐
研究生(外文):Shu-fong Li
論文名稱:使用潛在狄氏配置改善支援向量機的文章分類表現
論文名稱(外文):Using Latent Dirichlet Allocation to Improve Text Classification Performance of Support Vector Machine
指導教授:陳耀輝陳耀輝引用關係
指導教授(外文):Yaw-Huei Chen
學位類別:碩士
校院名稱:國立嘉義大學
系所名稱:資訊工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
畢業學年度:104
語文別:中文
論文頁數:60
中文關鍵詞:文章分類特徵主題模型潛在狄氏配置支援向量機
外文關鍵詞:text classificationfeaturetopic modellatent Dirichlet allocationsupport vector machine
相關次數:
  • 被引用被引用:2
  • 點閱點閱:207
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:23
  • 收藏至我的研究室書目清單書目收藏:0
文章分類是一個應用廣泛的自然語言處理技術,目的是要用來定義文章的類別,而詞彙頻率在分類器中是最常使用的特徵。由詞彙頻率組成的向量空間具有較少的語意資訊,較無法處理同義詞及一詞多義的問題,容易造成分類的不準確。為了提升文章分類的準確度,本研究使用潛在狄氏配置(LDA)方法建立主題模型,除了使用主題特徵外,另發展出一種主題詞彙特徵,企圖凸顯文章中與主題相關的詞彙,將主題資訊轉換成特徵的方法,並搭配支援向量機(SVM)分類器將文章分類。特徵在機器學習方法上佔有很重要的影響力,我們探討不同的詞頻特徵、主題資訊特徵、合併詞頻及主題特徵對分類的影響,由實驗結果得知,單純使用同一種特徵集所得到的分類效果都未必最佳,但合併詞頻特徵及主題資訊特徵後,由於互相彌補彼此的缺點後使得分類的準確度得以提升。
Text classification is a widely used technique in natural language processing, whose objective is to determine the article category, and term frequency feature is the most commonly used feature in the classifier. Because the vector space that consists of term frequencies alone does not contain much semantic information, the classifier cannot address the problem of synonymy and polysemy of terms, which may lead to inaccurate classification results. In order to improve the accuracy of the text classifier, we establish topic models with the latent Dirichlet allocation (LDA) method so that the topic information can be used in the classifier. In addition, we develop topic term features, which convert the topic information into features and highlight the vocabulary related to the theme in documents, and use them in support vector machines (SVM). Because features play a very important role in machine learning method, we explore different forms of term frequency features, topic information features, and combinations of the two features for their effectiveness in text classification. The experimental results indicate that the combined features can enhance the text classification accuracy.
摘要 i
Abstract ii
誌謝 iv
目錄 v
圖目錄 viii
表目錄 x
第一章 緒論 1
1.1 研究背景 1
1.2 研究目的 2
1.3 研究貢獻 3
1.4 論文架構 3
第二章 文獻探討 4
2.1 文章分類 4
2.2 主題模型 5
2.3 利用主題提升文章分類的準確度 7
第三章 背景知識 9
3.1 模型探討 9
3.1.1 unigram model 9
3.1.2 mixture of unigram model 10
3.1.3 pLSI (Probabilistic Latent System Index) 模型 10
3.1.4 LDA (Latent Dirichlet Allocation)模型 11
3.2 LDA模型參數預估 13
第四章 研究方法 17
4.1 問題定義 17
4.2 方法描述 17
4.3 詞頻特徵 22
4.4 主題資訊特徵 22
4.5 合併詞頻特徵及主題特徵 28
4.6 訓練與測試分類模型 28
第五章 實驗與討論 30
5.1 實驗設定 30
5.1.1 資料集 30
5.1.2 實驗環境 32
5.1.3 實驗設計 32
5.2 訓練主題模型比較 34
5.3 20 Newsgroups 35
5.3.1 使用Chi-Square挑選特徵 36
5.3.2 主題數量比較 37
5.3.3 找出最佳加入主題詞彙數量 38
5.3.4 不同特徵處理方法比較 39
5.4 RCV1-V2 42
5.4.1 使用Chi-Square挑選特徵 42
5.4.2 主題數量比較 43
5.4.3 找出最佳加入主題詞彙數量 44
5.4.4 不同特徵處理方法比較 45
5.5 中研院現代漢語語料庫 48
5.5.1 使用Chi-Square挑選特徵 48
5.5.2 主題數量比較 49
5.5.3 找出最佳加入主題詞彙數量 50
5.5.4 不同特徵處理方法比較 51
5.6 實驗討論 54
第六章 結論與未來工作 57
參考文獻 58
[1] 靳志輝,「LDA數學八卦」,2013。
[2] 沈竟,「基於信息增益的LDA模型的短文本分類」,重慶文理學院學報 (自然科學版),第30卷,第6期,2011年。
[3] 蔡佾翰,「使用TF-IDF和SVM評量中文文章適讀性」,國立嘉義大學資訊工程系碩士論文,2011
[4] A. Asuncion, M. Welling, P. Smyth, and Y.W. Teh, "On smoothing and inference for topic models," Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI2009), pp. 27-34, 2009.
[5] I. Biro and J. Szabo, "Latent Dirichlet Allocation for Automatic Document Categorization," Proceedings of the 19th European Conference on Machine Learning and 12th Principles of Knowledge Discovery in Databases, 2009.
[6] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," The Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003.
[7] X. Cheng, X. Yan, Y. Lan, and J. Guo, "BTM: Topic Modeling over Short Texts," IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 12, pp. 2928-2941 (2014).
[8] E. Chen, Y. Lin, H. Xiong, Q. Luo, and H. Ma, " Exploiting probabilistic topic models to improve text categorization under class imbalance," Information Processing and Management, Vol. 47, Issue 2, pp. 202-214, 2011.
[9] CKIP, Academia Sinica Balanced Corpus of Modern Chinese, Academia Sinica. http://asbc.iis.sinica.edu.tw/.
[10] CKIP, A Chinese Word Segmentation System, Academia Sinica. http://ckipsvr.iis.sinica.edu.tw/.
[11] R. -E. Fan, K. -W. Chang, C. -J. Hsieh, X. -R. Wang, C. -J. Lin, " LIBLINEAR: A Library for Large Linear Classification," Journal of Machine Learning Research, Vol. 9, pp.1871-1874, 2008.
[12] G. Heinrich, "Parameter Estimation for Text Analysis," Technical Note, 2008.
[13] T. Hofmann, "Probabilistic Latent Semantic Analysis," Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pp. 289-296, 1999.
[14] M. Ikonomakis, S. Kotsiantis, and V. Tampakas, "Text Classification Using Machine Learning Techniques," Wseas Transactions on Computers, Vol. 4, Issue 8, pp. 966-974, 2005.
[15] T. K. Landauer, P. W. Foltz, D. Laham, An Introduction to Latent Semantic Analysis, Discourse Processes, 1998.
[16] C.-J. Lin, LIBLINEAR - A Library for Large Linear Classification, http://www.csie.ntu.edu.tw/~cjlin/liblinear/, Nov. 2014.
[17] C. D. Manning, P. Raghavan, and H. Schütze, "An Introduction to Information Retrieval," Cambridge University Press, 2009.
[18] D. Newman, E. V. Bonilla, and W. Buntine, "Improving Topic Coherence with Regularized Topic Models," Neural Information Processing Systems, pp. 496-504, 2011.
[19] X.-H. Phan, C.-T. Nguyen, D.-T. Le, L.-M. Nguyen, S. Horiguchi, and Q.-T. Ha, "A Hidden Topic-Based Framework toward Building Applications with Short Web Documents," IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 7, pp. 961-976, 2011.
[20] X.-H Phan, L.-M Nguyen, and S. Horiguchi, "Learning to Classify Short and Sparse Text & Web with Hidden Topic from Large-scale Data Collections," Proceedings of the 17th International Conference on World Wide Web, pp. 91-100, 2008.
[21] X.-H. Phan and C.-T. Nguyen, GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation2007. http://gibbslda.sourceforge.net/.
[22] A. H. Razavi and D. Inkpen, "Text Representation Using Multi-level Latent Dirichlet Allocation," Canadian AI 2014, LNAI 8436, pp. 215-226, 2014.
[23] T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, "Statistical Topic Models for Multi-label Document Classification," Machine Learning, Vol. 88, pp. 157-208, 2012.
[24] W. Sriurai, "Improving Text Categorization by Using a Topic Model," Advanced Computing: An International Journal (ACIJ), Vol. 2, No.6, pp. 21-27, 2011.
[25] M. Steyvers, "Probabilistic Topic Models," Handbook of Latent Semantic Analysis, 2007.
[26] H. M. Wallach, "Topic Modeling: Beyond Bag-of-Words," Proceedings of the 23rd International Conference on Machine Learning, pp. 977-984, 2006.
[27] Y. Wang, Q. Guo, "Multi-LDA Hybrid Topic Model with Boosting Strategy and its Application in Text Classification," Proceedings of the 33rd Chinese Control Conference, pp. 4802-4806, 2014.
[28] J. Weng, E.-P. Lim, J. Jiang, and Q. He, "TwitterRank: Finding Topic-sensitive Influential Twitterers," ACM International Conference on Web Search and Data Mining (WSDM 2010), pp. 261-270, 2010.
[29] Y. Yang, and J. O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proceedings of the 14th International Conference on Machine Learning, pp. 412-420, 1997.
[30] S. Zhou, K. Li, Y Liu, "Text Categorization Based on Topic Model," International Journal of Computational Intelligence Systems, Vol. 2, No. 4, pp. 398-409, 2009.
[31] D. D. Lewis, RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection, http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README_2004_0404.htm/.
[32] G. Salton and C. Buckley, The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971, http://www.lextek.com/manuals/onix/stopwords2.html/.
[33] S. Mimaroglu, 20 Newsgroups, http://www.cs.umb.edu/~smimarog/textmining/datasets/.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔