(54.173.237.152) 您好!臺灣時間:2019/02/22 22:52
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
本論文永久網址: 
line
研究生:陳奕安
論文名稱:適用於中文史料文本之標記式主題模型分析方法研究
論文名稱(外文):An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora
指導教授:蔡銘峰蔡銘峰引用關係
學位類別:碩士
校院名稱:國立政治大學
系所名稱:資訊科學學系
學門:工程學門
學類:電資工程學類
畢業學年度:105
語文別:中文
論文頁數:43
中文關鍵詞:主題模型標記式主題模型隱含狄利克雷分布
相關次數:
  • 被引用被引用:0
  • 點閱點閱:299
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
本論文提出了一個適用於中文史料文本主題分析方法,主要是根據標記式隱含狄利克雷分布(Labeled Latent Dirichlet Allocation,LLDA) 演算法,使其可以透過人工標記的中文文本找出特定主題的相關詞彙。在我們提出的演算法中,我們加上主題種子字詞(Seed Words) 資訊,以增強 LDA 群聚過後的結果,使群聚過後的詞彙與主題的關聯度能夠獲得提昇。近年來,隨著網際網路的普及以及資訊檢索的蓬勃發展,同時由於數位典藏的資料成長,越來越多的實體書藉被編輯成數位版本並且加上後設資料(Metadata),在取得這些富有價值的 歷史文本資料後,如何利用文字探勘技術(Text Mining)在這些資料上變成一項重要的研究議題。其中,如何從大量文本史料中辨識出文章主題更是許多學者感興趣的方向,而 LDA 主題模型則是在文字探勘領域中非常經典的方法。在此研究中我們發現傳統 LDA 對於群聚後的主題描述存在些許問題,包括主題類別的高隨機性以及個別主題的低易讀性,使得後續的解讀工作變得十分困難,因此我們採用了由 LDA 衍生出的標記式主題模型 Labeled LDA 演算法,限定能夠產生的主題類別以降低期隨機性,此外我們還加入了考量中文字詞的長度以及自定義的相關種子字詞等改進,使群聚出的主題詞彙能夠與主題更加相關,更加容易描述。實驗部分,我們利用改良後的演算法提取出主題詞彙,並進行人工標記,接著將標記的結果作為正確解答來計算平均準度均值(Mean Average Precision,MAP)等資訊檢索之評估方法作為評估,結果證實以長字詞以及種子字詞為考量所群聚出的結果皆優於傳統主題模型所群聚出的結果;此外,我們也將最終的結果與 TF-IDF 權重計算後的字詞進行比較,並由實驗結果可見其兩者之間的差異性。
This paper proposes an enhanced topic model based on Labeled Latent Dirichlet Allocation (LLDA) for Chinese historical corpora to discover words related to specific topics. To enhance the traditional LDA performance and to increase the readability of its clustered words, we attempt to use the infor- mation of seed words and the Chinese word length into the traditional LDA algorithm. In this study, we find that the traditional LDA exists some prob- lems about topic descriptions after clustering. We therefore apply the Labeled LDA algorithm, which is derived from traditional LDA, with the proposed improvements of considering the lengths of the words and related seed words. In our experiments, Mean Average Precision (MAP) is used to evaluate our experiment results based on the topics words labeled manually by historical experts. The experimental results shows that the proposed method of consid- ering both Chinese word length information and seed words is better than the traditional LDA method. In addition, we compare the proposed results with the TF-IDF weighting scheme, and the proposed method also outperforms the TF-IDF method significantly.
致謝 1
中文摘要 2
Abstract 3
第一章 緒論...................................... 1
1.1 前言........................................ 1
1.2 傳統主題模型與其限制........................... 1
1.3 研究目的..................................... 2
第二章 相關文獻探討................................ 4
2.1 主題模型之應用................................ 4
2.2 適用於已附標記之文本 .......................... 5
2.3 英文中的長字詞................................ 5
第三章 研究方法................................... 7
3.1 傳統主題模型簡介.............................. 7
3.2 隱含狄利克雷分布LDA .......................... 8
3.3 標記式LDA ................................... 11
3.4 適用中文文本之改良 ............................ 13
3.4.1 斷詞問題................................... 13
3.4.2 長字詞優先 ................................ 14
3.4.3 加入已知資訊 ............................... 15
第四章 實驗結果與討論.............................. 17
4.1 實驗設定..................................... 17
4.1.1 資料集以及資料前處理......................... 17
4.1.2 斷詞工具................................... 19
4.1.3 量化評估標準 ............................... 19
4.2 實驗結果分析與討論 ............................ 23
4.2.1 長字詞優先 ................................ 23
4.2.2 考量種子字詞 ............................... 24
4.2.3 與傳統關鍵字提取方法比較 ..................... 28
4.3 小結........................................ 29
第五章 結論...................................... 30
參考文獻......................................... 32
[1] I. Bhattacharya. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM International Conference on Data Mining, volume 124, page 47. SIAM, 2006.
[2] I. B ́ıro ́, J. Szabo ́, and A. A. Benczu ́r. Latent dirichlet allocation in web spam filter- ing. In Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, pages 29–32. ACM, 2008.
[3] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[5] K.-Y. Chen and B. Chen. 主題語言模型於大詞彙連續語音辨識之研究 (on the use of topic models for large-vocabulary continuous speech recognition)[in chinese]. In Proceedings of the 2009 ROCLING, pages 179–194, 2009.
[6] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (CVPR’05), volume 2, pages 524–531. IEEE, 2005.
[7] T. L. Griffiths and M. Steyvers. Finding scientific topics. Journal of Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
[8] G. E. Hinton and T. J. Sejnowski. Unsupervised Learning: Foundations of Neural Computation. 1999.
[9] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and Development in In- formation Retrieval, pages 50–57. ACM, 1999.
[10] R. A. Horn. The hadamard product. In Proceedings of Symposia in Applied Mathe- matics, volume 40, pages 87–169, 1990.
[11] R. V. Lindsey, W. P. Headden III, and M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning, pages 214–222. Association for Computational Linguistics, 2012.
[12] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent seman- tic indexing: A probabilistic analysis. In Proceedings of the 17th ACM SIGACT- SIGMOD-SIGART symposium on Principles of database systems, pages 159–168. ACM, 1998.
[13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248–256. Association for Computational Linguistics, 2009.
[14] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their localization in images. In Proceedings of the 10th IEEE Inter- national Conference on Computer Vision (ICCV’05) Volume 1-Volume 01, pages 370–377. IEEE Computer Society, 2005.
[15] Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics, 2006.
[16] X. Wang, A. McCallum, and X. Wei. Topical n-grams: Phrase and topic discov- ery, with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining, pages 697–702. IEEE Computer Society, 2007.
[17] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Pro- ceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 178–185. ACM, 2006.
[18] D. Xing and M. Girolami. Employing latent dirichlet allocation for fraud detection in telecommunications. Journal of Pattern Recognition Letters, 28(13):1727–1734, 2007.
[19] L. Yao, Y. Zhang, B. Wei, W. Wang, Y. Zhang, X. Ren, and Y. Bian. Discov- ering treatment pattern in traditional chinese medicine clinical cases by exploiting supervised topic model and domain knowledge. Journal of Biomedical Informatics, 58(C):260–267, 2015.
[20] 孟海濤, 陳思, and 周睿. 基于 lda 模型的 web 文本分類. 鹽城工學院學報 (自然 科學版), 22(4):56–59, 2009.
[21] 賈西平, 彭宏, 鄭啟倫, 石時需, and 江焯林. 基于主題的文檔檢索模型. 華南理 工大學學報 (自然科學版), 36(9):37–42, 2008.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔