跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.59) 您好!臺灣時間:2025/10/17 06:43
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:鄭宇傑
研究生(外文):JHENG, YU-JIE
論文名稱:以核運算方法與LDA主題模型產生文字標籤之比較研究
論文名稱(外文):A Comparative Study of Automatic Text Labeling Using Von Neumann Kernel and LDA Topic Model
指導教授:陳宗天陳宗天引用關係
指導教授(外文):CHEN, TSUNG-TENG
口試委員:陳志銘蔡瑞煌李瑞元何善輝陳宗天
口試委員(外文):Chen, Chih MingTsaih, Rua HuanLee, Maria R.Ho, Shan HuiChen, Tsung Teng
口試日期:2015-07-24
學位類別:碩士
校院名稱:國立臺北大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:中文
論文頁數:154
中文關鍵詞:隱含狄利克雷分佈主題模型自動標籤
外文關鍵詞:Latent Dirichlet AllocationTopic ModelAutomatic Labeling
相關次數:
  • 被引用被引用:1
  • 點閱點閱:1466
  • 評分評分:
  • 下載下載:243
  • 收藏至我的研究室書目清單書目收藏:1
學術領域中的眾多文獻可根據文獻間的關聯或文獻內文間之相似性質來產生文獻集群,為了讓研究者更容易瞭解各文獻集群所表達的概念,自動標籤系統(Automatic Labeling System)透過系統分析文獻集群內文,自動化地產生學術文獻集群的標籤。而近年來隱含狄利克雷分佈(Latent Dirichlet Allocation, LDA)也被廣泛運用於各領域,透過統計學與機率學的分析產生主題模型(Topic Model)。本研究期望以特定領域文獻集群作為基礎,透過LDA技術發掘各集群文獻間的相關機率分佈特徵,以建構主題模型的方式產生機率模型參數進而產生各集群之主題關鍵詞分佈並組成標籤。為了檢驗LDA主題模型產生之預測集群標籤準確度,本研究將採用Treeratpituk所提出的標籤評估架構評估LDA主題模型系統組成標籤的品質做為系統執行成效的驗證方法,此方法紀錄相關LDA主題模型系統參數設定組合運作成效與實驗數據進而分析LDA主題模型系統之集群標籤準確度,並採用自動標籤系統之學術集群關鍵字擷取技術進行兩者集群標籤準確度比較,企圖以本研究相關實驗數據分析兩者方法之預測準確度高低進而採取成效最佳方法運用之。研究顯示當集群數量為4且主題數量範圍固定為4~50時,主題字詞數量設置為30時於Precision與MTRR標籤評估方法中表現最佳,且系統標籤表現會隨著主題字詞數量的增加而些許下降;自動標籤系統與LDA主題模型產生的關鍵字詞組皆與集群名稱具有一定相關性且各自具有其對集群的解釋力,其兩者系統產生的集群標綜合表現分別為8.65375與6.59098,以LDA主題模型系統所產生的集群標籤獲得較高的標籤品質分數,具有較高的集群標籤命名準確度。
There are tools and techniques that are capable of grouping vast documents into cohesive clusters based on the relatedness or similarity metrics between these documents. The resulted clusters of documents need to be properly labeled to facilitate a fast and holistic comprehension of the main themes or topics bore by them. There were systems that employed various theoretical or empirical based approaches to label clusters of documents automatically. Our study applied Latent Dirichlet Allocation (LDA) to obtain the most likely keywords for topics in the document clusters. The obtained keywords are then composed into key phrases as the representative labels of the clusters. The appropriateness of the labels are evaluated using the evaluative framework proposed by Treeratpituk. We found the LDA-based automatic labeling system generates proper clusters’ labels. We also compare the effectiveness of the LDA-based labeling system with our home-grown kernel-based system. In most of the cases, the LDA-based system generated better clusters’ labels then our kernel-based system in the experiment.
誌 謝 I
國立臺北大學一○三學年度第二學期碩士學位論文提要 II
ABSTRACT III
目 次 IV
表 次 VIII
圖 次 X
第一章 、緒論 1
第一節、研究背景與動機 1
第二節、研究目的 2
第三節、研究架構 2
第二章 、文獻探討 4
第一節、主題模型 4
第二節、狄利克雷分佈 4
第三節、機率隱含語意分析 5
第四節、隱含狄利克雷分佈模型 6
2.4.1隱含狄利克雷分佈 6
2.4.2隱含狄利克雷分佈模型應用 7
2.4.3隱含狄利克雷分佈前身模型 8
2.4.4隱含狄利克雷分佈模型原理 10
2.4.5 隱含狄利克雷分佈系統與參數選擇 13
第五節、Gibbs Sampling 15
第六節、特徵選取 18
第七節、語意相似性 20
2.7.1潛在語意索引 21
2.7.2核基礎方法 21
第八節、Jaccard 相似度測量 23
第九節 、詞彙關聯強度 24
2.9.1相互資訊 24
2.9.2改良相互資訊 25
2.9.3關鍵詞彙網路 25
第十節 、標籤評估架構 26
2.10.1 正確標籤準則定義 27
2.10.2 標籤評估方法 27
第十一節 、自動標籤發展現況 29
第三章 、研究方法 31
第一節、研究流程圖 32
第二節、資料蒐集 34
3.2.1文獻字詞索引 34
第三節、資料處理 38
3.3.1自動標籤系統資料處理流程 38
3.3.2 LDA主題模型系統處理流程 41
3.3.3組成關鍵片語 44
第四節、資料評估 46
3.4.1 Jaccard相似度測量 47
3.4.2 LDA主題模型系統參數設置評估 48
3.4.3系統標籤品質總評估 52
第五節 、使用工具介紹 53
3.5.1自動標籤系統 53
3.5.2 JGibbLDA2.0系統 55
3.5.3 Jaccard相似度配對系統 62
第四章、研究實作與結果 64
第一節 、研究實作 64
4.1.1系統實作流程基本介紹 64
4.1.2 LDA主題模型系統參數設置評估 65
4.1.3關鍵字詞組Jaccard相似度測量 84
4.1.4系統標籤品質評估比較 88
第二節 、研究結果 91
4.2.1 LDA主題模型系統參數設置評估 91
4.2.2 關鍵字詞組Jaccard相似度測量 93
4.2.3 系統標籤品質評估比較 94
第五章、研究結論與建議 96
第一節 、結論 96
第二節 、研究貢獻 97
第三節 、研究限制 98
第四節 、研究未來建議 99
參考文獻 100
附錄一、集群文獻文章列表 104
附錄二、LDA主題模型系統參數設置評估實驗數據 108
附錄2.1、Topic:4~50、Words:30 108
附錄2.2、Topic:4~50、Words:50 111
附錄2.3、Topic:4~50、Words:100 114
附錄2.4、Topic:4、Words:30~100 117
附錄2.5、Topic:6、Words:30~100 121
附錄2.6、Topic:10、Words:30~100 125
附錄三、LDA主題模型數據(Topic:12、Words:30) 129
附錄3.1、Theta文章主題分佈機率值(Topic:12、Words:30) 129
附錄3.2、Twords主題字詞分佈機率值(Topic:12、Words:30) 137
附錄四、LDA主題模型關鍵二詞片語(Topic:12、Words:30) 140
附錄4.1 TOPO關鍵二詞片語(Topic:12、Words:30) (取前50名) 140
附錄4.2 MMI關鍵二詞片語(Topic:12、Words:30) (取前50名) 143
附錄五、自動標籤關鍵二詞片語 146
附錄5.1自動標籤關鍵字詞組 146
附錄5.2 TOPO關鍵二詞片語 148
附錄5.3 MMI關鍵二詞片語 149
附錄六、集群標籤品質評估數據 150
附錄6.1、LDA主題模型集群標籤品質評估數據(Topic:12、Words:30) 150
附錄6.2、自動標籤系統集群標籤品質評估數據 151
附錄七、Jaccard相似度值與LDA主題模型配對數據 152
附錄7.1、Jaccard相似度值與LDA主題模型配對表 152
附錄7.2、Jaccard相似度值與LDA主題模型配對數量表 152
簡  歷 153
著作權聲明 154

Anthes, G. (2010). Topic models vs. unstructured data. Commun. ACM, 53(12), 16-18. doi: 10.1145/1859204.1859210
Azzopardi, L., Girolami, M., & Van Rijsbergen, C. J. (2004, 25-29 July 2004). Topic based language models for ad hoc information retrieval. Paper presented at the Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics): Springer-Verlag New York, Inc.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Machine Learning Res., 3, 993-1022.
Carmel, D., Roitman, H., & Zwerdling, N. (2009). Enhancing cluster labeling using Wikipedia SIGIR '09 Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 139-146): Association for Computing Machinery.
Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading Tea Leaves: How Humans Interpret Topic Models, Advances in Neural Information Processing Systems 22, pp. 288-296.
Cristianini, N., Kandola, J., Elisseeff, A., & Shawe-Taylor, J. (2006). On Kernel Target Alignment. In D. Holmes & L. Jain (Eds.), Innovations in Machine Learning (Vol. 194, pp. 205-256): Springer Berlin Heidelberg.
Cutting, D. R., Karger, D. R., & Pedersen, J. O. (1993). Constant interaction-time scatter/gather browsing of very large document collections. Paper presented at the Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, Pittsburgh, Pennsylvania, USA.
Darling, W. M. (2011). A theoretical and practical implementation tutorial on topic modeling and gibbs sampling. Paper presented at the Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies.
Davidov, D., Gabrilovich, E., & Markovitch, S. (2004). Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. Paper presented at the Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, United Kingdom.
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6), 391-407. doi: citeulike-article-id:78280
Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., . . . Zamir, O. (1998). Text mining at the term level. In J. Żytkow & M. Quafafou (Eds.), Principles of Data Mining and Knowledge Discovery (Vol. 1510, pp. 65-73): Springer Berlin Heidelberg.
Ferrer I Cancho, R., & Solé, R. V. (2001). The small world of human language. Proceedings. Biological sciences / The Royal Society, 268(1482), 2261-2265. doi: 10.1098/rspb.2001.1800
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Paper presented at the Proceedings of the 20th international joint conference on Artifical intelligence, Hyderabad, India.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228-5235. doi: 10.1073/pnas.0307752101
Heinrich, G. (2005). Parameter estimation for text analysis. Web: http://www. arbylon. net/publications/text-est. pdf.
Hofmann, T. (1999). Probabilistic latent semantic indexing. Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA.
Hsinchun, C., Yi-Ming, C., Ramsey, M., & Yang, C. C. (1998). An intelligent personal spider (agent) for dynamic Internet/Intranet searching. Decision Support Systems, 23(1), 41-58. doi: http://dx.doi.org/10.1016/S0167-9236(98)00035-9
Hu, D. J. (2009). Latent dirichlet allocation for text, images, and music. University of California, San Diego. Retrieved April, 26, 2013.
Ito, T., Shimbo, M., Mochihashi, D., & Matsumoto, Y. (2006). Exploring Multiple Communities with Kernel-Based Link Analysis. In J. Fürnkranz, T. Scheffer & M. Spiliopoulou (Eds.), Knowledge Discovery in Databases: PKDD 2006 (Vol. 4213, pp. 235-246): Springer Berlin Heidelberg.
Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Paper presented at the International Conference Research on Computational Linguistics (ROCLING X).
Jones, K. S. (1972). A Statistical Interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-21. doi: doi:10.1108/eb026526
Kakkonen, T., Myller, N., Sutinen, E., & Timonen, J. (2008). Comparison of Dimension Reduction Methods for Automated Essay Grading. Educational Technology & Society, 11(3), 275-288.
Kandola, J. S., Shawe-Taylor, J., & Cristianini, N. (2002). Learning Semantic Similarity. Paper presented at the Advances in Neural Information Processing Systems 15: Neural. http://books.nips.cc/papers/files/nips15/AA22.pdf
Kasliwal, B., Bhatia, S., Saini, S., Thaseen, I. S., & Kumar, C. A. (2014, 21-22 Feb. 2014). A hybrid anomaly detection model using G-LDA. Paper presented at the Advance Computing Conference (IACC), 2014 IEEE International.
Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. Paper presented at the Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, Tampere, Finland.
Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011). Automatic labelling of topic models. Paper presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon.
Levandowsky, M., & Winter, D. (1971). Distance between Sets. Nature, 234(5323), 34-35.
Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., & Baldi, P. (2007). Mining concepts from code with probabilistic topic models. Paper presented at the Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, Atlanta, Georgia, USA.
Liu, N., Zhang, B., Yan, J., Yang, Q., Yan, S., Chen, Z., . . . Ma, W.-Y. (2004). Learning similarity measures in non-orthogonal space. Paper presented at the Proceedings of the thirteenth ACM international conference on Information and knowledge management, Washington, D.C., USA.
Maskeri, G., Sarkar, S., & Heafield, K. (2008). Mining business topics in source code using latent dirichlet allocation. Paper presented at the Proceedings of the 1st India software engineering conference, Hyderabad, India.
McCandless, M., Hatcher, E., & Gospodnetic, O. (2010). Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Greenwich, CT, USA: Manning Publications Co.
Mehrotra, R., Sanner, S., Buntine, W., & Xie, L. (2013). Improving LDA topic models for microblogs via tweet pooling and automatic labeling. Paper presented at the Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland.
Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Paper presented at the Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, San Jose, California, USA.
Osinski, S., & Weiss, D. (2005). A concept-driven algorithm for clustering search results. Intelligent Systems, IEEE, 20(3), 48-54. doi: 10.1109/MIS.2005.38
Peipeng, L., & Sim, R. T. T. (2014). Research experience of big data analytics: the tools for government: a case using social network in mining preferences of tourists. Paper presented at the Proceedings of the 8th International Conference on Theory and Practice of Electronic Governance, Guimaraes, Portugal.
Popescul, A., & Ungar, L. H. (2000). Automatic labeling of document clusters. Unpublished manuscript, available at http://citeseer. nj. nec. com/popescul00automatic. html.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137. doi: doi:10.1108/eb046814
Saini, S., Kasliwal, B., & Bhatia, S. (2013). Spam Detection using G-LDA. International Journal of Advanced Research in Computer Science and Software Engineering, 3(10).
Salton, G., & McGill, M. J. (1986). Introduction to Modern Information Retrieval. New York, NY, USA: McGraw-Hill, Inc.
Treeratpituk, P., & Callan, J. (2006). Automatically labeling hierarchical clusters. Paper presented at the Proceedings of the 2006 international conference on Digital government research, San Diego, California, USA.
Wang, Y. (2008). Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details.
Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. Paper presented at the Proceedings of ICML-97, 14th International Conference on Machine Learning.
Zhou, S., Li, K., & Liu, Y. (2008). Text categorization based on topic model. Paper presented at the Proceedings of the 3rd international conference on Rough sets and knowledge technology, Chengdu, China.
江珅薇. (2007). 相關學術論文集合關鍵詞擷取-學術領域自動命名. (碩士), 國立臺北大學, 新北市.
林佳宜. (2008). 相關文件群集之階層式自動標籤. (碩士), 國立臺北大學, 新北市.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊