(3.238.186.43) 您好!臺灣時間:2021/02/25 02:34
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:游淑鑫
研究生(外文):Su-hsin Yu
論文名稱:基於詞彙模式關聯的主題式中文文本集群
論文名稱(外文):Chinese Text Clustering for Topic Detection Based on Word Pattern Relation
指導教授:楊燕珠楊燕珠引用關係
指導教授(外文):Yen-Ju Yang
學位類別:碩士
校院名稱:大同大學
系所名稱:資訊經營學系(所)
學門:商業及管理學門
學類:一般商業學類
論文種類:學術論文
論文出版年:2006
畢業學年度:94
語文別:英文
論文頁數:79
中文關鍵詞:文本集群主題模式關聯模式詞彙
外文關鍵詞:text clusteringtopicpattern relationpatternword
相關次數:
  • 被引用被引用:0
  • 點閱點閱:98
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
資訊搜尋與新聞、雜誌閱讀是目前人們上網時最常進行的活動,但網路所回應的資訊若未依主題聚集,使用者將必須花費更多的時間去識別出各搜尋結果的主題為何。聚集主題的問題可以集群方法解決,而向量空間模型是一種集群文件前的表示方法,該模型應用於文件集群的缺點在於其基本假設是特徵項之間是獨立無關的,但自然語言中詞彙之間並非獨立無關,有些詞彙常常一起出現,該模型僅以詞彙索引項匹配進行文本之間的比對,可能會遭致詞彙匹配錯誤的問題,因為詞彙關係裡的多義現象(Polysemy),會造成檢索結果資訊過多;同義現象(Synonymy),會造成檢索結果資訊過少的問題。
為解決上述問題,研究藉由詞彙擴展的方法,將具相關性的特徵組成同一語意概念之後,進而引導出相對應的文件,期待這種以語意概念形成索引的機制,能減少詞彙共現、一詞多義及一義多詞的問題。實驗將同一句中的兩個或是三個詞彙序列形成一種詞彙模式,取代關鍵詞作為文本特徵。依據模式頻率、模式頻率與反向文件頻率、條件機率、交互訊息,以及關聯基準等衡量各詞彙模式在文件裡分佈的強度,再以階層式集群方法進行詞彙模式的集群。之後每個群體被視為一個語意概念,再以概念間共同出現的文件為基礎,將數個語意概念合併成同一主題,此時同一主題所對應的文本將被視為與主題相關。
實驗結果顯示,我們所提出的集群方法基於五種特徵強度都優於傳統的VSM集群方法,在Average Recall方面,成效最好的是模式頻率,98.84%。Average Precision方面,成效最佳者為關聯基準,95.26%。至於Average F-measure方面,關聯基準依然最佳,96.7%。
Information searching and news & magazines reading are the most common activities when people are surfing Internet nowadays. However, if the information that is responded by internet does not assemble according to the topics, then the user needs to spend more time on distinguishing the topic of every search result. The problems of assembling the topic can be solved by clustering techniques, and vector space model is a kind of document representation method before clustering. The weakness of this model applied to document clustering is its assumption, which states that there is no relation between features; however, the words in natural language are not independent and irrelevant, some of the words always appear at the same time. Therefore, this model using only word index to proceed with the matching to the text may result in the problem of word matching errors, since the polysemy in the word relation will cause too much information retrieved; but synonymy will cause too little information on retrieved.
In order to solve the above problem, this study affiliates with the method of word expansion to compose relevant features into the same semantic concept, and then lead the corresponding documents out; we expect this mechanism, the use of semantic concept to form an index, can reduce the problems of collocation, polysemy, and synonymy. In the experiment, the sequence of two or three words in the same sentence is used to form a word pattern; and then this word pattern is used to replace the keyword and becomes the feature of the text. The distributive strength of key patterns is measured by Pattern Frequency, Pattern Frequency-Inverse Document Frequency, Conditional Probability, Mutual Information, and Association Norm. According to the strength the hierarchical clustering technique is applied to cluster these key patterns. After that, every cluster is going to be considered as one semantic concept. Then, based on the common documents between concepts, several semantic concepts are merged and become the same topic. At this time, the corresponding text in the same topic will be considered as topic-related.
The experimental results show that our proposed text clustering based on five strength of features are all better than the traditional VSM clustering. In Average Recall, Pattern Frequency has the best outcome, 98.84%. In Average Precision, Association Norm has the best outcome, 95.26%. In Average F-measure, Association Norm is still the best, 96.7%.
CHINESE ABSTRACT i
ENGLISH ABSTRACT ii
ACKNOWLEDGEMENT iv
TABLE OF CONTENTS v
LIST OF FIGURES vii
LIST OF TABLES viii
CHAPTER 1 INTRODUCTION 1
1.1 BACKGROUND AND MOTIVATION 1
1.2 OBJECTIVE 5
1.3 THESIS ORGANIZATION 7
CHAPTER 2 RELATED RESEARCH 8
2.1 FEATURE SELECTION 8
2.2 JUDGMENT OF FEATURE INTENSITY 11
2.2.1 Document Frequency 11
2.2.2 Term Frequency 12
2.2.3 Term Frequency-Inverse Document Frequency, tf-idf 12
2.2.4 Mutual Information 13
2.2.5 Association Norm Estimation 13
2.3 VECTOR SPACE MODEL 15
2.4 THE SYNONYMOUS RELATIONS OF WORD 17
2.5 CLUSTERING METHOD 19
2.5.1 Hierarchical Clustering 19
2.5.2 Partitioning Clustering 21
2.5.3 Suffix Tree Clustering, STC 22
2.5.4 Comparison of Clustering Methods 23
CHAPTER 3 RESEARCH METHODS 24
3.1 ALGORITHMS 24
3.2 SYSTEM FRAMEWORK 29
3.3 KEY PATTERNS EXTRACTION 30
3.3.1 Word Segmentation 30
3.3.2 Word Filtering 31
3.4 KEY PATTERNS RELATION ANALYSIS 33
3.4.1 Calculation of Pattern Strength 33
3.4.2 Indication of the Association between Patterns 37
3.5 KEY PATTERNS CLUSTERING 41
3.6 TEXTS CLUSTERING 44
CHAPTER 4 EXPERIMENTAL RESULTS AND EVALUATIONS 46
4.1 THE TESTING OF TEXT COLLECTION 46
4.2 THE EVALUATION METHODS OF EXPERIMENT 49
4.3 THE DESIGN OF THE EXPERIMENT 52
4.3.1 The Influence of 2NP and 3NP with regards to the topic that only has one related document 52
4.3.2 The Performance of the Five feature methods in Each Document Collection 54
4.3.3 The Influence of amount of documents with regards to the five feature clustering methods 58
4.3.4 The Influence of Key 2NP Quantity on five feature clustering methods 61
4.3.5 The Influence of the average amount of Key 2NP in documents on the five feature clustering methods 63
4.3.6 The Influence of the average amount of Key 2NP in the topic on the five feature clustering methods 66
4.3.7 The Influence of the average amount of documents includes in the topic on the five feature clustering methods 69
4.4 Analysis of Errors 72
CHAPTER 5 CONCLUSION AND PROSPECTIVE DIRECTION 75
BIBLIOGRAPHY 78
[1]K. Al-Kofahi, A. Tyrrell, A. Vachher, T. Travers, P. Jackson, "Combining Multiple Classifiers for Text Categorization," in Proceedings of the tenth international conference on Information and knowledge management Atlanta, Georgia, USA, 2001, pp. 97-104.
[2]R. Attar, A. S. Fraenkel, "Local Feedback in Full-Text Retrieval Systems," Journal of the ACM, vol. 24, pp. 397--417, 1977.
[3]R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval: Addison Wesley, 1999.
[4]L.-F. Chien, "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval," in Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval Philadelphia, Pennsylvania, United States 1997, pp. 50-58.
[5]Y. Choueka, S. T. Klein, E. Neuwitz, "Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large Corpus," Journal of the Association for Literary and Linguistic Computing, vol. 4, pp. 34-38, 1983.
[6]K. W. Church, P. Hanks, "Word Association Norms, Mutual Information, and Lexicography," Computational Linguistics, vol. 16, pp. 22-29, 1990.
[7]D. R. Cutting, J. O. Pedersen, D. Karger, J. W. Tukey, "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections," in Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval vol. 6. Copenhagen, Denmark 1992, pp. 318--329.
[8]K. Fragos, Y. Maistros, C. Skourlas:, "Discovering Collocations in Modern Greek Language," in Proceedings of 1st International Conference on Natural Language Understanding and Cognitive Science. Porto, Portugal, 2004, pp. 151-158.
[9]J. Han, M. Kamber, Data Mining, Concepts and Techniques: Morgan Kaufmann, 2001.
[10]J. S. Justeson, S. M. Katz, "Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text," Natural Language Engineering, vol. 1, pp. 9-27, 1995.
[11]B. Larsen, C. Aone, "Fast and Effective Text Mining Using Linear-time DocumentClustering," in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. San Diego, California, 1999.
[12]C.-M. Lee, "Vector Information Retrieval Technique with Word Bigram Relation Model," Master Thesis, Department of Information Management, Tatung University, 2004.
[13]K. R. McKeown, D. R. Radev, "Collocations " in A Handbook of Natural Language Processing, R. Dale, H. Moisl, H. Somers, Eds.: Marcel Dekker, 2000.
[14]NIST, "The 2004 Topic Detection and Tracking (TDT2004) task definition and evaluation plan," 2004.
[15]G. Punj, D. W. Stewart, "Cluster Analysis in Marketing Research: Review and Suggestions for Application," Journal of Marketing Research, vol. 20, pp. 134-148, 1983.
[16]G. Salton, A. Wong, C. S. Yang, "A Vector Space Model for Automatic Indexing " Commun. ACM, vol. 18, pp. 613-620, 1975.
[17]Y.-W. Seo, K. Sycara, "Text Clustering for Topic Detection," Robotics Institute, Carnegie Mellon University CMU-RI-TR-04-03, January 2004.
[18]M. Steinbach, G. Karypis, V. Kumar, "A Comparison of Document Clustering Techniques," in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. Boston, MA, USA., 2000.
[19]O. Zamir, O. Etzioni, "Web Document Clustering: A Feasibility Demonstration," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, vol. 6. Melbourne, Australia 1998, pp. 46-54.
[20]阮明淑, 溫達茂, "ontology應用於知識組織之初探," 佛教圖書館館訊, vol. 32, 2002.
[21]林頌堅, "基於術語抽取與術語叢集技術的主題抽取 " 中文計算語言學, vol. 9, pp. 97-112, 2004.
[22]許長謨, "從近三年報刊標題看語詞的豐富多變--兼論詞彙學的重要," 成大中文學報, vol. 11, pp. 167-200, 2003.
[23]陳正昌, "觀察體的分類技術," 教育學門提昇研究方法論研討會, 2001.
[24]陳光華, 莊雅蓁, "資訊檢索之中文詞彙擴展," 資訊傳播與圖書館學, vol. 8, pp. 59-75, 2001.
[25]曾元顯, "文件主題自動分類成效因素探討," 中國圖書館學會會報, vol. 68, pp. 62-83, 2002.
[26]黃居仁, "知識的投射與解讀兼論數位典藏中之語言座標," 漢學研究通訊, vol. 20, pp. 83-89, 2002.
[27]黃俊英, 多變量分析, 5 ed: 中國經濟企業研究所, 1995.
[28]楊允言, 謝清俊, 陳淑美, 陳克健, "中文文件自動分類之研究," 第六屆計算語言學研討會, pp. 217-233, 1993.
[29]蔡宜龍, "特殊領域文件分群之系統設計與研究─以佛學資料為例," 國立臺灣大學資訊工程學研究所碩士論文, 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔