跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.82) 您好!臺灣時間:2026/02/20 08:44
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:張家寧
研究生(外文):Chia-Ning Chang
論文名稱:以概念萃取為基礎之文件分群與視覺化
論文名稱(外文):A Concept Extraction Approach for Document Clustering and Visualization
指導教授:柯皓仁柯皓仁引用關係楊維邦楊維邦引用關係
指導教授(外文):Hao-Ren KeWei-Pang Yang
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊科學與工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2006
畢業學年度:94
語文別:中文
論文頁數:59
中文關鍵詞:文件分群關鍵字分群概念萃取主題關鍵字視覺化引用
外文關鍵詞:Document ClusteringKeyword ClusteringConcept ExtractionTopic KeywordVisualizationCitation
相關次數:
  • 被引用被引用:6
  • 點閱點閱:476
  • 評分評分:
  • 下載下載:79
  • 收藏至我的研究室書目清單書目收藏:3
近年來,網際網路已經成為取得資訊最方便的管道,其中又以在搜尋引擎輸入關鍵字取得資訊的方式最為普遍。然而,搜尋引擎通常不會對搜尋結果進行過濾與篩選,過多的資料提高了評估資料相關性的複雜度,如何在獲取的資料中去蕪存菁,並建立出容易讓使用者了解的模型,進而讓資料有效率地轉化為使用者容易吸收的知識,是目前重要的研究課題之一。分群演算法可以將資料分析之後,依照相似度將類似的資料群聚,不同的群具有不同的含意與概念,如何從群中自動萃取出其含意並賦予概念,是本研究的主要目的之一。
本研究提出以關鍵字分群的方式達到概念萃取的目的,且將文件以多種概念描述後,基於這些概念進行文件分群。進行概念萃取主要分為以下幾個主要的步驟:特徵選擇、特徵關係的建立,以及特徵分群;特徵分群的結果即為所有文件包含的概念。此外,透過文件內引用文章 (Citing Article)的相似度,建立文件間的引用關係 (Citation Relation),進而建立群與群之間的引用關係,達到建立概念之間的相關性。最後,取代傳統條列式的顯示方式,以視覺化的方式展現分群結果並呈現出概念之間的相關性。
本研究採用CiteSeer資料庫的論文做為語料庫,選取標題、摘要及引用做為資料來源,摘要部分所收錄的文字大約只有1000個字元,這個數量相當於在搜尋引擎中以關鍵字查找所得到的結果資料。根據實驗結果分析,本研究所萃取出的概念可以適合地表達出文件的整體概念,在文件分群的�_確率(Accuracy)上亦有一定水準,可達到80%的�_確率。
The World Wide Web (WWW) contains a giant amount of information, but finding relevant information from WWW is also a great challenge. Keyword-based querying usually returns many documents; however, they are neither strongly related nor presented in a comprehensible order. Clustering is capable of solving such a problem by grouping relevant documents. Users are able to find relevant documents through groups containing documents with similar concepts.
This thesis attempts to extract concepts from a corpus, each of which is defined as a collection of keywords in documents, and conduct document clustering on the basis of the extracted concepts. The overall processes are as follows. First, a clustering algorithm groups similar keywords to create concepts. Second, a document is represented by a vector, each element of which indicates the similarity between the document and a concept. Then, documents are clustered according to the abovementioned vector. Furthermore, citations between documents are used to construct documents connections. Such connections are further used for discovering group relations and concept relations. In addition to extracting concepts and clustering documents, this thesis uses the visualization technique to present clustering results and show the relationship between concepts. Several experiments with CiteSeer documents are performed in order to show that concepts extracted by our method can not only clearly represent each group, but also achieve good clustering accuracy, which is about 80%.
中文摘要 i
英文摘要 ii
誌謝 iii
目錄 iv
表目錄 vi
圖目錄 vii
第一章 緒論 1
1.1 研究動機與目的 1
1.2 研究方法與範圍 2
1.3 論文架構 2
第二章 相關研究工作 3
2.1 分群演算法 3
2.1.1 劃分式分群法 – 以k-Means為例 3
2.1.2 階層式分群法 – 以Agglomerative & Divisive為例 5
2.1.3 基於模型分群法 – 以Self-Organizing Map為例 7
2.1.4 關鍵字分群Topic Keyword Clustering 8
2.2 分群�_則 11
2.3 字詞語意關聯度 12
2.3.1 Pearson’s Chi-Square Test 13
2.3.2 Likelihood Ratio 13
2.3.3 Mutual Information 14
2.4 視覺化之應用 15
第三章 概念萃取之文件分群與視覺化 18
3.1 前置處理 18
3.1.1 斷詞切字與小寫化 19
3.1.2 停用字之處理 19
3.1.3 詞性標記(Part of Speech, POS) 20
3.1.4 詞幹轉換 22
3.1.5 片語化 23
3.2 文件分群演算法 23
3.2.1 特徵選擇 23
3.2.2 概念萃取與特徵分群 26
3.2.3 語意相似度向量之文件分群 31
3.3 群之後置處理 33
3.3.1 群聚標記 33
3.3.2 以論文之引用文章建立群聚關係 33
3.4 視覺化過程 35
第四章 實驗結果分析與評估 37
4.1 評估方法 37
4.1.1 以專家分群結果評估 37
4.1.2 以群聚分佈評估 40
4.1.3 以專家標示兩兩文章相似度評估 42
4.2 實驗結果 44
4.2.1 以專家分群結果評估 44
4.2.2 以群聚分佈評估 46
4.2.3 以專家標示兩兩文章相似度評估 47
4.2.4 實驗討論 48
第五章 結論與未來研究方向 51
5.1 結論 51
5.2 未來研究方向 52
參考文獻 53
附錄 56
視覺化系統簡介 56
[1] P. Athanasios, Probability, Random Variables and Stochastic Processes. ,Second Edition ed.New York: McGraw-Hill, 1984.
[2] H. C. Chang and C. C. Hsu, "Using topic keyword clusters for automatic document clustering," IEICE Trans. Inf. Syst., vol. E88D, pp. 1852-1860, AUG. 2005.
[3] K. Chen and L. Liu, "ClusterMap: Labeling clusters in large datasets via visualization," in CIKM '04: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, 2004, pp. 285-293.
[4] H. Chernoff and E. L. Lehmann, "The use of maximum likelihood estimates in χ2 tests for goodness-of-fit," The Annals of Mathematical Statistics, vol. 25, pp. 579-586, 1954.
[5] K. Chidananda Gowda and G. Krishna, "Agglomerative clustering using the concept of mutual nearest neighbourhood," Pattern Recognit, vol. 10, pp. 105-112, 1978.
[6] L. E.L. , Testing Statistical Hypotheses. Wiley, 1986.
[7] B. S. Everitt, Statistical Methods for Medical Investigations. ,2nd Edition ed.Edward Arnold, 1994.
[8] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[9] J. He, A. Tan, C. L. Tan and S. Y. Sung, "On quantitative evaluation of clustering systems." in Clustering and Information Retrieval Anonymous 2003, pp. 105-134.
[10] J. Neyman, E.S. Pearson, "Joint statistical papers," 1967.
[11] A. K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: A review," ACM Comput. Surv., vol. 31, pp. 264-323, SEP. 1999.
[12] G. Karypis, E. H. Han and V. Kumar, "Chameleon: Hierarchical clustering using dynamic modeling," Computer, vol. 32, pp. 68-+, AUG. 1999.
[13] D. A. Keim, "Information Visualization and Visual Data Mining," IEEE Trans. Visual. Comput. Graphics, vol. 8, pp. 1-8, 2002.
[14] D. Koller and M. Sahami, "Hierarchically classifying documents using very few words," in ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp. 170-178.
[15] J. R. Levine, T. Mason and D. Brown, Lex & Yacc. ,2nd ed.O'Reilly & Associates, Inc, 1992.
[16] K. Leonard and J. R. Peter, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 1990.
[17] J. MacQueen, "Some methods for classification and analysis of multivariate observations," Math. Statist, Prob., vol. 1, pp. 281-297, 1967.
[18] J. Makhoul, F. Kubala, R. Schwartz and R. Weischedel, "Performance measures for information extraction," in Proc. DARPA Broadcast News Workshop, pp. 249-252, 1999.
[19] C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. 1999.
[20] G. Minnen, J. Carroll and D. Pearce, "Applied morphological processing of English," Nat. Lang. Eng., vol. 7, pp. 207-223, 2001.
[21] M. Rosell, V. Kann and J. Litton, "Comparing comparisons: Document clustering evaluation using two manual classifications," in Proc. Int. Conf. on Natural Language Processing (ICON -- 2004), 2004, pp. 207-216.
[22] D. G. Roussinov and H. C. Chen, "Information navigation on the web by clustering and summarizing query results," Information Processing & Management, vol. 37, pp. 789-816, NOV. 2001.
[23] F. Sebastiani, "Machine learning in automated text categorization," ACM Comput. Surv., vol. 34, pp. 1-47, 2002.
[24] M. Steinbach, G. Karypis and V. Kumar, "A comparison of document clustering techniques," KDD Workshop on Text Mining, 2000.
[25] Y. Yang, "Noise reduction in a statistical approach to text categorization," in SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 256-263.
[26] Y. Zhao and G. Karypis, "Empirical and theoretical comparisons of selected criterion functions for document clustering," Mach. Learning, vol. 55, pp. 311-331, JUN. 2004.
[27] 劉政璋 Cheng-Chang Liu, "以概念分群為基礎之新聞文件自動摘要系統 Concept Cluster Based News Document Summarization," pp. 67, 民94.
[28] 謝佩原 Pei-Yuan Hsieh, "目標導向之SOM應用於文件分群 Goal-Oriented SOM for Document Clustering," pp. 46, 民93.
[29] CiteSeer - http://citeseer.ist.psu.edu/
[30] NLP process - Text Analysis Toolkit. Available as http://www.infogistics.com/textanalysis.html
[31] Infogistics, POS -tag - http://www.infogistics.com/tagset.html
[32] Page Rank - http://www.webworkshop.net/pagerank.html
[33] Mopha - http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html
[34] Kappa Statistics - http://www.dmi.columbia.edu/homepages/chuangj/kappa
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top