研究生(外文):Hao-hsiang Lin
論文名稱(外文):Preference-Anchored Document Clustering Technique: Effects of Term Relationships and Thesaurus
指導教授(外文):Chih-Ping Wei
外文關鍵詞:Personalized document clusteringDocument clusteringPreference-based document clusteringText miningHierarchical agglomerative clustering (HAC)
根據情境式文件分群理論,個人的文件分群行為不單純只是考量文件的屬性(包含內容),也取決於個人在什麼樣任務和情境之下進行分群。因此,有效的文件分群技術必須能夠考量使用者不同的偏好觀點,進而產生特定偏好的分群結果。偏好引導的情境式文件分群技術(PAC)支援以偏好為基礎的文件分群,並且考量使用者的分群偏好產生特定偏好的分群結果。而本文主要針對PAC探討兩個研究議題:(1)不同的字詞關係是否可以增進PAC的效能以及(2)不同的語料庫所建構出來的統計式字典是否可以增進PAC的效能。實證的結果顯示,在完整的群集標註詞(Anchoring terms)前提下,本文所提出來的方法和PAC具有相同的分群效能,然而隨著群集標註詞(Anchoring terms)的減少,並沒有辦法到達和PAC相同的分群效能,甚至產生較差的分群效能。實證的結果也顯示使用較大的語料庫所建構出來的統計式字典沒有辦法增進PAC的分群效能。
According to the context theory of classification, the document-clustering behaviors of individuals not only involve the attributes (including contents) of documents but also depend on who is doing the task and in what context. Thus, effective document-clustering techniques need to be able to take into account users’ categorization preferences and thus can generate document clusters from different preferential perspectives. The Preference-Anchored Document Clustering (PAC) technique was proposed for supporting preference-based document-clustering. Specifically, PAC takes a user’s categorization preference into consideration and subsequently generates a set of document clusters from this specific preferential perspective. In this study, we attempt to investigate two research questions concerning the PAC technique. The first research question investigates “whether the incorporation of the broader-term expansion (i.e., the proposed PAC2 technique in this study) will improve the effectiveness of preference-based document-clustering, whereas the second research question is “whether the use of a statistical-based thesaurus constructed from a larger document corpus will improve the effectiveness of preference-based document-clustering.” Compared with the effectiveness achieved by PAC, our empirical results show that the proposed PAC2 technique neither improves nor deteriorates the effectiveness of preference-based document-clustering when the complete set of anchoring terms is used. However, when only a partial set of anchoring terms is provided, PAC2 cannot improve and even deteriorate the effectiveness of preference-based document-clustering. As to the second research question, our empirical results suggest the use of a statistical-based thesaurus constructed from a larger document corpus (i.e., the ACM corpus consisting of 14,729 documents) does not improve the effectiveness of PAC and PAC2 for preference-based document-clustering.
1.1 Background
1.2 Research Motivation and Objectives
1.3 Organization of the Thesis
2.1 Content-based Document Clustering Techniques
2.2 Preference-Anchored Document Clustering (PAC) Technique
3.1 Statistical-based Thesaurus Construction
3.2 Preference Expansion
3.3 Document Representation
3.4 Clustering
4.1 Collection of Document Corpora
4.2 Collection of Users'' Preferred Clustering
4.3 Evaluation Criteria and Procedure
4.4 Experiment 1: Effectiveness of PAC2 vs. PAC
4.4.1 Tuning of Traditional Content-based Document-Clustering Technique
4.4.2 Tuning of PAC and PAC2 Techniques
4.4.3 Comparative Evaluation
4.5 Experiment 2: Effects of Thesaurus on PAC and PAC2
4.5.1 Tuning of PAC and PAC2 Techniques
4.5.2 Comparative Evaluation
