研究生(外文):Kang-Di Ting
論文名稱(外文):Clustering Articles in a Literature Digital Library Based on Content and Usage
指導教授(外文):San-Yih Hwang
外文關鍵詞:Digital libraryDocument categorizationUsage clusteringDocument clusteringContent-based clustering
在之前的相關研究當中,文件分類或分群可以適用於本研究所要解決的問題,但文件分類方式多半需要專家的幫忙以及有既定的主題類別或目錄。所以本研究想試著利用系統中使用者的使用紀錄(usage log)來取代模擬專家的分類,減少專家人工上的成本,而能建構出符合使用者需求的瀏覽介面。本研究主要提出兩種結合文件內容與使用紀錄的方法(Document categorization-based與Document clustering based),最後並以傳統內容式的方法(Content-based)與以分別針對專家人工分類的結果比較Entropy來評估。結果發現內容式的方法整體而言對於專家分類的結果吻合度較高。
Literature digital library is one of the most important resources to preserve civilized asset. To provide more effective and efficient information search, many systems are equipped with a browsing interface that aims to ease the article searching task. A browsing interface is associated with a subject directory, which guides the users to identify articles that need their information need. A subject directory contains a set (or a hierarchy) of subject categories, each containing a number of similar articles. How to group articles in a literature digital library is the theme of this thesis.

Previous work used either document classification or document clustering approaches to dispatching articles into a set of article clusters based on their content. We observed that articles that meet a single user’s information need may not necessarily fall in a single cluster. In this thesis, we propose to make use of both Web log and article content is clustering articles. We proposed two hybrid approaches, namely document categorization based method and document clustering based method. These alternatives were compared to other content-based methods. It has been found that the document categorization based method effectively reduces the number of required click-through at the expense of slight increase of entropy that measures the content heterogeneity of each generated cluster.
Chapter 1 Introduction 1
1.1 Research Background 1
1.2 Research Motivations and Objectives 1
1.3 Data Description 2
1.4 Problem Description 4
1.5 Thesis organization 5
Chapter 2 Literature review 6
2.1 Converting an article to a set of vectors 6
2.2 Keyword Selection 8
2.2.1 CHI Square Statistics 8
2.2.2 Information Gains 9
2.3 Web Usage Clustering 9
2.3.1 Data preparation for Web usage log 9
2.3.2 Usage Clustering 11 Based on frequent itemsets 12 Based on Hyperclique Patterns 13
2.4 Content-based Clustering 15
2.5 Text Categorization 20
2.5.1 Probabilistic Classifiers 20
2.5.2 Neural Network Classifiers 21
2.5.3 Support Vector Machines 21
Chapter 3 Content-based and hybrid approach 24
3.1 Content-based clustering 24
3.1.1 Article Clique Hypergraph Partitioning 25
3.1.2 K-means 26
3.2 Hybrid approach 26
3.2.1 Document categorization based hybrid approach 27
3.2.2 Document clustering based hybrid approach 28
Chapter 4 Performance Evaluation 29
4.1 Performance Metrics 32
4.2 Experimental Results 34
4.2.1 Comparing usage coherence of various clustering 34
4.2.2 Comparing automatic clusters with manual clusters 36
Chapter 5 Conclusions 41
Reference 50
