( 您好!臺灣時間:2021/07/30 11:06
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Kang-Di Ting
論文名稱(外文):Clustering Articles in a Literature Digital Library Based on Content and Usage
指導教授(外文):San-Yih Hwang
外文關鍵詞:Digital libraryDocument categorizationUsage clusteringDocument clusteringContent-based clustering
  • 被引用被引用:0
  • 點閱點閱:99
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:6

在之前的相關研究當中,文件分類或分群可以適用於本研究所要解決的問題,但文件分類方式多半需要專家的幫忙以及有既定的主題類別或目錄。所以本研究想試著利用系統中使用者的使用紀錄(usage log)來取代模擬專家的分類,減少專家人工上的成本,而能建構出符合使用者需求的瀏覽介面。本研究主要提出兩種結合文件內容與使用紀錄的方法(Document categorization-based與Document clustering based),最後並以傳統內容式的方法(Content-based)與以分別針對專家人工分類的結果比較Entropy來評估。結果發現內容式的方法整體而言對於專家分類的結果吻合度較高。
Literature digital library is one of the most important resources to preserve civilized asset. To provide more effective and efficient information search, many systems are equipped with a browsing interface that aims to ease the article searching task. A browsing interface is associated with a subject directory, which guides the users to identify articles that need their information need. A subject directory contains a set (or a hierarchy) of subject categories, each containing a number of similar articles. How to group articles in a literature digital library is the theme of this thesis.

Previous work used either document classification or document clustering approaches to dispatching articles into a set of article clusters based on their content. We observed that articles that meet a single user’s information need may not necessarily fall in a single cluster. In this thesis, we propose to make use of both Web log and article content is clustering articles. We proposed two hybrid approaches, namely document categorization based method and document clustering based method. These alternatives were compared to other content-based methods. It has been found that the document categorization based method effectively reduces the number of required click-through at the expense of slight increase of entropy that measures the content heterogeneity of each generated cluster.
Chapter 1 Introduction 1
1.1 Research Background 1
1.2 Research Motivations and Objectives 1
1.3 Data Description 2
1.4 Problem Description 4
1.5 Thesis organization 5
Chapter 2 Literature review 6
2.1 Converting an article to a set of vectors 6
2.2 Keyword Selection 8
2.2.1 CHI Square Statistics 8
2.2.2 Information Gains 9
2.3 Web Usage Clustering 9
2.3.1 Data preparation for Web usage log 9
2.3.2 Usage Clustering 11 Based on frequent itemsets 12 Based on Hyperclique Patterns 13
2.4 Content-based Clustering 15
2.5 Text Categorization 20
2.5.1 Probabilistic Classifiers 20
2.5.2 Neural Network Classifiers 21
2.5.3 Support Vector Machines 21
Chapter 3 Content-based and hybrid approach 24
3.1 Content-based clustering 24
3.1.1 Article Clique Hypergraph Partitioning 25
3.1.2 K-means 26
3.2 Hybrid approach 26
3.2.1 Document categorization based hybrid approach 27
3.2.2 Document clustering based hybrid approach 28
Chapter 4 Performance Evaluation 29
4.1 Performance Metrics 32
4.2 Experimental Results 34
4.2.1 Comparing usage coherence of various clustering 34
4.2.2 Comparing automatic clusters with manual clusters 36
Chapter 5 Conclusions 41
Reference 50
[AS94] Agrawal. R. and Srikant. R., “Fast algorithms for mining association rules”, In Proceedings of the 20th VLDB conference, pp. 487-499, Santiago, Chile, 1994.
[BGGH99] Daniel Boley, Maria Gini, Robert Gross, and Eui-Hong Han etal. “Partitioning-Based Clustering for Web Document Categorization”, Decision Support Systems archive Volume 27 , Issue 3 Dec.1999 table of contents Special issue on WITS ''97. Pages: 329 – 341, 1999.
[Chuang03] S. M. Chuang. "Combining Content-based and Collaborative Article Recommendation in Literature Digital Libraries", master thesis, National Sun Yat-sen University Department of Information Management, Jul.2003.
[CMS99] R. Cooley, B. Mobasher, and J. Srivastava, “Creating adaptive Web sites through usage-based clustering of URLs,” In Proc. of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX), November 1999.
[FM01] E. A. Fox and G. Marchionini. "Digital Libraries," Communications of the ACM, 44(5), pp. 30-32, May 2001.
[Fox92] C.Fox, “Lexical Analysis and Stoplists,” Chapter 7, in Information Retrieval: Data Structures & Algorithms, edited by W. B. Frakes and R. Baeza-Yates, Prentices Hall, 1992.
[HKKM97] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Clustering based on association rule hypergraphs," In Proccedings of SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’97), May 1997.
[HKKM98] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Hypergraph based clustering in high dimensional data sets: a summary of results." IEEE Bulletin of the Technical Committee on Data Engineering, (21) 1, March 1998.
[Hsiung02] W.C. Hsiung. “Article Recommendation in Literature Digital Libraries.”, master thesis, National Sun Yat-sen University, department of Information Management, Jul. 2002.
[Joac98] T. Joachims ,”Text Categorization with support vector machines: learning with many relevant features.” In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemintz, DE, 1998), pp.137-142
[Joac99] T. Joachims, “Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning”, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
[KS97] Daphe Koller and Mehran Sahami, "Hierarchically classifying documents using very few words," Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997, Pages 170-178.
[MDL00a] B. Mobasher, H. Dai, T. Luo, Miki Nakagawa, and Jim Witshire. "Discovery of aggregate usage profiles for Web personalization," In Proc. of the WebKDD Workshop, 2000.
[MDL00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, and J. Zhu, "Integrating Web Usage and Content Mining for More Effective Personalization," International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK. September 2000.
[Se02] Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization” Consiglio Nazionale delle Ricerche, Italy, 2002
[SKK00] M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," In KDD Workshop on Text Mining, 2000.
[SYZX01] Z.Su, Q.Yang, H.Zhang , X.Xu , and Y.Hu, "Correlation-based Document Clustering using Web Logs," 34th Annual Hawaii International Conference System Science(HICSS-34)-Volume 5.Jan 03-06,2001.
[XTK04] Hui Xiong, Pang-Ning Tan, and Vpin Kumar, “Mining Hyperclique Patterns in Data Sets with Skewed Support Distributions,” Kluwer Acadenic Publishers, 2004.
[YP97] Yang, Y. and Pederson, J.O., “A comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp412-420.
[ZK02] Ying Zhao and George Karypis, ”Evaluation of hierarchical clustering algorithms for document datasets” Conference on Information and Knowledge Management Proceedings of the eleventh international conference on Information and knowledge management, 2002, pp515- 524
第一頁 上一頁 下一頁 最後一頁 top
1. 羅常芬 環境問題與環境權,法律學刊,第20期,1989年7月。
2. 蕭新煌 當前環評制度面臨的信任差距問題-從地方環保抗爭事件談起,勞工之友雜誌,第580期,1999年4月,頁6-13。
3. 蕭文生 自程序與組織觀點論基本權利之保障,憲政時代,第25卷第3期,2000年1月,頁27-54。
4. 蔡震榮 行政聽證與警察處分,中央警察大學學報,第38期,2001年7月,頁35-63。
5. 廖元豪 行政程序的憲法化-論行政處分之「正當程序」,世新大學學報,第9期,1999年10月,頁213–267。
6. 董保城 行政程序中程序行為法律性質及其效果之探討,政大法學評論,第51期,1994年6月,頁73-85。
7. 葉俊榮 環境影響評估的民眾參與—法規範的要求與現實的考慮,經社法制論叢,第11期,1993年1月,頁17-42。
8. 葉俊榮 從「方案」到「法律」:環境影響評估之過去與未來,環保與經濟,第18期,1990年12月,頁26-29。
9. 葉俊榮 政府再造與制度興革:以環境影響評估為例,經社法制論叢,第23期,1999年1月,頁1-29。
10. 張嘉尹 環境保護入憲的問題-德國經驗的初步考察,月旦法學雜誌第38期,1998年7月,頁86-96。
11. 黃舒芃 我國行政權民主正當性基礎之檢視-以德國公法釋意學對於行政權民主正當性概念的詮釋為借鏡,憲政時代,第25卷第2期,1999年10月,頁61-95。
12. 黃錦堂 從「拜耳案」論地方自治團體之參與權,月旦法學雜誌,第42期,1998年10月,頁75-87。
13. 黃錦堂 「政府政策環境影響評估作業辦法」之釋義與評價,全國律師,2002年9月,頁111-127。
14. 黃三榮 論環境影響評估法中之民眾參與,律師雜誌,第260期,2001年5月,頁29-39。
15. 湯德宗 論訴願的正當程序,月旦法學雜誌,第61期,2000年6月,頁127-141。