跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.168) 您好!臺灣時間:2024/12/06 01:18
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:蔡香君
研究生(外文):Hsiang-Chun Tsai
論文名稱:網頁文獻叢集化搜尋
論文名稱(外文):Web-base Literature Clustering Search
指導教授:翁昭旼翁昭旼引用關係
指導教授(外文):Jau-Min Wong
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:醫學工程學研究所
學門:工程學門
學類:綜合工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:62
中文關鍵詞:叢集化關聯法則資料探勘
外文關鍵詞:Document ClusteringAssociation RuleText Mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:188
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著資訊時代的來臨,數位化文獻的資料量也以急劇速度不斷地成長。如何能在大量的數位化資料中迅速尋找出高關聯度的資料、淬取出相關的知識,無疑是一個急迫解決的重要議題。我們在這篇論文中提出一個叢集化的方法(Literature Clustering Search, LCS)。利用這個方法可以將大量的資料分類成階層式叢集,並進一步幫助使用者在短時間內能對大量的資料進行初步的了解以及具有初步的概念。我們的方法共有四個步驟。首先,Metadata Retrieval可以將資料做格式化的動作。第二步,進行Feature Selection的程序,在這個步驟中我們只留下具有文章代表性的單字或單詞做為Feature。第三步,再利用Association Rule Mining的程序計算出所有Feature之間的關係。最後,我們依據這些關係形成一個階層式叢集。由於Association Rules代表著一群共同出現的字詞,我們可以藉由這一群共同出現的字詞輕易地了解群組中所代表的涵意。除此之外,我們同時建立了一個線上文獻叢集化搜尋服務,以展示我們的方法與成果。
In the past two decades it has been seen a dramatic increase in the amount of information or data being stored in electronic format. Retrieving relevant information from large data set becomes important issue. We propose a clustering method which generates hierarchical clusters and helps us to have overall picture of the concepts through the massive information in a short time. We call it Literature Clustering Search (LCS). There are four steps to accomplish the task. First, metadata retrieval will help normalizing the data format. Second, feature selection extracts words/phrases which could represent the document. Third, association rule mining generates relations between features. Finally, group the documents that share the same association rules. Since association rules represent a set of terms that co-occur frequently, we could easily obtain the concept of the cluster based on the association rules of the cluster. In addition, we build an online clustering web service to demonstrate the methodology of literature clustering search.
中文摘要 ii
ABSTRACT iii
ACKNOWLEDGEMENTS iv
TABLE OF CONTENTS v
List of Figures viii
List of Tables x

Chapter 1 INTRODUCTION 1
1.1 Motivation 1
1.2 Purpose 1
1.3 Our Approach 2
1.4 Outline 2

Chapter 2 RELATED WORKS 3
2.1 Feature Selection 4
2.2 Association Rule Mining 5
2.3 Clustering 7
2.3.1 Components of a Clustering Task 7
2.3.2 Well-known Clustering Algorithms 8
2.3.3 Previous approaches to Document Clustering 9
2.4 A Brief Introduction of Clustering Search Engines 10

Chapter 3 MATERIALS 13
3.1 PubMed 13
3.2 Google™ Search Engine 14
3.3 Reuters-21578, Distribution 1.0 14

Chapter 4 METHODS 15
4.1 Metadata Retrieval 15
4.2 Feature Extraction 19
4.2.1 The Framework of Feature Extraction 19
4.2.2 Part-Of-Speech Tagging 20
4.2.3 Definition of Phrase Patterns 20
4.2.4 Feature Selection 21
4.3 Association Rules 23
4.3.1 Support 23
4.3.2 Confidence 23
4.4 Clustering by Association Rules 24

Chapter 5 CLUSTERING WEBSITE – DESIGN AND EVALUATION 28
5.1 Introduction 28
5.2 General architecture 28
5.3 The Client 29
5.3.1 The client environment 29
5.3.2 A look at the user interface 29
5.4 The Server 32
5.4.1 The Server Environment 32
5.4.2 Design Objectives 33
5.4.3 The Clustering Web Server Framework 33
5.5 Evaluation of Clustering Website 36

Chapter 6 EXPERIMENTS AND DISCUSSION 37
6.1 PubMed 37
6.1.1 Experimental Design 37
6.1.2 Experimental Results 37
6.1.3 Improve Accuracy of Results 40
6.2 Google Search Engine 42
6.2.1 Experimental design 42
6.2.2 Experimental Results 42
6.2.3 Limitations 47
6.3 Reuters-21578 48
6.3.1 Data Corpora 48
6.3.2 Evaluation Metrics 48
6.3.3 Effect of feature selection 51
6.3.4 Experimental Design 51
6.3.5 Experimental Results 52
6.4 Discussion 56

Chapter 7 CONCLUSIONS 57
7.1 Contributions 57
7.2 Limitations 57
7.3 Future Works 58

BIBLIOGRAPHY 59
[1]S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30, 107-117, 1998.
[2]Entrez PubMed, http://www.ncbi.nlm.nih.gov/entrez/
[3]U. M. Fayyad and E. Simoudis. Data mining and knowledge discovery. In Proceedings of 1st International Conf. Prac. App. KDD & Data Mining, 3-16, 1997.
[4]G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, 3, 1289-1305, 2003.
[5]L. Ertoz, M. Steinbach, and V. Kumar. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. SIAM International Conference on Data Mining, San Francisco, CA, 2003.
[6]W. Pratt, et al. A Knowledge-Based Approach to Organizing Retrieved Documents. In Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, 80-85, 1999
[7]Y. F. B. Wu and X. Chen. Extracting Features from Web Search Returned Hits for Hierarchical Classification. In Proceedings of the 2003 International Conference on Information and Knowledge Engineering (IKE''03), 103-108, 2003.
[8]S. Sakurai and A. Suyama. Rule Discovery from Textual Data based on Key Phrase Patterns. ACM Symposium on Applied Computing, 606-612, 2004.
[9]G. Dias, S. Guilloré and J. G. P. Lopes. Extracting Textual Associations from Part-Of-Speech Tagged Corpora. European Association for Machine Translation Workshop on Harvesting Existing Resources, Ljubljana, Slovenia, 2000.
[10]Y. S. Maarek, R. Fagin, I. Z. Ben-Shaul, and D. Pelleg. Ephemeral document clustering for web applications. Technical Report RJ 10186, IBM Research, 2000.
[11]R. Al-Kamha and D. W. Embley. Grouping Search-Engine Returned Citations for Person-Name Queries. In Proceedings of the 6th annual ACM international workshop on Web information and data management, 96-103, 2004.
[12]I-J. Chiang, T.Y. Lin, and J.Y.-J. Hsu. Generating Hypergraph of Term Associations for Automatic Document Concept Clustering. Artificial Intelligence and Soft Computing, Marbella, Spain, 2004.
[13]C. Zhang and S. Zhang. Association Rule Mining. Springer-Verlagz, Berlin Heidelberg, 2002.
[14]R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Database, 207-216, 1993.
[15]S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings ACM SIGMOD International Conference on Management of Data, 265-276, 1997.
[16]O. R. Zaïane and M. L. Antonie. Classifying Text Documents by Associating Terms with Text. In Proceedings of the thirteenth Australasian conference on Database technologies, 5, 215-222, 2002.
[17]P. Berkhin. Survey of Clustering Data Mining Techniques. Technical report, Accrue Software, San Jose, California, 2002.
[18]A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31, 264-323, 1999.
[19]A. K. Jain and R. C. Dubes. Algorithm for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ, 1988.
[20]C. Ordonez. Clustering Binary Data Streams with K-means. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 12-19, 2003
[21]J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, MA, 1981.
[22]D. D. Lewis. http://www.research.att.com/~lewis
[23]H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma and J. Ma. Learning to Cluster Web Search Results. In Proceedings of the 27th annual international conference on Research and development in information retrieval, 210-217, 2004.
[24]K. Kummamuru, et al. A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In Proceedings of the 13th international conference on World Wide Web, 658-665, 2004.
[25]F. SEBASTIANI. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1-47, 2002.
[26]P. Ferragina and A. Gullì. The Anatomy of a Hierarchical Clustering Engine for Web-page, News and Book Snippets. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM''04), 395-398, 2004.
[27]O. Zamir and O. Etzioni. Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks, 31, 1361-1743, 1999.
[28]W. Xu and Y. H. Gong. Document Clustering by Concept Factorization. In Proceedings of the 27th annual international conference on Research and development in information retrieval, 202-209, 2004.
[29]T. Li, S. Ma and M. Ogihara. Document Clustering via Adaptive Subspace Iteration. In Proceedings of the 27th annual international conference on Research and development in information retrieval, 218-225, 2004.
[30]S. Siersdorfer and S. Sizov. Restrictive Clustering and Metaclustering for Self-Organizing Document Collections. In Proceedings of the 27th annual international conference on Research and development in information retrieval, 226-233, 2004.
[31]Vivísimo, http://www.vivisimo.com
[32]KartOO, http://www.kartoo.com
[33]Mooter, http://www.mooter.com
[34]O. Mason. QTAG. http://www.english.bham.ac.uk/staff/omason/index.html
[35]Z. H. Deng et al. A Comparative Study on Feature Weight in Text Categorization. In Proceedings of The Sixth Asia Pacific Web Conference (APWEB 2004), Hangzhou, China, 2004, published by Springer-Verlag as Lecture Note Series in Computer Science (LNCS 3007), 588-597.
[36]TouchGraph. http://touchgraph.sourceforge.net
[37]Y. T. Chang. Biology Knowledge Representation. M.S. Thesis, Institute of Biomedical Engineering, National Taiwan University.
http://bioinfo.bme.ntu.edu.tw/ontomarker/
[38]B. C. M. Fung, K. Wang, and M. Ester. Hierarchical Document Clustering Using Frequent Itemsets. In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM''03), 59-70, 2003.
[39]G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊