跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.84) 您好!臺灣時間:2024/12/03 09:16
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:紀均易
研究生(外文):Chun-Yi Chi
論文名稱:文件分類中自動訓練資料收集法
論文名稱(外文):Automatic Training Corpora Acquisition for Document Classification
指導教授:鄭卜壬鄭卜壬引用關係
指導教授(外文):Pu-Jen Cheng
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
畢業學年度:96
語文別:英文
論文頁數:70
中文關鍵詞:文件分類訓練資料
外文關鍵詞:Document ClassificationTraining Data
相關次數:
  • 被引用被引用:2
  • 點閱點閱:347
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
多年來,文獻分類在幾個領域中是一個典型的問題。然而,先前大多數的工作都假設認為,語料庫可以被明確標記以及顯著分類。在這論文中,我們將注重於自動收集品質良好的訓練資料。我們提出探勘方法從給定的無標記的語料庫中,或者網路上,來收集訓練資料。我們提出的方法是全自動的,只需要人們事先建立好分類類別。
在我們的論文中,類別名稱的概念是可以從和其他被分類的類別中捕獲的,這就是在類別之中的共同概念。此外,我們可以重複地在各個類別之中發掘鑑別性的概念。這麼一來,藉由尋找共同的概念和鑑別性的概念,我們可以獲得品質很高的訓練資料。實驗評估給了經驗上的證據:被訓練的分類器因此有了顯著的準確率。總而言之,藉由我們提出的方法來自動收集品質良好的訓練資料,是我們這篇論文中最主要的貢獻。
Document classification is a typical problem in several fields for many years. However, most previous work has the assumptions that the corpora can be explicitly-labeled and well-classified. In this work, we will concentrate on automatic acquisition of training data in good quality. We propose mining approaches to collect training data from given unlabeled corpus or the web, and our proposed approaches are fully automatic which is only needed to construct classes by humans in advance.
In our work, the concept of class name can be captured by comparing with other classes, which is the common concept among classes. Moreover, we can discover discriminative concepts iteratively within each class. In this way, by finding common concepts and discriminative concepts, we can acquire training data of high quality. The evaluation gives empirical evidence that the classifiers thus created have promising accuracy. In a word, the automatic acquisition of training data in good quality by our proposed methods is the primary contributions of this work.
Acknowledgements ii
摘要 iii
ABSTRACT iv
List of Figures vii
List of Tables ix
Chapter 1: Introduction 1
1.1 Motivation 1
1.2 Previous Work 2
1.3 Basic Idea 3
1.4 Challenges 4
1.5 Proposed Approach 7
1.6 Experiments 8
1.7 Contributions 8
1.8 Thesis Outline 9
Chapter 2: Related Work 10
2.1 Query Expansion 10
2.2 Using Web to Acquire Training Data 12
2.3 Unlabeled Training Data in Machine Learning 15
Chapter 3: The Problem 17
Chapter 4: Overview of Our Approach 19
4.1 Common-Concept: DMOZ Topic Hierarchy Method 19
4.2 Common-Concept: Co-Occurrence Method 25
4.3 Common-Concept: Context-Based Method 28
4.4 Discriminative-Concept: Difference-Based Method 32
4.5 Discriminative-Concept: Similarity-Based Method 35
Chapter 5: Experiments 39
5.1 Different Associated Terms to Class Names 42
5.2 Different Size of Search Results 45
5.2.1 Closed Set 46
5.2.2 Open Set 49
5.3 Different Size of Expansion Results 51
5.3.1 Closed Set 52
5.3.2 Open Set 55
5.4 Combination of Closed and Open Sets 59
Chapter 6: Discussion 62
Chapter 7: Application 65
Chapter 8: Conclusion and Future Work 68
References 69
[1] C.-C. Huang, K.-M. Lin, L.-F. Chien. Automatic Training Corpora Acquisition through Web Mining. In 2005 IEEE/WIC/ACM Conference on Web Intelligence, July 2005.
[2] C.-C. Huang, S.-L. Chuang, L.-F. Chien. LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora. In World Wide Web Conference, 2004.
[3] Chen-Ming Hung and Lee-Feng Chien. Web-Based Text Classication in the Absence of Manually Labeled Training Documents. In Journal of the American Society for Information Science and Technology, 2007.
[4] Y. Qui and H. Frei. Concept based query expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference, pages 160–169, 1993.
[5] J. Xu and W. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference, pages 412–420, 1996.
[6] C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1):1–27, 2001.
[7] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.
[8] A. McCallum and K. Nigam. Text classification by bootstrapping with keywords. In ACL Workshop for Unsupervised Learning in Natural Language Processing, 1999.
[9] J. H. H. Yu, C. Zhai. Text classification from positive and unlabeled documents. In Proceedings of the 12th Annual International ACM Conference on Information and Knowledge Management, pages 232–239, 2003.
[10] H. Yu. SVMC: Single-class classification with support vector machines. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 2003.
[11] Reuters-21578 Text Categorization Test Collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/
[12] Google Search Engine. http://www.google.com
[13]The Lemur Toolkit for Language Modeling and Information Retrieval. http://www.lemurproject.org/
[14] Rainbow. http://www.cs.cmu.edu/~mccallum/bow/rainbow/
[15] D. W. C. Kwok, O. Etzioni. Scaling question answering to the web. In Proceedings of the 10th international conference on World Wide Web, pages 150–161, 2001.
[16] R. Goldman and J. Widom. A practical approach for combined querying of databases and the web. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 285–296, 2000.
[17] A. Kilgarriff and G. Greffenstette. Introduction to the special issue on web as corpus. Computational Linguistics, 29(3), 2003.
[18] Dmoz Open Directory Project, http://www.dmoz.org/
[19] Javed Aslam, Katya Pelekhov, and Daniela Rus. A Practical Clustering Algorithm for Static and Dynamic Information Organization". In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1999.
[20] Yahoo! Directory. http://dir.yahoo.com/
[21] S.-L. Chuang, L.-F. Chien. Towards Automatic Generation of Query Taxonomy: A Hierarchical Query Clustering Approach. In Proc. the 2002 IEEE International Conference on Data Mining (ICDM), pages 75-82, Dec. 2002.
[22] Classifier for Computer Science. http://irlab.csie.org/~ccy/cgi-bin/classifier/.
[23] Yahoo! Directory for Computer science. http://dir.yahoo.com/Science/Computer_Science/
[24] Microsoft Libra for Computer Science Directory. http://libra.msra.cn/
[25] File::Random Perl Module. http://search.cpan.org/~bigj/File-Random-0.17/Random.pm
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top