研究生(外文):Chun-Yi Chi
論文名稱(外文):Automatic Training Corpora Acquisition for Document Classification
指導教授(外文):Pu-Jen Cheng
外文關鍵詞:Document ClassificationTraining Data
Document classification is a typical problem in several fields for many years. However, most previous work has the assumptions that the corpora can be explicitly-labeled and well-classified. In this work, we will concentrate on automatic acquisition of training data in good quality. We propose mining approaches to collect training data from given unlabeled corpus or the web, and our proposed approaches are fully automatic which is only needed to construct classes by humans in advance.
In our work, the concept of class name can be captured by comparing with other classes, which is the common concept among classes. Moreover, we can discover discriminative concepts iteratively within each class. In this way, by finding common concepts and discriminative concepts, we can acquire training data of high quality. The evaluation gives empirical evidence that the classifiers thus created have promising accuracy. In a word, the automatic acquisition of training data in good quality by our proposed methods is the primary contributions of this work.
Acknowledgements ii
摘要 iii
List of Figures vii
List of Tables ix
Chapter 1: Introduction 1
1.1 Motivation 1
1.2 Previous Work 2
1.3 Basic Idea 3
1.4 Challenges 4
1.5 Proposed Approach 7
1.6 Experiments 8
1.7 Contributions 8
1.8 Thesis Outline 9
Chapter 2: Related Work 10
2.1 Query Expansion 10
2.2 Using Web to Acquire Training Data 12
2.3 Unlabeled Training Data in Machine Learning 15
Chapter 3: The Problem 17
Chapter 4: Overview of Our Approach 19
4.1 Common-Concept: DMOZ Topic Hierarchy Method 19
4.2 Common-Concept: Co-Occurrence Method 25
4.3 Common-Concept: Context-Based Method 28
4.4 Discriminative-Concept: Difference-Based Method 32
4.5 Discriminative-Concept: Similarity-Based Method 35
Chapter 5: Experiments 39
5.1 Different Associated Terms to Class Names 42
5.2 Different Size of Search Results 45
5.2.1 Closed Set 46
5.2.2 Open Set 49
5.3 Different Size of Expansion Results 51
5.3.1 Closed Set 52
5.3.2 Open Set 55
5.4 Combination of Closed and Open Sets 59
Chapter 6: Discussion 62
Chapter 7: Application 65
Chapter 8: Conclusion and Future Work 68
References 69
