跳到主要內容

臺灣博碩士論文加值系統

(3.236.84.188) 您好!臺灣時間:2021/08/03 16:37
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:江文成
研究生(外文):Wen -Cheng Chiang
論文名稱:基於時間分析之文件自動分類系統
論文名稱(外文):Automatic Document Classification Based on Temporal Analysis
指導教授:李秀惠李秀惠引用關係
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:63
中文關鍵詞:時間分析文件分類分類器
外文關鍵詞:temporal analysisoptimum training setdocument classificationclassifier
相關次數:
  • 被引用被引用:0
  • 點閱點閱:112
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:3
隨著網際網路的盛行,網路文件慢慢累積成一個虛擬的龐大資料庫。如何從這大量的網路文件中,正確且有效率地找到想要的文件,已成為一個重要的議題。文件分類是資料檢索中常用的一個技術,分類的好壞往往會影響到檢索結果,所以一個好的文件分類系統,不但可以幫助管理者更方便整理文件也同樣可以讓使用者更有效率地檢索到真正想要的文件。

隨著時代潮流的改變,不同時期的網路文件就會有不同的文章風格,比如在各個領域中,不同時期所研究或討論的東西會隨著熱門程度,而影響到某些熱門字詞出現在文件中頻率;或是某篇文章也可能因為時代認知的不同,而被分類到不同的類別。在傳統的文件分類系統中,是不會考慮到時間特性的,直覺上這樣的文件分類系統,如果文件集本身橫跨了多個時期,那麼分類的結果可能就會不盡理想。

我們在論文中,會先設計一些時間性的分析實驗,來驗證藉此分析的確能夠在分類結果得到不錯改善。進而提出一個基於時間分析的文件自動分類方法,藉由在文件上的時間特性,來訓練出不同時期的分類器。透過事先的時間因素的分析,一方面能夠減少分類器所需的訓練資料集,也讓我們得到更好的分類效果。
The popular use of the Internet has increased the amount of information which is accessible and stored through the web. Therefore retrieving a great deal of the information efficiently is becoming more and more important. Automatic Documents Classification (ADC) is a common strategy to associate the information with semantically meaningful classes and can improve the efficiency. However traditional ADC doesn’t consider temporal factor when constructing classifier. New information may appear or specific terms may disappear with time. These characteristics would lead into different classification of some documents in different time.

We first discuss several temporal issues and design experiments to evaluate the influence of temporal factor on classification. Finally we propose our temporal analysis strategy to explore optimum training set for constructing temporal classifier. With the temporal analysis process, we reduce the amount of data for training classifier and improve the classification performance.
中文摘要 1
Abstract 2
Chapter 1 Introduction 3
1.1 Motivation 3
1.2 Research Objectives 4
1.3 Organization of This Thesis 6
Chapter 2 Background 7
2.1 Automatic Document Classification 7
2.2 Support Vector Machine (SVM) 13
2.2.1 SVM Concepts 13
2.2.2 Non-Linear Classification 17
Chapter 3 System Architecture 19
3.1 System Overview 19
3.2 Data Extraction 20
3.3 Data Pre-Processing 23
3.4 Temporal Effects Analysis 25
3.5 Extraction of Optimum Training Set 27
Chapter 4 Characterization of Sampling and Temporal Effects 28
4.1 Characterizing the Sampling Effects 28
4.1.1 Sampling Effects of Year 29
4.1.2 Sampling Effect of the Whole Corpus 30
4.2 Characterizing and Quantifying the Temporal Effects 33
4.2.1 Selection of Training Data 33
4.2.2 Evaluation of Class Distribution 36
4.2.3 Evaluation of Class Similarity 41
Chapter 5 Experiments and Results 45
5.1 Pre-procedure for Exploring Optimum Training Set 45
5.2 Peak Accuracy Distribution 46
5.3 Exploring the Optimum Training Set 49
Chapter 6 Conclusion and Future Works 59
6.1 Conclusion 59
6.2 Discussion 59
6.3 Future Works 60
References 61
[Burges 98] Cristopher J. C. Burges, “A tutorial on
Support Vector Machine for pattern recognition,” Data Mining and Knowledge Discovery (DMKD), vol. 2, no. 2, pp. 121-167, 1998.

[COSIM]Cosine Similarity,
URL=http://en.wikipedia.org/wiki/Cosine_similarity

[IDOM04] URL=http://ir.dcs.gla.ac.uk/resources/linguistic_ut
ils/stop_words

[Joachims 97] T. Joachims, “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization,” International Conference on Machine Learning (ICML), pp. 143-151, 1997.

[LIBSVM]URL=http://www.csie.ntu.edu.tw/~cjlin/libsvm/

[Lovins 68] Julie Beth Lovins, “Development of a stemming algorithm,”Mechanical Translation and Computational Linguistics, vol. 11, pp. 22-31, 1968.

[MRA 08]Fernando Mourão, Leonardo Rocha, Renata Araújo, Thierson Couto, Marcos Goncalves and Wagner Meira, Jr., “Understanding temporal aspects in document classification,” International Conference on Web Search and Web Data Mining (WSDM), 2008.

[Porter 80] M. Porter, “An algorithm for suffix stripping program,” vol. 14, pp. 130-137, 1980.

[SB 88] G. Salton, and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, vol.24, pp. 513-523, 1988.

[Sebastiani 02] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no.1, pp. 1-47, 2002

[QZZ 08]Li-Qing Qiu, Ru-Yi Zhao, Gang Zhou and Sheng-Wei Yi, “An extensive empirical study of feature selection for text categorization,” IEEE/ACIS International Conference on Computer and Information Science (ICIS), pp. 312-315, 2008.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top