跳到主要內容

臺灣博碩士論文加值系統

(44.192.94.177) 您好!臺灣時間:2024/07/21 22:10
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:鄭佩琪
研究生(外文):Pei-Chi Cheng
論文名稱:應用在文件分類的領域空間權重機制
論文名稱(外文):Domain-space Weighting Scheme for Document Classification
指導教授:曾憲雄曾憲雄引用關係
指導教授(外文):Shian-Shyong Tseng
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊科學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2004
畢業學年度:92
語文別:英文
論文頁數:41
中文關鍵詞:文件分類文件表示維度縮減文詞權重
外文關鍵詞:document classificationdocument representationdimension redoctionterm weighting
相關次數:
  • 被引用被引用:0
  • 點閱點閱:191
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
隨著電子式文件的發展與增多,自動化文件分類(automatic document classification)在為使用者發掘和管理資訊上越來越重要。許多典型的分類方法,例如:C4.5,SVM,naïve Bayesian等,已被應用於發展文件分類器(classifier)。然而,這些方法大部份是批次處理(batch-based)的探勘技術,無法處理分類器在類別隨時間變化而增加的適應問題(category adaptation problem)。另外,關於文件表示的問題(document representation problem),大部份的表示法是以詞語空間(term-space)表示文件,可能產生許多沒有代表性的維度,使得分類器的效率和有效性因而降低。

本論文提出一個領域空間權重機制(domain-space weighting scheme),將文件以領域空間(domain-space)的表示法表示,並以漸進式(incremental)的方法建立文件分類器,解決上述的類別適應問題和文件表示問題。此機制包含三個階段:訓練階段(Training Phase)、鑑別階段(Discrimination Phase)和微調階段(Tuning Phase)。在訓練階段,此機制針對各個類別萃取出足以代表該類別的特徵,並依其對該類別的重要性給予權重值,再將結果儲存於特徵領域關聯權重表(feature-domain association weighting table)中,該表是用於記錄特徵與所有相關領域的關聯程度的表格。接著進入鑑別階段,此機制調降在分類時鑑別力小的特徵的權重值,以減低其對分類的影響力。至此,根據特徵領域關聯權重表,分類器已建置完成。而微調階段是選擇性的,利用微調文件的資訊加強分類器的分類能力。在實驗時,我們使用標準的測試文件集Reuters-21578 based on the “ModApte” split version評估所建置的分類器。實驗結果顯示,在有足夠的訓練文件下,分類器更加有效;而藉由微調階段,分類器更為強化。
As evolving and available of digital documents, automatic document classification (a.k.a. document categorization) has become more and more important for managing and discovering useful information for users. Many typical classification approaches, such as C4.5, SVM, Naïve Bayesian and so on, have been applied to develop a classifier. However, most of them are batch-based mining approaches, which cannot resolve the category adaptation problem; and referring to the document representation problem, the representations are usually in term-space, which may result in lots of less representative dimensions such that the efficiency and effectiveness are decreased.

In this thesis, we propose a domain-space weighting scheme to represent documents in domain-space and incrementally construct a classifier to resolve both document representation and category adaptation problems. The proposed scheme consists of three major phases: Training Phase, Discrimination Phase and Tuning Phase. In the Training Phase, the scheme first incrementally extracts and weights features from each individual category, and then integrates the results into the feature-domain association weighting table which is used to maintain the association weight between each feature and all involved categories. Then in the Discrimination Phase, it diminishes feature weights with lower discriminating powers. A classifier can be therefore constructed according to the feature-domain association weighting table. Finally, the Tuning Phase is optional to strengthen the classifier by the feedback information of tuning documents. Experiments over the standard Reuters-21578 benchmark based on the “ModApte” split version are carried out and the experimental results show that with enough training documents the classifier constructed by our proposed scheme is rather effective and it is getting stronger by the Tuning Phase.
摘要 I
Abstract III
誌謝 V
Table of Content VI
List of Figures VII
List of Tables VIII
List of Algorithms IX
Chapter 1: Introduction 1
Chapter 2: Related Work 4
2.1 Document Representation 4
2.2 Classifier Construction 5
2.3 Classifier Evaluation 8
Chapter 3: Domain-space Weighting Scheme for Document Classification 11
Chapter 4: Classifier Construction Based on Domain-space Document Representation 14
4.1 Training Phase 17
4.2 Discrimination Phase 19
4.3 Tuning Phase 21
Chapter 5: Document Labeling by the Constructed Classifier 25
Chapter 6: Experiments 27
6.1 Experimental Setting 27
6.2 Experimental Results Analysis 27
Chapter 7: Conclusions and Future Work 36
Bibliography: 38
[1] Antonie, M.L. and Zaiane, O.R. Text document categorization by term association. International conference on data mining, IEEE. 2002.
[2] Baker, L. and McCallum, A. Distributional clustering of words for text classification. In SIGIR-98, 1998.
[3] Berry, M.W., Dumais, S.T. and O’Brien, G.W. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595, 1995.
[4] Chickering D., Heckerman D., and Meek, C. A Bayesian approach for learning Bayesian networks with local structure. Proc. of 13th Conf. on Uncertainty in Artifical Intelligence, 1997.
[5] Chu, W.W., Liu, Z. and Mao, Z. Textual document indexing and retrieval via knowledge sources and data mining. Communication of the institute of information and computing machinery, 2002.
[6] Dagan, I., Karov, Y., and Roth, D. Mistake-driven learning in text categorization. Proc. of EMNLP-97, 2nd Conf. on Empirical Methods in Neural Language Processing, 1997.
[7] Debole, F. and Sebastiani, F. Supervised term weighting for automated text categorization. ACM SAC, 2003.
[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Hashman, R. Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41(6), 1990.
[9] Dumais, S.T. and Chen, H. Hierarchical classification of web content. Proc. ACM-SIGIR Interformational Conference on Research and Development in Information Retrieval, pp. 256-263, Athens, 2000.
[10] Dumais, S., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. ACM CIKM, 1998.
[11] Fuketa, M., Lee, S., Tsuji, T., Okada, M. and Aoe, J. A document classification method by using field association words. Jour. of information sciences, Elsevier Science. 2002.
[12] Galavotti, L., Sebastiani, F. and Simi, M. Experiments on the use of feature selection and negative evidence in automated text categorization. Proc. of ECDL-00, 4th European Conf. on Research and Advanced Technology for Digital Libraries, 2000.
[13] George H, J., Ron K. and Karl, P. Irrelevant features and the subset selection problem. In Proceedings of the 11 Machine Learning (1994) pp. 121-129.
[14] Han, E.H. Text categorization using weight adjusted k-Nearest Neighbor classification. PhD thesis, University of Minnesota, October 1999.
[15] Joachims, T. Text categorization with support vector machines: Linearing with many relevant features. Proc. of the 10th European Conference on Machine Learning, vol. 1938, pp. 137-142, Berlin, 1998, Springer.
[16] Joachims, T. Making large-scale SVM learning practical. Advances in Kernel Methods-Support Vector Learning, Chapter 11, pp. 169-184. The MIT Press, 1999.
[17] Karypic, G. and Han, E.H. Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval & categorization. CIKM, 2000.
[18] Kim, Y.H. and Zhang, B.T. Document indexing using independent topic extraction. Proc. of the International Conference on Independent Component Analysis and Signal Separation (ICA). 2001.
[19] Lam, S.L. and Lee, D.L. Feature reduction for neural network based text categorization. Proc. of DASAA-99, 6th IEEE International Conf. on Database Advanced Systems for Advanced Application, 1999.
[20] Lweis, D.D. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.
[21] Lewis, D.D. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis/reuters21578.html, 1999.
[22] Lewis, D.D., Schapire, R.E., Callan, J.P. and Papka, R. Training algorithms for linear text classifiers. Proc. of ACM-SIGIR. 1996.
[23] Lewis, D. Naïve (bayes) at forty: The independence assumption in information retrieval. In Tenth European Conference on Machine Learning. 1998.
[24] Liu, R.L. and Lu, Y.L. Incremental context mining for adaptive document classification. ACM SIGKDD, 2002.
[25] Quinlan, J.R. C4.5: Programs for machine learning. Moran Kaufmann, San Mateo, CA, 1993.
[26] Rocchio, J.J. Relevance feedback in information retrieval. In The Smart Retrieval System-Experiments in Automatic Document Processing. P.313-323. Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.
[27] Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47, 2002.
[28] Shankar, S. and Karypis, G. Weight adjustment schemes for a centroid based classifier. TextMining Workshop, KDD, 2000.
[29] Wang, B.B., McKay R.I., Abbass H.A., and Barlow M. A comparative study for domain ontology guided feature extraction. ACSC 2003.
[30] Wibowo, W. and Williams, H.E. Simple and accurate feature selection for hierarchical categorization. Proc. of the symposium on document engineering, ACM. 2002.
[31] Yang, Y. and Liu, X. A re-examination of text categorization. In Proc. of the 22nd Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Morgan Kaufmann, 1999.
[32] Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. Proc. of ICML-97, 14th International Conference on Machine Learning, pp. 412-420, 1997.
[33] Yang Y. An evaluation of statistical approaches to MEDLINE indexing. In Proceedings of the American Medical Informatic Association (AMIA), pp. 358- 362, 1996.
[34] Yang Y. An evaluation of statistical approaches to Text Categorization. Technical Report, CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997.
[35] Zhou, S., Ling, T.W., Guan, J., Hu, J. and Zhou, A. Fast text classification: a training corpus pruning based approach. Proc. of the Eighth International Conf. on Database System for Advanced Applications, IEEE, 2003.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top