(3.235.108.188) 您好!臺灣時間:2021/03/03 19:50
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:英家慶
研究生(外文):Jia-Ching Ying
論文名稱:使用N-gram模型對中文文件自動分類
論文名稱(外文):Automatic Chinese Text Categorization Using N-gram Model
指導教授:顏秀珍顏秀珍引用關係
指導教授(外文):Show-Jane Yen
學位類別:碩士
校院名稱:銘傳大學
系所名稱:資訊傳播工程學系碩士班
學門:傳播學門
學類:一般大眾傳播學類
論文種類:學術論文
論文出版年:2007
畢業學年度:95
語文別:中文
論文頁數:56
中文關鍵詞:中文斷詞屬性篩選羅吉斯迴歸文件分類基於N-grm 的分類模型
外文關鍵詞:text classificationN-gram-based classificationfeature selectionword segmentationlogistic regression
相關次數:
  • 被引用被引用:1
  • 點閱點閱:2322
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
中文文件分類(Chinese text classification)在機器學習的領域是一種重要和著名的技術。 但是, 過去文獻經常不討論字與字之間的關係與忽略中文斷詞(word segmentation)所造成的影響。 然而我們相信中文斷詞對文件分類的影響是重要且不可被忽視的, 再過去雖然有部分的研究有針對此一問題加以解決,但是卻又延伸出更多的問題,例如,對於未知字的處理能力較差以及未考慮句子這個結構錦江文章考慮成一個文字的序列。此外在特徵選取方面,過去的研究並沒有提出較適合於此種N-gram-based分類模型的特徵選取法。在本研究中,我們提出一種基於N-grm的分類法(N-gram-based classification)可以考慮字與字之間的關係。 同時我們也提出一種新平滑方法(Smoothing)透過使用羅吉斯回歸(logistic regression)來估計N-grm的機率值,並且改正chi-square特徵選擇法使其適合於基於N-grm的分類法,進而同時改善過去無法考慮字與字之間關係的問題,卻也不會衍生出上述所提及之傳統N-gram-based分類模型的問題以致於影響分類的正確率。在本研究的實驗結果顯示,本研究所提出的方法相較於過去的N-gram-based分類模型在micro-average F-measure的表現上,一般來說要好11%左右。
Chinese text classification is an important and well-known technique in the field of machine learning. However, most applications often avoid the problem of word segmentation and ignore the relationship between words. It is important to model a suitable classifier for Chinese text classification. In this paper, we propose an N-gram-based Language model for Chinese text categorization which considers the relationship of words. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms former N-gram-based classification model above 11% on micro-average F-measure.
Abstract in Chinese i
Abstract in English ii
Acknowlege iii
Catalog v
Catalog of Table vi
Catalog of Graph vii
Chapter 1 Introduction 1
Chapter 2 Related Works 7
2.1 Vector Space Models 7
2.2 Probabilistic Models 10
2.2.1 Naive Bayes 11
2.2.2 N-gram-based classification 12
2.2.3 N-gram Smoothing Methods 16
2.3 Feature Selection 18
2.3.1 Mutual information 19
2.3.2 Information gain 20
2.3.3 Chi-square Statistic 20
Chapter 3 Logistic-regression-based N-gram Models 23
3.1 Significant N-grams 23
3.2 N-gram model smoothing estimator 27
3.3 Language models as text classifiers 30
Chapter 4 Empirical evaluation 35
4.1 Experimental paradigm 35
4.2 Measuring classification performance 37
4.3 Experimental result and analysis 38
Chapter 5 Conclusion 45
[1]A. Aizawa, “Linguistic Techniques to Improve the Performance of Automatic Text Categorization”, Proceedings of the Sixth Natural Language Processing Pacific Rim. Symposium (NLPRS), pp. 307-314, 2001.
[2]S. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, pp. 310-318, 1998
[3]W. Cavnar and J. Trenkle, “N-Gram-Based Text Categorization”, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, 1994.
[4]M. Damashek, “Gauging Similarity with N-Grams: Language-Independent Categorization of Text”, Science, Vol. 267, pp. 843-848, 1995.
[5]S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representations for Text Categorization”, Proceedings of the 7th International Conference on Information and Knowledge Management, pp. 148-155, 1998
[6]J. He, A. Tan, and C. Tan, “On Machine Learning Methods for Chinese Document Categorization”. Applied Intelligence, Vol.18, pp. 311-322, 2003.
[7]E. Jiang, “Learning to Semantically Classify Email Messages”, Proceeding of 2nd International Conference on Intelligent Computing, pp. 664-675, 2006
[8]T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Proceedings of the ECML, pp. 137-142, 1998.
[9]W. Lam, M. Ruiz and P. Srinivasan, “Automatic text categorization and its application to text retrieval”, IEEE Transactions on Knowledge and Data Engineering, Vol. 11, pp.865-879, 1999.
[10]C. D. Manning and H. Schuetze, Fundations of Statistical Natural Language Processing, MIT Press, pp.191-227, 2004.
[11]F. Peng, X. Huang, D. Schuurmans, and N. Cercone, “Investigating the Relationship of Word Segmentation Performance and Retrieval Performance in Chinese IR” Proceedings of COLING, pp. 72-78, 2002.
[12]F. Peng, X. Huang, D. Schuurmans, and S. Wang, “Text Classification in Asian Languages without Word Segmentation”, Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages (IRAL), Vol. 18, pp. 41-48, 2003.
[13]F. Peng and D. Schuurmans, “Combining Naive Bayes and N-Gram Language Models for Text Classification”, Proceedings of ECIR2003, pp. 335-350, 2003.
[14]F. Sebastian, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol.34, pp.1-47, 2002
[15]C. Silva and B. Ribeiro, “Scaling Text Classification with Relevance Vector Machines”, Proceeding of IEEE Conference on Systems, Man, and Cybernetics (SMC), pp. 4186-4191, 2006.
[16]W. Teahan and D. Harper, “Using Compression-Based Language Models for Text Categorization”, Proceedings of LMIR, pp. 83-88, 2001.
[17]M. Tipping, ”Sparse Bayesian Learning and the Relevance Vector Machine”, Journal of Machine Learning Research, 1, pp. 211-214, 2001.
[18]V. Vapnik, “The Nature of Statistical Learning Theory”, Springer-Verlag, 1995.
[19]Y.C. Wu, “Chinese text categorization with term clustering”, M.S. thesis, Mining-Chuan University, 2003.
[20]Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization”, Information Retrieval Journal, Vol. 1, pp.69-90, 1999.
[21]Y. Yang and X. Liu, “A Re-examination of Text Categorization Methods”, Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.42-49, 1999.
[22]S. Yen, Y. Lee, C. Lin, J. Ying, “Investigating the Effect of Sampling Methods for Imbalanced Data Distributions,” Proceedings of IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 4163-4168, 2006.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊
 
系統版面圖檔 系統版面圖檔