跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.89) 您好!臺灣時間:2025/01/26 04:00
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:陳威廷
研究生(外文):Wei-Ting Chen
論文名稱:多類支持向量學習應用在電子文件分類之研究
論文名稱(外文):Multiclass Support Vector Learning with Applications to Text Classification
指導教授:林建宏林建宏引用關係
指導教授(外文):Jiann-Horng Lin
學位類別:碩士
校院名稱:義守大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:中文
論文頁數:73
中文關鍵詞:支持向量機電子文件分類多類別分類
外文關鍵詞:Support Vector MachinesText ClassificationMulticlass classification
相關次數:
  • 被引用被引用:2
  • 點閱點閱:380
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
摘 要
在本論文中,我們提出了一個基於多類支持向量學習的新方法來做電子文件的分類。支持向量機(Support Vector Machines, SVMs)是一個在高維度特徵空間的線性學習系統。而其學習演算法是從最佳化理論以及統計學習理論得來的。本論文的主要動機來自於線上資訊的大量快速成長。電子文件分類便扮演處理與組織電子文件的關鍵技術。這可以用在全球網際網路-WWW如新聞電子文件的分類與搜尋上。由於電子文件有以下幾點特色:(1)高維度的輸入空間;(2)很少有不相關的Features;(3)文件向量(Document Vectors)是稀疏的;(4)大部分的電子文件分類問題是線性可分的。因此,SVM是一個非常合適的分類工具。在多類別支持向量學習中,我們提出了一個改善的決策單向非循環分類器策略,並將其應用電子文件分類上。我們發展一些快速且精確的電子文件分類的軟體工具。這將提供龐大資料庫搜尋/組織之複雜問題一個解決方法。我們更擴展我們建構支持向量學習系統的一些觀念,對電子文件分類做更深入的研究。
關鍵字: 支持向量機, 電子文件分類, 多類別分類

ABSTRACT
In this thesis, we propose a new method of text classification based on multiclass support vector learning. Support Vector Machines (SVMs) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. With the rapid growth of online information, text classification has become one of the key techniques for handling and organizing text data. Text classification is used to classify news stories and to find interesting information on the World Wide Web. To find out what methods are promising for learning text classifiers, we should find out more about the properties of text: (1) High dimensional input space; (2) Few irrelevant features; (3) Document vectors are sparse; (4) Most text categorization problems are linearly separable. In multiclass support vector learning, we propose an improved Decision Directed Acyclic Graph classifier strategy with application in text classification. We develop software tools for rapid and accurate text classification. This provides an alternative approach to undertake the highly complex problems of database search and organization. Benchmark datasets with different characteristics are used for comparative study.
Keywords: Support Vector Machines, Text Classification, Multiclass classification

目 錄
中文摘要……………………………………………………………………..I
英文摘要……………………………………………………………………..II
誌謝………………………………………………………………………………… III
目錄………………………………………………………………………….. IV
圖目錄……………………………………………………………………….. VI
表目錄……………………………………………………………………….. VIII
第一章 緒論………………………………………………………………1
1.1 研究背景與動機……………………………………………………1
1.2 論文組織結構………………………………………………………7
第二章 電子文件分類探討………………………………………………8
2.1 電子文件分類介紹…………………………………………8
2.2 電子文件分類相關文獻探討………………………………………9
2.3 支持向量機方法優越的準確性……………………………………10
2.3.1 支持向量機與其他方法的比較…………………………..11
2.3.2 支持向量機在多類別分類決策比較……………………..14
第三章 支持向量機………………………………………………………15
3.1 支持向量機的基本概念……………………………………………15
3.2 支持向量機的演算法………………………………………………17
3.3 支持向量分類、分群、迴歸及模糊支持向量機…………………22
3.4 支持向量機的應用…………………………………………………24
第四章 多類別支持向量學習……………………………………………26
4.1 一對剩餘分類器策略………………………………………………26
4.2 一對一分類器策略…………………………………………………28
4.3 階層式或樹狀支持向量分類器策略………………………………29
4.4 決策單向非循環圖形分類器策略…………………………………31
4.5 各種多類別分類方法的優缺點比較………………………………33
第五章 多類別支持向量學習在電子文件分類上的處理………………34
5.1 整體架構……………………………………………………………34
5.2 處理過程簡介………………………………………………………36
5.3 輸入格式及來源……………………………………………………37
5.4 前置處理……………………………………………………………38
5.4.1 字之頻率計算……………………………………………..38
5.4.2 贅字的處理………………………………………………..42
5.4.3 特徵字的選擇……………………………………………..44
5.5 核心處理……………………………………………………………48
5.5.1 核心處理架構……………………………………………..48
5.5.2 多類別分類問題…………………………………………..50
第六章 實驗結果…………………………………………………………52
6.1實驗資料來源及格式………………………………………………………52
6.1.1資料的來源…………………………………………………………52
6.1.2資料的格式…………………………………………………………52
6.2實驗過程…………………………………………………………………...54
6.3 實驗結果………………………………………………………………….59
第七章 結論與未來研究方向……………………………………………61
7.1 結論…………………………………………………………………61
7.2 未來研究方向………………………………………………………62
參考文獻
圖目錄
圖1:在網頁中的多類別選項(1)…………………………………………………..2
圖2:在網頁中的多類別選項(2)…………………………………………………3
圖3:Single Perceptron……………………………………………………………15
圖 4:SVM Binary Classification………………………………………….………..16
圖 5:Support Vector Machines 整體架構…………………………………………17
圖6:可分離資料的超平面………………………………………………………18
圖7:線性不可分離分離的例子…………………………………………………19
圖8:維度的轉換…………………………………………………………………..20
圖 9:Support Vector Clustering…………………………………………………….23
圖10:Support Vector Regression………………………………………………….. 23
圖11:Fuzzy Support Vector Machines…………………………………………….24
圖12:One-against-Rest Classifiers Strategy………………………………………..27
圖13:One-against-One Classifiers Strategy I……………………………………….28
圖14:One-against-One Classifiers Strategy II………………………………………29
圖15:Hierarchies or trees of binary SVM Classifiers Strategy……………………30
圖16:Decision Directed Acyclic Graph I………………………………………….31
圖17:Decision Directed Acyclic Graph II…………………………………………32
圖18:電子文件分類基本概圖……………………………………………………34
圖19:SVM for Text Classification………………………………………………….35
圖20:Input Vectors示意圖…………………………………………………………37
圖21:Northern Webs Spider View的網頁…………………………………………39
圖22:WORDTRACKER的網頁…………………………………………………40
圖23:Web Frequency Indexer的網頁…………………………………………….40
圖24:Web Frequency Indexer計算頻率…………………………………………41
圖25:Pick Stop of Words in Bags of Words……………………………………….43
圖26:Example of the DDAG for multiclass classification………………………..49
圖27:Example of the proposed multiclass support vector learning…………………51
圖28:Reuters-21578資料集的資料格…………………………………………….53
圖29:分離字詞與加註文章編號…………………………………………………55
圖30:輸入程式前的資料格式…………………………………………………….55
圖31:分類前的Training Data及Test Data格式…………………………………56
圖32:決策函數所需參數值……………………………………………………….57
圖33:每個node的輸出結果………………………………………………………58
圖34:不同維度的正確率………………………………………………………...59
表目錄
表1:Classification Accuracy from [26] Thorsten Joachims……………………….11
表2:Classification Accuracy from [29] Thorsten Joachims………………………11
表3:Classification Accuracy from [29] Thorsten Joachims……………………….12
表4:Classification Error from [42] Jason and Ryan……………………………….12
表5:Classification Error from [42] Jason and Ryan……………………………….13
表6:Classification Error from [43] Friedhelm schwenker………………………….13
表7:Classification Error from [40] John and Nello…………………………………14
表8:各種多類別方法優缺點比較…………………………………………………33
表9:處理過程簡介表………………………………………………………………36
表10:不同類別Training的數目………………………………………………….60
表11:各種方法的正確率比較…………………………………………………….60

[1] McCallum, Andrew and Kamal Nigam. “A comparison of event models for Naive Bayes text classification.” AAAI-98 Workshop on Learning for Text Categorization, 1998.
[2] James Allan and Vipin Kumar and Paul Thompson, “Text Mining.”, Material from IMA Talks, IMA HOT TOPICS Workshop, http://www.ima.umn.edu/reactive/spring/tm.html
[3] D. P. Bertsekas, “Parallel and Distributed Computation: Numerical Methods.”, Prentice-Hall, Engcliff, NJ, 1989.
[4] B. Boser, I. Guyon and V. Vapnik, “A training algorithm for optimal margin classification.”, In Proceeding the Fifth Annual Workshop on Computation Learning Theory, 1992.
[5] Michael Brown, ”Support vector machines.”, 1999. http://www.cse.ucsc.edu/research/compbio/genex/genexTR2html/node3.html
[6] A. Ben-Hur and D. Horn and H.T. Siegelmann and V. Vapnik, “Support vector clustering.”, Journal of Machine Learning Research, vol.2, pp.125-137, 2001.
[7] Bredensteiner and Kristin P. Bennett, “Multicategory classification by support vector machines.” E. J., Computational Optimizations and Applications, vol.12, pp.53-79, 1999.
[8] Trafalis, T. B.; Ince, H., “Support vector machine for regression and Applications to Financial Forecasting.”, IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks., vol.6, pp.248-353, 2000.
[9] Nello Cristianini and John Shawe-Taylor, Support Vector Machines, Cambridge University Press, UK, 2000.
[10] C. Cortes and V. Vapnik, “Support-vector networks.”, Machine Learning, 20: pp.273-297, 1995.
[11] Apt, C., Damerau, F and Weiss, S. “Automated learning of decision rules for text categorization.”, ACM Transaction on Information System, vol.12, no.3, pp.233-251, 1994.
[12] C. Campbell, D-Facto Publications, “Algorithmic approaches to training support vector machines : A Survey.”, Proceedings of ESANN2000 , Belgium, pp.27-36, 2000.
[13] Nello Cristianini, “Support vector and kernel machines”, ICML,2001, http://www.support-vector.net/icml-tutorial.pdf
[14] Melanie Dumas, “Emotional expression recognition using support vector machines.”, CSE 254: Seminar on Learning Algorithms, Spring 2001.
[15] Lewis, D.D. and Ringuette, M., “A comparison of two learning algorithms for text categorization.” In Third Annual Symposium on Document Analysis and Information Retrieval, pp.81-93, 1994.
[16] Wiener E., Pederson, J.O. AND Weigend, A.S. “A neuralal network approach to topic spotting.”, In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR ‘95), 1995.
[17] J.H. Friedman. “Another approach to polychotomous classification.”, Technical report, Stanford University, Department of Statistics, 1996.
[18] Gudivala et al., “Information retrieval on the World Wide Web.”, In IEEE Internet Computeing, vol. 1, No. 5, pp.58-68, September,1997.
[19] Guodong Guo, Stan Z. Li and Kapluk Chan, “Face recognition by support vector machines.”, Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, 2000.
[20] J. Hynek, K. Jezek and O. J Rohlik, “Short document categorization- itemsets method.”, PKDD'2000 MLTIA Workshop,Lyon, Tuesday 12th of September 2000.
[21] C.-W. Hsu and C.-J. Lin., “A comparison of methods for multi-class support vector machines.” , IEEE Transactions on Neural Networks, vol.13, pp.415-425., 2002.
[22] D. Heckerman, D. Geiger, D. and Chickering, D.M. “Learning Bayesian networks: the combination of knowledge and statistical data.”, Machine Learning, vol. 20, pp.131-163,1995.
[23] Schutze, H., Hull, D. and Pederson, J.O. “A comparison of classifiers and document representation for the routing problem.”, In SIGIR ’95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.229-237, 1995.
[24] ISIS Research Group, ”Support vector machines.”, 2001. http://www.isis.ecs.soton.ac.uk/resources/svminfo/
[25] J.-S.R.Jang, C.-T.Sun, E.Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice-Hall Inc., 1997.
[26] Thorsten Joachims, “Text categorization with support vector machines: learning with many relevant features.”, Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998.
[27] Thorsten Joachims, "A Statistical Learning Model of Text Classification with Support Vector Machines.", Proceedings of SIGIR01, 24th ACM International Conference on Research and Development in Information Retrieval, ACM Press, New York, US, New Orleans, US, W. Bruce Croft and David J. Harper and Donald H. Kraft and Justin Zobel, pp.128-136, 2001.
[28] Thorsten Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization.", International Conference on Machine Learning, pp.143-151, 1997.
[29] Thorsten Joachims, "Transductive Inference for text classification using Support vector machines.", Proc. 16th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, pp.200-209, 1999.
[30] Jin-Tsong Jeng; Tsu-Tain Lee, “Support vector machines for the fuzzy neural networks.”, IEEE SMC '99 Conference Proceedings. IEEE International Conference on Systems, Man, and Cybernetics, vol.6,.pp.115-120, 1999.
[31] Vojislav Kecman, Learning and Soft Computing: Support Vector Machine, Neural Networks and Fuzzy Logic Model, The MIT Press Cambridge, Massachusetts London, England, 2001.
[32] Fuka, K. - Hanka, R. “Feature set reduction for document classification problems.” IJCAI-01 Workshop: Text Learning: Beyond Supervision. Seattle , August 2001.
[33] U. Krebel. “Pairwise classification and support vector machines.”, In B. Scholkopf,C. Burges, and A.Smola, edits, Advances in Kernel Methods, chapter 15, pp.255-268. the MIT Press, 1999.
[34] Raymond Kosala and Hendrik Blockeel, “Web Mining Research: A survey.”, SIGKDD Exploration., ACM ACM SIGKDD, Vol 2, Issue 1, pp.1-15, July 2000.
[35] Chn-Fu Lin and Sheng-De Wang, “Fuzzy support vector machines.”, IEEE Transaction on Neural Networks, vol.13, no. 2,.pp.464-471, MARCH 2002.
[36] E Mayoraz, E Alpaydin, "Support vector machines for multi-class classification.", IWANN'99, Alicante, Spain, June 1999.
[37] Larry M. Manevitz and Malik Yousef, “One-class SVMs for document classification.”, Journal of Machine Learning Research 2, Submitted3/01; Published12/01, pp.139-154, 2001.
[38] NRC's Interactive Media Research Lab and Industry Canada's Schoolnet project, “Text Classification.”, 1998. http://www.iit.nrc.ca/II_public/Classification
[39] Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. “Air/X — A rule-based multi-stage indexing system for lage subject fields.”, In Proceedings of RIAO’91, 606-623, 1991.
[40] John C. Platt, et al., “Large margin DAGs for multiclass classification.”, appear in Advances in Neural Information Processing Systems 12 S.A. Solla, T.K. Leen and K.-R. Muller (eds.), pp.547-553, MIT Press, 2000.
[41] S.M. Ruger, S.E. Gauch, “Feature reduction for document clustering and classification.”, DTR 2000/8. Department of Computing, Imperial College London, pp.1-9, September 2000.
[42] J. Rennie, "Improving multi-class text classification with the support vector machines", Master's thesis, Massachusetts Institute of Technology, 2001.
[43] F. Schwenker., “Hierarchical support vector machines for multi-class pattern recognition.” IEEE 4th International Conference on Knowledge-Based Intelligent Engineering System & Allied Technologies, 30th Aug-1st, Brighton, UK, pp.561-565, Sept. 2000.
[44] Sam Scott, Stan Matwin, “Feature engineering for text classification.”, Proc. 16th International Conf. on Machine Learning,.pp.1-13, 1999.
[45] V. Sindhwani, P. Bhattacharyya, S., Rakshit “Information theoretic feature crediting in multiclass support vector machines.”, in SIAM International Conference on Data Mining, Chicago, USA, April, 2001.
[46] Dumais, S. T., Platt, J., Heckerman, D., & Sahami, M. . “Inductive learning algorithms and representations for text categorization.” Proceedings of the Seventh International Conference on Information and Knowledge Management . New York: ACM Press., pp.148-155, 1998.
[47] Y. Tan, Y. Xia and J. Wang, “Neural network realization of support vector methods for pattern classification.”, IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)24 —27, pp.411-416 July 2000 Como, Italy.
[48] Kristina Toutanova, Francine Chen, Kris Popat, and Thomas Hofmann, “Text classification in a hierarchical mixture model for small training sets.”, Proceedings of the Tenth International ACM Conference on Information and Knowledge Management, CIKM 2001.
[49] Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag Inc., New York, 1995.
[50] Shivakumar Vaithyanathan and Jian-Chang Mao and Byron Dom, “Hierarchical Bayes for text classification.", PRICAI Workshop on Text and Web Mining, pp.36-43, 2000.
[51] J. Weston and C. Watkins, "Multi-Class Support Vector Machines.". Royal Holloway Technical report CSD-TR-98-04, 1998.
[52] Cohen, W.W. and Singer, Y. “Context-sensitive learning methods for text categorization.”, In SIGIR ’96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.307-315, 1996.
[53] Yiming Yang and Jan O. Pedersen, "A comparative study on feature selection in text categorization", International Conference on Machine Learning,.pp.412-420, 1997.
[54] Yang, Y. “Expert network: Effective and efficient learning from human decisions in text categorization.”, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 13-22, 1994.
[55] Yang, Y. and Chute, C.G. “An example-based mapping method for text categorization and retrieval.”, ACM Transactions on Information Systems, vol.12, no.3,252-277, 1994.
[56] Yuan Yao, Paolo Frasconi, “ Fingerprint classification with combinations of support vector machines.”,. pp.253-258, AVBPA 2001.
[57] 蘇木村、張孝德 編著, 機器學習:類神經網路、模糊系統以及基因演算法, 全華科技圖書股份有限公司, 89年3月二版.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top