跳到主要內容

臺灣博碩士論文加值系統

(98.82.140.17) 您好!臺灣時間:2024/09/10 13:42
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林俊良
研究生(外文):Jun-liang Lin
論文名稱:應用Google N-gram 與 PTFICF的自動文件分類系統
論文名稱(外文):A New Auto Document Category System by Using Google N-gram and Probability based Term Frequency and Inverse Category Frequency
指導教授:黃文楨黃文楨引用關係
指導教授(外文):Wen-Chen Huang
學位類別:碩士
校院名稱:國立高雄第一科技大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2012
畢業學年度:100
語文別:中文
論文頁數:77
中文關鍵詞:Google N-gramPTFICFSVM文件分類
外文關鍵詞:PTFICFGoogle N-gramText ClassificationSVM
相關次數:
  • 被引用被引用:1
  • 點閱點閱:508
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於資訊技術普遍使用,各企業與機構之間的電子化文件不斷快速成長累積,如何利用自動化技術快速且有效地協助人工分類,以應付大量的分類需求,為現今資訊服務與知識管理之重要課題。關鍵詞為文件主題意義之最小單位,因此大部分對於非結構化之文件自動處理,如知識探勘、自動過濾、自動摘要、事件追蹤或概念檢索等,都必須先進行文件關鍵詞擷取後,再進行後續的分析處理。
本研究提出N-gram斷詞演算法(N-gram Segmentation Algorithm, NSA),改善傳統靜態擷取特定字數的問題。NSA方法結合刪除停用詞(Stopwords)、還原字根(Stemming)與N-gram選詞處理,並搭配Google N-gram語料庫,使其從文件中擷取出來的N-gram關鍵詞具有代表性。除了關鍵詞擷取方法外,本研究也提出新的詞彙加權方法,我們依照Google N-gram語料庫中所提供的Google詞頻出現的次數,對文件詞頻作分級式的加權,將特定群組中比較有意義的詞彙提升加權評等。
我們使用NSA方法來擷取文件中的關鍵詞彙,並使用機率式詞頻與反類別頻率 (Probability based Term Frequency and Inverse Category Frequency, PTFICF)將擷取出的關鍵詞彙加權,最後再使用SVM分類器將文件進行分類。本研究共設置三個實驗:實驗一使用Classic4作為平衡資料集,實驗結果F_1Value達到96.4%;實驗二使用Reuter-21578作為不平衡資料集,實驗結果F_1Value達到78.7%;實驗三使用Google詞頻加權方法,實驗結果證實Google詞頻越大,其分類準確度越高。本研究所提出之方法其分類準確度不僅優於傳統所提出的方法,且分類器所訓練的時間為原來的1/10倍。
The electronic documents between companies and organizations are growing fast. Automatic classification is an important issue of information service and knowledge management. Keywords are the smallest units to present the document. Therefore, almost each part of document automation processing such as knowledge mining, automatic filtering, automatic summarization, event tracking, or concept retrieval etc., have to retrieve keywords from documents first, and then proceed with analytical processing.
We propose the N-gram Segmentation Algorithm (NSA) in this study to improve the problem of static retrieving keywords. NSA method combines Stopwords, Stemming, N-gram choosing, and Google N-gram Corpus. We fetch the meaningful N-gram keywords by using NSA method. In addition to keywords extraction methods, this research also proposes a new keyword weighting method. We use Google N-gram Frequency as a weight for terms frequency. This method can enhances the weighting mechanism for keywords extraction in particular group.
The Probability based Term Frequency and Inverse Category Frequency (PTFICF) is used for weighting the keywords in documents. Finally, we use SVM to classify the test documents. This study sets up three experiments: Experiment 1 used Classic4 as a balanced data set, and experimental result showed that F_1Value was 96.4%. Experiment 2 used Reuter-21578 as an imbalanced dataset, and experimental result showed that F_1Value was 78.7%. Experiment 3 used Google Frequency as a weighting method, the experimental result demonstrated that if the Google Frequency was higher, the classification result would be more accurate. Overall, the proposed methods are more accurate than traditional methods, and they also reduce 90% of the training time.
第一章、緒論 1
1.1 研究背景 1
1.2 研究動機與目的 2
1.3 研究貢獻與重要性 4
1.4 論文架構 4
第二章、文獻探討 5
2.1 關鍵詞擷取方法 5
2.2 特徵詞加權方法 7
第三章、研究方法 10
3.1 資料樣本選取 11
3.2 Google 5-gram語料庫 13
3.3 分類器選擇 15
3.4 N-gram Segmentation Algorithm 16
3.4.1 刪除停用詞(Stopwords) 16
3.4.2 還原字根(Stemming) 18
3.4.3 N-gram選詞處理(N-gram Choosing) 19
3.5 Google詞頻加權方法 21
3.5.1 Google Frequency 21
3.5.2 Google based Term Frequency 22
第四章、實驗研究 24
4.1 實驗限制 24
4.2 實驗資料集 24
4.3評估方法 28
4.4 實驗一:平衡資料集(Classic4) 30
4.4.1 實驗設定 30
4.4.2 實驗分析 31
4.5 實驗二:不平衡資料集(Reuter-21578) 36
4.5.1 實驗設定 36
4.5.2 實驗分析 37
4.6 實驗三:Google Frequency加權效果 44
4.6.1 Google Frequency 44
4.6.2正向加權(G+TF) 48
4.6.3負向加權(G-TF) 52
4.7 實驗結果探討 55
4.7.1 實驗一探討 55
4.7.2 實驗二探討 57
4.7.3 實驗三探討 58
第五章、結論與未來研究 60
第六章、參考文獻 62
6.1 中文部分 62
6.2 英文部分 62
6.1 中文部分
[1]林延璉,2011年,“使用句構分析模型與向量支持機的自動文件分類架構”,碩士論文,國立高雄第一科技大學資訊管理研究所。
[2]曾元顯,2002年,“文件主題自動分類成效因素探討”,中國圖書館學會會報,68期,頁62-83。
[3]蔡純純,2002年,“中文新聞文件空間資訊擷取之研究─以火災、搶劫、車禍事件為例”,碩士論文,台灣大學地理環境資源學系。
[4]鄭為倫,王台平,2005年,“運用特徵詞權重改善文件自動分類之成效-以貝氏分類器為例”,第一屆資訊管理學術暨專案管理實務研討會,論文編號IMPM-E18。
6.2 英文部分
[5]Al-Kofahi, K., Tyrrell, A., Travers, A. V. T., and Jackson P.,“Combining Multiple Classifiers for Text Categorization”, Proceedings of the Tenth International Conference on Information and Knowledge Management, 2001, pp. 97-104.
[6]Azzopardi L., Porter Stemming with C#.Net, Retrieved June 18, 2012, from Univerity of Paisley, Scotland Web Site: http://tartarus.org/~martin/PorterStemmer/
[7]Bekkerman, R., El-Yaniv, R., Winter, Y., and Tishby, N.,“On Feature Distributional Clustering for Text Categorization”, Proceedings of the Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 2001, pp.146-153.
[8]Bergsma, S., Pitler E. and Lin D.,“Creating Robust Supervised Classifiers via Web-Scale N-gram Data”, Proceeding ACL ''10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 865-874.
[9]Bergsma, S.,Lin D. and Goebel R.,“Web-Scale N-gram Models for Lexical Disambiguation”, Proceeding IJCAI''09 Proceedings of the 21st international jont conference on Artifical intelligence, 2009, pp. 1507-1512.
[10]Brants T., Franz A., Google N-gram Corpus, Retrieved June 18, 2012, from Web Site: http://books.google.com/ngrams/datasets
[11]Chen, C L., Tseng, S.C. and Liang, T.,“Mining fuzzy frequent itemsets for hierarchical document clustering”, Information Processing and Management, 2010, vol.46, pp. 193-211.
[12]Church, K. W. and Hanks, P.,“Word association norms, mutual information and lexicography”, Computational Linguistics, 1990, 16(1), pp. 22-29.
[13]Debole, F., and Sebastiani, F., “Supervised term weighting for automated text categorization”, Proceedings of the 2003 ACM symposium on applied computing, 2003, pp. 784-788.
[14]Elisseeff, A. and Weston, J., “A kernel method for multi-labelled classification”, Advances in Neural Information Processing Systems, 2002, pp. 681–687.
[15]Ferreira, A. and Figueiredo, M.,“An unsupervised approach to feature discretization and selection”, Pattern Recognition, 2012, 45(9), pp.3048-3060.
[16]Ferreira, A. and Figueiredo, M.,“Efficient Unsupervised Feature Selection for Sparse Data”, International Conference on Computer as a Tool (EUROCON), 2011 IEEE.
[17]Fung, B. C. M., Wang, K. and Ester, M.,“Hierarchical document clustering using frequent itemsets”, Proceedings of the 3th SIAM, 2003, pp. 59-70.
[18]Guan, H., Zhou, J., and Guo, M.,“A class-feature-centroid classifier for text categorization”, Proceedings of the 18th international conference on World wide web, 2009, pp. 201-210.
[19]Hughes, T., and Ramage, D.,“Lexical semantic relatedness with random grapg walks”, Proceeding of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007.
[20]Jiang, M., Jensen, E., Beitzel, S., and Argamon, S.,“Choosing the right bigrams for information retrieval”, Proceedings of the 2004 Meeting of the International Federation of Classification Societies , 2004, Chicago, IL.
[21]Joachims, T.,“Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Proceedings of the European Conference on Machine Learning, 1998, pp. 137-142.
[22]Leopold, E., and Kindermann, J.,“Text categorization with support, vector machines – How to represent texts in input space”, Machine Learning, 2002, vol. 46, pp. 423–444.
[23]Liu, Y., Loh, H. T., and Sun, A,“Imbalanced text classification: A term weighting approach”, Expert Systems with Applications, 2009, vol. 36, pp. 690-701.
[24]Peat, H. J., and Willett, P.,“The limitation of term co-occurrence data for query expansion in document retrieval systems”, Journal of the American Society for Information Science, 1991, 42(5), pp.379-380.
[25]Salton, G. and Buckley C., Stop Word List 2, Retrieved June 18, 2012, from Cornell University, Experimental SMART Information Retrieval System Web Site: http://www.lextek.com/manuals/onix/stopwords2.html
[26]Saracoğlu, R., Tutuncu, K. and Allahverdi, N.,“A new approach on search for similar documents with multiple categories using fuzzy clustering”, Expert Systems with Applications, 34(4), 2008, pp. 2545–2554.
[27]Steinbach, M., Karypis, G. and Kumar, V.,“A comparison of document clustering techniques”, Workshop on Text Mining, 2000, pp. 109-111.
[28]Tandon, N., and Melo, G. D.,“ Information Extraction from Web-Scale N-Gram Data”, SIGIR 2010 WEB N-GRAM Workshop, 2010.
[29]Wang, K., Thrasher, C., Viegas, E., Li, X., and Hsu, P.,“An Overview of Microsoft Web N-gram Corpus and Applications”, Proceedings of the NAACL HLT 2010: Demonstration Session, 2010, pp. 45-48.
[30]Yang, Y., and Pedersen, J. O.,“A Comparative Study on Feature Selection in Text Categorization”, Proceedings of the International Conference on Machine Learning, 1997, pp.412-420.
[31]Yu, L. C., Wu, C. H., Philpot, A. and Hovy, E.,“OntoNotes : Sense Pool Verification Using Google N-gram and Statistical Tests”, Linguistic Data Consortium (LDC), 2007.
[32]Zhang, M. L., “ML-RBF: RBF neural networks for multi-label learning,” Neural Processing Letters, 2009, 29(2), pp. 61–74.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top