研究生(外文):Jun-liang Lin
論文名稱:應用Google N-gram 與 PTFICF的自動文件分類系統
論文名稱(外文):A New Auto Document Category System by Using Google N-gram and Probability based Term Frequency and Inverse Category Frequency
指導教授(外文):Wen-Chen Huang
中文關鍵詞:Google N-gramPTFICFSVM文件分類
外文關鍵詞:PTFICFGoogle N-gramText ClassificationSVM
本研究提出N-gram斷詞演算法(N-gram Segmentation Algorithm, NSA),改善傳統靜態擷取特定字數的問題。NSA方法結合刪除停用詞(Stopwords)、還原字根(Stemming)與N-gram選詞處理,並搭配Google N-gram語料庫,使其從文件中擷取出來的N-gram關鍵詞具有代表性。除了關鍵詞擷取方法外,本研究也提出新的詞彙加權方法,我們依照Google N-gram語料庫中所提供的Google詞頻出現的次數,對文件詞頻作分級式的加權,將特定群組中比較有意義的詞彙提升加權評等。
我們使用NSA方法來擷取文件中的關鍵詞彙,並使用機率式詞頻與反類別頻率 (Probability based Term Frequency and Inverse Category Frequency, PTFICF)將擷取出的關鍵詞彙加權,最後再使用SVM分類器將文件進行分類。本研究共設置三個實驗:實驗一使用Classic4作為平衡資料集,實驗結果F_1Value達到96.4%;實驗二使用Reuter-21578作為不平衡資料集,實驗結果F_1Value達到78.7%;實驗三使用Google詞頻加權方法,實驗結果證實Google詞頻越大,其分類準確度越高。本研究所提出之方法其分類準確度不僅優於傳統所提出的方法,且分類器所訓練的時間為原來的1/10倍。
The electronic documents between companies and organizations are growing fast. Automatic classification is an important issue of information service and knowledge management. Keywords are the smallest units to present the document. Therefore, almost each part of document automation processing such as knowledge mining, automatic filtering, automatic summarization, event tracking, or concept retrieval etc., have to retrieve keywords from documents first, and then proceed with analytical processing.
We propose the N-gram Segmentation Algorithm (NSA) in this study to improve the problem of static retrieving keywords. NSA method combines Stopwords, Stemming, N-gram choosing, and Google N-gram Corpus. We fetch the meaningful N-gram keywords by using NSA method. In addition to keywords extraction methods, this research also proposes a new keyword weighting method. We use Google N-gram Frequency as a weight for terms frequency. This method can enhances the weighting mechanism for keywords extraction in particular group.
The Probability based Term Frequency and Inverse Category Frequency (PTFICF) is used for weighting the keywords in documents. Finally, we use SVM to classify the test documents. This study sets up three experiments: Experiment 1 used Classic4 as a balanced data set, and experimental result showed that F_1Value was 96.4%. Experiment 2 used Reuter-21578 as an imbalanced dataset, and experimental result showed that F_1Value was 78.7%. Experiment 3 used Google Frequency as a weighting method, the experimental result demonstrated that if the Google Frequency was higher, the classification result would be more accurate. Overall, the proposed methods are more accurate than traditional methods, and they also reduce 90% of the training time.
第一章、緒論 1
1.1 研究背景 1
1.2 研究動機與目的 2
1.3 研究貢獻與重要性 4
1.4 論文架構 4
第二章、文獻探討 5
2.1 關鍵詞擷取方法 5
2.2 特徵詞加權方法 7
第三章、研究方法 10
3.1 資料樣本選取 11
3.2 Google 5-gram語料庫 13
3.3 分類器選擇 15
3.4 N-gram Segmentation Algorithm 16
3.4.1 刪除停用詞(Stopwords) 16
3.4.2 還原字根(Stemming) 18
3.4.3 N-gram選詞處理(N-gram Choosing) 19
3.5 Google詞頻加權方法 21
3.5.1 Google Frequency 21
3.5.2 Google based Term Frequency 22
第四章、實驗研究 24
4.1 實驗限制 24
4.2 實驗資料集 24
4.3評估方法 28
4.4 實驗一:平衡資料集(Classic4) 30
4.4.1 實驗設定 30
4.4.2 實驗分析 31
4.5 實驗二:不平衡資料集(Reuter-21578) 36
4.5.1 實驗設定 36
4.5.2 實驗分析 37
4.6 實驗三:Google Frequency加權效果 44
4.6.1 Google Frequency 44
4.6.2正向加權(G+TF) 48
4.6.3負向加權(G-TF) 52
4.7 實驗結果探討 55
4.7.1 實驗一探討 55
4.7.2 實驗二探討 57
4.7.3 實驗三探討 58
第五章、結論與未來研究 60
第六章、參考文獻 62
6.1 中文部分 62
6.2 英文部分 62
