跳到主要內容

臺灣博碩士論文加值系統

(44.210.83.132) 您好!臺灣時間:2024/05/27 03:00
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃健哲
研究生(外文):Jian-Jhe Huang
論文名稱:以遺傳演算法為基礎結合交互資訊之自動化中文斷詞系統
論文名稱(外文):An Automatic Chinese Word Segmentation System Based on Integration of Genetic Algorithm and Mutual Information
指導教授:洪智力洪智力引用關係
指導教授(外文):Chih-Li Hung
學位類別:碩士
校院名稱:中原大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2011
畢業學年度:99
語文別:中文
論文頁數:54
中文關鍵詞:中文文章分類中文資訊處理中文斷詞器交互資訊遺傳演算法
外文關鍵詞:Genetic AlgorithmChinese Segmentation SystemChinese Information ProcessingChinese SegmenterChinese Document ClassificationMutual Information
相關次數:
  • 被引用被引用:2
  • 點閱點閱:322
  • 評分評分:
  • 下載下載:3
  • 收藏至我的研究室書目清單書目收藏:1
隨著資訊科技的蓬勃發展,電腦中文資訊處理已從過去研究如何讓電腦顯示中文到了如何讓電腦理解文章的內容。舉凡任何有關於中文資訊處理的範疇,如中文語音辨識、中文資訊檢索、機器翻譯、自然語言理解、中文文字探勘等,都先必須經過中文斷詞之處理,將句子或文章切割成較小單位之詞彙,進而讓機器理解、處理。目前現代化中文斷詞系統,仍必須仰賴後端詞庫或語料庫的詞彙知識,且詞庫之建立必須耗費極大的人力成本。若有新詞、未知詞出現,而中文詞庫的更新無法跟隨上這些詞彙的產生,將會直接地影響到中文斷詞器的斷詞結果。為了因應以上問題,本研究透過提出一具有動態調適能力之中文斷詞系統,從網路上蒐集中文文本並建立、更新中文詞庫。在斷詞計算上,透過遺傳演算法計算最佳之斷詞組合,並以交互資訊之概念的適應函式,加強探討詞彙上下之間的關係。最後於中文文章分類問題上,檢驗研究提出之中文斷詞器與過去學者所提出,以遺傳演算法為基礎之中文斷詞器,以及知名中文斷詞器CKIP之成效。研究結果顯示,若是將欲分類的中文文章,當作詞庫建立的來源來建立詞庫,則使用本研究所提出之中文斷詞器,其所斷詞的中文文章在文章分類準確率上明顯高於其餘二者。且在同樣是使用遺傳演算法來計算斷詞結果的中文斷詞器上,本研究所提出結合交互資訊的適應函式,於相同比較基準上,也較優於過去以詞長、詞頻的適應函式。根據研究結果,若本研究提出的詞庫建立方法,能夠更加廣泛蒐集詞彙,將能提升現有中文斷詞器之斷詞結果。


Accompany with the development of information technology, the Chinese information processing has changed to meaning or context understanding rather than just showing Chinese character on computer screen. When it comes to Chinese information processing, including its subarea, Chinese Voice Recognition, Chinese Information Retrieval, Chinese Document Classification, Chinese Machine Translation, Understanding of Natural Language, Chinese Text Mining, etc. We can't go advance without the first step of Chinese Word Segmentation which splits whole documentation or sentences into meaningful and understandable words so that machine could handled. Take a look at nowadays Chinese Word Segmentation System, it still relies on knowledge of back-end lexicon or corpus and the lexicon building needs a lot of artificial works. If anew word or unknown word such as person name, place name, event name appears and the lexicon doesn't update these word immediately, it would impacts the result of Chinese Word Segmentation System. In order to take a measure of above problem, we propose a Chinese Word Segmentation System with dynamic adaptive ability in lexicon building that collects Chinese documents from Internet and uses these information to build and update lexicon automatically. In computing of segmentation result, we use Genetic Algorithm which combines with fitness function with concept of mutual information that came from statistics area so as to enhance discussion between word and word. Finally, since there are no absolutely criterion to judge a segmentation result good or not, so we take Chinese Documentation Classification to evaluate the segmentation result with another GA-based Chinese Word Segmentation System that proposed by Chen(2000) and well-known modern Chinese Word Segmentation System, CKIP which proposed by Academia Sinica. The research result shows, if we take the document set of classification as training data of lexicon building before document classification, the proposed Chinese segmenter greater than other two segmenter on classified accuracy. In the basis of same GA-based Chinese segmenter, our proposed fitness function that combines with mutual information also outperform Chen's fitness function that using word length and frequency. According to our research result, if we can massively and largely collect words by using proposed approach of lexicon building, it would great improve the result of current Chinese Word Segmentation System.


目錄
摘要................................................................ I
Abstract........................................................... II
目錄.............................................................. III
圖目錄.............................................................. V
表目錄............................................................. VI
第一章 緒論......................................................... 1
1.1研究背景 .................................................... 1
1.2研究動機 .................................................... 2
1.3研究問題 .................................................... 2
1.4研究目的 .................................................... 3
1.4 論文架構 .................................................... 3
第二章 文獻探討..................................................... 5
2.1中文斷詞介紹 ................................................ 5
2.2 傳統中文斷詞法 .............................................. 6
2.2.1 法則式斷詞法 .......................................... 7
2.2.1.1後追蹤最大匹配法 ..................................... 9
2.2.1.2前後最大匹配法 ....................................... 9
2.3 統計式斷詞法 ................................................ 9
2.3.1簡單機率斷詞法 ........................................ 11
2.3.2交互資訊 .............................................. 12
2.4 混合式斷詞法 ............................................... 12
2.5遺傳演算法 ................................................. 13
2.5.1 遺傳演算法的特性 ..................................... 13
2.5.2 遺傳演算法運作方式 ................................... 13
2.6支援向量機 ................................................. 16
第三章 研究方法.................................................... 18
3.1系統架構 ................................................... 18
3.2初始訓練模組 ............................................... 19
3.2.1資料來源 .............................................. 19
3.2.2 爬行器 ............................................... 19
3.2.3本文萃取 .............................................. 20
3.2.4文章預處理 ............................................ 20
3.2.5 N-gram演算法 ......................................... 21
3.3 自動訓練模組 ............................................... 24
3.4詞庫修剪 ................................................... 24
3.5最佳化斷詞模組 ............................................. 25
3.5.1染色體編碼 ............................................ 25
3.5.2適應函數 .............................................. 27
3.6中文文章分類評估 ........................................... 28
第四章 實驗結果.................................................... 30
4.1 實驗說明 ................................................... 30
4.1.1 中文斷詞器 ........................................... 30
4.1.2 訓練與測詴集 ......................................... 31
4.1.3 SVM分類器參數設定 .................................... 33
4.1.4 評估方式 ............................................. 33
4.2 實驗結果與分析 ............................................. 35
4.2.1 實驗一 ............................................... 35
4.2.2 實驗二 ............................................... 39
第五章 結論與未來展望.............................................. 42
參考文獻:......................................................... 44
附錄A:中文停用詞表 ............................................... 47
附錄B:符號表 ..................................................... 48

圖目錄
圖 2.1統計式斷詞法基本流程 ............................................................................................ 10
圖 2.2遺傳演算法運作流程 ................................................................................................ 14
圖 2.3 SVM 分類示意圖 ....................................................................................................... 16
圖 3.1系統架構圖 ................................................................................................................ 18
圖 3.2新聞文件連結 ............................................................................................................ 20
圖 3.3文章預處理 ................................................................................................................ 21
圖 3.4部分詞庫內容 ............................................................................................................ 23
圖 3.5染色體編碼 ................................................................................................................ 26

表目錄
表2. 1詞彙機率表 ................................................................................................................. 11
表3.1混亂矩陣(Confusion Matrix) ...................................................................................... 29
表4. 1斷詞器代稱 ................................................................................................................. 30
表4. 2文章資料集分類與篇數 ............................................................................................. 32
表4. 3 SVM參數設定 ........................................................................................................... 33
表4. 4實驗一CKIP分類結果 .............................................................................................. 36
表4. 5實驗一GA+詞頻斷詞器分類結果 ............................................................................ 37
表4. 6實驗一GA+MI斷詞器分類結果 .............................................................................. 38
表4. 7實驗二GA+詞頻斷詞器分類結果 ............................................................................ 40
表4. 8實驗二GA+MI斷詞器分類結果 .............................................................................. 41
表4. 9 成對母體平均數差異檢定之結果 ............................................................................ 41
(1)Feng, Y. (1998). Design and analysis of Chinese automatic segmenting system based on neural network. Journal of The China Society for Scientific and Technical Information, 1.
(2)Foo, S., & Li, H. (2004). Chinese Word Segmentation and its effect on information retrieval. Information Processing and Management, 40(1), 161-190.
(3)Fung, P., & Wu, D. (1999). Statistical augmentation of a Chinese machine-readable dictionary. Natural Language Processing Using Very Large Corpora, 137.
(4)Galil, Z. (1986). Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 18(1), 23-38.
(5)Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Boston: Addison-Wesley Longman Publishing Co,. Inc.
(6)Li, H., & Yuan, B. (1998). Chinese Word Segmentation. Proceedings of the 12th Paci Asia Conference on Language, Information and Computation, 212-217.
(7)He, J., & Chen, L. (2008). Chinese Word Segmentation based on the improved Particle Swarm Optimization neural networks. Cybernetics and Intelligent Systems 2008 IEEE Conference, 695-699.
(8)Hong, C. M., Chen, C. M., & Chiu, C. Y. (2006). New word extraction utilizing Google News corpuses for supporting lexicon-based Chinese Word Segmentation systems. Neural Networks 2006 IJCNN, 3040-3046.
(9)Wang, H., & Cui, M. (2009). A Chinese Word Segmentation based on machine learning. Education Technology and Computer Science, ETCS'09, 2, 610-613.
(10)Lipski, W., & Preparata, F. P. (1981). Efficient algorithms for finding maximum matchings in convex bipartite graphs and related problems. Acta Informatica, 15, 329-346.
(11)Wong, P., & Chan, C. (1996). Chinese Word Segmentation based on maximum matching and word binding force. Proceedings of the 16th conference on Computational linguistics, 1, 200-203.
(12)Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval: Achievements and problems. Journal of the American Society for Information Science, 44, 532-542.
(13)Yang, C. C., Yen, J., Yung, S. K., & Chung, A. K. (1998). Chinese indexing using mutual information. Proceedings of the First Asia Digital Library Workshop, 1, 57-64.
(14)Zhen, L., & Li, Y. (2010). Reverse backtracking research of Chinese segmentation based on dictionary of hash structure. Information Technology and Computer Science (ITCS), 2010 Second International Conference, 265-267.
(15)Ma, W. Y., & Chen, K. J. (2003). Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. Proceedings of the second SIGHAN workshop on Chinese language processing, 17, 168-171.
(16)Manning, C., & Schtze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.
(17)Reinman, S. L. (2010). The World Factbook. Oklahoma : Emerald Group Publishing Limited.
(18)Chen, K.J. & Ma Wei-Yun. (2002). Unknown word extraction for Chinese documents. COLING '02 Proceedings of the 19th international conference on Computational linguistics, 1, 1-7.
(19)Chen, K.J. & Liu, S.H. (1992).Word Identification for Mandarin Chinese Sentences. COLING '92 Proceedings of the 14th conference on Computational linguistics, 101-107.
(20)Lu, X. (2005). Towards a hybrid model for Chinese Word Segmentation. Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing, 189-192.
(21)Chau, M., Lu, Y., Fang, X., & Yang, C.C. (2009). Characteristics of character usage in Chinese Web searching. Information Processing and Management, 45, 115-130.
(22)Jia H., Lin C., &Hong X. (2007). Chinese Word Segmentation using back propagation trained by genetic algorithm DCDIS A Supplement. Advances in Neural Networks, 14, 416-420.
(23)Zhang H.P., Liu Q., Zhang H. & Cheng X. (2002). Automatic recognition of Chinese unknown words based on roles tagging. In Proceedings of First SIGHAN Workshop on Chinese Language Processing, 71-77.
(24)Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine learning, 20(3), 273-297.
(25)Zhou, S., & Guan, J. (2002). Chinese Documents Classification Based on N-Grams. Computational Linguistics and Intelligent Text Processing, 2276, 31-50.
(26)Yang, S., Zhu, H., Apostoli, A & Cao, P. (2007). N-gram Statistics in English and Chinese: Similarities and Differences. Journal of IEEE Internet Computing, 454-460.
中文部分:
(1)陳稼興, 謝佳倫, & 許芳誠. (2000). 以遺傳演算法為基礎的中文斷詞研究. 資訊管理研究期刊, 8-24.
(2)陳永德 (1997). 中文斷詞中長詞優先、詞頻比對及前詞優先規則之使用, 國立臺灣大學心理學研究所博士論文.
(3)林千翔. (2004). 基於特製隱藏式馬可夫模型之中文斷詞研究 , 國立中央大學資訊工程研究所碩士論文.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top