跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.23) 您好!臺灣時間:2025/10/26 22:23
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃政傑
研究生(外文):Jeng Jie Huang
論文名稱:應用於Blog Connect的中文未知詞擷取模型
論文名稱(外文):A Chinese Unknown Words Extraction Model for The Blog Connect
指導教授:呂瑞麟呂瑞麟引用關係
指導教授(外文):Eric Jui-Lin Lu
口試委員:陳克健陳宜惠
口試日期:2014-07-25
學位類別:碩士
校院名稱:國立中興大學
系所名稱:資訊管理學系所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:中文
論文頁數:36
中文關鍵詞:未知詞中文斷詞查詢關鍵字
外文關鍵詞:Unknown wordChinese segmentationQueried keyword
相關次數:
  • 被引用被引用:0
  • 點閱點閱:204
  • 評分評分:
  • 下載下載:21
  • 收藏至我的研究室書目清單書目收藏:2
由於中文的原始呈現方式並不像歐美語系一樣,每個字詞之間都有空白(blank)做區隔,所以在處理中文資料的過程中,中文斷詞是一個極重要的環節,而中文斷詞主要的問題之一即是”未知詞”的處理。一般來說,傳統的未知詞擷取方法,主要是針對一篇的文章,且以句子為處理單位,從中擷取未知詞;然而由Blog Connect平台收集使用者查詢某一部落格文章所用的查詢關鍵字卻不是完整的句子。為此我們提出了一個未知詞擷取的方法,希望從查詢關鍵字中擷取出未知詞,進而提高查詢關鍵字的斷詞正確性。在本篇論文中,我們將未知詞分兩階段進行偵測與擷取。在未知詞偵測階段,我們利用查詢關鍵字集的特性及關鍵字的頻率設立條件,來偵測出可能含有未知詞的查詢關鍵字。在未知詞擷取階段,以我們提出的algorithm搭配處理規則,以遞迴的方式來擷取未知詞。實驗結果顯示,我們的方法可以幫助提高查詢關鍵字的斷詞正確性,其F-measure高達76.75%。另外我們的方法優於有未知詞辨識的斷詞系統CKIP,在實驗資料中總共有988個未知詞,我們的方法擷取出689個未知詞,而CKIP只擷取出573個未知詞。
Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words.
中文摘要 i
英文摘要 ii
目錄 iii
表目錄 iv
圖目錄 v
1. Introduction 1
2. Related Work 5
3. Research Method 9
3.1. Blog Connect 9
3.1.1. 前處理 10
3.2. 未知詞偵測 11
3.3. 未知詞擷取 12
3.3.1. 中文人名擷取 13
3.3.2. 歐美譯名擷取 14
3.3.3. 非特定類型未知詞擷取 14
3.3.3.1. 優先順序 16
3.3.3.2. 未知詞擷取規則(General Rule) 17
3.3.3.2.1. Pattern Condition 17
3.3.3.2.2. Statistical Condition 18
3.3.3.3. A Variant Bottom up merging algorithm 範例 20
4. Experiment and Analysis 23
4.1. 實驗一 24
4.2. 實驗二 26
5. Conclusion and Future Work 29
Reference 30
Appendix 34
[1] Ali-Hasan, N.,& Adamic, E. (2008). Expressing Social Relationships on the Blog through links and comments. [Online]. Available: www.ladamic.com/work/papers/oc/onlinecommunities.pdf. [Accessed 9 6 2008].
[2] Academia Sinica Balanced Corpus( in chinese as "中央研究院平衡語料庫), [Online]. Available: http://db1x.sinica.edu.tw/cgi-bin/kiwi/mkiwi/kiwi.sh.
[3] Bojars,U., Breslin, J. G., Peristeras,V., Tummarello,G.,& Decker,S. (2008). Interlinking the Social Web with Semantics. Journal of IEEE Intelligent Systems 2008( pp. 29-40.)
[4] "Blog Connect," [Online]. Available: http://bridge.nchu.edu.tw/BC/.
[5]Chen,Y. H., Lu,J. L., & Tsai,M. F.(2013). Finding Keywords in Blogs: Efficient Keyword Extraction in Blog Mining via User Behaviors. SCI.( pp. 663-670.)
[6] Chen,Y. H., Lu,J. L., & Huang, J.J. (2014). Analysis Chinese Sgmentation Systems on Queried Keywords. International Conference on Information Management.
[7] Chen, K. J. , &Liu, S. H. (1992). Word identification for Mandarin Chinese sentences. Fifth International Conference on Computational Linguistics.( pp. 101-107.)
[8] Chen, K.J. ,& Ma, W.Y. (2003). Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. In Proceedings of SIGHAN, pp. 168-171.
[9] Chen, K.J. ,& Ma, W.Y. (2002).Unknown Word Extraction for Chinese Documents. In Proceedings of COLING.( pp. 169-175.)
[10] Chen, K.J.,& Ma, W.Y. (2003). A bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of SIGHAN.
[11] Chen, H.H., & Lee, J.C. (1996). Identification and Classification of Proper Names in Chinese Texts. In Proceedings of the 16th conference on Computational linguistics.( pp. 222-229.)
[12] Chen, K.J. ,& Chen, C.J. (2000). Knowledge Extraction for Identification of Chinese Organization Names. In Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics.(pp. 15-21.)
[13] Chen, K.J.,& Bai, M.H. (1998).Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Processing. (pp. 27-44.)
[14] Chang, T.H., & Lee, C.H. (2003). AUTOMATIC CHINESE UNKNOWN WORD EXTRACTION USING. Natural Language Processing and Knowledge Engineering.
[15] Church, K., Gale, W., Hanks, P.,& Hindle, D. (1991).Using Statistics in Lexical Analysis. Lawrence Erlbaum Associates Publishers.(pp. 115-164.)
[16] Common english names, [Online]. Available: http://www.hitutor.com.tw/english-name.php
[17] Erdmann, M.,& Studer, R. (2001). How to Structure and Access XML Documents with Ontologies. Data Knowledge Engineering (36:3)(pp.317-335.)
[18] Fan, C. K., & Tsai, W. H. (1998). Automatic Word Identification in Chinese Sentences by the Relaxation Technique. Computer Proceeding of Chinese and Oriental Languages.( pp. 33-56.)
[19] Gao,J., & Lai,W. ( 2010). Formal Concept Analysis Based Clustering for Blog Network Visualization. Proceedings of International Conference on Advanced Data Mining and Applications.( pp. 394-404.)
[20] Goh, C.L., Asahara, M., & Matsumoto,Y. (2006). Machine Learning-based Methods to Chinese Unknown Word Detection and POS Tag Guessing. International Journal of Chinese Language and Computing(pp. 185-206.)
[21] Gao, J.M.,&Lin, C.L.Coupus Constrction"(in chinese as "語料庫建構技術"),” [Online]. Available: http://www.naer.edu.tw/ezfiles/0/1000/img/25/439422325.pdf
[22] Hu, X.,& Wu,B. (2006). Automatic Keyword Extraction Using Linguistic Features. In: Proceedings of the Sixth IEEE International Conference on Data Mining-Workshop(ICDMW).(pp. 19-23.)
[23] Jiang, X., Wang, L., Cao,Y.,& Lu,Z. (2011). Automatic Recognition of Chinese Unknown Word for Single-Character and Affix Models. Knowledge Engineering and Management, AISC.
[24] Johnson, N. (2008).Google on User Intent in Search Queries, Search Engine Watch. [Online]. Available: http://searchenginewatch.com/article/2053806/Google-On-User-Intent-in-Search-Queries.
[25] Lu,L.,& Zhu,F. (2010). Blogger clustering by utilizing link information. Proceedings of IEEE International Conference on Intelligent Computing and Intelligent Systems(ICIS)( pp. 267-270.)
[26] Larsen,B., & Aone,C. (1999). Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge discovery and Data Mining(KDD ’99).( pp. 16-22.)
[27] Li,H., & Yuan,B. Chinese Word Segmentation. Proceedings of the 12th Paci Asia Conference on Language Information and Computation.(pp. 212-217.)
[28] Lai. Y.,& Wu, C. (2000). Unknown Word and Phrase Extraction Using a Phrase-Like-Unit Based Likelihood Ration. Iutemational Joumal of Computer Processing . (pp. 83-95.)
[29] Li, H., Huang, C.-H., Gao, J., & Fan, X. (2005). The Use of SVM for Chinese New Word Identification. LNCS. (pp. 723-732.)
[30] Li, B.-I. (1991). A maximal matching automatic Chinese word Segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference.( pp. 135-146.)
[31] Lo, C.H., Huang, W.C.,& Chen, H.L.(2011). Construction of Semantic and Sentence Patterns Retrieval Service System. Journal of Information Management(in Chinese as(資訊管理學報),vo.18.

[32] Nie, J., Briscbois, M.,& Ren, X. (1996). On Chinese Text Retrieval. Conference Proceedings of SIGIR. (pp. 225-233.)
[33] Ohtsuki, K., Matsuoka, T., Matsunaga, S., & Furui, S. (1998). Topic extraction with multiple topic-words in broadcast-news speech. In : Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).(pp. 329-332.)
[34] Sobel, J. (2011). State of the Blogosphere 2011: Introduction and Methodology. [Online]. Available: http://www.199it.com/archives/tag/state-of-the-blogosphere-2011.
[35] Surnames ( in chinese as "百家姓).[Online] Available: http://zh.wikipedia.org/wiki/%E7%99%BE%E5%AE%B6%E5%A7%93

[36] Tsai,C.H.,2000, "MMSEG4J: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching,". [Online]. Available: http://technology.chtsai.org/mmseg4j/.
[37] Wu, Y., Hsieh, C., Lin, W., Liu, C., & Yu, L. (2011). Unknown word extraction from multilingual code-switching sentences. In Proceedings of the 23rd conference on computational linguistics and speech processing (pp. 349–360).
[38] Word List withAccumulated Word Frequency in Sinica Corpus 3.0( in chinese as 中央研究院平衡語料庫詞集及詞頻統計). [Online]. Available: http://www.aclclp.org.tw/doc/wlawf_abstract.pdf
[39] Yang, C.C., & Chang, C. H. (2008).A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning. The 13th conference on Artiticial Intelligence and Application.
[40] Zhu, Q., Cheng, X. Y.,& Gao, Z. (2001). The Recognition Method of Unknown Chinese Words in Fragments Based On Mutual Information. Knowledge Engineering and Management.AISC.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top