跳到主要內容

臺灣博碩士論文加值系統

(3.90.139.113) 您好!臺灣時間:2022/01/16 18:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:林義証
研究生(外文):Yih-Jeng Lin
論文名稱:中文常用字串-一個優於傳統語言模型的新觀念
論文名稱(外文):Chinese Frequent Strings, A New Concept Over Traditional Language Models
指導教授:余明興余明興引用關係
指導教授(外文):Ming-Shing Yu
學位類別:博士
校院名稱:國立中興大學
系所名稱:應用數學系
學門:數學及統計學門
學類:數學學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:英文
論文頁數:68
中文關鍵詞:中文常用字串正規化混淆度音轉字字轉音韻律段中文自然語言處理語言模型未知詞
外文關鍵詞:CFSnormalized perplexityphoneme-to-charactercharacter-to-phonemeprosodic segmentChinese natural language processinglanguage modelsunknown words
相關次數:
  • 被引用被引用:4
  • 點閱點閱:1795
  • 評分評分:
  • 下載下載:334
  • 收藏至我的研究室書目清單書目收藏:2
本論文提出一個新觀念叫做中文常用字串(Chinese frequent string, CFS),亦即常常被我們使用的字串。中文常用字串包含一般辭典中的詞、未知詞、和其他經常出現的字串,例如「只得將」、「分分秒秒」、「為對方著想」及「並沒有人」等。我們發現CFS的unigram模型有許多優於傳統語言模型的地方。
在本論文中我們提出一個抽取中文常用字串的方法、在不需要辭典的輔助下,我們可以從中文語料庫中抽取中文常用字串。我們也評估這些中文常用字串是否恰當。
我們發現許多中文常用字串優於傳統語言模型及辭典的特性。並且比較它們在解決若干個自然語言處理中的問題的表現。
另外,我們也提出方法給與中文常用字串適當的詞性。這些詞性可以提供給一些解決中文自然語言處理問題上的重要訊息,如構詞等。

This dissertation proposes a new concept of Chinese frequent strings (CFSs), which are frequently used by people. The Chinese frequent strings include the words defined in a traditional lexicon, unknown words, and other frequently appeared Chinese strings. Some examples of the CFSs are “只得將 (can only let)”, “分分秒秒 (every minute and every second)”, “為對方著想 (bearing in mind the interest of each other)”, and “並沒有人 (and nobody)”. We find that CFSs with unigram language model is over traditional language model in many aspects.
We propose a method to extract the Chinese frequent strings without a dictionary from a Chinese corpus. We also show that the Chinese frequent strings we extracted are proper.
We find that there are many properties of CFSs over traditional language models and traditional dictionaries. We make some comparisons among them. We also applied CFSs in solving some Chinese natural language processing problems. We find that CFSs with unigram language model is over the traditional language model and lexicon in solving such problems.
We also propose a method to assign proper part-of-speech (POS) information to CFSs. Such POS information can be important information in many aspects of Chinese language processing problem, such as parsing.

List of Tables…………………………………………………………V
List of Figures………………………………………………………..VII
Chapter 1 Introduction ……………………………………………… 1
1.1 The motivation ……………………………………………….. 1
1.2 Some related works…………………………………….. 2
1.3 What is a Chinese frequent string ………………………….… 2
1.4 Structure of the dissertation …………………………………. 4
Chapter 2 Extracting Chinese Frequent Strings ……………..…… 5
2.1 The main idea ………………………………………………… 5
2.2 Constructing the MayBe database …………………………… 6
2.3 Extracting Chinese frequent strings from MayBe …………… 9
2.4 Implementation ……………………………………………….13
2.5 Evaluations of Chinese frequent strings ……………………...16
2.5.1 Comparing the word boundaries …………………………16
2.5.2 Comparing the normalized perplexity ……………………17
2.5.3 The precision rate and recall rate …………………………18
2.5.4 The MOS measurement …………………………………..18
Chapter 3 Some Properties of CFS versus LM and ASCED ………..20
3.1 The lexicon …………………………………………………….20
3.2 CFSs vs. LMs ………………………………………………….21
3.3 CFSs vs. ASCED ………………………………………………23
Chapter 4 Assign Part-of-Speech Information to CFSs ………………26
4.1 Extracting the parsing rules from Sinica treebank version 1.0 ...26
4.2 Determining the parts-of-speech of a CFS …………………….29
4.3 The experiment results …………………………………………30
Chapter 5 Applications in Mandarin and Taiwanese TTS Systems …...34
5.1 The Chinese character-to-phoneme task ………………………..35
5.2 The determination of prosodic segments ……………………….38
5.3 The application of CFSs in a Taiwanese TTS system …………..40
Chapter 6 Applications in Chinese Natural Language Processing ……. 45
6.1 The Chinese phoneme-to-character conversion …………………46
6.2 The Chinese toneless phoneme-to-character conversion ……….. 49
6.3 The Chinese spelling error correction issue ……………………. 52
Chapter 7 Conclusions and Future Works …………………………….. 58
7.1 Some properties of CFSs over LM and ASCED …………...58
7.2 Some possible future works ………………………………….. 59
Acknowledgement ………………………………………………….. 61
References ………………………………………………………… 62
Appendix A:Abbreviations in this paper ……………………………..66
Appendix B:Some Chinese Frequent Strings …………………………67
Publications …………………………………………………………..68

1. C. H. Chang, “A Pilot Study on Automatic Chinese Spelling Error Correction,” Communication of COLIPS, Vol.4, No.2, 1994, pp.143-149.
2. J. S. Chang, “Automatic Lexicon Acquisition and Precision-Recall Maximization for Untagged Text Corpora,” Ph.D. Thesis, Dept. of Electrical Engineering, National Tsing-Hua University, 1997.
3. J. S. Chang, S. D. Chen, S. J. Ker, Y. Chen, and J. Liu, “A Multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts,” Computer Processing of Chinese and Oriental Languages, Vol. 8, No.1, 1994, pp. 75-85.
4. K. J. Chen, “The CKIP Chinese Treebank: Guidelines for Annotation,” ATALA Workshop — Treebanks, Paris, June 18-19, 1999, pp.85-96.
5. K. J. Chen and M. H. Bai, “Unknown Word Detection for Chinese by a Corpus-Based Learning Method,” Proceeding of ROCLING X, 1997, pp.159-174.
6. H. H. Chen and G. W. Bian, “Proper Name Extraction from Web Pages for Finding People in Internet,” Proceeding of ROCLING X, 1997, pp.143-158.
7. K. J. Chen, C. R. Huang, L. P. Chang, and H. L. Hsu, “SINICA CORPUS: Design Methodology for Balanced Corpora,” Proceeding of PACLIC 11th Conference, 1996, pp.167~176.
8. H. H. Chen and J. C. Lee, “The Identification of Organization Names in Chinese Texts,” Communication of COLIPS, Vol.4, No.2, 1994, pp. 131-142.
9. F. Y. Chen, P. F. Tsai, K. J. Chen, and C. R. Huang, “Sinica Treebank,” Computational Linguistics and Chinese Language Processing, Vol. 4, No.2, August 1994, pp. 75-85.
10. L. F. Chien, “尋易(Csmart)-A High-Performance Chinese Document Retrieval System,” Proceedings of the 1995 International Conference on Computer Processing of Oriental Languages, ICCPOL’95, 1995, pp. 176-183.
11. CKIP( Chinese Knowledge Information Processing Group, 詞庫小組) , “Analysis of Chinese Part-of-Speech (中文詞類分析), Technology Report of CKIP #93-05(中文詞知識庫小組技術報告 #93-05),” Academia Sinica, Taipei, Taiwan, 1993.
12. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, “Introduction to Algorithms,” The MIT Press, 1998.
13. J. R. Deller, Jr., J. G. Proakis, and J. H. L. Hansen, “Discrete-Time Processing of Speech Signals,” Macmillan Publishing Co., 1993.
14. T. H. Ho, K. C. Yang, J. S. Lin, and L. S. Lee, “Integrating Long-Distance Language Modeling to Phoneme-to-Character Conversion,” Proceeding of ROCLING X, 1997, pp.287-292.
15. W. L. Hsu, “Chinese Parsing in a Phoneme-to-Character Conversion System Based on Semantic Pattern Matching,” International Journal on Computer Processing of Chinese and Oriental Languages, 1995, pp.227-236.
16. S. H. Hwang and S. H. Chen, “A Neural Network Based F0 Synthesizer for Mandarin Text-to-Speech System,” IEE Porc. Vis. Image Signal Process, Vol.141, No. 6, Dec., 1994, pp.384-390.
17. W. T. Jen, “Prediction Models for Syllable Duration in a Mandarin Text-to-Speech System,” Master Thesis, Department of Applied Mathematics, National Chung-Hsing University, Taichung, Taiwan, 1997.
18. S. C. Lai, “The Preliminary Study of Phonetic Symbol-to-Chinese Character Conversion,” Master Thesis, Department of Electrical Engineering, National Chin-Hua University, Hsinchu, Taiwan, June 1998.
19. C. J. Lin, “The Study of a Mandarin-Taiwanese Machine Translation System,” Master Thesis, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, June, 1997.
20. Y. J. Lin and M. S. Yu, “An Efficient Mandarin Text-to-Speech System on Time Domain,” IEICE Transactions on Information and Systems, Vol. E81-D, No 6, 1998, pp. 545-555.
21. Y. J. Lin and M. S. Yu, “Extracting Chinese Frequent Strings without a Dictionary from a Chinese Corpus and Its Applications,” Journal of Information Science and Engineering, Vol. 17, No. 5, 2001, pp.805-824.
22. E. Lopez-Gonzalo, J. M. Rodriguez-Garcia, L. Hernandez-Gomez, and J. M. Villar, “Automatic Prosodic Modeling for Speaker and Task Adaptation in Text-to-Speech,” Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 1997, pp. 927-930.
23. National Taiwan Normal University, “Mandarin Phonetics,” National Taiwan Normal University Press, Taipei, Taiwan, 1982.
24. N. H. Pan, “Prosody Model for Syllable Energy and Intonation in a Mandarin Text-to-Speech System,” Master Thesis, Department of Applied Mathematics, National Chung-Hsing University, Taichung, Taiwan, 1998.
25. L. R. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition,” Prentice Hall Co. Ltd., 1993.
26. M. S. Sun, C. N. Huang, H. Y. Gao, and J. Fang, “Identifying Chinese Names in Unrestricted Texts,” Communication of COLIPS, Vol.4, No.2, 1994, pp. 113-122.
27. H. N. Wei, “An Approach to the Measurement of Fondness and Similarity on Speech,” Master Thesis, Department of Applied Mathematics, National Chung-Hsing University, Taichung, Taiwan, 1997.
28. Z. Wu and G. Tseng, “Chinese Text Segmentation for Text Retrieval: Achievements and Problems,” Journal of American Society for Information Science, 44(9), 1994, pp. 532-542.
29. K. C. Yang, “Further Studies for Practical Chinese Language Modeling,” Master Thesis, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, June 1998.
30. GuoDong Zhou and KimTeng Lua, “Interpolation of N-gram and Mutual-Information Based Trigger Pair Language Models for Mandarin Speech Recognition,” Computer speech and language, Vol. 13, 1999, pp. 125-141.
31. C. T. Yang (楊青矗), “Mandarin to Taiwanese Bilingual Dictionary (國台雙語辭典),” Dun-Li Publication Co. (敦理出版社), Taipei, Taiwan, 1992.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top