跳到主要內容

臺灣博碩士論文加值系統

(44.201.97.138) 您好!臺灣時間:2024/09/08 05:09
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:姜天戩
研究生(外文):Jiang, Tian-Jian
論文名稱:Syllable Word Segmentation for Mandarin Chinese via Double Ranking of the Left and Right Context
指導教授:許聞廉許聞廉引用關係
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2012
畢業學年度:100
語文別:英文
論文頁數:65
中文關鍵詞:中文斷詞交疊歧義雙重排名
外文關鍵詞:Chinese word segmentationoverlapping ambiguitydouble rank
相關次數:
  • 被引用被引用:0
  • 點閱點閱:318
  • 評分評分:
  • 下載下載:26
  • 收藏至我的研究室書目清單書目收藏:1
音節斷詞是中文注音/拼音輸入法的一部分,相較於中文斷詞,同音詞歧義引入了更多交疊邊界。中文注音/拼音輸入法通常假定輸入是完整的句子,並用結構嚴謹的語料庫評估效能。然而,大部分拼音使用者偏好一一輸入多半只含一到二字的短語,這種用法在運算資源有限的手持裝置上更是普遍。若想盡可能取得最佳的音轉字成果,短語往往無法提供足夠的脈絡,尤其是在短語中有邊界交疊時。這些交疊歧義具有方向性。因此本文提出善用前後脈絡的雙重排名策略。實驗結果顯示,比起記憶體需求較低而速度夠快的詞頻法,雙重排名有較佳的效能,而比起記憶體需求極高且速度頗慢的條件隨機域模型,雙重排名占用空間極低且效能在伯仲之間。
Syllable word segmentations as a part of Chinese phonetic input methods (CPIM) involve more overlapping boundaries than word segmentations because of homophone ambiguities. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, most Pinyin users prefer progressive text entry in short chunks, mainly in one or two words each, which is even more popular on handheld devices with limited computing power. Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping boundaries. Those overlapping ambiguities show directional tendencies. This dissertation proposes a double ranking (DR) strategy on the left and right context. Experiments show that DR has the benefits of less memory with competitive performance compared to the frequency-based method (low memory and fast) and the conditional random fields model (larger memory and slower).
ABSTRACT I
CHAPTER 1 INTRODUCTION 1
1.1 MOTIVATIONS 1
1.2 CHINESE WORD SEGMENTATION 2
1.3 OUT-OF-VOCABULARY AND IN-VOCABULARY 4
1.4 CHINESE SHORT SYLLABLE WORD SEGMENTATION 5
1.5 AMBIGUITY ON SURFACE PATTERNS: WORDS AND SHORT SYLLABLE WORDS 8
1.6 DOUBLE RANKING STRATEGY 12
1.7 OUTLINES OF THIS DISSERTATION 15
CHAPTER 2 RELATED WORKS ON SEGMENTATION 16
2.1 CHINESE WORD SEGMENTATION 16
2.1.1 The State of the Art 16
2.1.2 Unsupervised Feature Selection 17
2.1.3 Evaluation Metrics 18
2.2 PREDICTIVE PHONETIC INPUT METHOD AND SYLLABLE-TO-WORD CONVERSION IN CHINESE 20
2.3 CONCEPTS SIMILAR TO DOUBLE RANKING STRATEGY 21
CHAPTER 3 CHINESE SYLLABLE WORD SEGMENTATION USING MINIMAL LEFT AND RIGHT OVERLAPPING CONTEXT 23
3.1 SYLLABLE WORD SEGMENTATION 23
3.2 DOUBLE RANK ASSIGNMENT PROBLEM 24
3.2.1 Problem Definition 24
3.2.2 The Feedback Arc Set Problem (FASP) 26
3.3 ALGORITHMS 28
3.3.1 Rank Pre-assignment 28
3.3.2 Genetic Algorithm for the DRAP 30
3.4 EXPERIMENTS 34
3.4.1 Data Set 34
3.4.2 Closed Test on Toneless Pinyin Syllables 37
3.4.3 Open Test on Toneless Pinyin Syllables 40
3.4.4 Additional Closed Test on Tonal Pinyin Syllables 42
3.5 DISCUSSION 42
3.6 SPACE REQUIREMENT 47
3.7 REINFORCEMENT MEMOIZATION 48
3.8 SUMMARY 51
CHAPTER 4 CONCLUSION AND FUTURE WORKS 53
4.1 CONCLUSION 53
4.1.1 Defense 53
4.1.2 Summary 55
4.2 FUTURE WORKS 56
BIBLIOGRAPHY 57

Ailon N, Charikar M, Newman A. 2005. Aggregating inconsistent information: ranking and clustering. In: Proceedings of the thirty-seventh annual ACM Symposium on Theory of Computing. pp. 684–693.
Alon N. 2006. Ranking Tournaments. SIAM Journal on Discrete Mathematics 20:137–142.
Ando RK, Lee L. 2003. Mostly-unsupervised statistical segmentation of Japanese kanji sequences. Nat. Lang. Eng. 9:127–149.
Bar-Yehuda R, Geiger D, Naor J, Roth RM. 1998. Approximation algorithms for the feedback vertex set problem with applications to constraint satisfaction and Bayesian inference. SIAM Journal on Computing 27:942–959.
Bast H, Weber I. 2005. Insights from viewing ranked retrieval as rank aggregation. In: Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration. pp. 232–239.
Becker A, Bar-Yehuda R, Geiger D. 2000. Randomized algorithms for the loop cutset problem. J. Artif. Intell. Res. (JAIR) 12:219–234.
Bing-Quan L, Xiao-Long W, Liu B-Q, Wang X-L. 2002. An approach to machine learning of Chinese Pinyin-to-character conversion for small-memory application. In: Proceedings. International Conference on Machine Learning and Cybernetics. pp. 1287–1291.
Chang J-S, Su K-Y. 1997. Corpus-based statistics-oriented (CBSO) machine translation researches in Taiwan. Proceedings of Machine Translation Summit VI:165–173.
Chen Z, Lee K-F. 2000. A new statistical approach to Chinese Pinyin input. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics - ACL '00. pp. 241–247.
Chiang T-H, Chang J-S, Lin M-Y, Su K-Y. 1992. Statistical Models for Word Segmentation and Unknown Word Resolution. Proceedings of ROCLING V:121–146.
Chin FYL, Deng X, Fang Q, Zhu S. 2004. Approximate and dynamic rank aggregation. Theoretical Computer Science 325:409–424.
Cohen P, Adams N, Heeringa B. 2007. Voting experts: An unsupervised algorithm for segmenting sequences. Intelligent Data Analysis 11:607–625.
Cohn T, Smith A, Osborne M. 2005. Scaling Conditional Random Fields Using Error-Correcting Codes. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL '05). pp. 10–17.
De Jong KA. 2006. Evolutionary Computation - A Unified Approach. The MIT Press
Dom M, Guo J, Hüffner F, Niedermeier R, Truß A. 2006. Fixed-Parameter Tractability Results for Feedback Set Problems in Tournaments. In: LNCS. pp. 320–331.
Dong Z, Dong Q, Hao C. 2010. Word Segmentation needs change — From a linguist ’ s view. In: CIPS-SIGHAN Joint Conference on Chinese Language Processing. pp. 1–7.
Duan H, Bai X, Chang B, Yu S. 2003. Chinese Word Segmentation at Peking University. In: Proceedings of the second SIGHAN workshop on Chinese language processing. pp. 152–155.
Dwork C, Kumar R, Naor M, Sivakumar D. 2001. Rank aggregation methods for the web. In: Proceedings of the 10th international conference on World Wide Web. pp. 613–622.
Eades P, Lin X, Smyth WF. 1993. A fast and effective heuristic for the feedback arc set problem. Inf. Process. Lett. 47:319–323.
Even G, Seffi Naor J, Schieber B, Sudan M. 1998. Approximating Minimum Feedback Sets and Multicuts in Directed Graphs. Algorithmica 20:151–174.
Fagin R, Kumar R, Sivakumar D. 2003. Efficient similarity search and classification via rank aggregation. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 301–312.
Feng H, Chen K, Deng X, Zheng W. 2004. Accessor Variety Criteria for Chinese Word Extraction. Computational Linguistics 30:75–93.
Fleck MM. 2008. Lexicalized phonotactic word segmentation. Proceedings of ACL08 HLT:130–138.
Flood MM. 1990. Exact and heuristic algorithms for the weighted feedback arc set problem: A special case of the skew-symmetric quadratic assignment problem. Networks 20:1–23.
Gao J, Goodman J, Li M, Lee K-F. 2002. Toward a unified approach to statistical language modeling for Chinese. ACM Transactions on Asian Language Information Processing 1:3–33.
Gao J, Suzuki H, Yuan W. 2006. An empirical study on language model adaptation. ACM Transactions on Asian Language Information Processing 5:209–227.
Gao J, Zhang M. 2002. Improving language model size reduction using better pruning criteria. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02. p. 176.
Goldberg DE. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc.
Goldwater S, Griffiths TL, Johnson M. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112:21–54.
Goodman J, Gao J. 2000. Language model size reduction by pruning and clustering. In: ICSLP-2000.
Graff D. 2007. Chinese gigaword third edition. Linguistic Data Consortium.
Guo J, Gramm J, Huffner F, Niedermeier R, Wernicke S, H 252 ffner F. 2006. Compression-based fixed-parameter algorithms for feedback vertex set and edge bipartization. Journal of Computer and System Sciences 72:1386–1396.
Guo J, Hüffner F, Moser H. 2007. Feedback arc set in bipartite tournaments is NP-complete. Information Processing Letters 102:62–65.
Guo J. 1997. Critical tokenization and its properties. Computational Linguistics 23:569–596.
Gupta S. 2008. Feedback arc set problem in bipartite tournaments. Information Processing Letters 105:150–154.
Hai Z, Kit C. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In: The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6). pp. 106–111.
Harris Z. 1967. Morpheme boundaries within words: Report on a computer test.
Hsu W-L, Chen K-J. 1993. 「自然」智慧型輸入系統的語意分析 「脈絡會意法」. In: Proceedings of the 6th International Symposium on Cognitive Aspects of the Chinese Language. pp. 527–540.
Hsu W-L, Chen Y-S. 1999. On Phoneme-to-Character Conversion Systems in Chinese Processing. Journal of Chinese Institute of Engineers 22:573–579.
Hsu W-L. 1995. Chinese parsing in a phoneme-to-character conversion system based on semantic pattern matching. International Journal on Computer Processing of Chinese and Oriental Languages 40:227–236.
Huang C-R, Lee L-H, Qu W-G, Hong J-F, Yu S. 2008. Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on Heterogeneous Tagging Systems. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation.
Huang C. 2009. Tagged chinese gigaword version 2.0. Catalog LDC2009T14 [Internet]. Available from: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14
Huang JH, Powers D. 2003. Chinese word segmentation based on contextual entropy. In: Proceedings of the 17th Asian Pacific Conference on Language, Information and Computation. pp. 152–158.
Jiang MT-J, Hsu W-L, Kuo C-H, Yang T-H. 2011. Enhancement of Unsupervised Feature Selection for Conditional Random Fields Learning in Chinese Word Segmentation. In: The 7th IEEE International Conference on Natural Language Processing and Knowledge Engineering. pp. 382–389.
Jin Z, Tanaka-Ishii K. 2006. Unsupervised segmentation of Chinese text by use of branching entropy. Proceedings of the COLING/ACL on Main conference poster sessions:428–435.
Lafferty JD, McCallum A, Pereira FCN. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. pp. 282–289.
Levow GA. 2006. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. pp. 108–117.
Li L, Wang X, Wang X-L, Yu Y-B. 2009. A Conditional Random Fields Approach to Chinese pinyin-to-character Conversion. Journal of Communication and Computer 6:25–31.
Li M, Gao J, Huang C, Li J. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: Proceedings of the second SIGHAN workshop on Chinese language processing. pp. 1–7.
Liang N. 1987. A written Chinese automatic word segmentation system. Journal of Chinese Information Processing 1:44–52.
Liang Y, Zhu Y. 2009. A conditional random fields model for overlapping ambiguity resolution in chinese word segmentation. In: 2009 IEEE International Conference on Granular Computing GRC 2009. pp. 384–389.
Liu Y, Wang B, Ding F, Xu S. 2008. Information retrieval oriented word segmentation based on character associative strength ranking. Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP 08:1061.
Liu Y, Wang Q. 2007. Chinese pinyin phrasal input on mobile phone: usability and developing trends. In: Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology - Mobility '07.
Low JK, Ng HT, Guo W. 2005. A maximum entropy approach to Chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. pp. 161–164.
Ma W-Y, Chen K-J. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. In: Proceedings of the second SIGHAN workshop on Chinese language processing. pp. 168–171.
MacKenzie IS, Soukoreff RW. 2002. Text Entry for Mobile Computing: Models and Methods,Theory and Practice. Human-Computer Interaction 17:147–198.
Nanas N, Uren V, De Roeck A. 2003. Building and applying a concept hierarchy representation of a user profile. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. pp. 198–204.
Ng HT, Low JK. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based. Proceedings of EMNLP 2004:277.
Palmer D, Burger J. 1997. Chinese word segmentation and information retrieval. In: AAAI Spring Symposium on Cross-Language Text and Speech Retrieval. pp. 175–178.
Peng F, Feng F, McCallum A. 2004. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics - COLING '04. pp. 562–es.
Qiao W, Sun M, Menzel W. 2008. Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation. In: Proceedings of the 11th international conference on Text, Speech and Dialogue. pp. 177–186.
Raman V, Saurabh S. 2003. Parameterized complexity of directed feedback set problems in tournaments. In: International workshop on algorithms and data structures. pp. 484–492.
Shi L, Nie J-Y. 2009. Integrating phrase inseparability in phrase-based model. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '09. pp. 708–709.
Spears WM, De Jong KA. 1996. Analyzing GAs Using Markov Models with Semantically Ordered and Lumped States. In: Proceedings of the Fourth Workshop on Foundations of Genetic Algorithms. pp. 85–100.
Sproat R, Emerson T. 2003. The first international Chinese word segmentation bakeoff. In: Proceedings of the second SIGHAN workshop on Chinese language processing. pp. 133–143.
Stonedahl F, Rand W, Wilensky U. 2008. CrossNet: A Framework for Crossover with Network-based Chromosomal Representations. In: Proceedings of the 10th annual conference on Genetic and evolutionary computation - GECCO '08. p. 1057.
Sun M. 1998. Overlapping ambiguity in Chinese text. In: Quantitative and Computational Studies on the Chinese Language.
Sun W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. pp. 1385–1394.
Tanaka-Ishii K. 2005. Entropy as an indicator of context boundaries: An experiment using a web search engine. In: Natural Language Processing–IJCNLP 2005. pp. 93–105.
Teahan WJ, Wen Y, McNab R, Witten IH. 2000. A Compression-based Algorithm for Chinese Word Segmentation. Computational Linguistics 26:375–393.
Tsai RT-H. 2010. Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures. Expert Systems with Applications 37:3553–3560.
Tsai RTH, Hung HC, Sung CL, Dai HJ, Hsu WL. 2006. On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. pp. 108–117.
Tseng H, Chang P, Andrew G, Jurafsky D, Manning C. 2005. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005. Word Journal Of The International Linguistic Association 1682171:168–171.
van Zuylen A, Hegde R, Jain K, Williamson DP. 2007. Deterministic pivoting algorithms for constrained ranking and clustering problems. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 405–414.
Wang FH. 2004. A distributed algorithm for finding minimal feedback vertex sets in directed split-stars. 7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings.:174–179.
Wang X, Chen Q, Yeung DS. 2004. Mining Pinyin-to-character conversion rules from large-scale corpus: a rough set approach. Systems, Man and Cybernetics, Part B, IEEE Transactions on 34:834–844.
Wang X, Yao L, Anwar W. 2006. A Maximum Entropy Approach to Chinese Pin Yin-To-Character Conversion. In: 2006 IEEE International Conference on Systems, Man and Cybernetics. pp. 2956–2959.
Wang Z, Huang C, Zhu J. 2008. Which Performs Better on In-Vocabulary Word Segmentation: Based on Word or Character? Proceeding of the Sixth Sighan Workshop on Chinese Language Processing.
Ward DJ, Blackwell AF, MacKay DJC. 2000. Dasher - a Data Entry Interface Using Continuous Gestures and Language Models. In: Proceedings of the 13th annual ACM symposium on User interface software and technology - UIST '00. pp. 129–137.
Whitley D. 1989. The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best. In: Proceedings of the Third International Conference on Genetic Algorithms. pp. 116–121.
Wong K-F, Li W, Xu R, Zhang Z-S, Hirst G. 2009. Introduction to Chinese Natural Language Processing (Synthesis Lectures on Human Language Technologies). Morgan &; Claypool Publishers
Wong P-K, Chan C. 1996. Chinese word segmentation based on maximum matching and word binding force. In: Proceedings of the 16th conference on Computational linguistics . p. 200.
Wu G, Zheng F. 2003. A method to build a super small but practically accurate language model for handheld devices. Journal of Computer Science and Technology 18:747–755.
Xiao J, Liu B, Wang X. 2007. Exploiting Pinyin Constraints in Pinyin-to-Character Conversion Task: a Class-Based Maximum Entropy Markov Model Approach. Computational Linguistics and Chinese Language Processing 12:325–348.
Xiong Y, Zhu J. 2007. A New Machine Learning Method for Chinese Overlapping Disambiguity--Conditional Random Fields. In: Machine Learning and Cybernetics, 2007 International Conference on. pp. 3922–3926.
Xu Y, Goebel R, Ringlstetter C, Kondrak G. 2010. Application of the Tightness Continuum Measure to Chinese Information Retrieval. In: Proceedings of the Multiword Expression: From Theory to Applications. pp. 55–63.
Xue N, Shen L. 2003. Chinese word segmentation as LMR tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing. pp. 176–179.
Xue N. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8:29–48.
Zhang K, Sun M. 2011. A Comparison Study of Candidate Generation for Chinese Word Segmentation. In: The 7th IEEE International Conference on Natural Language Processing and Knowledge Engineering. pp. 60–67.
Zhang M, Zhou G, Yang L, Ji D. 2006. Chinese word segmentation and named entity recognition based on a context-dependent Mutual Information Independence Model. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. pp. 154–157.
Zhao H, Huang CN, Li M, Lu BL. 2010. A Unified Character-Based Tagging Framework for Chinese Word Segmentation. ACM Transactions on Asian Language Information Processing (TALIP) 9:1–32.
Zhao H, Huang CN, Li M. 2006. An improved Chinese word segmentation system with conditional random field. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. pp. 162–165.
Zhao H, Kit C. 2007. Incorporating global information into supervised learning for Chinese word segmentation. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics. pp. 66–74.
Zhao H, Kit C. 2011. Integrating unsupervised and supervised word segmentation: The role of goodness measures. Information Sciences 181:163–183.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top