跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.110) 您好!臺灣時間:2026/05/05 23:07
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:楊顓溥
研究生(外文):Chuan-Pu Yang
論文名稱:中文縮寫詞研究
論文名稱(外文):A Study of Chinese Abbreviations
指導教授:黃純敏黃純敏引用關係
指導教授(外文):Chuen-Min Huang
學位類別:碩士
校院名稱:國立雲林科技大學
系所名稱:資訊管理系碩士班
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:53
中文關鍵詞:特徵選取最大熵中文縮寫詞最長共同子序列
外文關鍵詞:Longest Common SubsequenceMaximum Entropy PrincipleChinese abbreviationFeature Selection
相關次數:
  • 被引用被引用:3
  • 點閱點閱:1077
  • 評分評分:
  • 下載下載:59
  • 收藏至我的研究室書目清單書目收藏:0
在中文文件中,字詞常以縮寫型態出現,例如:「台灣鐵路局」縮寫成「台鐵局」。這種高度「可縮寫性」之用法雖然為現代人爭取了時效及便利性,但也為中文字詞處理帶來了一些挑戰。像是對於以關鍵字為基礎的資訊檢索系統而言,使用者在進行檢索時所下之關鍵字為「縮寫詞」或是「原形詞」對搜尋引擎而言是兩個不同的詞,造成回傳結果遺失許多資訊。抑或在進行中文斷詞、文件自動分群及字詞權重計算等處理時,亦會對系統效能產生影響。

基於這些問題,本研究提出一個中文縮寫詞或原形詞的對應機制。藉由這樣的機制能夠將文件中的縮寫詞和與之對應原形詞連結起來,而不需依賴任何固定辭典,如同利用語料庫以建立一個動態縮寫詞對照表,也很容易應用到其它語言上。

在本論文中,我們以八千五百篇電子新聞文件為語料庫設計了幾個實驗。主要分為選取最佳候選詞、擷取正確縮寫詞及原形詞兩大部份。實驗為雙流程,即原形詞對應縮寫詞、縮寫詞對應原形詞兩組流程。在選取最佳候選詞的實驗方面,我們考量擷取上下文資訊以訓練最大熵模型,進行最佳候選詞的選取。實驗結果發現選取候選詞的精確率平均可以達到80%~90%。另外在縮寫詞及原形詞對應實驗方面,由縮寫詞對應原形詞,精確率最高可達到70%;而由原形詞對應縮寫詞,我們做出來的精確率最高可以達到80%。
The form of abbreviation is commonly used in the Chinese text. For instance, we often transform ‘台灣鐵路局’ into ‘台鐵局’. This kind of transformation is timesaving and convenient. However, this merit also brings some challenges in Chinese text processing. In keyword-based information retrieval system, using the abbreviated form and the original form as search entry respectively, usually return different results even though both are the same meaning. In addition, the influences of abbreviation on Chinese word segmentation, automatic documents clustering and weight of terms are obvious.

To solve the semantic ambiguity problem, we propose an approach to connect the two forms and construct an abbreviation list automatically in corpus without any fixed dictionary.

In this study, we conduct three major experiments with 8,500 documents from news website. Each experiment is a duo-process, from original form to abbreviation form back and forth. In the first experiment, we employ Maximum Entropy Model which uses many contextual “features” to locate the best candidate. In the second experiment, we attempt to transform original forms from their abbreviations. The third experiment is aimed at finding abbreviations from their original forms. The precision ratios achieve 80%-90%, 70%, and 80% respectively.
Table of Contents
Table Index VII
Figure Index IX
Chapter 1 Introduction 1
1.1 Research Background and Motivation 1
1.2 Research Objectives 2
1.3 Research Limitation 2
1.4 Research Contribution 2
1.5 Research Framework 3
Chapter 2 Literature Review 4
2.1 Chinese word segmentation 4
2.2 Sequence Similarity 4
2.2.1 Longest Common Subsequence, LCS 4
2.2.2 Longest Common Consecutive Subsequence, LCCS 7
2.3 Expansion of Abbreviations 7
2.4 Principle of Maximum Entropy 8
2.4.1 Overview 8
2.4.2 Maximum Entropy for NLP 9
Chapter 3 Experiments 13
3.1 Overview 13
3.2 Part-of-Speech Tag and Chinese Word Segmentation 14
3.3 Extract Possible Abbreviations and Original Forms 14
3.4 Find the Candidates of Abbreviations and Original Forms 15
3.5 Choosing the Best Candidates of Abbreviations and of Original Forms 16
3.5.1 Procedure 17
3.5.2 Feature Template 18
3.5.3 Extract Contextual Information 20
Chapter 4 Experiments Design and Results 22
4.1 Evaluation Criteria 22
4.2 Resource 23
4.3 Results and Analysis 23
4.3.1 Choosing the Best Candidate 23
4.3.1.1 Choosing the Best Candidate of Original Forms 24
4.3.1.2 Choosing the Best Candidate of Abbreviation 28
4.3.1.3 Summary 31
4.3.2 Finding Corresponding Original forms of Abbreviations 32
4.3.2.1 Summary 36
4.3.3 Finding Corresponding Abbreviations of Original forms 36
4.3.3.1 Summary 38
Chapter 5 Conclusion 40
5.1 Research Contributions 40
5.2 Future Works 41
Reference 42
Appendix 44
1.Akira, T., & Takenobu, T. (2001). Automatic disabbreviation by using context information. in Proceedings of the sixth natural language processing pacific rim symposium workshop on automatic paraphrasing:theories and applications, 21-28.
2.Berger, A., Pietra, S. D., & Pietra, V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39-71.
3.Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). Exploiting diverse knowledge sources via maximum entropy in named entity recognition. Paper presented at the The Sixth Workshop on Very Large Corpora.
4.Brown, P., Pietra, S. D., Pietra, V. D., & Mercer, R. (1991). A statistical approach to sense disambiguation in machine translation. Paper presented at the DARPA Workshop on Speech and Natural Language.
5.Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), 1470-1480.
6.Elmi, M. A., & Evens, M. (1998). Spelling correction using context.
7.Greiff, W. R., & Ponte, J. M. (2000). The maximum entropy approach and probabilistic ir models. ACM Transactions on Information Systems, 18, 246-287.
8.Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620-630.
9.Kehler, A. (1997). Probabilistic coreference in information extraction. Paper presented at the Second Conference on Empirical Methods in Natural Language Processing.
10.Leah, L., Ogilvie, P., Price, A., & Tamilio, B. (2000). Acrophile:An automated acronym extractor and server. In Proceedings of the ACM digital libraries conference, 205-214.
11.LIN, H., & YUAN, C. F. (2002). Chinese part of speech tagging based on maximum entropy method. Paper presented at the First International Conference on Machine learning and Cybernetics,Beijing, Beijing.
12.Park, Y., & Byrd, R. J. (2001). Hybrid text mining for finding abbreviations and their definitions. In Proceedings of EMNP2001.
13.Pavlov, D. (2003). Sequence modeling with mixtures of conditional maximum entropy distributions. Paper presented at the Third IEEE International Conference on Data Mining(ICDM''03).
14.Pavlov, D., Popescul, A., Pennock, D., & Ungar, L. (2003). Mixtures of conditional maximum entropy models, Twentieth International Conference on Machine Learning(ICML-2003).
15.Pietra, V. D., Pietra, S. D., & Lafferty, J. (1995). Inducing features of random fields:Technical Report CMU-CS95-144,School of Computer Science,Carnegie-Mellon University.
16.Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In conference on Empirical Methods in Natural Language Processing.
17.Ratnaparkhi, A. (1997). A simple introduction to maximum entropy models for natural language processing. Technical Report 97-08,Institute for Research in Cognitive Science,University of Pennsylvania.
18.Ratnaparkhi, A., JeffReynar, & Roulos, S. (1994). A maximum entropy model for prepositional phrase attachment. In Proceedings of the Human Language Technology Workshop(ARP,1994), 250-255.
19.Reynar, J. C., & Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Fifth Conference on Applied Natural Language Processing, 16-19.
20.Taghva, K., & Gilbreth, J. (1999). Recognizing acronyms and their definitions. International journal on document analysis and recognition(IJDAR), 191-198.
21.Terada, A., Tokunaga, T., & Tanaka, H. (2004). Automatic expansion of abbreviations by using context and character information. Information Processing and Management, 31-45.
22.Toole, J. (2000). A hybrid approach to the identification and expansion of abbreviations. In Proceedings of RIAO''2000, 1, 725-736.
23.賴育佐. (2003). 中文縮寫詞之機率統計模式. 國立暨南國際大學資訊工程學系碩士論文.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top