(3.234.221.162) 您好!臺灣時間:2021/04/14 16:48
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:白明弘
研究生(外文):Bai, Ming-Hong
論文名稱:Extraction of Bilingual Multiword Expressions with Application to Bilingual Concordancer
指導教授:張俊盛張俊盛引用關係陳克健陳克健引用關係
指導教授(外文):Chang, Jason S.Chen, Keh-Jiann
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2013
畢業學年度:101
語文別:英文
論文頁數:97
中文關鍵詞:機器翻譯電腦輔助翻譯詞語對齊多詞表達
外文關鍵詞:Machine TranslationComputer-Assisted TranslationWord AlignmentMultiword Expression
相關次數:
  • 被引用被引用:0
  • 點閱點閱:173
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:15
  • 收藏至我的研究室書目清單書目收藏:0
Bilingual concordancer 是一種建構在平行語料庫上的電腦輔助翻譯工具。當使用者輸入一個單字或片語時,bilingual concordancer從平行語料庫中抽出包含該單字或片語的句子。接著,在對譯的句子中標出對等翻譯出現的位置,以及依照翻譯相關性重新排列句子。這樣的輸出結果不僅讓使用者可以習得對等的翻譯,同時也可以從句子中研究或學習該單字或片語翻譯的使用方法。因此,對於詞典的編輯者、專業的翻譯者、或是第二語言學習者來說,bilingual concordancer 都是非常實用的工具。
多詞表達(multi-word expression)的對等翻譯抽取技術則是 bilingual concordancer 中最重要的技術。例如對等翻譯標示 (highlighting translation equivalents) 及產生對等翻譯表(translation equivalents list) 都需要依賴高品質的對等翻譯抽取技術。然而到目前為止,對等翻譯的抽取技術仍有許多改進的空間。
在本論文中,我們將探討現有多詞表達對等翻譯抽取的一些問題,包括過度對應 (over-alignment) 的問題,以及不足對應 (under-alignment) 的問題。我們將提出一個全新的對等翻譯抽取模型來解決這些問題,以提高翻譯的品質。同時,我們以所提出的模型,實際建構了一個 bilingual concordancer電腦輔助翻譯系統。為了測試系統的品質,我們以三組不同型態的多詞表達做為測試資料,來測試 bilingual concordancer ,並以現有的統計式翻譯模型做為比較的對像。

A bilingual concordancer is a computer-assisted translation tool that uses the parallel corpus as its knowledge base. Given a word or phrase, the bilingual concordancer retrieves aligned sentence pairs, which contain the word or phrase in the source sentences, from the parallel corpus. Then, it identifies the translation equivalents in the target sentences and reorders the sentence pairs according to the correlation from the query string and the translation equivalents. It helps not only on finding translation equivalents of the query but also presenting various contexts of occurrence. As a result, it is extremely useful for bilingual lexicographers, human translators and second language learners.
Extraction of bilingual multi-word expressions is the most important part of a bilingual concordancer. For example, highlighting translation equivalents in the target sentence and generating translation equivalent list are highly depend on a high quality extraction model. However, the existing models for extracting translation equivalents still have many problems and still room to improve.
In this thesis, we discuss some problems of the existing models for extracting bilingual multi-word expressions, including the over-alignment problem and the under-alignment problem. Then, we propose a novel model to address these problems to improve the quality the extracted translation equivalents. Further, we implement a bilingual concordancer employs the proposed translation extraction model. To measure the performance of the bilingual concordancer, we use three type of multi-word expression as our test target. The results are compared with the existing statistical machine translation models.

Contents
摘要 i
Abstract ii
誌謝 iii
Contents v
List of Figures viii
List of Tables x
Chapter 1 Introduction 1
1.1 Bilingual Concordancer 1
1.2 Extraction of Bilingual Multiword Expressions 4
1.3 Thesis Goals 5
Chapter 2 Extraction of Translation Equivalents for Multiword Expressions 7
2.1 Problem Statement 7
2.2 Extracting Translation Equivalences 10
2.2.1 Selecting Candidate Words 11
2.2.2 Local Normalized Correlation 11
2.2.3 Normalized Correlation 13
2.2.4 Generation and Ranking of Candidate Translations 13
2.2.5 Generating Possible Translations 14
2.2.6 Filtering Common Subsequences 15
2.2.7 Selection of Candidate Translations 17
2.3 Experiments 18
2.3.1 Evaluation of Word Candidates 18
2.3.2 Evaluating Extracted Translations 19
2.4 Applying MWE Translations to MT 22
2.4.1 Experimental Settings 22
2.4.2 Selection of MWEs 23
2.4.3 Extra Information 23
2.4.4 Evaluation Results 24
2.5 Summary 25
Chapter 3 Bilingual Concordancer 27
3.1 The System 27
3.2 Extraction of Bilingual Multi-word Expressions 28
3.3 Ranking 28
3.4 Evaluation 29
3.4.1 Experimental Setting 29
3.4.2 Evaluation of Translation Spotting 31
3.4.3 Evaluation of Ranking 33
Chapter 4 Chinese Word Alignment 35
4.1 Problem Statement 35
4.2 Word Segmentation Adjustment 37
4.3 Affix Rule Method 37
4.3.1 Training Data 38
4.3.2 Word-to-Morpheme Alignment 39
4.3.3 Rule Extraction 40
4.4 Impurity Measure Method 42
4.4.1 Impurity Measure of Translation 43
4.4.2 Target Word Selection 44
4.4.3 Best Breaking Point 45
4.5 Experiment 46
4.6 Summary 48
Chapter 5 Translation of Unknown Words 49
5.1 Problem Statement 49
5.2 The TTR Model 51
5.2.1 Definition of TTR 52
5.2.2 Translation Process 53
5.2.3 Translation Probability and Lexical Weighting 55
5.2.4 Extraction of TTRs 57
5.2.5 Classifier and Rule Fitting Probability 59
5.2.6 Synchronous Morphological Rule 60
5.3 Experimental Setting 63
5.3.1 The baseline SMT System and Data Sets 63
5.3.2 Training 64
5.4 Experimental Results 64
5.4.1 Impact of Unknown Word Identification 65
5.4.2 OOV Classification 65
5.4.3 TTR selection 66
5.4.4 BLEU score 67
5.5 Summary 69
Chapter 6 Conclusion 70
Bibliography 72
Appendix A – Chinese Idioms for Testing Bilingual Concordancer 81
Appendix B – Lists of Template Rules 87
Publications 96

[1] Anthony, L. 2012. Advancing AntConc: Design and Performance Improvements for Multi-Language. Proceedings of the Japan Association for English Corpus Studies (JAECS) Annual Conference, Sept. 29, 2012, Osaka University, Osaka, Japan.
[2] Ayan, Necip Fazil and Bonnie J. Dorr. 2006. Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT. In Proceedings of ACL 2006, pages 9-16, Sydney, Australia.
[3] Bai, Ming-Hong, Yu-Ming Hsieh, Keh-Jiann Chen and Jason S. Chang. 2012. DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation. In Proceedings of ACL 2012, pages 55-60, Jeju Island, Korea.
[4] Bai, Ming-Hong, Jia-Ming You, Keh-Jiann Chen, Jason S. Chang. 2009. Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies. In Proceedings of EMNLP, pages 478-486.
[5] Bai, Ming-Hong, Keh-Jiann Chen and Jason S. Chang. 2008. Improving Word Alignment by Adjusting Chinese Word Segmentation. In Proc. of IJCNLP 2008. pp. 249-256.
[6] Bai, Ming-Hong, Keh-Jiann Chen and Jason S. Chang. 2006. Sense Extraction and Disambiguation for Chinese Words from Bilingual Terminology Bank. Computational Linguistics and Chinese Language Processing, 11(3):223-244.
[7] Barlow, Michael. 1995. A concordancer for parallel texts. Computers and Texts, 10, 14-16.
[8] Barlow, Michael. 1999. Monoconc 1.5 and Paraconc. International Journal of Corpus Linguistics, 4(1):173-184.
[9] Bach, Nguyen, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz, and Alan Black. The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System. In Proceedings of the IWSLT’07, Trento, Italy, 2007.
[10] Berger, Adam L., Stephen A. Della Pietra, Vincent J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71.
[11] Bird, Steven and Edward Loper. 2004. NLTK: The Natural Language Toolkit. In Proceedings of ACL, pages 214-217.
[12] Bourdaillet, Julien, Stéphane Huet, Philippe Langlais and Guy Lapalme. 2010. TRANSSEARCH: from a bilingual concordance to a translation finder. Machine Translation, 24(3-4): 241–271.
[13] Bowker, Lynne, Michael Barlow. 2004. Bilingual concordancers and translation memories: A comparative evaluation. In Proceedings of the Second International Workshop on Language Resources for Translation Work, Research and Training , pages. 52-61.
[14] Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263-311.
[15] Callison-Burch, Chris, Colin Bannard and Josh Schroeder. 2005. A Compact Data Structure for Searchable Translation Memories. In Proceedings of EAMT.
[16] Callison-Burch, Chris, Philipp Koehn, Miles Osborne. 2006. Improved Statistical Machine Translation Using Paraphrases. In Proc. of HLT/NAACL 2006. pp. 17-24
[17] Chang, Jason S, David Yu, Chun-Jun Lee. 2001. Statistical Translation Model for Phrases(in Chinese). Computational Linguistics and Chinese Language Processing, 6(2):43-64.
[18] Chen, Keh-Jiann, Ming-Hong Bai. 1998. Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Processing. 3(1): 27-44.
[19] Chen, Keh-Jiann, Shing-Huan Liu. 1992. Word Identification for Mandarin Chinese Sentences. In Proceedings of 14th COLING, pages 101-107.
[20] Chen, Keh-Jiann, Wei-Yun Ma. 2002. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING 2002, pages 169-175, Taipei, Taiwan.
[21] Chiang, David. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proc. of ACL 2005. pp. 263-270.
[22] CKIP. 1993. Chinese Electronic Dictionary. Technical Report, No. 93-05, Academia Sinica, Taiwan.
[23] DeNero, John, Dan Klein. 2007. Tailoring Word Alignments to Syntactic Machine Translation. In Proceedings of ACL 2007, pages 17-24, Prague, Czech Republic.
[24] Deng, Yonggang, William Byrne. 2005. HMM word and phrase alignment for statistical machine translation. In Proceedings of HLT-EMNLP 2005, pages 169-176, Vancouver, Canada.
[25] Duda, Richard O., Peter E. Hart, David G. Stork. 2001. Pattern Classification. John Wiley & Sons, Inc.
[26] Fairon, C. 1999. GlossaNet: Parsing a web site as a corpus. In Le systeme INTEX, Lingvisticae Investigationes, volume XXII, pages 327-340. John Benjamins Publishing, Amsterdam/Philadelphia.
[27] Gao, Jianfeng, Jian-Yun Nie, Hongzhao He, Weijun Chen, Ming Zhou. 2002. Resolving Query Translation Ambiguity using a Decaying Co-occurrence Model and Syntactic Dependence Relations. In Proc. of SIGIR’02. pp. 183 -190.
[28] Gao, Jianfeng, Mu Li, Andi Wu and Chang-Ning Huang. 2005. Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics, 31(4)
[29] Gao, Zhao-Ming. 2011. Exploring the effects and use of a Chinese–English parallel concordance. Computer-Assisted Language Learning 24.3 (July 2011): 255-275.
[30] Goldwater, Sharon, David McClosky. 2005. Improving Statistical MT through Morphological Analysis. In Proceedings of HLT/EMNLP 2005, pages 676-683, Vancouver, Canada.
[31] Huang, Chung-chi, Ho-ching Yen and Jason S. Chang. 2011. Using Sublexical Translations to Handle the OOV Problem in Machine Translation. ACM Transactions on Asian Language Information Processing, 10(3): Article 16.Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proc. of HLT/NAACL’03. pp. 127-133.
[32] Jian, Jia-Yan, Yu-Chia Chang and Jason S. Chang. 2004. TANGO: Bilingual Collocational Concordancer. In Proceedings of ACL, pages 166-169.
[33] Kitamura, Mihoko and Yuji Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proc. of the 4th Annual Workshop on Very Large Corpora. pp. 79-87.
[34] Koehn, Philipp, Franz J. Och, Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proceedings of HLT/NAACL 2003, pages 48-54, Edmonton, Canada.
[35] Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proc. EMNLP’04. pp. 388-395.
[36] Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL’07, demonstration session.
[37] Kupiec, Julian. 1993. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In Proceedings of ACL, pages 17-22.
[38] Lee, Young-Suk. 2004. Morphological Analysis for Statistical Machine Translation. In Proceedings of HLT-NAACL 2004, pages 57-60, Boston, USA.
[39] Lee, Young-Suk, Kishore Papineni, Salim Roukos. 2003. Language Model Based Arabic Word Segmentation. In Proceedings of ACL 2003, pages 399-406, Sapporo, Japan.
[40] Li, Zhifei and David Yarowsky. 2008. Unsupervised translation induction for Chinese abbreviations using monolingual corpora. In Proc. of ACL 2008. pp. 425-433.
[41] Liang, Percy, Ben Taskar, Dan Klein. 2006. Alignment by Agreement. In Proceedings of HLT-NAACL 2006, pages 104-111, New York, USA.
[42] Liou, Hsien-Chin, Jason S. Chang, Hao-Jan Chen, Chih-Cheng Lin, Meei-Ling Liaw, Zhao-Ming Gao, Jyh-Shing Roger Jang, Yuli Yeh, Thomas C. Chuang, Geeng-Neng You. 2006. Corpora Processing and Computational Scaffolding for a Web-based English Learning Environment: The Candle project. CALICO Journal, 24(1), 77–95.
[43] Ma, Wei-Yun, Keh-Jiann Chen. 2003. A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of ACL 2003, Second SIGHAN Workshop on Chinese Language Processing, pp31-38, Sapporo, Japan.
[44] Ma, Wei-Yun and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff. In Proceedings of the second SIGHAN workshop on Chinese language processing, pages 168-171.
[45] Ma, Yanjun, Nicolas Stroppa, Andy Way. 2007. Bootstrapping Word Alignment via Word Packing. In Proceedings of ACL 2007, pages 304-311, Prague, Czech Republic.
[46] Ma, Yanjun, Sylwia Ozdowska, Yanli Sun, and Andy Way. 2008. Improving Word Alignment Using Syntactic Dependencies. In Proc. of ACL/HLT’08 Second Workshop on Syntax and Structure in Statistical Translation. pp. 69-77.
[47] Ma, Xiaoyi. 2006. Champollion: A Robust Parallel Text Sentence Aligner. In Proceedings of the Fifth International Conference on Language Resources and Evaluation..
[48] Marton, Yuval, Chris Callison-Burch and Philip Resnik. 2009. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In Proc. of ACL/AFNLP 2009. pp. 381-390.
[49] Melamed, Ilya Dan. 2001. Empirical Methods for Exploiting parallel Texts. MIT press.
[50] Mirkin, Shachar, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman and Idan Szpektor. 2009. Source-Language Entailment Modeling for Translating Unknown Terms. In Proc. of ACL/AFNLP 2009. pp. 791-799.
[51] Moore, Robert C. 2004. Improving IBM Word-Alignment Model 1. In Proceedings of ACL 2004, pages 519-526, Barcelona, Spain.
[52] Och, Franz Josef and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003.
[53] Och, Franz Josef, Christoph Tillmann, and Hermann Ney. 1999. Improved Alignment Models for Statistical Machine Translation. In Proc. of EMNLP/VLC’99. pp. 20-28.
[54] Och, Franz J. and Hermann Ney., 2000, Improved Statistical Alignment Models, In Proceedings of ACL, pages 440-447. Hong Kong.
[55] Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL 2003. pp. 160-167.
[56] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proc. of ACL’02. pp. 311-318.
[57] Shima, Hideki, Ni Lao, Eric Nyberg, and Teruko Mitamura. Complex Cross-lingual Question Answering as Sequential Classification and Multi-Document Summarization Task. In Proceedings of NTCIR-7 Workshop, Japan, 2008.
[58] Smadja, Frank, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1):1-38.
[59] St. John, Elke. 2001. A Case for Using a Parallel Concordancer and Corpus for Beginners of a Foreign Language. Language Learning & Technology, 5(3), 185-203.
[60] Sudo, Kiyoshi, Satoshi Sekine, and Ralph Grishman. Cross-lingual information extraction system evaluation. In Proceedings of COLING ’04, page 882, Geneva, Switzerland, 2004. Association for Computational Linguistics.
[61] Vogel, Stefan, Hermann Ney, Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of COLING 1996, pages 836-841, Copenhagen, Denmark.
[62] Wilkinson, Michael (2011). "WordSmith Tools: The best corpus analysis program for translators?", in Translation Journal, Vol. 15, No 3
[63] Wu, Dekai, Xuanyin Xia. 1994. Learning an English-Chinese Lexicon from a Parallel Corpus. In Proceedings of AMTA 1994, pages 206-213, Columbia, MD.
[64] Wu, Dekai. 1997. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora. Computational Linguistics, 23(3):377-403.
[65] Wu, Hua, Ming Zhou. 2003. Synonymous Collocation Extraction Using Translation Information. In Proc. of ACL’03. pp. 120-127.
[66] Wu Jian-Cheng. 2010. Learning to Find Translations for Terms on the Web. In Ph.D. Thesis, Computer Science, National Tsing Hua University, Taiwan.
[67] Wu, Jian-Cheng, Kevin C. Yeh, Thomas C. Chuang, Wen-Chi Shei, Jason S. Chang. 2003. TotalRecall: A Bilingual Concordance for Computer Assisted Translation and Language Learning. In Proceedings of ACL, pages 201-204.
[68] Yamamoto, Kaoru, Yuji Matsumoto. 2000. Acquisition of Phrase-level Bilingual Correspondence using Dependency Structure. In Proceedings of COLING 2000, pages 933-939.
[69] Zhang, Le. 2004. Maximum entropy modeling toolkit for python and c++. available at http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html.
[70] Zhang, Ying and Nguyen Bach. Virtual babel: Towards context-aware machine translation in virtual worlds. In Proceedings of the Twelfth Machine Translation Summit (MTSummit-XII), Ottawa, Canada, August 2009. International Association for Machine Translation.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔