(3.238.240.197) 您好!臺灣時間:2021/04/12 02:10
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:林哲民
研究生(外文):Zhe-Min Lin
論文名稱:微型語料庫的自動處理:賽夏語詞性標記、部份剖析及其應用
論文名稱(外文):Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications
指導教授:宋麗梅宋麗梅引用關係
指導教授(外文):Li-May Sung
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:語言學研究所
學門:人文學門
學類:語言學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:104
中文關鍵詞:台灣南島語基於轉換的錯誤驅動學習標記集線上語料庫田調文本處理維特根斯坦
外文關鍵詞:Formosan Austronesian languagestag settransformation-based error-driven learningonlinecorpus designfieldwork processWittgenstein
相關次數:
  • 被引用被引用:0
  • 點閱點閱:301
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:56
  • 收藏至我的研究室書目清單書目收藏:3
 本論文旨在研究二萬詞以下的微型語料庫的詞性標記及部份剖析技術,並提出三項應用。
  台大南島語語料庫是基於語調(intonation unit)的語料庫,其中賽夏語約有一萬二千詞。本文第一章介紹了當前處理南島語語料庫的難點,特別是因為規模太小,不能使用統計式自然語言處理,所以必須尋求其他方案。第二章介紹了新設計的標記集,以切實反應賽夏語的語言特點,並實際使用在詞性標記上,其中,詞彙法從田野調查記錄中抽取語法信息,得到約75%的正確率,再利用基於轉換的錯誤驅動學習(TBL)算則,進一步將正確率提升至85%。本章特別討論了賽夏語的主格及受格格標記(ka)難以區別的問題。
  論文第三章介紹了賽夏語的二位部份剖析,部份剖析可以為抽取名詞詞組和一些其他應用創造條件。我們嘗試了基於Kullback-Leibler分歧值的最短路徑法和TBL法,前者在小句長度加長時,正確率就會快速下降,而且需要大量的計算時間,而後者約達70%的正確率,符合我們設定的需求。
  第四章把標記過的語料庫同語言學研究、說本族語者及一般群眾連繫起來。機器幫助標註作業,讓語言學家較快速、較正確地處理採集到的語料;考慮到人民群眾和語言學家的不同需求,我們設計了在線多媒體語料庫的整合平台,並針對標準化、易及性、互換性三個特點,調整了細項設計。
  最後,本論文嘗試從前、後期的維特根斯坦哲學的角度,討論自然語言處理的哲學意義。我們強調詞在語言中的使用和詞義的關聯性,並認為計算機不能突破語料庫中文本構成的微型宇宙的界限。
This thesis demonstrates an effective method to tag and parse a corpus with no more than twenty thousand words, along with three useful applications which take advantage of the manipulated corpus. The NTU corpus of Austronesian languages, an intonation-unit (IU) based corpus, is chosen to be processed. In Chapter 1, we introduce current problems in automatic processing of Austronesian languages. As small-scaled corpora limit the usage of statistical natural language processing, we are urged to find an alternative method to deal with Austronesian corpora. A new tag set is defined in Chapter 2 to reflect linguistic particularity of the object language of this thesis, SaiSiyat. Two methods to label part-of-speech tags, the gloss-based approach (accuracy rate 75%) and transformation-based error-driven learning (TBL, accuracy rate 85%), are evaluated and reported robust. Difficulties to distinguish between SaiSiyat nominative and accusative case markers are especially discussed. A partial parser is useful in preparing a corpus for noun-phrase extraction and
further analyses. In Chapter 3, the tagged corpus is parsed into binary trees by a statistical approach, Kullback-Leibler divergence, and the TBL method. The former method declines quickly as IU length increases and needs huge computation time, while the accuracy rate of the latter method is a little less than 70%. Chapter 4 shows how an annotated corpus is related to linguistic research, native speakers of the object language and the public. Machine-aided annotation helps linguists to quickly rearrange collected data. An integrated platform of multimedia online corpora is also designed in this chapter, in order to serve both linguists and the public. In the last chapter, the natural language processing is discussed in early and late Wittgenstein''s points of view. We agree with the idea that the meaning of a word is as many as its actual use. Thus, the computer cannot go beyond the boundary of the micro-cosmos composed by texts given in a corpus.
1 Introduction 1
1.1 SaiSiyat Language . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 NTU SaiSiyat corpus . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Corpus size and the scopes of tagging and parsing . . . . . . 7
1.3.1 The scope of POS-tagging . . . . . . . . . . . . . . . 8
1.3.2 The scope of partial parsing . . . . . . . . . . . . . . 10
1.4 Integrated applications . . . . . . . . . . . . . . . . . . . . . 12
2 Syntactic Word-class Tagging of the SaiSiyat Corpus 15
2.1 Tag set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Mutual divergence and word-aligned corpus . . . . . 20
2.2.2 Method 1: a gloss-based strategy . . . . . . . . . . . 22
2.2.3 Method 2: transformation-based error-driven learning 25
2.3 Evaluation of Gloss-based Method and TBL algorithm . . . 29
2.3.1 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Results and discussion . . . . . . . . . . . . . . . . . 30
2.4 Slight modi cations to improve accuracy . . . . . . . . . . . 37
2.4.1 Manual correction of lexicon . . . . . . . . . . . . . . 37
2.4.2 Preservation of intonation unit . . . . . . . . . . . . . 38
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Partial Parsing SaiSiyat 41
3.1 Automatic grammar induction . . . . . . . . . . . . . . . . . 43
3.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2 Results and discussion . . . . . . . . . . . . . . . . . 47
3.2 Transformation-based error-driven parsing . . . . . . . . . . 49
3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Results and discussion . . . . . . . . . . . . . . . . . 53
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Applications 57
4.1 Bigram/trigram retrieval . . . . . . . . . . . . . . . . . . . . 57
4.2 Machine-aided glossary annotation . . . . . . . . . . . . . . 61
4.2.1 Brute-force method . . . . . . . . . . . . . . . . . . . 61
4.2.2 Problems in the lexicon . . . . . . . . . . . . . . . . . 62
4.2.3 Interim summary . . . . . . . . . . . . . . . . . . . . 67
4.3 Design of a publicly accessible corpus . . . . . . . . . . . . . 67
4.3.1 Standardisation of text commitment and standards of committed texts . . . . . . . . . . . . . . . . . . . 70
4.3.2 Database design . . . . . . . . . . . . . . . . . . . . . 74
4.3.3 Back-end programmes and the POS-tagger . . . . . . 75
4.3.4 Uni ed output interface . . . . . . . . . . . . . . . . 77
4.3.5 Interoperability . . . . . . . . . . . . . . . . . . . . . 78
4.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Conclusion 83
5.1 Boundary of Natural Language Processing . . . . . . . . . . 85
5.2 Pure substitution of symbols . . . . . . . . . . . . . . . . . . 88
5.3 Rules and the understanding of meanings . . . . . . . . . . . 89
5.4 A recognition of the world . . . . . . . . . . . . . . . . . . . 91
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A Coding List 93
B Database Schema 97
Abney, Steven. 1996. Corpus-based methods in language and speech, chapter Part-of-Speech Tagging and Partial Parsing. Dordrecht: Kluwer.

Anoop, Sarkar. 2001. Applying cotraining methods to statistical parsing. In Proceedings of the 2nd NAACL. Pittsburgh, PA. URL: http://citeseer.ist.psu.edu/sarkar01applying.html.

Brill, Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, 152--155. Trento, IT. URL http://citeseer.ist.psu.edu/article/brill92simple.html.

Brill, Eric. 1993a. Automatic grammar induction and parsing free text: a transformation-based approach. In Proceedings of Meeting of the ACL, 259--265. URL http://citeseer.ist.psu.edu/brill93automatic.html.

Brill, Eric. 1993b. A corpus-based approach to language learning. Doctoral
Dissertation, University of Pennsylvania.

Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4):543--565.

Brill, Eric. 1996. Transformation-based error-driven parsing. Http://www.cs.buffalo.edu/~drpierce/cse/738S2002/brill-parsing-1996.ps.

Brill, Eric, and Mitch Marcus. 1992. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language - AAAI Technical Report.

Brill, Eric, and Mitch Markus. 1992. Automatically acquiring phrase structure using distributional analysis. In Proceedings of DARPA Speech and Natural Language Workshop, 155--159.

Chafe, Wallace L. ed. 1980. The pear stories: Cognitive, cultural, and linguistic aspects of narrative production. Norwood, NJ: Ablex Publishing Corp.

Choueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO International Conference on User-Oriented Content-Based Text and Image Handling, 609--623. Cambridge, Mass.

Cloeren, Jan. 1999. Syntactic wordclass tagging, chapter Tagsets, 37--54.
Dordrecht: Kluwer.

Dien, Dinh, and Hoang Kiem. 2003. Pos-tagger for English-Vietnamese bilingual corpus. In Proceedings of HLT-NAACL 2003 Workshop: Building and Using Parrallel Texts Data Driven Machine Translation and Beyond, 88--95.

Du Bois, J. W. 1993. Talking data: Transcription and coding in discourse research, chapter Outline of discourse transcription, 45--89. NJ: Hillsdale: Lawrence Erlbaum Associates.

Ezeiza, N., I. Alegria, J. M. Arriola, R. Urizar, and I. Aduriz. 1998. Combining stochastic and rule-based methods for disambiguation in agglutinative languages. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, ed. Christian Boitet and Pete Whitelock, 379--384. San Francisco, California: Morgan Kaufmann Publishers. URL: http://citeseer.ist.psu.edu/article/ezeiza98combining.html.

Haegeman, Liliane M. V. 1994. Introduction to government and binding theory. Oxford: Blackwell.

van Halteren, Hans. 1999. Syntactic wordclass tagging, chapter Performance of taggers, 81--94. Dordrecht: Kluwer.

Heidegger, Martin. 1930. Was ist metaphysiks. 台北: 仰哲. Reprinted in
1993 with Chinese translation.

Howe, Denis. 1993. The free on-line dictionary of computing. Http://dictionary.reference.com/search?q=heuristic.

Huang, Shuan-Fan, Lily I-wen Su, and Li-May Sung. 2003. Syntax and
cognition in SaiSiyat. NSC 93-2411-H-022-094.

Iida 飯田, 隆. 2001. 維特根斯坦:語言的界限. 石家庄:河北教育出版社.

Jian 簡, 鴻模. 2003. 台灣原住民傳統祭典中的神聖現象──以賽夏族矮靈祭為例. 輔仁宗教研究 8:129--162.

Knuth, Donald E. 1998. The art of computer programming: Sorting and
searching. Massachusetts: Addison-Wesley.

Leech, Geoffrey, and Nicholas Smith. 1999. Syntactic wordclass tagging,
chapter The use of tagging, 23--36. Dordrecht: Kluwer.

Leech, Geoffrey, and Andrew Wilson. 1999. Standards for tagsets, 55--80.
Dordrecht: Kluwer.

Li, Paul Jen-Kuei. 1978. A comparative vocabulary of Saisiyat dialects.
Bulletin of the Institute of History and Philology 49.2:133--199.

Lin, Zhemin. 2004a. Extract saisiyat collocations: a brief report on NTU
SaiSiyat corpus. June 2004.

Lin, Zhemin. 2004b. Pos-tagger for saisiyat: using fieldwork notations and
tbl. In Proceedings of ROCLING XVI Student Workshop II , 25--33.

Lin, Zhemin, and Li-may Sung. 2004. Tiny corpus applications with
transformation-based error-driven learning: Evaluations of automatic
grammar induction and partial parsing of saisiyat. In Proceedings of
PACLIC 18 , 197--204.

Luhn, H. P. 1960. Keyword-in-context index for technical literature (kwic
index). American Documentation 11:288--295.

Manning, Christopher D., and Hinrich Sch�utze. 1999. Foundations of
statistical natural language processing. Cambridge: MIT Press.

Mayer, Mercer. 1980. Frog, where are you?. NY: Dial Books.

Rose, Tony, Nicholas Haddock, and Roger Tucker. 1997. The ects of corpus size and homogeneity on language model quality. In Proceedings of
ACL SIGDAT workshop on very large corpora, Beijing and Hong Kong,
178--191. URL http://acl.ldc.upenn.edu/W/W97/W97-0118.pdf.

Rosmorduc, Serge. n.d. Automata-guided context-free parsing for punctuationless languages. URL: http://citeseer.ist.psu.edu/363381.html.

Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse
and grammar. Amsterdam: John Benjamins.

Tsuyoshi, Ono, and Sandra A. Thompson. 1995. What can conversation
tell us about syntax? , 213--71. Amsterdam: Benjamins.

Wittgenstein, Ludwig. 1958. Philosophical investigations. Oxford: Basil
Blackwell. Translated by G.E.M. Anscombe.

Wittgenstein, Ludwig. 1961. Tractatus logico-philosophicus. London:
Routledge and Kegan Paul. Translated by D. F. Pears and B. F. McGuinness.

Wittgenstein, Ludwig. 1965. The blue and brown books. New York: Harper
and Row.

Xia, Fei, Martha Palmer, Nianwen Xue, Mary E. Okurowski, John Kovarik, Shizhe Huang, Tony Kroch, and Mitch Marcus. 2000. Developing guidelines and ensuring consistency for chinese text annotation. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece. URL
http://citeseer.ist.psu.edu/xia00developing.html.

Yeh 葉, 美利. 2000. 賽夏語參考語法. 台北: 遠流.

Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language archive: Development of a multimedia tool to salvage the languages and oral traditions of the indigenous tribes of taiwan. Oceanic
Linguistics 42(1):218--232.

Zeitoun, Elizabeth, and Ching-Hua Yu. 2005. The formosan language archive: Linguistic analysis and language processing. Computational Linguistics and Chinese Language Processing 10(2):167--200.

Zhao 趙, 敦華. 1996. 維根斯坦. 台北: 生智.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔