跳到主要內容

臺灣博碩士論文加值系統

(100.28.227.63) 您好!臺灣時間:2024/06/16 19:20
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:黃瀚萱
研究生(外文):Hen-Hsen Huang
論文名稱:以序列標記方法解決古漢語斷句問題
論文名稱(外文):Classical Chinese Sentence Division by Sequence Labeling Approaches
指導教授:孫春在孫春在引用關係
指導教授(外文):Chuen-Tsai Sun
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊科學與工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:中文
論文頁數:74
中文關鍵詞:古漢語斷句自然語言處理文本分割序列標記條件隨機域
外文關鍵詞:Classical Chinese sentence divisionnatural language processing (NLP)text segmentationsequence labelingconditional random fields (CRFs)
相關次數:
  • 被引用被引用:2
  • 點閱點閱:591
  • 評分評分:
  • 下載下載:85
  • 收藏至我的研究室書目清單書目收藏:2
斷句是古漢語處理的特殊議題。在20世紀之前,中文的書寫系統,並沒有使用標點符號的習慣。在閱讀古籍的時候,讀者必須從文句中,辨別應該停頓或分隔的地方,而後才能理解文義。由於斷句並沒有明確的規則和方法,全憑讀者的語感和經驗來判斷,同一個句子,不同的讀者,往往會有不同的斷法,而不同的斷法,造成了不同的文義解讀。所以,在處理古籍的時候,斷句是重要而困難的第一步驟。
過去沒有理想的自動化斷句方法,斷句的工作,多半交由文史專家,以人力來處理。雖然常見的經史典籍,目前已有斷句標點過的版本,但隨著歷史文獻不斷地發掘出土,仍然有無數的古代文獻,尚待斷句處理。
在本研究中,我以hidden Markov models(HMMs)和conditional random fields(CRFs)等兩種序列標記模型,設計古漢文斷句系統,並在實驗中獲得不錯的斷句結果。同時,在實驗中也發現,只要training data的質量足夠,則具有跨文本、跨作者、跨體裁的適用性。例如,以《史記》作training data,對於其他上古漢語的文本,都有頗佳的斷句表現。本研究的成果,展現了自動化古漢語斷句的可行性,並得以實用在數位典藏、文字探勘、資訊擷取等工作上,輔助人力,更快速地處理大量歷史文獻。
Sentence segmentation is a special issue in Classical Chinese language processing. To facilitate reading and processing of the raw Classical Chinese data, I proposed a statistical method to split unstructured Classical Chinese text into smaller pieces such as sentences and clauses. To build this segmenter, I transformed the sentence segmenting task to a character labeling task, and utilized two sequence labeling models, hidden Markov models (HMMs) and conditional random fields (CRFs), to perform the labeling work. My methods are evaluated on nine datasets from several eras (from the 5th century BCE to the 19th century). My CRF segmenter achieves an acceptable performance and can be applied on a variety of data from different eras.
中文摘要 i
英文摘要 ii
誌謝 iii
目錄 v
表目錄 vii
圖目錄 viii
一、 緒論 1
1.1 研究動機 1
1.2 問題描述 5
1.3 研究目標 6
二、 相關研究 8
2.1 中文斷詞 9
2.2 句式邊界偵測(Sentence Boundary Detection) 13
2.3 詞性標記(Part-of-Speech Tagging) 16
2.4 Markov Model Taggers 17
2.5 Conditional Random Fields 19
2.5.1 簡介 19
2.5.2 模型定義 19
2.5.3 參數評估 23
2.5.4 Averaged Perceptron Training 24
2.6 古漢語的語言特徵 25
三、 系統設計 28
3.1 評量準則(metrics) 29
3.2 Datasets 35
3.2.1 語料選擇 35
3.2.2 資料蒐集與處理 36
3.3 古漢語斷句模型 39
3.3.1 序列標籤化方法 39
3.3.2 斷句系統基本架構 41
3.3.3 Markov Model Tagger 42
3.3.4 Conditional Random Fields 43
四、 實驗 46
4.1 實驗設計 46
4.2 實驗一:斷句模型效能 47
4.2.1 實驗方法 47
4.2.2 實驗結果與分析 49
4.3 實驗二:Training Data評比 59
4.3.1 實驗方法 59
4.3.2 實驗結果與分析 59
4.4 實驗三:Training Data跨時代的適用性 63
4.4.1 實驗方法 63
4.4.2 實驗結果與分析 64
4.5 評量指標的討論 67
五、 結論 68
參考文獻 71
[1] 楊樹達,《古書句讀釋例》(上海:上海古籍出版社,2007)。
[2] 李鐸、王毅,〈關於古代文獻信息化工程與古典文學研究之間互動關系的對話〉,《文學遺產》,頁126-160,2005第一期。
[3] 林爾正、林丹紅,〈計算機應用於古籍整理研究概況〉,《情報探索》,頁28-29,2007第六期。
[4] J. Gao, M. Li, and C. Huang, " Improved Source-Channel Models for Chinese Word Segmentation," in Proceedings of the 41st Annual Meeting of Association of Computational Linguistics (ACL), Japan, 2003.
[5] H. Zhang, Q. Liu, X. Cheng, H. Zhang, and H. Yu, "Chinese Lexical Analysis Using Hierarchical Hidden Markov Model," in Proceedings of the Second SIGHAN Workshop, Japan, 2003, pp. 63-70.
[6] N. Xue, "Chinese Word Segmentation as Character Tagging," International Journal of Computational Linguistics and Chinese Language Processing, vol. 8, no. 1, pp. 29-48, 2003.
[7] F. Peng, F. Feng, and A. McCallum, "Chinese Segmentation and New Word Detection using Conditional Random Fields," in Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 2004, pp. 562-568.
[8] L. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," in Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 282-289.
[9] R. Mitkov, The Oxford Handbook of Computational Linguistics. New York.: Oxford University Press, 2003.
[10] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Trees. Belmont, CA.: International Group, 1984.
[11] S. M. Humphrey, "Research on Interactive Knowledge-Based Indexing: The Medindex Prototype," in Symposium on Computer Applications in Medical Care, 1989, pp. 527-533.
[12] D. D. Palmer and M. A. Hearst, "Adaptive Sentence Boundary Disambiguation," in Proceedings of the 1994 Conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany, 1994, pp. 78-83.
[13] M. D. Riley, "Some Ap plications of Tree-Based Modeling to Speech and Language Indexing," in Proceedings of the DARPA Speech and Natural Language Workshop, Morgan Kaufmann, 1989, pp. 339-352.
[14] J. C. Reynar and A. Ratnaparkhi, " A Maximum Entropy Approach to Identifying Sentence Boundaries," in Proceedings of the 5th Conference on Applications of Natural Language Processing, 1997, pp. 16-19.
[15] S. Cuendet, D. Hakkani-T�卣, and E. Shriberg, "Automatic Labeling Inconsistencies Detection and Correction for Sentence Unit Segmentation in Conversational Speech," in Proceedings of MLMI 2007, Brno, Czech Republic., 2007.
[16] L. Huang, Y. Peng, H. Wang, and Z. Wu, "Statistical Part-of-Speech Tagging for Classical Chinese," in Text, Speech, and Dialogue: 5th International Conference (TSD 2002), 2002, pp. 115-122.
[17] P. N. Tan, M. Steinbach, and K. V., Introduction to Data Mining: Pearson Education, Inc., 2006.
[18] E. Alpaydin, Introduction to Machine Learning: The MIT Press, 2004.
[19] A. Berger, S. Della Pietra, and V. Della Pietra, "A Maximum Entropy Approach to Natural Language Processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[20] S. Abney, R. E. Schapire, and Y. Singer, "Boosting Applied to Tagging and PP Attachment," in Proceedings of the Joint SIGDAT Conference on EMNLP and VLC, 1999, pp. 38-45.
[21] Y. Altun and H. T., "Large Margin Methods for Label Sequence Learning," in Proceedings of the 8th European Conference on Speech Communication and Technology (EuroSpeech), 2003.
[22] L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition," Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[23] A. J. Viterbi, "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm," IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260-267, 1967.
[24] C. D. Manning and H. Sch�厎ze, Foundations of Statistical Natural Language Processing. MA, US: The MIT Press, 1999.
[25] A. McCallum, D. Freitag, and F. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation," in Proceedings of International Conference on Machine Learning 2000, Stanford, California, 2000, pp. 591-598.
[26] H. M. Wallach, "Conditional Random Fields: An Introduction," University of Pennsylvania CIS Technical Report 2004.
[27] R. Feldman and J. Sanger, The Text Mining Handbook. New York, US.: Cambridge University Press, 2007.
[28] F. Sha and F. Pereira, "Shallow Parsing with Conditional Random Fields," in Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2003, pp. 134-141.
[29] Y. Liu, A. Stolcke, E. Shriberg, and H. M., "Using Conditional Random Fields for Sentence Boundary Detection in Speech," in Proceedings of the 43rd Annual Meeting of Association of Computational Linguistics (ACL), 2005, pp. 451-458.
[30] S. Della Pietra, V. Della Pietra, and J. Lafferty, "Inducing Features of Random Fields," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 4, pp. 380-393, 1997.
[31] A. McCallum, "Mallet: A machine learning for language toolkit," 2002.
[32] M. Collins, "Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms," in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002, pp. 1-8.
[33] F. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Review, vol. 65, pp. 384-408, 1958.
[34] Y. Freund and R. E. Schapire, "Large Margin Classification using the Perceptron Algorithm," Machine Learning, vol. 37, no. 3, pp. 277-296, 1999.
[35] M. Collins and D. Nigel, "New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002, pp. 263-270.
[36] A. McCallum and C. Sutton, "An Introduction to Conditional Random Fields for Relational Learning," in Introduction to Statistical Relational Learning MA, US: The MIT Press, 2007, pp. 1-35.
[37] 楊樹達,《詞詮》(上海:上海古籍出版社,2006)。
[38] 朱自清,《經典常談》(上海:復旦大學出版社,2004)。
[39] S. W. Durrant, The Cloudy Mirror: Tension and Conflict in the Writing of Sima Qian. Albany: State University of New York Press, 1995.
[40] S. Chen, J. Hsiang, H. Tu, and M. Wu, "On Building a Full-Text Digital Library of Historical Documents," in Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, 2007, pp. 49-60.
[41] T. A. Cohn, "Scaling Conditional Random Fields for Natural Language Processing," in Department of Computer Science and Software Engineering, Faculty of Engineering. vol. Doctor of Philosophy: University of Melbourne, 2007.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top