(34.239.150.57) 您好!臺灣時間:2021/04/14 23:06
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:賴郁婷
研究生(外文):YU-TING LAI
論文名稱:非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
論文名稱(外文):Unsupervised Event Type Identification of Historical Texts: A Case Study of Wei-so Events in the Ming Shilu
指導教授:蔡宗翰蔡宗翰引用關係
指導教授(外文):Richard Tzong-Han Tsai
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2016
畢業學年度:104
語文別:英文
論文頁數:54
中文關鍵詞:事件類型辨識文本聚類Paragraph Vector明實錄衛所自然語言處理古漢語
外文關鍵詞:Event type identificationText clusteringParagraph VectorMing ShiluWei-suoNatural Language ProcessingClassical Chinese
相關次數:
  • 被引用被引用:0
  • 點閱點閱:150
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:7
  • 收藏至我的研究室書目清單書目收藏:0
  自然語言技術對於古漢語方面的研究,受限於古漢語的資源匱乏,現有研究仍處於句讀、斷詞與命名實體擷取的初期階段。然而,能由文本中辨識出特定主題或事件,一直都是資訊擷取的重要目標,並且若能將事件擷取技術應用在歷史文本中,相信對人文學者也會有很大的幫助。
  但現有的事件擷取技術皆需要於事前定義事件模板,且現有的事件模板並不符合歷史文獻的情形。而定義事件模板與標注訓練資料皆需要大量時間人力,並仰仗專業知識,對於歷史文本尤為困難。因此,我們以文本聚類做為事件擷取的前置處理,以期識別出文本所含的事件類型,以便未來進一步歸納事件模板。文本聚類能將類似的文章群聚在一起,亦即事件類型相同的段落會分布在同一群集。本論文提出的非監督的文本事件類型識別方法,首先使用Paragraph Vector模型將文本向量化,並以其聚類結果做為事件類型,進一步訓練事件類型的分類器。
  本研究實現了初步的自動化文本事件類型識別,並實用於《明實錄》上,我們以識別衛所相關的事件為例,並開發網頁系統輔助研究者能更快速的歸納事件脈絡。本研究一方面希望能提供人文學者一個新的研究方法,另一方面也希望為古漢語文字探勘提出一個新的研究方向,奠定日後事件擷取研究的基礎。

  Natural language processing (NLP) for classical Chinese is very challenging because the lack of resources. Current works focused mainly on named entity recognition (NER), sentence segmentation and word segmentation and still have much work left to implement a meticulous event extraction system for classical Chinese.
  Current event extraction methods need to specify the target event type in advance, which is a high threshold for historical texts. The lack of word boundaries and POS tags are also the obvious barriers to apply these methods. Thus, we develop a tool that can classify paragraphs into event categories, which will make it easier to develop new extraction tools. We first use the Paragraph Vector model for texts embedding and apply unsupervised text clustering to group paragraphs into clusters by their event type. Then use categorized data for training an automatic text classifier.
  In this thesis, we propose an unsupervised event type identification approach based on paragraph embedding and apply to the Ming Shilu, focusing on events involving “wei-so”. We also develop a web interface for users to overview the thread of the event. We believe such a tool can help historians to systematically analyze the evolution of historical events. This system also provides a new research direction for mining historical texts and creates a foundation for future work in event extraction of historical texts.

摘要 i
Abstract ii
Acknowledgments iii
Contents iv
List of figures vi
List of tables vii
1 Introduction 1
2 Related Works 3
2.1 Classical Chinese processing 3
2.2 Sentence clustering 4
2.3 Sentence representation 5
3 Method 6
3.1 Formal problem definition 6
3.2 System flow 6
3.2.1 Module 1 – Time extraction module 7
3.2.2 Module 2 – Named Entity Recognizer 7
3.2.3 Module 3 – Wei-so entities linking 11
3.2.4 Module 4 – Paragraph embedding 12
3.2.5 Module 5 – Clustering 15
3.2.6 Module 6 – Classifier 15
4 Experiment 16
4.1 Dataset 16
4.2 Experimental protocols 17
4.2.1 Baseline 17
4.2.1 The proposed method 17
4.3 Evaluation methodology 18
4.4 Experimental results 19
4.4.1 Considering different training texts of paragraph vectors 19
4.4.2 Comparison between different parameters 21
4.4.3 Comparison between different dimensions 22
4.4.4 Comparative with baseline 23
5 Discussion 25
5.1 Result 25
5.2 Error analysis 28
6 Humanity interpretation 29
6.1 System introduction 29
6.2 Analysis and compare result 33
7 Conclusion 36
8 Future work 37
References 39


[1] Chinea-Rios, Mara, Germán Sanchis-Trilles, and Francisco Casacuberta. "Sentence clustering using continuous vector space representation." Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, 2015.
[2] Le, Quoc V., and Tomas Mikolov. "Distributed Representations of Sentences and Documents." ICML. Vol. 14. 2014.
[3] Chang, Yung-Chun, et al. "Linguistic Template Extraction for Recognizing Reader-Emotion and Emotional Resonance Writing Assistance." ACL-IJCNLP (2015): 775-780.
[4] Wang, Li. Hanyu Shigao. Vol. 2. Science Press, 1958.
[5] Huang, Hen-Hsen, Chuen-Tsai Sun, and Hsin-Hsi Chen. "Classical chinese sentence segmentation." Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing. 2010.
[6] Shi, Min, X. H. Chen, and B. Li. "CRF Based Research on a Unified Ap-proach to Word Segmentation and POS Tagging for Pre-Qin Chinese." Journal of Chinese Information Processing 2.24 (2010): 39-45.
[7] Liu, Shih-Gang. "Automated Annotation of Person Name of the Veritable Records of the Qing Dynasty." Master Thesis, Department of Computer Science and Information Engineering, National Taiwan University (2012): 1-50.
[8] Kao, Shin-Kai. "Automated Annotation of Geo-information of Historical Documents: A Case Study with the Veritable Records of the Qing Dynasty." Master Thesis, Department of Computer Science and Information Engineering, National Taiwan University (2013): 1-40.
[9] Pang, Wai-him et al. “Automated Name-extraction in Chinese Classics: Applying PMI (Pointwise Mutual Information) Segmentation to Zizhi Tongjian.” Digital Humanities and Craft:Technological Change. (2014): 232.
[10] Tang, Yafen. "Research of Automatically Recognizing Name in Pre-Qin Ancient Chinese Classics." XINADAI TUSHU QINGBAO JISHU 29.7/8 (2013): 63-68.
[11] Li, Qi, Heng Ji, and Liang Huang. "Joint Event Extraction via Structured Prediction with Global Features." ACL (1). 2013.
[12] Aliguliyev, Ramiz M. "A new sentence similarity measure and sentence based extractive technique for automatic text summarization." Expert Systems with Applications 36.4 (2009): 7764-7772.
[13] Wang, Dingding, et al. "Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization." Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[14] Sarkar, Kamal. "Sentence clustering-based summarization of multiple text documents." International Journal of Computing Science and Communication Technologies 2.1 (2009): 325-335
[15] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.
[16] Wei, Furu, et al. "Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization." Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[17] Kumaran, Giridhar, and James Allan. "Text classification and named entities for new event detection." Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2004.
[18] Hammouda, Khaled M., and Mohamed S. Kamel. "Efficient phrase-based document indexing for web document clustering." IEEE Transactions on knowledge and data engineering 16.10 (2004): 1279-1296.
[19] Zhao, Lin, Xuanjing Huang, and Lide Wu. "Fudan university at DUC 2005." Proceedings of DUC. Vol. 2005. 2005.
[20] Kotlerman, Lili, et al. "Sentence clustering via projection over term clusters." Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2012.
[21] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
[22] MacQueen, James. "Some methods for classification and analysis of multivariate observations." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 14. 1967.
[23] Qian, Gang, et al. "Similarity between Euclidean and cosine angle distance for nearest neighbor queries." Proceedings of the 2004 ACM symposium on Applied computing. ACM, 2004.
[24] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[25] Dai, Andrew M., Christopher Olah, and Quoc V. Le. "Document embedding with paragraph vectors." arXiv preprint arXiv:1507.07998 (2015).
[26] Andrés-Ferrer, Jesús, Germán Sanchis-Trilles, and Francisco Casacuberta. "Similarity word-sequence kernels for sentence clustering." Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer Berlin Heidelberg, 2010.
[27] Yue, Chih-chia, “The Evolution of the Military System in Chiang-his during the Ming Dynasty,” Bulletin of the Institute of History and Philology (BIHP) Vol. 66-4, (1995.12)
[28] Wikipedia, Hundred Family Surnames, https://en.wikipedia.org/wiki/Hundred_Family_Surnames

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文
 
系統版面圖檔 系統版面圖檔