跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.31) 您好!臺灣時間:2025/12/02 23:03
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:李祥賓
研究生(外文):Hsiang-Pin Lee
論文名稱:新聞文件摘要之研究
論文名稱(外文):Text Summarization on News
指導教授:柯淑津柯淑津引用關係陳培敏陳培敏引用關係
指導教授(外文):Sue J. KerPei-Min Chen
學位類別:碩士
校院名稱:東吳大學
系所名稱:資訊科學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2001
畢業學年度:89
語文別:中文
論文頁數:55
中文關鍵詞:文件摘要文件分類詞彙權重詞彙位置詞彙網絡統計處理詞義辨識
外文關鍵詞:Text summarizationtext categorizationvocabulary weightvocabulary positionWordNetstatistical processingword sense disambiguation
相關次數:
  • 被引用被引用:6
  • 點閱點閱:149
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在資訊科技發達的今日,使用者可以輕易藉由網際網路來獲得所需求的資訊。為了讓使用者能快速地瀏覽文件,以判斷此篇文件是否為自己所要尋找的資訊,這需要由文件擷取出其重要內容形成此篇文件的摘要,提供給使用者參考。傳統的文件摘要,多以人工方式進行處理,需要耗費大量人力成本,也無法滿足時效性的需求。因此自動文件摘要技術是不可或缺的。本文以新聞文件來進行摘要處理,期望使新聞的摘要內容能表達出文件的重要訊息。
本文主要以三種摘要技巧對路透社新聞文件進行處理,分別為利用資訊檢索技巧來挑選文件內的重要詞彙、由語句出現的位置來判斷其重要性,以及擴充標題詞彙。我們由文件內找出重要詞彙,來表達文件所內含的概念;以及對文件進行分析探討,找出文件主題通常是佔據了哪些位置;另外,我們認為標題對於文件是相當重要的,因此我們透過詞彙網絡(WordNet)找尋標題的相關詞彙,對標題詞彙進行擴充,來找出更多與標題相關的字,增加標題的重要性,進而協助在文件中找尋與標題較相關的摘要語句。
在實驗方面,我們分別為三種摘要方法各自設計不同的實驗組別,並將摘要結果進行分類處理,由分類的精確度來評估摘要成效。從實驗結果可證明三種方法在摘要處理上均有相當成效。最後,本文提出了一種綜合擴充標題詞彙與重要位置的摘要方法,此方法得到71.9%精確率的實驗結果,相較於65.6%的基準精確率改善了9.6%。
The swift development of information technique and the Internet has resulted in a problem of information overflow. Hence it is imperative to find a way to help users browse through documents efficiently and effectively. Text summarization could be a remedy to this problem. Traditional text summarization is usually processed manually. However, it does cost lots of human resources and cannot satisfy the demand in real time. Therefore, it is necessary to automate the process.
This paper presents three methods of text summarization on Reuters news corpus. First, we use the technique of Information Retrieval to collect the important vocabulary of the document (called Important Vocabulary Extract Policy). Second, we determine the significance of the sentence with its position in the document (called Optimal Position Policy). Last, we expand the vocabulary of the title (called Title Expand Policy). To express the concept of the document, we extract the important vocabulary from the document and analyze its structure to find which position the document subject occupies. Moreover, we believe that the title is rather significant in the document. We therefore expand the relative vocabulary of the title from the WordNet. We then use the expanded set of words to find the appropriate sentence for summarization.
In experimentation, we design different experiments for three text summarization methods. The summary of text is then evaluated according to text categorization. Experimental results indicate that all of the methods used in this thesis can achieve acceptable performance. Finally, this thesis also proposes a method to combine two policies -- Optimal Position and Title Expand. Opposite to the criterion in 65.6% precision rate, the proposed method result a 71.9% precision rate, a 9.6% improvement in precision.
誌謝 i
中文摘要 ii
英文摘要 iii
目錄 v
表目錄 viii
圖目錄 ix
1 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 論文架構 4
2 相關研究 5
2.1 文件的範圍 5
2.2 字詞與概念 6
2.3 不同語言的文件 7
2.4 其它重要的摘要方法 8
3 研究資源 10
3.1 路透社新聞語料 10
3.2 WordNet 12
3.3 含標記詞義訊息的布朗語料庫 16
4 研究方法探討 18
4.1 挑選重要詞彙 18
4.2 由位置來判斷文件主題句 20
4.3 擴充標題詞彙 22
4.3.1 以WordNet進行詞義歧異辨識 25
4.3.2 利用語料庫詞義出現機率進行詞義歧異辨識 28
4.4 評估方法 29
5 英文文件的摘要實驗 32
5.1 實驗資料 32
5.2 挑選重要詞彙 32
5.2.1 實驗設計 33
5.2.2 實驗結果與討論 34
5.3 由位置來判斷文件主題句 35
5.3.1 實驗設計 37
5.3.2 實驗結果與討論 38
5.4 擴充標題詞彙 40
5.4.1 實驗設計 40
5.4.2 實驗結果與討論 42
5.5 結合擴充標題詞彙與重要位置摘要方法 43
5.5.1 實驗設計 44
5.5.2 實驗結果與討論 45
6 結論與未來方向 46
參考文獻 48
附錄一:WordNet內名詞的階層圖 52
附錄二:路透社新聞語料版本3的分類類別與文件數量 53
附錄三:英文停用字表 54
Apte C, Damerau F and Weiss S, “Towards language independent automated learning of text categorization models,” In 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrival, 1994, pp. 23-30.
Barzilay R., K. R. McKeown and M. Elhadad, “Information Fusion in the Context of Multi-Document Summarization,” In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), 1999, pp. 550-557.
Baxendale P. B., “Machine-Made Index for Technical Literature─An Experiment,” IBM Journal, 1958.
Brandow R., K. Mitze and L. F. Rau, “Automatic Condensation of Electronic Publications by Sentence Selection,” Information Processing & Management, Vol. 31, No. 5, 1995, pp. 675-685.
Chen H. H. and J. C. Lee, “Identification and Classification of Proper Names in Chinese Texts”, In Proceedings of COLING96, 1996, pp. 222-229.
Chen H. H. and S. J. Huang, “A Summarization System for Chinese News from Multiple Sources,” In Proceeding of the 4th Information Retrieval for Asia Language, 1999, pp. 1-7.
Chen K. J. and S. H. Liu, “Word Identification for Mandarin Chinese Sentences,” In Proceedings of COLING, 1992, pp. 101-107.
Edmundson H. P., “New Methods in Automatic Extracting,” Journal of the ACM, Vol. 16, No. 2, 1969, pp. 264-289.
Fellbaum C., WordNet : An Electronic Lexical Database, The MIT Press, 1998.
Forsyth and Rada, “Adding an Edge in Machine Learning: Applications in Expert Systems and Information Retrieval,” Ellis Horwood Ltd, 1986, pp. 198-212.
Frakes W. B. and B. Y. Ricardo, Information Retrieval — Data Structures & Algorithms, The Prentice Hall Press, 1992.
Habn U., M. Inderjeet, “The Challenges of Automatic Summarization,” IEEE COMPUTER, Vol. 33, No. 11, 2000, pp. 29-36.
Hovy E. and C. Y. Lin, “Identifying Topic by Position,” In Proceedings of the 5th Conference on Applied Natural Language Processing(ANLP), Washington, DC, 1997.
Hovy E. and C. Y. Lin, “Automated Text Summarization in SUMMARIST,” In Maybury M. and I. Mani (eds), Advances in Automatic Text Summarization. Cambridge, The MIT Press, 1998.
Jing H. and K. R. McKeown, “The Decomposition of Human-Written Summary Sentences,” In Proceedings on the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999, pp. 129-132.
Ker S. J. and J. N. Chen, “A Text Categorization Based on Summarization Technique,” In Proceeding of NLPIR Workshop of ACL2000, 2000, pp. 79-83.
Kupiec J., J. Pedersen and F. Chen, “A Trainable Document Summarizer,” In Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 68-73.
Lewis D. and M. Ringuette, “Comparison of two learning algorithms for text categorization,” In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994.
Li, B. I., S. Lin, C. F. Sun and M. S. Sun “A Maximal Matching Automatic Chinese Word Segmentation Algorithm Using Corpus Tagging for Ambiguity Resolution,” In Proceedings of ROCLING, 1991, pp. 135-146.
Lin, M.U., T. H. Chang, and K.Y. Su, “A Preliminary Study on Unknown Word Problem in Chinese Word Segmentation,” In Proceedings of ROCLING, 1993, pp. 119-141.
Mani I. and E. Bloedorn, “Machine Learning of Generic and User-Focused Summarization,” In Proceedings of Fifteenth National Conference on Artificial Intelligence, Madison, WI, 1998.
Mani I., E. Bloedorn, “Summarizing Similarities and Difference Among Related Documents,” Information Retrieval, Vol. 1, No. 1, 1999, pp.35-67.
Marcu D., “Discourse Trees are Good Indicators of Importance in Text,” Advances in Automatic Text Summarization, The MIT Press, 1999.
McKeown K. and D. R. Radev, “Generating summaries of multiple news articles,” In Proceedings on the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 74-82.
Mihalcea, R. and D. I. Moldovan, “Word Sense Disambiguation Based on Semantic Density,” In Proceedings of the COLING-ACL ’98 Workshop Usage of WordNet in Natural Language Processing Systems, 1998, pp. 16-22
Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross and K. J. Miller, “Introduction to WordNet: An On-line lexical database,” International Journal of Lexicography, Vol. 3, 1990, pp. 235-244.
Ng, H. T. and H. B. Lee, “Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Example-based approach,” In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics(ACL-96), 1996, pp. 40-47.
Radev D. R. and W. Fan, “Automatic Summarization of Search Engine Hit Lists,” In ACL2000/NLPIR, 2000, pp. 99-109.
Riley, M. D., “Some Application of Tree-Based Modeling to Speech and Language Indexing,” In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 339-352, 1989.
Salton G. and J. Allan, “Selective Text Utilization and Text Traversal,” International Journal of Human-Computer Studies, Vol. 43, 1995, pp. 483-497.
Salton G., “Automatic Text Structuring and Summarization,” Information Process & Management, Vol. 33, No. 2, 1997, pp. 193-207.
Woods W. A., “Conceptual Indexing: A Better Way to Organize Knowledge,” Sun Labs Technical Report: TR-97-61, editor, Technical Reports, 901 San Antonio Road, Palo Alto, California 94303, USA, 1997.
Yang Y., “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval. Vol. 1, 1999, pp. 69-90.
Yarowshy, D., “Unsupervised Word Sense Disambiguation Rivaling Supervised Method,” In Proceedings of the 33th Association of computational Linguistics, 1995, pp. 189-196.
Yoshio Nakao, “An Algorithm for One-Page Summarization of a Long Text Based on Thematic Hierarchy Detection”, In Processings of the 38th Annual Meeting of the Association for Computational Linguistics(ACL-2000), 2000, pp. 302-309.
丁永偉,“文本超鏈結自動建構方法之研究”,碩士論文,國立台灣大學, 1998.
邱中人,“中文新聞摘要”,碩士論文,清華大學資訊工程系, 2000.
陳光華, “新資訊時代的啟發性資訊服務”, 21世紀資訊科學與技術的展望學術研討會論文集, 世新大學, 桃園, 1998, pp. 195-208.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top