跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.169) 您好!臺灣時間:2025/01/22 02:33
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:唐顥
研究生(外文):Hao Tang
論文名稱:由語音文件中擷取關鍵語彙之研究
論文名稱(外文):Key Term Extraction from Spoken Documents
指導教授:李琳山李琳山引用關係
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2010
畢業學年度:98
語文別:中文
論文頁數:52
中文關鍵詞:關鍵語彙關鍵詞語音文件
外文關鍵詞:key termkey wordspoken document
相關次數:
  • 被引用被引用:0
  • 點閱點閱:225
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
擷取關鍵語彙一直是一個重要的課題,可惜目前的研究較少,內容多分散在各個 不同的會議或是文章的某些部份,尚缺乏一個完整的簡介,也缺乏統一的比較。 本論文首先將目前較常見的方法歸納整理,分析各項方法的優缺點,提出一個較 統一的評估方式,然後進行比較。此外,過去使用的方法有過多人為的介入,例 如手動去除停用詞。為保留真實擷取關鍵語彙的環境,我們在不對語料進行過多 的前處理及後處理的情況下進行實驗。
與前人不同的是,我們不在只利用單一特徵進行篩選,而嘗試使用單純貝 氏分類器及馬可夫模型同時使用不同的特徵來擷取關鍵語彙。雖然最後結果 F-measure 分別只有 18.0% 及 17.0% ,但召回率都有超過一半,分別為 55.6% 及 54.2% 。我們還嘗試進行特徵選擇,發現某些特徵組合表現較好,且兩模型均有 穩定的表現。
進行特徵分析的結果顯示,多半被視為停用詞的詞很有可能是決定關鍵語彙的 線索,我們稱這些詞為線索詞。從模型上的表現顯示,平均而言,使用線索詞的 表現都優於不使用線索詞。可見某些停用詞提供了重要的訊息,不應該隨意地移 除。
我們更發現,訓練集及測試集有著不小的差異,測試集中有超過半數的詞是詞 外詞。在如此不批配的環境下兩個模型能有如此的表現,顯示使用模型比使用單 一特徵來的有效且穩定。

口試委員會審定書 i
摘要 iii
圖目錄 x
表目錄 xi
1 緒論 1
1.1 研究動機................................. 1
1.2 相關研究................................. 2
1.2.1 人工索引............................. 2
1.2.2 自動抽取關鍵語彙........................ 4
1.3 本論文之研究方法與成果........................ 8
1.4 章節安排................................. 8
2 背景知識 9
2.1 單純貝氏分類器.............................. 9
2.2馬可夫模型................................ 13
2.2.1參數估計............................. 13
2.2.2 維特比演算法(ViterbiAlgorithm)................ 16
2.3本章總結................................. 17
3 特徵 19
3.1詞頻與逆向文件詞頻........................... 19
3.2位置熵(PositionEntropy)........................ 21
3.3共同出現................................. 22
3.4本章總結................................. 24
4 連續分佈的單純模型 25
4.1高斯分佈(GaussianDistribution).................... 25
4.2指數分佈(ExponentialDistribution) .................. 27
4.3貝塔分佈(BetaDistribution) ...................... 29
4.4本章總結................................. 30
5 實驗及分析 33
5.1實驗語料................................. 33
5.2實驗設計................................. 34
5.3評估方法................................. 35
5.4結果分析................................. 37
5.5本章總結................................. 40
6 關鍵語彙圖 41
6.1連通圖(ConnectedGraph)........................ 41
6.2小世界(SmallWorld)........................... 42
6.3本章總結................................. 46
7 結論與展望 47
7.1總結.................................... 47
7.2未來展望................................. 48
參考文獻 52

[1] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval.
Addison Wesley, 1999.
[2] Chengzhi Chang, Huilin Wang, Yao Liu, Dan Wu, Yi Liao, and Bo Wang. Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 2008.
[3] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction based on pattern discovery. In Proceedings of the 10th international conference on World Wide Web, 2001.
[4] Lee-Feng Chien. Pat-tree-based keyword extraction for chinese information retrieval. ACM SIGIR Forum, 1997.
[5] Lee-Feng Chien. Pat-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval. Information Processing and Management, 1999.
[6] Jacob Eisenstein and Regina Barzilay. Bayesian unsupervised topic segmenta- tion. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing, 2008.
[7] Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning. Domain-specific keyphrase extraction. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.
[8] Jorng-Tzong Horng and Ching-Chang Yeh. Applying genetic algorithms to query optimization in document retrieval. Information Processing and Man- agement, 2000.
[9] Anette Hulth, Jussi Karlgren, Anna Jonsson, Henrik Bostr ̈om, and Lars Asker. Automatic keyword extraction using domain knowledge. Computational Lin- guistics and Intelligent Text Processing, 2001.
[10] Feifan Liu, Deana Pennell, Fei Liu, and Yang Liu. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter, 2009.
[11] Igor Malioutov and Regina Barzilay. Minimum cut model for spoken lecture segmentation. In Proceedings of the 21st International Conference on Compu- tational Linguistics and the 44th annual meeting of the Association for Com- putational Linguistics, 2006.
[12] Yutaka Matsuo and Mitsuru Ishizuka. Keyword extraction from a single doc- ument using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 2003.
[13] Yutaka Matsuo, Yukio Ohsawa, and Mitsuru Ishizuka. Keyworld: Extracting keywords from documents as a small world. Discovery Science, 2001.
[14] Yutaka Matsuo, Yukio Ohsawa, and Mitsuru Ishizuka. Keyworld: Extracting keywords from document’s small world. Discovery Science, 2001.
[15] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In Con-
ference on Empirical Methods in Natural Language Processing, 2004.
[16] Thomas Minka. Expectation-maximization as lower bound maximization, 1999.
[17] Donald R. Morrison. Patricia—practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 1968.
[18] Edgar Moyotl-Herna ́ndez and H ́ector Jim ́enez-Salazar. An analysis on fre- quency of terms for text categorization. Procesamiento del lenguaje natural, 2004.
[19] Edgar Moyotl-Herna ́ndez and H ́ector Jim ́enez-Salazar. Enhancement of dtp feature selection method for text categorization. Computational Linguistics and Intelligent Text Processing, 2005.
[20] Nancy C. Mulvany. Indexing Books. University of Chicago Press, 2005.
[21] Yukio Ohsawa, Nels E. Benson, and Masahiko Yachida. Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the Advances in Digital Libraries Conference, 1998.
[22] David Pinto, Paolo Rosso, Alfons Juan, and H ́ector Jim ́enez-Salazar. A compar- ative study of clustering algorithms on narrow-domain abstracts. Procesamiento del languaje natural, 2006.
[23] R. Urbizagástegui-Alvarado. Las posibilidades de la ley de zipf. Technical report, Universidad de California Riverside, 1999.
[24] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks. Letters to Nature, 1998.
[25] Bella Hass Weinberg. The earliest hebrew citation indexes. Journal of the
American Society for Information Science, 1997.
[26] Kuo Zhang, Hui Xu, Jie Tang, and Juanzi Li. Keyword extraction using support vector machine. Advances in Web-Age Information Management, 2006.
[27] George Kingsley Zipf. Human behavior and the principle of least effort. Addison- Wesley Press, 1949.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top