跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.81) 您好!臺灣時間:2024/12/02 22:26
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:鄭順謙
研究生(外文):SHUN-CHIEN CHENG
論文名稱:以拔靴法產生人造物專名辭典
論文名稱(外文):Creating Gazetteers of Artifact Entities by Bootstrapping Method
指導教授:林川傑林川傑引用關係
指導教授(外文):Chuan-Jie Lin
學位類別:碩士
校院名稱:國立臺灣海洋大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2007
畢業學年度:95
語文別:中文
論文頁數:66
中文關鍵詞:專名實體辨識人造物專名辭典拔靴法
外文關鍵詞:named entity recognitionartifactgazetteerbootstrapping
相關次數:
  • 被引用被引用:0
  • 點閱點閱:332
  • 評分評分:
  • 下載下載:23
  • 收藏至我的研究室書目清單書目收藏:0
本論文研究專有名詞辨識中,人造物類別的辨識方法,將人造物類別分成十種,從其中資源豐富而容易驗證的電影類別開始,各種研究後,以最成功的策略,拔靴法和網際網路自動產生專名辭典,推展到其他類別。
從引號內字串和人工專名辭典查表法,得到在發展語料集中48.2% 的F-measure。
接著提出同伴辭典(收錄該類別相關人名的辭典),和專名辭典,以拔靴法交互新增資料,自動建立專名辭典,以無標記語料集做為發展資源,得到在發展語料集中62.6% 的F-measure。
增加同伴辭典和專名辭典的過濾準則,同時改以網際網路做為發展資源,自動建立專名辭典,並且若當專名辭典未收錄時,建立機制,隨即從網際網路取得資料判斷過濾準則。得到最好的結果,是在發展語料集中69.6% 的F-measure,以及測試語料集55.6% 的F-measure。
補充為解決人造物類別專有名詞辨識問題,曾研究過的上下文特徵詞策略,以條件機率、出現頻率、chi-square限定上下文視窗為整篇文章本身、chi-square上下40個詞,是為自動選擇上下文特徵詞的手段,得到最好的F-measure是chi-square限定上下文視窗為整篇文章本身取前100名的上下文特徵詞,在發展語料集的結果為63.7% 。
嘗試在上下文特徵詞的機器學習,使用拔靴法,以發展語料集的人工標記結果做為seed,最好的結果,是positive有過濾下,一次重覆動作後,在發展語料集的F-measure結果為64.9%。
The thesis studies on the artifact named entity recognition. 10 types of different artifacts were defined. There are abundant resources for the MOVIE artifacts, so it is more easily to verify the performance of a NE recognizer. Starting from the MOIVE name recognition, a useful method was proposed to create gazetteers from the Internet by a bootstrapping algorithm. This method can be extended to the other artifact types.
By using a gazetteer constructed by hand, the gazetteer lookup method checking all the quoted strings in development set achieved an F-measure of 48.2%.
The idea of “companion gazetteer” was proposed. For MOVIE names, a companion gazetteer is a list of names of persons who are relevant to the movie industry. After finding new elements of the MOVIE gazetteer and the movie-related PERSON gazetteer by the proposed bootstrapping algorithm, recognizing movie names in the development set achieved an F-measure of 62.6% by using the final version of the MOVIE gazetteer.
The data set used to create a MOVIE gazetteer was then shifted to the Internet. Two filtering rules were proposed in the bootstrapping algorithm in order to select more accurate movies names and movie-related person names. When a candidate string was not collected in the gazetteer, it would be judged immediately from the Internet resources following the filtering rules. By using such a method to identify MOVIE names, it achieved an F-measure are 69.6% in the development set, and 55.6% in the test set, which were the best results in this thesis.
The idea of using context feature terms was also proposed. But it has been proved not a good solution. By using condition probability, corpus frequency, chi-square, setting context window as the whole document or a 40-word passage, some context feature terms were selected accordingly. Two features corresponding to each feature term were then used to train a MOVIE name identifier by machine learning. The best performance was an F-Measure of 63.7% in the development set when the context window was set to be the whole document and the chi-square values were used to select top 100 context feature terms.
A bootstrapping method was also proposed to select context feature terms for machine learning. By using the list of movie names in the development set as a seed, the best F-measure is 64.9% when some filtering rules were applied.
摘要
Abstract
誌謝
目錄
表格目錄
圖表目錄
第一章、 緒論
1.1. 背景
1.2. 人造物專名實體辨識
第二章、 電影名稱辨識之實驗
2.1. 實驗資料
2.2. 引號內字串 (QUOTED STRING)
2.3. 專名辭典 (GAZETTEER)
第三章、 以同伴辭典和拔靴法產生專名辭典
3.1. 由語料庫產生專名辭典
3.2. 由網際網路產生專名辭典
3.2.1. 觸發詞
3.2.2. 觸發詞組合查詢句
3.2.3. 觸發詞緊鄰查詢句
3.2.4. 電影名稱過濾準則
3.2.5. 電影相關人名過濾準則
3.2.6. 拔靴法演算法
3.3. 實驗與結果
3.3.1. 產生自語料庫的電影專名辭典之實驗
3.3.2. 產生自網際網路的電影專名辭典之實驗
3.4. 各種參數值設定比較
3.4.1. 出現頻率門檻值大小之影響
3.4.2. 專名辭典Seed大小之影響
3.4.3. 不同專名辭典Seed之影響
3.5. 實驗方法效能的討論
3.5.1. 電影專名辭典的正確性
3.5.2. 專名辭典查表法的最佳效能
3.5.3. 電影名稱過濾準則的覆蓋度
3.5.4. 加入動態判斷機制的系統效能
3.5.5. 電視節目名稱之實驗
第四章、 以上下文特徵詞建立電影名稱分類器
4.1. 上下文特徵詞
4.2. 實驗與結果討論
4.2.1. 實驗一:選取上下文特徵詞的小測試
4.2.2. 實驗二:由未標記語料集和條件機率選取上下文特徵詞
4.2.3. 實驗三:由未標記語料集和Chi-Square選取上下文特徵詞
4.2.4. 實驗四:決定上下文特徵詞之選取數量
4.2.5. 實驗五:限定上下文視窗來決定上下文特徵詞
4.3. 以上下文特徵詞和拔靴法產生電影名稱分類器
4.3.1. 演算法
4.3.2. 實驗與結果討論
第五章、 結論
Reference
Asahara, Masayuki and Yuji Matsumoto (2003) “Japanese Named Entity Extraction with Redundant Morphological Analysis,” Proceedings of HLT-NAACL 2003, Main Papers, pp. 8–15.
Chinchor, N. (1998) "MUC-7 Named Entity Task Definition (Version 3.5)," Proceedings of the Seventh Message Understanding Conference.
Florian, R., H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, N. Nicolov, and S. Roukos (2004) “A Statistical Model for Multilingual Entity Detection and Tracking,” Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 1–8.
Hacioglu, Kadri, Benjamin Douglas, and Ying Chen (2005) “Detection of Entity Mentions Occurring in English and Chinese Text,” Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 379–386.
Hassan, Hany and Jeffrey Sorensen (2005) “An Integrated Approach for Arabic-English Named Entity Translation,” Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 87–93.
Jiang, Jing and ChengXiang Zhai (2006) “Exploiting Domain Structure for Named Entity Recognition,” Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pp. 74–81.
Kishida, Kazuaki, Kuang-hua Chen, Sukhoon Lee, Kazuko Kuriyama, Noriko Kando, Hsin-Hsi Chen, Sung Hyon Myaeng, and Koji Eguchi (2004) “Overview of CLIR Task at the Fourth NTCIR Workshop,” the Proceedings of NTCIR-4, 2004.
Kozareva, Zornitsa, Boyan Bonev, and Andres Montoyo (2005) “Self-training and Co-training applied to Spanish Named Entity Recognition,” Proceedings of the 4th Mexican International Conference on Artificial Intelligence, pp. 770–780.
Kozareva, Zornitsa (2006) “Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists,” Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics, Student Research Workshop, pp. 15–21.
Kumano, Tadashi, Hideki Kashioka, Hideki Tanaka, and Takahiro Fukusima (2003) “Construction and Analysis of Japanese-English Broadcast News Corpus with Named Entity Tags,” Proceedings of ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, pp. 17–24.
Minkov, Einat, Richard C.Wang, Anthony Tomasic, and William W. Cohen (2006) “NER Systems that Suit User’s Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction,” Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 93–96.
Sekine, Satoshi and Hitoshi Isahara (2000) “IREX: IR and IE Evaluation project in Japanese,” Proceedings of the LREC 2000.
Solorio, Thamar (2005) “Exploiting Named Entity Taggers in a Second Language,” Proceedings of the ACL Student Research Workshop, pp. 25–30.
Toral, Antonio and Rafael Mu□oz (2006) “A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia,” Proceedings of the workshop on NEW TEXT Wikis and blogs and other dynamic text sources, pp. 56–61.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top