跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.102) 您好!臺灣時間:2025/12/04 04:45
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:丁少弘
研究生(外文):Shao-Hong Ding
論文名稱:PNRS-基於字典樣式搜尋和探勘的蛋白質名稱辨識系統
論文名稱(外文):PNRS - Protein Name Recognition System Using Dictionary-Based Pattern Search and Mining
指導教授:林宣華林宣華引用關係
指導教授(外文):Shian-Hua Lin
學位類別:碩士
校院名稱:國立暨南國際大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:英文
論文頁數:46
中文關鍵詞:蛋白質名稱辨識序列樣式探勘經驗法則關聯探勘
外文關鍵詞:Protein Name RecognitionSequential Pattern MiningHeuristics RulesMining Association
相關次數:
  • 被引用被引用:0
  • 點閱點閱:221
  • 評分評分:
  • 下載下載:11
  • 收藏至我的研究室書目清單書目收藏:0
從生醫文獻中擷取出與蛋白質或基因等知識之前,生物醫學相關的名稱辨識是一項最基本且重要的工作。我們提出了一個整合字典、經驗法則和資料探勘的新方法,以有效辨識生醫文獻中的蛋白質名稱。過去研究利用蛋白質字典和以蛋白質型態特徵為基礎的經驗法則,可以有效率地辨識出蛋白質名稱的主要片段。然而,要判斷出完整的蛋白質名稱則是非常困難的。因此,我們將關聯探勘的概念運用在蛋白質名稱字典上,將構成蛋白質名稱的每個單字視為關聯探勘的一個項目,進而找出由這些項目所組成的重要序列樣式。根據這些序列樣式,已辨識出的名稱片段可以延伸,以正確地找出完整的蛋白質名稱。以最常被廣泛採用的Yapex測試集 (Yapex101) 實驗測試資料集,我們的系統可以達到74.5%的F-measure,此結果高於目前在Google Scholar可以參考到的蛋白質名稱辨識系統文獻。
Recognizing biological named entities is an important task while automatically extracting biological information like proteins and genes from biomedical literature. We propose a novel method that integrates dictionary, heuristics and data mining approaches to effectively recognize protein names from literature. According to the protein name dictionary and heuristic rules published in related papers, core tokens of a protein name can be efficiently detected. However, the exact boundary of the protein name is hard to be identified. By regarding tokens of a protein name as items, we apply mining associations to discover significant sequential patterns (SSPs) from the protein name dictionary. Based on SSPs, protein name parts are extended from found core tokens so that the protein name boundaries can be correctly identified. Based on the widely used Yapex test corpus (Yapex101), the Protein Name Recognition System (PNRS) achieves 74.5% F-measure that is better than current available systems and citations obtained from the Google Scholar.
中文摘要
Abstract
Contents
List of Tables
List of Figures
1. Introduction
2. Related Works
2.1. Protein Databases
2.1.1. EMBL-EBI (European Bioinformatics Institute)
2.1.2. PIR (Protein Information Resource)
2.1.3. UniProt (Universal Protein Resource)
2.1.4. NCBI (National Center of Biotechnology Information)
2.1.5. Databases Used in the Thesis
2.2. Protein Name Recognition Methods
2.2.1. Dictionary-Based Methods
2.2.2. Learning-Based Methods
2.2.3. Alignment-Based Methods
2.2.4. Rule-Based Methods
2.2.5. Comparisons of PNR Methods
2.2.6. Evaluation Criteria
2.3. Mining Association and Sequential Pattern
2.3.1. Mining Association
2.3.2. Sequential Pattern
3. The Implementation
3.1. Concepts and Ideas
3.2. The System
3.2.1. Sentence Splitter
3.2.2. Core Token Selector (CTS)
3.2.3. Significant Sequential Pattern (SSP) Miner
3.2.4. Core Token Extender (CTE)
3.2.5. Protein Name Filter
4. Exmperiments and Evaluations
4.1.1. Experiments: Dictionary + Rule
4.1.2. Experiments: SSP (L-Ext vs. R-Ext)
4.1.3. Experiments: SSP (AD-Threshold)
4.1.4. Experiments: Different Dictionary (Swiss-Prot vs. TrEMBL)
4.1.5. Discussion
5. Conclusion and Future Works
6. References
[1]Agrawal, R., Imielinski, T., and Swami, A., “Mining association rules between sets of items in large databases,” Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C., May, 1993.
[2]Agrawal, R. and Srikant, R., “Fast algorithms for mining association rules,” Proceedings of International Conference on Very Large Databases, Santiago, Chile, September 1994, pp. 487-499.
[3]Agrawal, R. and Srikant, R., “Mining sequential patterns,” Proceedings of International Conference on Data Engineering, Taipei, Taiwan, March, 1995, pp. 3–14.
[4]Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J., “Basic local alignment search tool,” Journal of Molecular Biology, 1990, 215:403-410.
[5]Blaschke, C., Andrade, M. A., Ouzounis, C. and Valencia, A., “Automatic Extraction of Biological Information from Scientific Text: Protein-protein Interactions,” Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999, pp. 60-67.
[6]Brill, E., “Some advances in transformation-based part of speech tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence, Volume 1, 1994, pp. 722-727.
[7]Chang, J.T., Schutze, H. and Altman, R., "GAPSCORE: finding gene and protein names one word at a time," Bioinformatics, 20, 2004, 216-25.
[8]Chiang JH and Yu HC, “MeKE: Discovering the Functions of Gene Products from Biomedical Literature via Sentence Alignment,” Bioinformatics, 19, 2003, 1417-1422.
[9]Collier, N., Nobata, C. and Tsujii, J., “Extracting the names of genes and gene products with a hidden markov model,” Proceedings of the 18th International Conference on Computational Linguistics, 2000, pp. 201-207.
[10]Egorov, S., Yuryev, A. and Daraselia, N., “A Simple and Practical Dictionary-Based Approach for Identification of Proteins in MEDLINE Abstracts,” Journal of the American Medical Informatics Association, Feb., 2004.
[11]Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P. and Coster, J., “Protein Names and How to Find Them,” International Journal of Medical Informatics, Dec., 2002.
[12]Fukuda, K., Tsunoda, T., Tamura, A. and Takagi, T., “Toward information extraction: identifying protein names from biological papers,” Proceedings of the 3th Pacific Symposium on Biocomputing, 1998, 707-718,.
[13]Hanisch, D., Fluck, J., Mevissen, HT. and Zimmer, R., “Playing biology's name game: identifying protein names in scientific text,” Proceedings of the 8th Pacific Symposium on Biocomputing, 2003, 403-414.
[14]Huang, M. L., Zhu, X. Y., Hao, Y., Payan, D. G., Qu, K. B. and Li, M., “Discovering patterns to extract protein-protein interactions from full texts,” Bioinformatics, 20, 2004, 3604-3612.
[15]Kazama, J., Makino, T., Ohta, Y. and Tsujii, J., “Tuning support vector machines for biomedical named entity recognition,” Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, 2002, 1–8.
[16]Krauthammer, M., Rzhetsky, A., Morozov, P. and Friedman, C., “Using BLAST for identifying gene and protein names in journal articles,” Gene, 259, 2000, 245-252.
[17]Lin, S.-H., Shih, C.-S., Chen, M.C., Ho, J.-M., Ko, M.-T. and Huang, Y.-M., “Extracting Classification Knowledge of Internet Documents: A Semantics Approach,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 241-249.
[18]Lipman, D. J. and Pearson, W.R., “Rapid and sensitive protein similarity searches,” Science, 227, 1985, 1435–1441.
[19]Proux, D., Rechenmann, F., Julliard, L., Pillet, V. V. and Jacq, B., “Detecting gene symbols and names in biological texts: A First Step toward Pertinent Information Extraction,” The 9th Workshop on Genome Informatics, 1998, 72-80.
[20]Malik, R., Franke, L. and Siebes, A., “Combination of text-mining algorithms increases the performance,” Bioinformatics, 22, 2006, 2151-2157.
[21]Mika, S. and Rost, B., “Protein names precisely peeled off free text,” Bioinformatics, 20, 2004, i241–i247.
[22]Narayanaswamy, M., Ravikumar, K. E. and Vijay-Shanker, K., “A biological named entity recognizer,” Proceedings of the 8th Pacific Symposium on Biocomputing, 427–438.
[23]Nobata, C., Collier, N. and Tsujii, J., “Automatic term identification and classification in biology texts,” Proceedings of the 5th Natural Language Pacific Rim Symposium, 1999, 369–375.
[24]Richard Tzong-Han Tsai, Shih-Hung Wu, Wen-Chi Chou, Yu-Chun Lin, Ding He, JiehHsiang, Ting-Yi Sung and Wen-Lian Hsu, “Various criteria in the evaluation of biomedical named entity recognition,” BMC Bioinformatics 2006, 7:92 doi:10.1186/1471-2105-7-92.
[25]Richard Tzong-Han Tsai, Cheng-Lung Sung, Hong-Jie Dai, Hsieh-Chuan Hung, Ting-YiSung and Wen-Lian Hsu, “NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition,” BMC Bioinformatics 2006, 7(Suppl 5):S11 doi:10.1186/1471-2105-7-S5-S11.
[26]Ryan McDonald and Fernando Pereira, “Identifying gene and protein mentions in text using conditional random fields,” BMC Bioinformatics 2005, 6(Suppl 1):S6 doi:10.1186/1471-2105-6-S1-S6.
[27]Salton, G. and McGill, M. J., “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983.
[28]Seki, K. and Mostafa, J., “A probabilistic model for identifying protein names and their name boundaries,” Proceedings of the 2003 IEEE Computer Society Bioinformatics Conference (CSB 2003), 2003, pp. 251–258.
[29]Seki, K. and Mostafa, J., “An approach to protein name extraction using heuristics anda dictionary,” Proceedings of the American Society for Information Science and Technology Annual Conference (ASIST), 2003.
[30]Tanabe, L. and Wilbur, W.J., “Tagging gene and protein names in biomedical texts,” Bioinformatics, 18, 2003, 1124-1132.
[31]Tapanainen, P. and Järvinen, T., “A non-projective dependency parser,” Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., 1997, pp. 64-71,.
[32]Tsuruoka, Y. and Tsujii, J., “Boosting precision and recall of dictionary-based protein name recognition” Proceedings of the ACL’2003 Workshop on Natural Language Processing in Biomedicine. Sapporo, Japan, 2003, pp. 41–48.
[33]Wen-Juan Hou and Hsin-His Chen, “Enhancing performance of protein and gene name recognizers with filtering and integration strategies,” Journal of Biomedical Informatics, 37 , 2004, 448-460.
[34]Yamamoto, K., Kudo, T., Konagaya, A. and Matsumoto, Y., “Use of morphological analysis in protein name recognition,” Journal of Biomedical Informatics, Volume 37, Issue 6, 2004, 471-482.
[35]Yeganova, L., Smith L. and Wilbur, W.J., “Identification of related gene/protein names based on an HMM of name variations,” Computational Biology and Chemistry, Volume28, Issue 2, 2004, 97-107.
[36]Yu, H. and Agichtein, E., “Extracting synonymous gene and protein terms from biological literature,” Bioinformatics, 19 (Suppl 1), I340-I349, 2003.
[37]Zhenzhen Kou., William W. and Robert F. Murphy, “High-recall protein entity recognition using a dictionary,” Bioinformatics, 21, 2005, i266-i273.
[38]Zhou, G.D., Zhang, J., Su, J., Shen, D. and Tan, C.L., “Recognizing names in biomedical texts: a machine learning approach,” Bioinformatics, 20, 2004, 1178–1190.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top