(3.231.166.56) 您好!臺灣時間:2021/03/08 12:17
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:顏敏哲
研究生(外文):Min-Zhe Yan
論文名稱:Web為基礎的資訊擷取系統-以蛋白質交互作用為例
論文名稱(外文):Web-Based Information Extraction System - A Case Study on Protein-Protein Interactions
指導教授:林宣華林宣華引用關係
指導教授(外文):Shian-Hua Lin
學位類別:碩士
校院名稱:國立暨南國際大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:英文
論文頁數:58
中文關鍵詞:資訊擷取文字探勘序列比對蛋白質間相互關係蛋白質名稱辨識
外文關鍵詞:Information ExtractionText MiningSequence AlignmentProtein-Protein InteractionsProtein Name Recognition.
相關次數:
  • 被引用被引用:0
  • 點閱點閱:178
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:24
  • 收藏至我的研究室書目清單書目收藏:1
在網頁資訊的快速增長下,使用者透過搜尋引擎可以找到大量資料,但只有少數資訊和使用者需求相關。若能以智慧型資訊擷取技術,將大量資訊轉成有用的知識,對各個領域的應用將有重大的幫助。「物件和物件間的關係 (事件)」是常用的知識表達法,現今有很多領域的網站內容就隱藏著此類型知識。例如:在生物資訊領域,NCBI PubMed管理大量的生物醫學相關文獻,文字的內容記錄大量的蛋白質與蛋白質間交互作用 (Protein-Protein Interactions, PPIs) 的資訊;新聞網站內容隱藏大量「人、事、時、地、物」的關係;旅遊網站和部落格有許多旅遊景點間的關聯資訊和人對景點的評價資訊。在本論文中提出以Web為資訊來源為基礎的資訊擷取系統平台,以半自動擷取生醫文獻中的PPI資訊為實際研究案例,探討此平台用於擷取其他領域應用的程序和設計方法。本系統 (Object-Object Relation Extraction System, O2RES) 設計分成Entity/Relation Recognizer、Translator和Event Extractor三大子系統,在各子系統中包含多個模組,可彈性地抽換使用於不同應用領域,以擷取不同領域的需求事件。我們與台中榮總 (TVGH) 合作,從生物醫學文獻中標示蛋白質交互作用的資訊作為我們系統實驗的測試資料。以O2RES為平台實作出蛋白質交互作用擷取系統 (Protein-protein Interaction Extraction System, PIES@O2RES),系統的查全率 (recall rate) 和查準率 (precision rate) 分別為92.25%和80.95% (F1 = 86.23%)。以BioCreative-PPI的測試資料,系統的查全率和查準率各為48.24%和91.11% (F1 = 63.08%),也優於目前最好之結果 (F1 = 56%)。
With the exponential growth of the Web, users usually retrieve too many relevant pages from search engines with only little information satisfying requirements. Developing the intelligent information extraction system to efficiently and effectively process large amounts of pages and translate into useful knowledge facilitates applications of various domains. “Object-Object Relation (Event)” is a kind of knowledge representation. Contents of many domain websites imply this kind of knowledge. For example, NCBI PubMed manages more than 18 millions biomedical papers implicitly containing much Protein-Protein Interactions (PPIs) information. News pages record large amounts of relationships among “Person, Event, Time, Location, and Thing”. Travel and Blog pages describe much information about “Location-Location Relations” or “People-Location Journeys”. In this thesis, we propose the information extraction platform for extracting web resources. Applying the platform to extract PPI information from the biomedical literature, we investigate issues of the platform design and implementation while deploying the platform to adapt other domains applications. The system, Object-Object Relation Extraction System (O2RES), is divided into three subsystems: Entity/Relation Recognizer, Translator and Event Extractor. These subsystems consist of reusable modules that can be customized for various applications while extracting different events. Based on O2RES, we cooperate with researchers of Taichung Veterans General Hospital (TVGH) and implement the Protein-protein Interaction Extraction System (PIES@O2RES). The performance is acceptable with 86.23% in F1 (Recall = 92.25%, Precision = 80.95%). Using the BioCreative-PPI corpus, the F1 value is 63.08% (R = 48.24%, P = 91.11%), that is also better than current results of citations.
中文摘要 I
Abstract II
Contents III
List of Tables VI
List of Figures VII
1. Introduction 1
2. Related Works 5
2.1. Block Extraction Tag-based Method 5
2.2. Protein-Related Resources on the Web 5
2.3. Protein Name Recognition in Literatures 8
2.4. PPI-Related Database on the Web 9
2.5. PPI Information Extraction from Literatures 11
2.6. Web-Based Biomedical Information Extraction System 12
3. The PIES@O2RES 16
3.1. The General Architecture of O2RES 16
3.2. Trarvel@O2RES: Ideas and Concepts 18
4. The PPI Extractor Model 21
4.1. Entity Relation Recognizer 21
4.1.1. E-R Recognizer Dictionary-based 22
4.1.2. E-R Recognizer PNRS-based 23
4.2. Sentence Translator 24
4.3. Sequence Refinement 25
4.3.1. Abbreviation 25
4.3.2. P, …, P and P 26
4.3.3. Filtering Noise 26
4.3.4. … not only … but also … 26
4.4. Pattern Matcher 27
4.5. Entity Relation Event Extractor 28
5. PIES@O2RES User Interface 31
6. Experiments 33
6.1. Experiment Settings and Evaluation 33
6.1.1. The Biomedical Literatures 33
6.1.2. The Protein Names 34
6.1.3. The Interaction Keywords 35
6.1.4. The PPI patterns 35
6.1.5. Recall and Precision Evaluations 37
6.2. Experiment – TVGH 37
6.2.1. Testing Data 38
6.2.2. Experiment Methods and Observations 39
6.3. Experiment - BioCreAtIve PPI corpus 41
6.3.1. BioCreAtIve (BC)-PPI corpus preparation 41
6.3.2. The experimental result 42
6.4. Experiment - DIP PPI corpus 43
6.4.1. DIP-PPI corpus preparation 43
6.4.2. The evidence for PPIs result 44
7. Conclusion and Future Works 45
8. References 46
Appendix A  List of Interaction Keywords 49
Appendix B  50 PMIDs of Randomly Selected Testing Abstracts 51
Appendix C  XML-format PPI Information Extracted Result Example 52
Appendix D  The scoring matrix and gap penalties 53
Appendix E  XML-format Testing Data from TVGH 54
[1]Alfarano, C., Andrade, C. E., Anthony, K., Bahroos, N., Bajec, M., Bantoft, K., Betel, D., Bobechko, B., Boutilier, K., Burgess, E., Buzadzija, K., Cavero, R., D'Abreo, C., Donaldson, I., Dorairajoo, D., Dumontier, M. J., Dumontier, M. R., Earles, V., Farrall, R., Feldman, H., Garderman, E., Gong, Y., Gonzaga, R., Grytsan, V., Gryz, E., Gu, V., Haldorsen, E., Halupa, A., Haw, R., Hrvojic, A., Hurrell, L., Isserlin, R., Jack, F., Juma, F., Khan, A., Kon, T., Konopinsky, S., Le, V., Lee, E., Ling, S., Magidin, M., Moniakis, J., Montojo, J., Moore, S., Muskat, B., Ng, I., Paraiso, J. P., Parker, B., Pintilie, G., Pirone, R., Salama, J. J., Sgro, S., Shan, T., Shu, Y., Siew, J., Skinner, D., Snyder, K., Stasiuk, R., Strumpf, D., Tuekam, B., Tao, S., Wang, Z., White, M., Willis, R., Wolting, C., Wong, S., Wrong, A., Xin, C., Yao, R., Yates, B., Zhang, S., Zheng, K., Pawson, T., Ouellette, B. F. F., and Hogue, C. W. V., “The Biomolecular Interaction Network Database and Related Tools 2005 Update,” Nucleic Acids Research, 33(Database Issue):D418-D424, Jan., 2005.
[2]Barioch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H. Z., Lopez, R., Magrane, M., Matrin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N., and Yeh, L. S., “The Universal Protein Resource (UniProt),” Nucleic Acids Research, 33(Database Issue):D154-159, Jan., 2005.
[3]Barker, W. C., Garavelli, J. S., McGarvey, P. B., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S., Ledley, R. S., Mewes, H. W., Pfeiffer, F., Tsugita, A., and Wu, C., “The PIR-International Protein Sequence Database,” Nucleic Acids Research, 27(1):39-43, Jan., 1999.
[4]Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E., “The Protein Data Bank,” Nucleic Acids Research, 28(1):235-242, Jan., 2000.
[5]Blaschke, C., Andrade, M. A., Ouzounis, C. and Valencia, A., “Automatic Extraction of Biological Information from Scientific Text: Protein-protein Interactions,” Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 1999, pages 60-67.
[6]Boeckmann, B., Barioch, A., Apweiler, R., Blatter, M. C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., and Schneider, M., “The Swiss-Prot Protein Knowledgebase and its Supplement TrEMBL in 2003,” Nucleic Acids Research, 31(1):365-370, Jan., 2003.
[7]Brill, E., “Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging,” Computational Linguistics, 21(4):543-565, 1995.
[8]Collier, N., Nobata, C. and Tsujii, J., “Extracting the Names of Genes and Gene Products with a Hidden Markov Model,” Proceedings of the 18th International Conference on Computational Linguistics, 2000, pages 201-207.
[9]Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A. and Mazo, I., “Extraction of Human Protein Interactions from MEDLINE Using Full-sentence Parser,” Bioinformatics, 19(0):1-8, Jan., 2003.
[10]Debnath, S., Mitra, P., Pal, N. and Giles, C. L., “Automatic Identification of Informative Sections of Web Pages,” IEEE Trans. Knowledge and Data Eng., 2005.
[11]Ding, J., Berleant, D., Nettleton, D., and Wurtele, E., “Mining MEDLINE: Abstracts, Sentences, or Phrases,” Proceedings of the 7th Pacific Symposium on Biocomputing, 2002, pages 326-337.
[12]Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G. D., Michalickova, K., Pawson, T. and Hogue, C. W., “PreBIND and Textomy: Mining the Biomedical Literature for Protein-protein Interactions Using a Support Vector Machine,” BMC Bioinformatics, 4(1):11-23, Mar., 2003.
[13]Durbin, R., Eddy, S., Krogh, A., and Mitchision, G., “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids,” Cambridge, Cambridge University Press, 1998.
[14]Egorov, S., Yuryev, A. and Daraselia, N., “A Simple and Practical Dictionary-based Approach for Identification of Proteins in MEDLINE Abstracts”, Journal of the American Medical Informatics Association, 11(3), Feb., 2004.
[15]Fukuda, K., Tamura, A., Tsunoda, T. and Tagagi, T., “Toward Information Extraction: Identifying Protein Names from Biological Paper,” Proceedings of the 3rd Pacific Symposium on Biocomputing, 1998, pages 707-718.
[16]Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., Margalit, H., Armstrong, J., Bairoch, A., Cesareni, G., Sherman, D., and Apweiler, R., “IntAct: An Open Source Molecular Interaction Database,” Nucleic Acids Research, 32(Database Issue):D452-D455, Jan., 2004.
[17]Huang, M. L., Zhu, X. Y., Hao, Y., Payan, D. G., Qu, K. B. and Li, M., “Discovering Patterns to Extract Protein-protein Interactions from Full Texts,” Bioinformatics, 20(18):3604-3612, Dec., 2004.
[18]Jang H., Lim J., Lim J. H., Park S. J., Lee K. C., and Park S. H., “Finding the evidence for protein-protein interactions from PubMed abstracts,“ Bioinformatics, vol. 22, no. 14, pp. e220–e226, 2006
[19]Jenssen, T. K., Lǽgreid, A., Komorowski, J. and Hovig, E., “A Literature Network of Human Genes for High-throughput Analysis of Gene Expression,” Nature Genetics, 28(1):21-28, May, 2001.
[20]Kanehisa, M., and Goto, S., “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research, 28(1):27-30, Jan., 2000.
[21]Kazama, J., Makino, T., Ohta, Y. and Tsujii, J., “Tuning Support Vector Machines for Biomedical Named Entity Recognition,” Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, 2002, pages 1–8.
[22]Krauthammer, M., Rzhetsky, A., Morozov, P. and Friedman, C., “Using BLAST for Identifying Gene and Protein Names in Journal Articles,” GENE, 259(1):245–252, Dec., 2000.
[23]Lin, S.-H. and Ho, J.-M., “Discovering Informative Content Blocks from Web Documents,” The Eighth ACM SIGKDD, 2002.
[24]Maglott, D., Ostell, J., Pruitt, K. D., and Tatusova, T., “Entrez Gene: Gene-centered Information at NCBI,” Nucleic Acids Research, 33(Database Issue):D54-D58, Jan., 2005.
[25]Marcotte, E. M., Xenarios, I., and Eisenberg, D., “Mining Literature for Protein-protein Interactions,” Bioinformatics, 17(4):359-363, Apr., 2001.
[26]Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A., “MIPS: Analysis and Annotation of Proteins From Whole Genomes,” Nucleic Acids Research, 32(Database Issue):D41-D44, Jan., 2004.
[27]Mika, S., and Rost, B., “Protein Names precisely peeled off free text,” Bioinformatics, 20(Suppl. 1): i241-i247, Aug., 2004.
[28]Ng, S. K. and Wong, M., “Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts,” Proceedings of the 10th Workshop on Genome Informatics, 1999, pages 104-112.
[29]Nobata, C., Collier, N. and Tsujii, J., “Automatic Term Identification and Classification in Biology Texts,” Proceedings of the 5th Natural Language Pacific Rim Symposium, 1999, pages 369–375.
[30]Novichkova, S., Egorov, S. and Daraselia, N., “MedScan, a Natural Language Processing Engine for MEDLINE Abstracts,” Bioinformatics, 19(13):1699-1706, Sep., 2003.
[31]Ono, T., Hishigaki, H., Tanigami, A. and Takagi, T., “Automated Extraction of Information on Protein-protein Interactions from the Biological Literature,” Bioinformatics, 17(2):155-161, Feb., 2001.
[32]Peri, S., Navarro, J. D., Amanchy, R., Kristiansen, T. Z., Jonnalagadda, C. K., Surendranath, V., Niranjan, V., Muthusamy, B., Gandhi, T. K. B., Gronborg, M., Ibarrola, N., Deshpande, N., Shanker, K., Shankar, S. H. N., Prasad, R. B., Ramya, M. A., Zhixing, Z., Chandrika, K. N., Padma, N., Harsha, H. C., Yatish, A. J., Poovaiah, K. M., Menezes, M., Choudhury, D. R., Suresh, S., Ghosh, N., Saravana, R., Chandran, S., Krishna, S., Joy, M., Anand, S. K., Madavan, V., Joseph, A., Wong, G., Schiemann, W. P., Constantinescu, S. N., Huang, L., Khosravi-Far, R., Steen, H., Tewari, M., Ghaffari, S., Blobe, G. C., Dang, C. V., Garcia, J. G. N., Pevsner, J., Jensen, O. N., Roepstorff, P., Deshpande, K. S., Chinnaiyan, A. M., Hamosh, A., Chakravarti, A., and Pandey, A., “Development of Human Protein Reference Database as An Initial Platform for Approaching Systems Biology in Humans,” Genome Research, 13(10):2363-2371, Oct., 2003.
[33]Proux, D., Rechenmann, F., Julliard, L., Pillet, V. V. and Jacq, B., “Detecting Gene Symbols and Names in Biological Texts: a First Step toward Pertinent Information Extraction,” Proceedings of the 9th Workshop on Genome Informatics, 1998, pages 72–80.
[34]Seki, K., and Mostafa, J., “An Approach to Protein Name Extraction Using Heuristics and A Dictionary,” Proceedings of the 66th American Society for Information Science and Technology Annual Conference, 2003, pages 71-77.
[35]Smith, T. F. and Waterman, M. S., “Identification of Common Molecular Subsequences,” Journal of Molecular Biology, 147:195-197, 1981.
[36]Sugiyama, K., Hatano, K., Yoshikawa, M., and Uemura, S., “Extracting Information on Protein-protein Interactions from Biological Literature Based on Machine Learning Approaches,” Proceedings of the 14th Workshop on Genome Informatics, 14:699-700, 2003.
[37]Tanabe, L. and Wilbur, J., “Tagging Gene and Protein Names in Biomedical Text,” Bioinformatics, 18(8):1124–1132, Aug., 2002.
[38]Temkin, J. M. and Gilder, M. R., “Extraction of Protein Interaction Information from Unstructured Text Using a Context-free Grammar,” Bioinformatics, 19(16):2046-2053, Nov., 2003.
[39]Wong, L., “A Protein Interaction Extraction System,” Proceedings of the 6th Pacific Symposium on Biocomputing, 2001, pages 520-531.
[40]Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. and Eisenberg, D., “DIP, the Database of Interacting Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions,” Nucleic Acids Research, 30(1):303-305, Jan., 2002.
[41]Zhou, G. D., Zhang, J., Su, J., Shen, D., and Tan, C. L., “Recognizing Names in Biomedical Texts: A Machine Learning Approach,” Bioinformatics, 20(7):1178-1190, May, 2004
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊
 
系統版面圖檔 系統版面圖檔