跳到主要內容

臺灣博碩士論文加值系統

(3.236.84.188) 您好!臺灣時間:2021/08/05 01:18
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:桂卓慶
研究生(外文):Tock-kheng Kooi
論文名稱:利用文字探勘技術萃取轉錄因子與目標基因調控資訊
論文名稱(外文):USING TEXT MINING TECHNIQUES TO EXTRACT REGULATION BETWEEN TRANSCRIPTION FACTOR AND TARGET GENE
指導教授:王惠嘉王惠嘉引用關係
指導教授(外文):Hei-Chia Wang
學位類別:碩士
校院名稱:國立成功大學
系所名稱:資訊管理研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2008
畢業學年度:96
語文別:中文
論文頁數:43
中文關鍵詞:目標基因與轉錄因子關係資訊萃取
外文關鍵詞:information extractionregulation information between TF and TGene
相關次數:
  • 被引用被引用:1
  • 點閱點閱:260
  • 評分評分:
  • 下載下載:73
  • 收藏至我的研究室書目清單書目收藏:0
鑑於人類基因序列的完整解碼,許多生物實驗持續地展開。人類的身體之所以能夠運作是依賴多個健全的蛋白質持續地交互作用(Protein-Protein Interaction;PPI)。蛋白質是從DNA->mRNA->Protein,即基因的最終產物。轉錄調控是基因形成蛋白質過程中的第一步也是最重要的起始步驟。本研究有興趣的是分析轉錄因子(Transcription Factor;TF)與目標基因(Target Gene;TGene)的辨示及其彼此間調控關係,而最終都會將此資訊以文獻方式公佈出來。
隨著生物文獻逐年迅速地增加,生物研究人員不易從如此大量文獻中完全的閱讀並擷取出轉錄因子與目標基因的資訊。因此,如果能夠有效率地利用資訊技術處理大量生物文獻,並進行過濾,協助讀者擷取出目標基因與轉錄因子關係,將對生物研究人員在實驗目標上有很大的幫助。
目前大部分學者極力投入在蛋白質間交互作用資訊萃取的研究,而本研究則是專注於轉錄因子與目標基因的調控關係。其中本研究的困難度在於:(1)名稱辨識上需要利用兩個不同的生物字典,(2)調控關係的萃取需要嚴謹界定,例如:轉錄因子調控目標基因,但是目標基因不能調控轉錄因子。除此之外,大部分學者極少處理句子中出現模糊字眼的處理,如:“Previous Studies …”指之前的研究,這樣子沒辦法表示該篇文獻確實提出實驗證明。
藉此,本研究設計了一個搭配錯誤樣板和正確樣板來分析PubMed查詢結果的文獻並預測TF和TGene之間的關係,以提供生物研究人員實驗參考。經由實驗結果證明搭配錯誤樣板和正確樣板的F-measure比單純使用正確樣板的F-measure值還來得高。
Human genome sequences have completely decoded. The data is helpful to the gene identification and gene regulation. In gene regulation research, it includes regulation information between transcription factor(TF) and target gene(TGene) that may help biologists to know which TGene is regulated by the TF. Presently, regulation information mostly is recorded in biological literatures.
Due to the rapid growth of biological literature, biologists hardly spend lot of time to read through all related literatures and extract regulation information between TF and TGene. Therefore, if any information technology can be utilized to filter and extract relationship between TF and TGene that may improve the reading efficiency.
Nowadays, most researchers put their every effort in protein-protein interactions research, but this thesis is specialized to extract regulation between TF and TGene. The difficulties are (1) named entity recognition need two domain dictionary (2) relation recognition must conscientiously defined. As an example, TF can only regulate TGene expression but TGene cannot. Besides that, most researchers focus on extracting important information but less aware modality information like “Previous Studies…” means that studies are some time ago, no experiment evidence in that paper.
Therefore, this thesis aims to use text-mining technique to analyze TF query literatures from PubMed , use negative and positive pattern to predict the relationship between TF and target gene that may give valuable insight to the biologists.
1. 緒論 - 1 -
1.1. 研究背景 - 1 -
1.2. 研究動機與目的 - 2 -
1.3. 研究範圍與限制 - 2 -
1.4. 研究流程 - 3 -
1.5. 論文架構 - 4 -
2. 文獻探討 - 5 -
2.1. 生物相關資源 - 5 -
2.1.1. PubMed - 5 -
2.1.2. Sequence Retrieval System (SRS) - 6 -
2.1.3. HUGO Gene Nomenclature Committee (HGNC) - 7 -
2.2. 文字探勘 - 8 -
2.3. 自然語言處理(NATURAL LANGUAGE PROCESSING) - 9 -
2.3.1. Part-of-Speech Tagging - 9 -
2.3.2. Stemming - 10 -
2.3.3. Pointwise Mutual Information - 10 -
2.4. 辭彙資源 - 11 -
2.4.1. WordNet及其技術 - 11 -
2.5. 相關研究 - 13 -
2.5.1. 樣板方式 - 13 -
2.5.2. 以剖析為基礎 - 14 -
2.6. 小結 - 14 -
3. 研究方法 - 16 -
3.1. 研究架構 - 16 -
3.2. ABSTRACTS RETRIEVE AND PROCESS MODULE - 17 -
3.2.1. Query TF in PubMed - 17 -
3.2.2. Preprocessing - 18 -
3.2.3. TF and TGene Tagging - 19 -
3.2.4. Sentence Restriction - 20 -
3.3. 訓練階段 - 20 -
3.3.1. 正確樣板訓練模組 - 20 -
3.3.1.1. Sentence Structure Representation - 21 -
3.3.1.2. Pattern Generation - 22 -
3.3.1.3. Pattern Validation - 25 -
3.3.2. 錯誤樣板訓練模組 - 26 -
3.3.2.1. Sentence Filtering - 27 -
3.3.2.2. Pattern Generation - 28 -
3.4. 測試階段 - 28 -
3.4.1. Pattern Matching - 29 -
4. 系統建置與驗證 - 30 -
4.1. 系統建置 - 30 -
4.1.1. 實作環境 - 30 -
4.1.2. 使用套件及模組 - 30 -
4.1.3. 系統處理流程 - 31 -
4.2. 實驗方法 - 32 -
4.2.1. 資料來源 - 32 -
4.2.2. 比較對象 - 32 -
4.2.3. 評估指標的選擇 - 32 -
4.3. 實驗結果與分析 - 33 -
5. 結論及未來研究方向 - 39 -
5.1. 研究成果 - 39 -
5.2. 未來研究方向 - 40 -
參考文獻 - 41 -
■ 英文文獻
Agichtein, E., Eskin, E., & Gravano, L. (2000). Extracting Relations from Large Text Collections. 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval: Addison-Wesley Harlow, England.
Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. Paper presented in WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98.
Edmunson, H. (1969). New methods in automatic extracting. Journal of the ACM, 16(2), 264-285.
Eyre, T. A., Ducluzeau, F., Sneddon, T. P., Povey, S., Bruford, E. A., & Lush, M. J. (2006). The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Research, 34, D319-D321.
Fano, R. (1961). Transmission of Information. Cambridge, Mass:MIT Press.
Fellbaum, C. (1998). WordNet: An electronic lexical database. MIT Press.
Fox, C. (1992). Lexical analysis and stoplists. In: Frakes WB, Baeza-Yates R, editors. Information retrieval: data structures and algorithms. (p. 102-30): Prentice Hall.
Fundel, K., Kuffner, R., & Zimmer, R. (2007). RelEx-Relation extraction using dependency parse trees. Bioinformatics, 23(3), 365-371.
Hearst, M. A. (1999). Untangling Text Data Mining. Paper presented at the Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.
Hobbs, J. R. (1993). The Generic Information Extraction System. Paper presented at the Proceedings of the 5th conference on Message understanding.
Hobbs, J. R. (2002). Information extraction from biomedical text. Journal of Biomedical Informatics, 35, 260-264.
Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., & Li, M. (2004). Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 20(18), 3604-3612.
Mihalcea, R., & Moldovan, D. I. (1999). Word Sense Disambiguation based on Semantic Density. Paper presented at the Use of WordNet in National Language Processing Systems:Proceedings of the conference.
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography, 3(4), 235-244.
Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2), 155-161.
Park, J. C., Kim, H. S., & Kim, J. J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. Paper presented at the In Proceedings of the Pacic Rim Symposium on Biocomputing.
Raychaudhuri, S. (2006). Computational Text Analysis for Functional Genomics and Bioinformatics: Oxford University Press.
Robertson, S. E., Porter, M. F., & Rijsbergen, C. J. (1980). New models in probabilistic information retrieval: London: British Library.
Sekimizu, T., Park, H. S., & Tsujii, J. i. (1998). Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. Genome Inform. Ser.Workshop Genome Inform., 9(62-71).
Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics, 4(1), 20-28.
Soderland, S. (1999). Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 34, 233-272.
Tatar, D. (2005). Word sense disambiguation by machine learning approach: a short survey. Fundamenta Informaticae, 64, 433-442.
Werner, T. (2005). The next generation of literature analysis: Integration of genomic analyses into text mining. Brief. Bioinformatics, 6.
Xiao, J., Chua, T. S., & Liu, J. (2003). A global rule induction approach to information extraction. Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, 530-537.
Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Paper presented at the In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.
王惠鈞, & 吳啟裕. 蛋白質體學之新進展/New Edvelopments in Proteomics. Paper presented 中央研究院 生物化學研究所.
■ 網站資料
Message Understanding Conference
(http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html)
PubMed
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=)

HUGO Gene Nomenclature Committee (HGNC)
(http://www.genenames.org/)

Sequence Retrieval System (SRS)
(http://www.ebi.ac.uk/)

The Comprehensive Perl Archive Network (CPAN)
(http://search.cpan.org/)

WordNet
(http://wordnet.princeton.edu/)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top