(18.206.177.17) 您好!臺灣時間:2021/04/23 05:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:楊晴涵
研究生(外文):Cing-Han Yang
論文名稱:基於序列排比與支持向量機之 蛋白質域邊界辨識
論文名稱(外文):Interdomain Boundary Detection by Sequence Alignment and Support Vector Machine
指導教授:白敦文
指導教授(外文):Tun-Wen Pai
學位類別:碩士
校院名稱:國立臺灣海洋大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2013
畢業學年度:101
語文別:中文
論文頁數:34
中文關鍵詞:蛋白質功能域功能域邊界胺基酸對二級結構LIBSVM
外文關鍵詞:protein functional domaindomain boundaryamino acid pairssecondary structureLIBSVM
相關次數:
  • 被引用被引用:0
  • 點閱點閱:102
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:7
  • 收藏至我的研究室書目清單書目收藏:0
蛋白質是由基本的功能域所組成,而功能域則是演化過程的基本單元。大部分的蛋白質是同時具備多個功能域,這類多功能域蛋白的形成有可能是物種進化過程時經由選擇壓力所造成的現象。不同類型多功能域的組合有可能會影響結構的穩定性,也會參與蛋白質與蛋白質間的相互作用或細胞週期的調控,但若是蛋白質功能域發生突變,將會造成蛋白質結構異常而產生病變。本研究的主要目標是透過蛋白質域資料庫序列的比對分析及使用機器學習技術提升功能域之間邊界的正確辨識,希望藉由辨識結果強化對研究蛋白質結構的穩定度、蛋白質功能的註解、蛋白質域的交互作用、和生物演化過程的探討。系統完整收集已知的蛋白質序列並建構蛋白質域資料庫,統計分析功能域間邊界序列胺基酸對的組合內容、邊界序列長度及二級結構元素的分布資訊,藉由這些資料進行特徵訓練且設計一套自動辨識功能域邊界的方法。
本研究首先建立蛋白質域的基本序列資料庫,這些已知功能域的序列資訊是來自蛋白質家族資料庫(Pfam),選擇經專家鑑定的Pfam-A資料集篩選12,273種不同功能域的代表性序列作為序列比對的基本資料庫。當第一階段無法藉由傳統序列排比方法偵測出查詢序列的功能域位置時,則進一步使用二級結構預測工具進行結構預測分析,再加上已知功能域的統計分析資訊、功能域邊界胺基酸對組合機率及功能域長度作為特徵,運用支持向量機分類器的技術自動分析辨識蛋白質功能域邊界,本研究的實驗結果針對1868筆蛋白質序列進行功能域邊界位置的自動判別可以達到86%的準確率,該項蛋白質域邊界自動辨識系統可以提供生物學者在設計生物實驗之前進行具有實用價值的分析參考。

A protein is composed of at least one functional domain which is considered as a fundamental evolutionary unit. Most of proteins contain multiple domains, and which are formulated through gene duplication events and likely caused by a selective pressure during evolution. Different domain combinations involve in protein structure stability, protein-protein interactions, and cell-cycle regulation. However, the mutations of functional domains might result in an abnormal protein structure and lead to serious diseases. The main goal of this study is to combine protein sequence alignment and machine learning approaches to improve interdomain boundary detection. Accurate domain boundary detection could provide a powerful technique and useful information for the studies of protein structure stability, functional annotation, protein domain interaction, and evolutionary biology. In this study, we have collected comprehensive protein sequences and corresponding domain annotations as the training datasets. Features of amino acid pairs, length of interdoamin boundary, and distribution of secondary structure elements were analyzed and trained for identifying locations of protein domain boundary automatically.
In this thesis, a protein sequence database containing protein domain annotations was established for sequence alignment. The known protein domain sequences are derived from Pfam database, where Pfam-A provides 12,273 representative sequences for different domain annotations by experts. If the domain characteristics of the query protein couldn’t be verified through the sequence alignment at the first stage, a secondary structure prediction tool would be applied for an alternative approach. Integrating the statistical characteristics of domains, occurrence frequencies of amino acid pairs, and length distributions of domain boundary, we employed a support vector machine classifier to identify protein domain boundaries. The proposed system achieved a precision rate of 86% on a testing set of 1868 proteins, and it has shown that our system can automatically detect interdomain boundaries from an unknown protein sequence. This identification system provides biologists a useful and practical advice prior to design biological experiments.

摘要 I
ABSTRACT II
致謝 III
目次 IV
圖次 V
表次 VI
1. 研究背景 1
2. 實驗方法 4
2.1. 資料蒐集 4
2.2. 定義蛋白質功能域邊界 4
2.3. 蛋白質序列分析 5
2.3.1. 胺基酸對傾向分析 5
2.3.2. 功能域/功能域邊界的特徵統計 8
2.3.3. 二級結構特性 9
2.4. 功能域邊界長度分析 11
2.5. 蛋白質功能域邊界偵測 12
2.6. 機器學習與訓練方法 14
2.6.1. 特徵篩選 14
2.7. 實驗方法評估 16
3. 研究結果 17
4. 結論與討論 22
參考文獻 23

[1] P. Borka, A. K. Downinga, B. Kieffera, and I. D. Campbell, "Structure and distribution of modules in extracellular proteins.," Quarterly Reviews of Biophysics, vol. 29, pp. 119-167, 1996.
[2] J. S. Chris P Ponting, Richard R Copley, Miguel A Andrade, Peer Bork, "Evolution of domain families," Contribution to "Analysis of amino acid sequences" in Advances in Protein Chemistry., vol. 54, pp. 185-244, 2000.
[3] G. E. Folkers, B. N. M. v. Buuren, and R. Kaptein, "Expression screening, protein purification and NMR analysis of human protein domains for structural genomics.," Structural and Functional Genomics, vol. 5, pp. 119-131, 2004.
[4] T. Hondoh, A. Kato, S. Yokoyama, and Y. Kuroda, "Computer-aided NMR assay for detecting natively folded structural domains.," Protein Science, vol. 15, pp. 871-883, 2006.
[5] Å. K. Björklund, D. Ekman, and A. Elofsson, "Expansion of Protein Domain Repeats.," PLoS Computational Biology, vol. 2, p. e114, 2006.
[6] E. G. Reynaud, M. A. Andrade, F. Bonneau, T. B. N. Ly, M. Knop, K. Scheffzek, and R. Pepperkok, "Human Lsg1 defines a family of essential GTPases that correlates with the evolution of compartmentalization.," BMC Biology, vol. 3, 2005.
[7] X. Shan, R. L. D. Jr, S. A. Christopher, and W. D. Kruger, "Mutations in the regulatory domain of cystathionine beta synthase can functionally suppress patient-derived mutations in cis.," Human Molecular Genetics, vol. 10, pp. 635-643, 2001.
[8] R. S. Gokhale and C. Khosla, "Role of linkers in communication between protein modules.," Current Opinion in Chemical Biology, vol. 4, pp. 22-27, 2000.
[9] M. Ikebe, T. Kambara, W. F. Stafford, M. Sata, E. Katayama, and R. Ikebe, "A hinge at the central helix of the regulatory light chain of myosin is critical for phosphorylation-dependent regulation of smooth muscle myosin motor activity.," Journal Of Biological Chemistry, vol. 273, pp. 17702-17707, 1998.
[10] H. C. v. Leeuwen, M. J. Strating, M. Rensen, W. d. Laat, and P. C. v. d. Vliet, "Linker length and composition influence the flexibility of Oct-1 DNA binding.," The EMBO Journal vol. 16, pp. 2043-2053, 1997.
[11] C. R. Robinson and R. T. Sauer, "Optimizing the stability of single-chain proteins by linker length and composition mutagenesis.," Proceedings of the National Academy of Sciences of the United States of America, vol. 95, pp. 5929-5934, 1998.
[12] S. E. Radford, E. D. Laue, R. N. Perham, S. R. Martin, and E. Appella, "Conformational flexibility and folding of synthetic peptides representing an interdomain segment of polypeptide chain in the pyruvate dehydrogenase multienzyme complex of Escherichia coli.," Journal Of Biological Chemistry, vol. 264, pp. 767-775, 1989.
[13] G. J. Russell GC, "Sequence similarities within the family of dihydrolipoamide acyltransferases and discovery of a previously unidentified fungal enzyme.," Biochim Biophys Acta. , pp. 225-232, 1991.
[14] J. Gracy and P. Argos, "DOMO: a new database of aligned protein domains.," Trends in Biochemical Sciences, vol. 23, pp. 495-497, 1998.
[15] L. I, G. L, D. NJ, D. T, S. J, M. R, C. F, C. RR, P. CP, and B. P., "Recent improvements to the SMART domain-based sequence annotation resource.," Nucleic Acids Research - NAR, vol. 30, pp. 242-244., 2002.
[16] D. H. Haft, J. D. Selengut, and O. White, "The TIGRFAMs database of protein families.," Nucleic Acids Research - NAR, vol. 31, pp. 371-373., 2003.
[17] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin, "The COG database: new developments in phylogenetic classification of proteins from complete genomes.," Nucleic Acids Research - NAR, vol. 29, pp. 22-28, 2001.
[18] V. K, M. J, B. E, and P. S., "The SBASE protein domain library, release 9.0: an online resource for protein domain identification.," Nucleic Acids Research - NAR, vol. 30, pp. 273-275, 2002.
[19] D. W. A. Buchan, S. C. G. Rison, J. E. Bray, D. Lee, F. Pearl, J. M. Thornton, and C. A. Orengo, "Gene3D: structural assignments for the biologist and bioinformaticist alike.," Nucleic Acids Research - NAR, vol. 31, pp. 469-473, 2003.
[20] R. A. George and J. Heringa, "SnapDRAGON: a method to delineate protein structural domains from sequence data.," Journal of Molecular Biology vol. 316, pp. 839-851, 2002.
[21] W. R.Taylor, "Protein structural domain identification.," Protein Engineering, vol. 12, pp. 203-216, 1999.
[22] S. J. Wheelan, A. Marchler-Bauer, and S. H. Bryant, "Domain size distributions can predict domain boundaries.," Bioinformatics, vol. 16, pp. 613-618, 2000.
[23] J. Eickholt, X. Deng, and J. Cheng, "DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning.," BMC Bioinformatics, vol. 12, 2011.
[24] J. Cheng, M. J. Sweredoski, and P. Baldi, "DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks.," Data Mining and Knowledge Discovery, vol. 13, pp. 1-10, 2006.
[25] J. Sim, S.-Y. Kim, and J. Lee, "PPRODO: prediction of protein domain boundaries using neural networks.," Proteins Structure Function and Bioinformatics, vol. 59, pp. 627-632, 2005.
[26] J. Cheng, A. Z. Randall, M. J. Sweredoski, and P. Baldi, "SCRATCH: a protein structure and structural feature prediction server. ," Nucleic Acids Research - NAR, vol. 33, pp. W72-76., 2005.
[27] C.-C. Chang and C.-J. Lin, "LIBSVM : a library for support vector machines.," ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1--27:27, 2011.
[28] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. L. Sonnhammer, S. R. Eddy, A. Bateman, and R. D. Finn., "The Pfam protein families database.," Nucleic Acids Research - NAR, vol. 40, pp. D290-D301., 2012.
[29] R. D. Finn, J. Clements, and S. R. Eddy, "HMMER web server: interactive sequence similarity searching.," Nucleic Acids Research - NAR, vol. 39, pp. W29–W37., 2011.
[30] T. Ebina, H. Toh, and Y. Kuroda, "DROP: An SVM domain linker predictor trained with optimal features selected by random forest.," Bioinformatics, vol. 27, pp. 487-494, 2010.
[31] S. Zou, Y. Huang, Y. Wang, C. Hu, Y. Liang, and C. Zhou, "A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy," in Advances in Neural Networks – ISNN 2007. vol. 4492, D. Liu, S. Fei, Z. Hou, H. Zhang, and C. Sun, Eds., ed: Springer Berlin Heidelberg, 2007, pp. 1264-1272.
[32] D. Sanchez and S. H. Courellis, "Protein Domain Boundary Prediction from Residue Sequence Alone Using Bayesian Neural Networks.," pp. 209-213, 2009.
[33] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, "SCOP: A structural classification of proteins database for the investigation of sequences and structures.," Journal of Molecular Biology, vol. 247, pp. 536-540, 1995.
[34] C. Orengo, A. Michie, S. Jones, D. Jones, M. Swindells, and J. Thornton, "CATH -- a hierarchic classification of protein domain structures," Structure, vol. 5, pp. 1093-1108, 1997.
[35] H. Berman, K. Henrick, and H. Nakamura, "Announcing the worldwide Protein Data Bank.," Nature Structural &; Molecular Biology, vol. 10, p. 980, 2003.
[36] B. Nirjhar, C. N., M. Daliah, B. N., and S. K., "An algorithm to find all identical internal sequence repeats.," Current Science, vol. 95, pp. 188-195, 2008.
[37] S. Mika and Burkhard Rost, "UniqueProt: Creating representative protein sequence sets.," Nucleic Acids Research - NAR, vol. 31, pp. 3789-3791, 2003.
[38] CASP9., "[http://www.predictioncenter.org/casp9/index.cgi].".
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔