跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.14) 您好!臺灣時間:2025/12/25 02:31
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:陳雅萍
研究生(外文):Ya-Ping Chen
論文名稱:分析序列保留性對蛋白質中核糖核酸結合殘基預測之影響
論文名稱(外文):Analyzing the Impacts of Sequence Conservation on Protein RNA-binding Residue Prediction
指導教授:黃乾綱黃乾綱引用關係
口試委員:歐陽彥正張瑞益陳倩瑜
口試日期:2011-06-30
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:工程科學及海洋工程學研究所
學門:工程學門
學類:綜合工程學類
論文種類:學術論文
論文出版年:2011
畢業學年度:99
語文別:中文
論文頁數:60
中文關鍵詞:預測核糖核酸結合殘基保留區域機器學習序列相似度
外文關鍵詞:predicting RNA-binding residuesconserved regionsmachine learningsequence identities
相關次數:
  • 被引用被引用:0
  • 點閱點閱:218
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
  核糖核酸與蛋白質的交互作用在基因表現的許多階段扮演重要的角色,如合成信使核糖核酸前體、剪接信使核糖核酸以及轉譯。普遍認為蛋白質是透過結合區域或結合模體辨識目標核糖核酸,其對應之核酸型態以及辨識的結構等級皆相當多元,由蛋白質一級結構預測核糖核酸結合殘基因此頗具挑戰性。
  本論文延續ProteRNA預測方法發展支援向量機與隨機森林分類器,特徵方面增加預測而得之不穩定序,亦改善以WildSpan樣式之序列保留區域進行後處理的品質。考慮資料集製備的效果和結合位置差異,在五疊交叉驗證中,兩分類器的馬修相關係數分別可達0.5288和0.4698。而在獨立測試的表現,支援向量機領先各提供線上服務的預測方法,可獲得92.12%的精確度、38.10%的靈敏度、97.47%的專一性、59.89%的準確度、0.4657的F測量值和0.4381的馬修相關係數,隨機森林表現僅次於支援向量機,可獲得90.08%的精確度、34.47%的靈敏度、95.59%的專一性、43.62%的準確度、0.3851的F測量值和0.3346的馬修相關係數。
  觀察不同序列相似度之資料集對機器學習方法預測能力所構成的趨勢,並討論其中預測表現的提升與瓶頸來源為何。我們發現和查詢序列同時存在於資料集的同源序列,甚至相似度較低的遠親同源序列,都可能影響預測,使結果更接近正確的結合位置分佈。此外,直接比對資料集內最接近的序列以決定結合殘基,在部分情況的預測結果是優於特徵向量為位置加權矩陣之機器學習方法的;然而機器學習方法在獨立測試時面對新穎的蛋白質序列,預測的優良表現終究顯現其一般化能力。

Protein-RNA interactions play a vital role in many stages of gene expression such as pre-mRNA synthesis, mRNA splicing and translation. It is generally believed that binding domains or binding motifs enable RNA-binding proteins to recognize their target RNA. Since the corresponding nucleic acid type and the structure level recognized can be quite diverse, predicting RNA-binding residues from primary structure of proteins is indeed a challenging task.
In this thesis, we continue the work of ProteRNA and develop two classifiers, namely support vector machine (SVM) and random forests (RF), with the predicted protein disorder added as a new feature descriptor. For the post-processing procedure, we build a discriminator in order to improve the pattern quality by distinguishing RNA-binding residues from other functionally important ones in conserved regions. When considering the dataset preparation effects and variance in binding sites, the two classifiers achieve Matthew’s correlation coefficient (MCC) of 0.5288 and 0.4698 using five-fold cross-validation. Our approach outperforms other predictors which provide online service. Testing on the independent test dataset, the SVM model achieves an accuracy of 92.12%, sensitivity of 38.10%, specificity of 97.47%, precision of 59.89%, F-score of 0.4657 and MCC of 0.4381, while the RF model ranks second only to SVM, it achieves an accuracy of 90.08%, sensitivity of 34.47%, specificity of 95.59%, precision of 43.62%, F-score of 0.3851 and MCC of 0.3346.
We observe the measure trend in machine learning methods for datasets based on different sequence identities, and discuss the origin of performance increment and bottleneck. We find out that the homologous sequence, or even remote homologous in the same dataset as query sequence will probably make prediction result closer to the distribution of real binding sites. Besides, a method that identifies the nearest neighbor by sequence alignment and determines its binding residues accordingly may perform better than machine learning methods trained on PSSM in some cases. Nevertheless, when dealing with novel protein sequences, the excellent performance of machine learning methods shows great generalization ability.

致謝 I
摘要 II
ABSTRACT III
目錄 V
圖目錄 VII
表目錄 VIII
第一章 導論 1
第二章 文獻回顧 4
2.1 中心法則(Central Dogma) 4
2.2 相關研究 4
2.3 BLASTClust 7
2.4 從胺基酸序列汲取資訊的工具 7
2.4.1 PSI-BLAST 7
2.4.2 PSIPRED 8
2.4.3 DISOPRED 9
2.5 隨機森林 9
2.6 支援向量機 10
2.7 WildSpan 14
第三章 實驗材料及方法 15
3.1 資料集 15
3.2 特徵向量編碼 15
3.3 以WildSpan樣式進行後處理 17
3.4 系統架構 19
3.5 分類器效能評估 20
第四章 實驗結果 22
4.1 參數最佳化 22
4.2 結合WildSpan樣式 24
4.3 改變資料集挑選方式 25
4.4 多數決標示結合位置 27
4.5 獨立測試 29
4.6 序列相似度與預測能力 31
4.6.1 交叉驗證之資料集討論 31
4.6.2 獨立測試之訓練集討論 36
4.7 比對資料集內最接近序列方法 37
第五章 結論 41
參考文獻 43
附錄A──序列相似度樹形圖 47
附錄B──RB33、RB301序列清單 55
附錄C──預測表現數據補充 57

1.Smith, C.W.J., RNA:protein interactions : a practical approach. 1998, Oxford ; New York: Oxford University Press. xxv,341p.
2.Barton, N.H., Evolution. 2007, Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press. xiv, 833 p.
3.Hertel, K.J. and B.R. Graveley, RS domains contact the pre-mRNA throughout spliceosome assembly. Trends in Biochemical Sciences, 2005. 30(3): p. 115-118.
4.Noller, H.F., RNA structure: Reading the ribosome. Science, 2005. 309(5740): p. 1508-1514.
5.Moras, D., Structural and functional relationships between aminoacyl-tRNA synthetases. Trends in Biochemical Sciences, 1992. 17(4): p. 159-164.
6.Morozova, N., et al., Protein-RNA interactions: exploring binding patterns with a three-dimensional superposition analysis of high resolution structures. Bioinformatics, 2006. 22(22): p. 2746-2752.
7.Shulman-Peleg, A., et al., Prediction of interacting single-stranded RNA bases by protein-binding patterns. Journal of Molecular Biology, 2008. 379(2): p. 299-316.
8.Elliott, D. and M. Ladomery, Molecular biology of RNA. 2011, Oxford ; New York: Oxford University Press. 441 p.
9.Sucheck, S.J. and C.H. Wong, RNA as a target for small molecules. Current Opinion in Chemical Biology, 2000. 4(6): p. 678-686.
10.Gherghe, C.M., et al., Native-like RNA Tertiary Structures Using a Sequence-Encoded Cleavage Agent and Refinement by Discrete Molecular Dynamics. Journal of the American Chemical Society, 2009. 131(7): p. 2541-2546.
11.Jones, S. and J.M. Thornton, Prediction of protein-protein interaction sites using patch analysis. Journal of Molecular Biology, 1997. 272(1): p. 133-143.
12.Marcotte, E.M., et al., Detecting protein function and protein-protein interactions from genome sequences. Science, 1999. 285(5428): p. 751-753.
13.Jones, S., et al., Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Research, 2003. 31(24): p. 7189-7198.
14.Ahmad, S., M.M. Gromiha, and A. Sarai, Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics, 2004. 20(4): p. 477-486.
15.Crick, F., Central Dogma of Molecular Biology. Nature, 1970. 227(5258): p. 561-563.
16.Li, J.L., et al., Darwinian Evolution of Prions in Cell Culture. Science, 2010. 327(5967): p. 869-872.
17.Draper, D.E., Protein-Rna Recognition. Annual Review of Biochemistry, 1995. 64: p. 593-620.
18.Jeong, E., I.F. Chung, and S. Miyano, A neural network method for identification of RNA-interacting residues in protein. Genome Informatics, 2004. 15(1): p. 105-116.
19.Rost, B. and C. Sander, Prediction of Protein Secondary Structure at Better Than 70-Percent Accuracy. Journal of Molecular Biology, 1993. 232(2): p. 584-599.
20.Jeong, E. and S. Miyano, A weighted profile based method for protein-RNA interacting residue prediction. Transactions on Computational Systems Biology Iv, 2006. 3939: p. 123-139.
21.Wang, L.J. and S.J. Brown, BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Research, 2006. 34: p. W243-W248.
22.Terribilini, M., et al., RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Research, 2007. 35: p. W578-W584.
23.Berman, H.M., et al., The Protein Data Bank. Acta Crystallographica Section D-Biological Crystallography, 2002. 58(Pt 6 No 1): p. 899-907.
24.Raghava, G.P.S., M. Kumar, and A.M. Gromiha, Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins-Structure Function and Bioinformatics, 2008. 71(1): p. 189-194.
25.Wang, Y., et al., PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids, 2008. 35(2): p. 295-302.
26.Tong, J., P. Jiang, and Z.H. Lu, RISP: A web-based server for prediction of RNA-binding sites in proteins. Computer Methods and Programs in Biomedicine, 2008. 90(2): p. 148-153.
27.Sung, T.Y., et al., Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics, 2008. 9.
28.Spriggs, R.V., et al., Protein function annotation from sequence: prediction of residues interacting with RNA. Bioinformatics, 2009. 25(12): p. 1492-1497.
29.Adamczak, R., A. Porollo, and J. Meller, Accurate prediction of solvent accessibility using neural networks-based regression. Proteins-Structure Function and Bioinformatics, 2004. 56(4): p. 753-767.
30.Wang, L.J., et al., BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Systems Biology, 2010. 4 Suppl 1: p. S3.
31.Zhang, T., et al., Analysis and Prediction of RNA-Binding Residues Using Sequence, Evolutionary Conservation, and Predicted Secondary Structure and Solvent Accessibility. Current Protein & Peptide Science, 2010. 11(7): p. 609-628.
32.Dor, O. and Y.Q. Zhou, Real-SPINE: An integrated system of neural networks for real-value prediction of protein structural properties. Proteins-Structure Function and Bioinformatics, 2007. 68(1): p. 76-81.
33.Huang, Y.F., et al., Predicting RNA-binding residues from evolutionary information and sequence conservation. BMC Genomics, 2010. 11 Suppl 4: p. S2.
34.Hsu, C.M., C.Y. Chen, and B.J. Liu, WildSpan: mining structured motifs from protein sequences. Algorithms for Molecular Biology, 2011. 6(1): p. 6.
35.Wang, C.C., et al., Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids, 2011. 40(1): p. 239-248.
36.Altschul, S.F., et al., Basic Local Alignment Search Tool. Journal of Molecular Biology, 1990. 215(3): p. 403-410.
37.Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997. 25(17): p. 3389-3402.
38.Draper, D.E., Themes in RNA-protein recognition. Journal of Molecular Biology, 1999. 293(2): p. 255-270.
39.Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 1999. 292(2): p. 195-202.
40.Ouali, M. and R.D. King, Cascaded multiple classifiers for secondary structure prediction. Protein Science, 2000. 9(6): p. 1162-1176.
41.Daughdrill, G.W., L.J. Hanely, and F.W. Dahlquist, The c-terminal half of the anti-sigma factor FlgM contains a dynamic equilibrium solution structure favoring helical conformations. Biochemistry, 1998. 37(4): p. 1076-1082.
42.Weiss, M.A., et al., Folding Transition in the DNA-Binding Domain of Gcn4 on Specific Binding to DNA. Nature, 1990. 347(6293): p. 575-578.
43.Dunker, A.K., et al., Intrinsic disorder and protein function. Biochemistry, 2002. 41(21): p. 6573-6582.
44.Ward, J.J., et al., The DISOPRED server for the prediction of protein disorder. Bioinformatics, 2004. 20(13): p. 2138-2139.
45.Breiman, L., Random forests. Machine Learning, 2001. 45(1): p. 5-32.
46.Liaw, A. and M. Wiener, Classification and Regression by randomForest. R News, 2002. 2(3): p. 18-22.
47.Cortes, C. and V. Vapnik, Support-Vector Networks. Machine Learning, 1995. 20(3): p. 273-297.
48.Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
49.Maetschke, S.R. and Z. Yuan, Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 2009. 10: p. 341.
50.Spriggs, R.V. and S. Jones, RNA-binding residues in sequence space: Conservation and interaction patterns. Computational Biology and Chemistry, 2009. 33(5): p. 397-403.
51.Finn, R.D., et al., The Pfam protein families database. Nucleic Acids Research, 2010. 38: p. D211-D222.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top