跳到主要內容

臺灣博碩士論文加值系統

(18.97.9.175) 您好!臺灣時間:2024/12/06 22:13
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:陳彥宏
研究生(外文):Yen-Hong Chen
論文名稱:使用蛋白質序列、結構與物理化學性質預測 Hub 蛋白質
論文名稱(外文):Predicting hub proteins using protein sequence, protein structure and physicochemical properties
指導教授:李國彬
指導教授(外文):Kuo-Bin Li
學位類別:碩士
校院名稱:國立陽明大學
系所名稱:生物醫學資訊研究所
學門:生命科學學門
學類:生物化學學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:60
中文關鍵詞:蛋白質蛋白質交互作用Hub 蛋白質蛋白質網路
外文關鍵詞:proteinprotein-protein interactionHub proteinprotein network
相關次數:
  • 被引用被引用:0
  • 點閱點閱:229
  • 評分評分:
  • 下載下載:12
  • 收藏至我的研究室書目清單書目收藏:0
Hub 蛋白質是擁有大量蛋白質交互作用對象的蛋白質,有潛力作為多種疾病與癌症的藥物標靶。本研究開發一種以蛋白質序列預測 Hub 蛋白質的方法,並使用 R 語言套件 caret 內的隨機森林演算法。我們首先收集 HPRD 資料庫裡的蛋白質序列和蛋白質交互作用資料作為訓練和測試資料集,並將交互作用對象超過 10 個的蛋白質定義為 Hub 蛋白質,只有一個交互作用對象的蛋白質定義為 End 蛋白質。我們採用了三種特徵集訓練預測模型,分別為 (i) 基於蛋白質結構的蛋白質結構特徵集,包含過去研究裡視為 Hub 蛋白質特徵的蛋白質固有無序性;(ii) 基於蛋白質序列的序列特徵集,其中包括氨基酸組成、雙肽組成和偽氨基酸組成;(iii) 蛋白質物理化學特性,收集 544 種基於 AAindex 所產生的蛋白質物理化學性質描述。我們用基於隨機森林的遞迴特徵消去法 (RF-RFE) 優化我們的預測模型採用的特徵,再經過隨機森林的參數選取後,最終預測模型的曲線下面積 (AUROC) 效能表現分別在 10 次交叉驗證法和獨立測試法為 0.77 和 0.76。最後,我們藉由分析各種特徵在分類預測上的相對重要性來獲得 Hub 蛋白質的新知識。預測模型以網頁工具的形式發佈於 http://bsaltools.ym.edu.tw/predHub。
Hub proteins are proteins with a large number of partners in a protein-protein interaction network. They are often regarded as potential drug targets for diseases such as cancers. This theses describes a hub protein prediction tool by using machine learning techniques, specifically the random forest that was implemented in the ‘caret’ R package. The training protein sequences were collected from the Human Protein Reference Database (HPRD). Proteins with ten or more interaction partners are labeled as the hub proteins, and those with exactly one interaction partner are labeled as the end proteins. Three types of feature sets were used in this study: (i) structure-based features supported by earlier studies, for example, the intrinsic disorder regions and the protein functional domains; (ii) sequence-based features, including amino acid composition, dipeptide composition and pseudo-amino acid composition; (iii) to incorporate information regarding amino acid’s physicochemical properties, the 20 amino acid compositions for any given protein are substituted by a single numerical value that can be considered as the sum of a specific amino acid physicochemical property (taken from the AAindex database) but weighted by the 20 composition values. The Random Forest Recursive Feature Elimination (RF-RFE) technique was used to select the optimal features from the combination of the three feature types. The final predictor is able to achieve a performance of 0.77 and 0.76 in terms of the areas under the receiver operating characteristic (ROC) curves using a 10-fold cross validation and an independent testing experiment, respectively. Furthermore, we are to demonstrate that the proposed hub protein predictor and the selected features indeed suggest new insights into the hub and end protein classification. Our prediction tool is freely accessible at http://bsaltools.ym.edu.tw/predHub.
致謝 i
Abstract ii
中⽂摘要 iv
⽬錄 v
表格⽬錄 viii
圖⽚⽬錄 ix
第一章 緒論 1
蛋白質交互作用網路裡的 Hub 蛋白質 1
Hub 蛋白質的特徵 2
過去分辨 Hub 蛋白質的方法 3
隨機森林演算法 (Random forest) 4
研究目的 5
第二章 研究方法與資料來源 7
資料來源 7
移除同源蛋白質序列 8
氨基酸序列特徵 8
氨基酸組成 (Amino acid composition, AAC) 8
偽氨基酸組成 (Pseudo-amino acid composition, PseAAC) 8
雙胜肽組成 (Dipeptide composition) 9
蛋白質結構特徵 9
固有無序區 (Intrinsic disorder regions, IDR) 9
蛋白質功能域 (Protein domain) 10
低複雜度區 (Low-complexity regions, LCRs) 10
分子識別特徵 (Molecular recognition features, MoRFs) 11
蛋白質序列長度 (Protein length) 11
蛋白質分子量 (Molecular weight, MW) 12
蛋白質次級結構與溶劑可接觸性 (Secondary structure and relative solvent accessibility) 12
物理化學特性 (Physicochemical properties) 13
預測性建模 (Predictive modeling) 14
隨機森林演算法 (Random forest) 14
參數優化 (Parameter selection) 15
特徵重要性 (Feature importance) 15
遞迴特徵消去法 (Recursive feature elimination, RFE) 17
預測模型建構流程 (Pipeline) 18
模型效能評估 (Performance measurement) 21
第三章 結果 23
資料集 23
比較各種特徵集預測 Hub 蛋白質的能力 24
蛋白質序列特徵 24
蛋白質結構特徵 25
蛋白質物理化學性質 26
參數與特徵最佳化訓練基於全特徵的預測模型 27
從特徵重要性排名得到 Hub 蛋白質特徵的洞見 30
第四章 討論 42
與基於蛋白質 Gene-ontology 的 Hub 蛋白質預測法比較預測效能 42
Hub 蛋白質定義 43
三種特徵集的預測模型效能 44
利用特徵重要性分析 Hub 蛋白質的特徵 46
研究限制 50
第五章 結論 53
參考文獻 54
Aloy, P. and Russell, R.B. Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci U S A 2002;99(9):5896-5901.
Andorf, C.M., Honavar, V. and Sen, T.Z. Predicting the binding patterns of hub proteins: a study using yeast protein interaction networks. PLoS One 2013;8(2):e56833.
Bernardes, J.S. and Pedreira, C.E. A review of protein function prediction under machine learning perspective. Recent Pat Biotechnol 2013;7(2):122-141.
Breiman, L. Random forests. Machine Learning 2001.
Camacho, C., et al. BLAST+: architecture and applications. BMC Bioinformatics 2009;10:421.
Coletta, A., et al. Low-complexity regions within protein sequences have position-dependent roles. BMC Syst Biol 2010;4:43.
Du, P., et al. PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Anal Biochem 2012;425(2):117-119.
Emir, B., et al. Identification of a potential fibromyalgia diagnosis using random forest modeling applied to electronic medical records. J Pain Res 2015;8:277-288.
Engelman, D.M., Steitz, T.A. and Goldman, A. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 1986;15:321-353.
Finn, R.D., et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 2016;44(D1):D279-285.
Fong, J.H. and Panchenko, A.R. Intrinsic disorder and protein multibinding in domain, terminal, and linker regions. Mol Biosyst 2010;6(10):1821-1828.
Ge, H., et al. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 2001;29(4):482-486.
Gene Ontology, C. Gene Ontology Consortium: going forward. Nucleic Acids Res 2015;43(Database issue):D1049-1056.
Giaever, G., et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature 2002;418(6896):387-391.
Goel, R., et al. Human Protein Reference Database and Human Proteinpedia as resources for phosphoproteome analysis. Mol Biosyst 2012;8(2):453-463.
Grissa, D., et al. Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data. Front Mol Biosci 2016;3:30.
Hao, T., et al. Reconstruction and Application of Protein-Protein Interaction Network. Int J Mol Sci 2016;17(6).
Haynes, C., et al. Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol 2006;2(8):e100.
Hsing, M., Byler, K. and Cherkasov, A. Predicting highly-connected hubs in protein interaction networks by QSAR and biological data descriptors. Bioinformation 2009;4(4):164-168.
Hsing, M., Byler, K.G. and Cherkasov, A. The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks. BMC Syst Biol 2008;2:80.
Ivanov, A.A., Khuri, F.R. and Fu, H. Targeting protein-protein interactions as an anticancer strategy. Trends Pharmacol Sci 2013;34(7):393-400.
Jones, D.T. and Cozzetto, D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics 2015;31(6):857-863.
Jonsson, P.F. and Bates, P.A. Global topological features of cancer proteins in the human interactome. Bioinformatics 2006;22(18):2291-2297.
Kalita, M.K., et al. CyclinPred: a SVM-based method for predicting cyclin protein sequences. PLoS One 2008;3(7):e2605.
Kawashima, S., et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2008;36(Database issue):D202-205.
Kerrien, S., et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res 2012;40(Database issue):D841-846.
Latha, A.B., et al. Identification of hub proteins from sequence. Bioinformation 2011;7(4):163-168.
Li, L., et al. Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backward feature selection approach. Biochimie 2014;104:100-107.
Li, W. and Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22(13):1658-1659.
Lin, W.J. and Chen, J.J. Class-imbalanced classifiers for high-dimensional data. Brief Bioinform 2013;14(1):13-26.
Liu, Q., et al. Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data. PLoS One 2009;4(12):e8250.
Lu, L., Lu, H. and Skolnick, J. MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 2002;49(3):350-364.
Magnan, C.N. and Baldi, P. SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 2014;30(18):2592-2597.
Malhis, N., Jacobson, M. and Gsponer, J. MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences. Nucleic Acids Res 2016;44(W1):W488-493.
Manning, G., et al. The protein kinase complement of the human genome. Science 2002;298(5600):1912-1934.
Martins, F., et al. Unravelling the relationship between protein sequence and low-complexity regions entropies: Interactome implications. J Theor Biol 2015;382:320-327.
Mohan, A., et al. Analysis of molecular recognition features (MoRFs). J Mol Biol 2006;362(5):1043-1059.
Nacher, J.C., Hayashida, M. and Akutsu, T. Emergence of scale-free distribution in protein-protein interaction networks based on random selection of interacting domain pairs. Biosystems 2009;95(2):155-159.
Nakashima, H., Nishikawa, K. and Ooi, T. Distinct character in hydrophobicity of amino acid compositions of mitochondrial proteins. Proteins 1990;8(2):173-178.
Ofran, Y. and Rost, B. Predicted protein-protein interaction sites from local sequence information. FEBS Lett 2003;544(1-3):236-239.
Ota, M., et al. Multiple-Localization and Hub Proteins. PLoS One 2016;11(6):e0156455.
Pancsa, R. and Tompa, P. Structural disorder in eukaryotes. PLoS One 2012;7(4):e34687.
Patil, A., Kinoshita, K. and Nakamura, H. Domain distribution and intrinsic disorder in hubs in the human protein-protein interaction network. Protein Sci 2010;19(8):1461-1468.
Patil, A., Kinoshita, K. and Nakamura, H. Hub promiscuity in protein-protein interaction networks. Int J Mol Sci 2010;11(4):1930-1943.
Patil, A. and Nakamura, H. Disordered domains and high surface charge confer hubs with the ability to interact with multiple proteins in interaction networks. FEBS Lett 2006;580(8):2041-2045.
Pellegrini, M., et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999;96(8):4285-4288.
Peng, Z., et al. Intrinsic disorder in the BK channel and its interactome. PLoS One 2014;9(4):e94331.
Peri, S., et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 2004;32(Database issue):D497-501.
Pierrot, C., et al. Inhibition of protein-protein interactions in Plasmodium falciparum: future drug targets. Curr Pharm Des 2012;18(24):3522-3530.
Prabhakaran, M. The distribution of physical, chemical and conformational properties in signal and nascent peptides. Biochem J 1990;269(3):691-696.
Ramana, J. and Gupta, D. LipocalinPred: a SVM-based method for prediction of lipocalins. BMC Bioinformatics 2009;10:445.
Rezwan, M. and Auerbach, D. Yeast "N"-hybrid systems for protein-protein and drug-protein interaction discovery. Methods 2012;57(4):423-429.
Richardson, J.S. and Richardson, D.C. Amino acid preferences for specific locations at the ends of alpha helices. Science 1988;240(4859):1648-1652.
Schad, E., Tompa, P. and Hegyi, H. The relationship between proteome size, structural disorder and organism complexity. Genome Biol 2011;12(12):R120.
Tolosi, L. and Lengauer, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 2011;27(14):1986-1994.
Touw, W.G., et al. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief Bioinform 2013;14(3):315-326.
Uversky, V.N., Oldfield, C.J. and Dunker, A.K. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit 2005;18(5):343-384.
Vacic, V., et al. Characterization of molecular recognition features, MoRFs, and their binding partners. J Proteome Res 2007;6(6):2351-2366.
Wang, L., Wang, Y. and Chang, Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016;111:21-31.
Wright, M.N.Z., A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Softw 2017.
Yang, R., et al. A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data. Int J Mol Sci 2016;17(2):218.
Zhang, J. Protein-length distributions for the three domains of life. Trends Genet 2000;16(3):107-109.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top