跳到主要內容

臺灣博碩士論文加值系統

(44.220.247.152) 您好!臺灣時間:2024/09/13 17:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:詹宗翰
研究生(外文):Zong-Han Chan
論文名稱:使用集成模型預測不平衡數據集的蛋白質N-醣基化位點
論文名稱(外文):Predicting Protein N- glycosylation Sites on Imbalanced Data Sets by Using Ensemble Models
指導教授:朱彥煒朱彥煒引用關係
指導教授(外文):Yen-Wei Chu
口試委員:謝立青董其樺
口試委員(外文):Li-Ching HsiehChi-Hua Tung
口試日期:2020-06-22
學位類別:碩士
校院名稱:國立中興大學
系所名稱:基因體暨生物資訊學研究所
學門:生命科學學門
學類:生物訊息學類
論文種類:學術論文
論文出版年:2021
畢業學年度:109
語文別:中文
論文頁數:46
中文關鍵詞:N-linked醣基化樣本不平衡機器學習集成式模型XGBoost
外文關鍵詞:N-linked glycosylationImbalanced DataMachine learningEnsemble modelXGboost
相關次數:
  • 被引用被引用:0
  • 點閱點閱:132
  • 評分評分:
  • 下載下載:16
  • 收藏至我的研究室書目清單書目收藏:0
醣基化是經酶的控制下蛋白質或脂質附加醣類的過程,是各個生物域非常普遍且最豐富的轉譯後修飾之一。而醣基化在細胞功能中影響蛋白質摺疊、識別抗原、細胞與細胞之間信息的傳遞、基因表達及控制新陳代謝。其中N-linked醣基化涉及了許多生物學功能的後修飾,且是醣基化資料中包含最多的一項。因此本研究挑選人類N-linked醣基化做為預測工具開發的目標。然而訓練資料中存在樣本類別不平衡的問題。為了解決此問題本研究以機器學習中集成式學習建構N-linked醣基化的預測工具。並透過醣基化位點周圍21個胺基酸的資訊編碼,將胺基酸特性進行基於序列、結構與功能三大類共11種特徵編碼,最後挑選效能最好的XGBoost集成模型做為建模。並基於準確度以移除特徵探討特徵重要性。最後,獨立測試資料的評估下MCC可達0.743,優於其他工具。且利用老鼠N-linked醣基化的資料集做為第二個獨立測試資料集來評估不同物種對於本研究的預測準確度,MCC可達0.901。因此成功開發一個快速又準確的醣基化位點預測工具以利提供相關研究員使用。
Glycosylation is the process of attaching carbohydrates to proteins or lipids by enzymes. It is one of the most common and abundant post-translational modifications in various biological domains. Among them, N-linked glycosylation involves many biological functions in PTM, and it has the most of data in glycosylation. In this study, we choose human N-linked glycosylation as the target of prediction tool development. In the dataset, we found that positive and negative have serious imbalance problems. To solve the problem, this study uses five ensemble models in machine learning to construct N-linked glycosylation prediction tools. A total of 11 feature codes based on sequence, structure and function are encoded by amino acid characteristics through protein sequence fragments. Finally, the XGBoost ensemble model with the best performance is selected for modeling. This study also explores the importance of features through individual feature elimination to reduce the time complexity of training models and further improve prediction performance. The experimentally verified data in different databases is used as the test data set, and compared with other tools. In the independent test, our performance (MCC= 0.743) is the best in other tools. And using the mouse N-linked glycosylation data set as the second independent test data set to evaluate the prediction accuracy of different species for this study, it was found that its MCC reached 0.901. Therefore, we have successfully developed a fast and accurate glycosylation site prediction tool to facilitate the use of relevant researchers.
中文摘要 i
Abstract ii
Content of Tables Figures vi
Content of Tables vii
1. Introduction 1
2. Related Works 4
2.1 Comparative Prediction Tools 4
2.1.1 NetNGlyc 1.0 5
2.1.2 GPP 6
2.1.3 GlycoEP 7
2.1.4 GlycoPP 8
2.2 Databases and Tools 9
2.2.1 Uniprot 9
2.2.2 dbPTM 9
2.2.3 O-GlycBase v6.00 10
2.2.4 iLearn 10
2.2.5 WebLogo 3 11
2.2.6 Pse-in-one 2.0 12
2.2.7 NetsurfP-2.0 12
2.2.8 SignalP-5.0 13
2.2.9 WEKA 14
2.2.10 Ensemble Strategies and Models 14
2.2.10.1 Bagging 15
2.2.10.2 Boosting 15
2.2.10.3 Gradient Boosting 15
2.2.10.4 XGBoost 16
3. Materials and Methods 17
3.1 Data Preparation 19
3.1.1 Training and Testing Dataset 20
3.1.2 Independent Set A 20
3.1.3 Independent Set B 20
3.2 Ensemble Learning 21
3.3 Feature Encoding 21
3.3.1 Sequence Based Features 21
3.3.1.1 iLearn 21
3.3.1.1.1 Binary 21
3.3.1.1.2 AAindex 22
3.3.1.1.3 AAC 23
3.3.1.1.4 Composition of k-spaced Amino Acid Pairs (CKSAAP) 23
3.3.1.2 Pse-in-one 2.0 23
3.3.1.2.1 Basic Kmer 23
3.3.1.2.2 Parallel Correlation Pseudo Amino Acid Composition(PC-PseAAC) 23
3.3.1.2.3 Series Correlation Pseudo Amino Acid Composition(SC-PseAAC) 23
3.3.1.3 WebLogo 3 24
3.3.1.4 Motif Encoding 24
3.3.2 Structure Based Features 24
3.3.2.1 NetsurfP-2.0 24
3.3.2.1.1 Relative/Absolute Surface Accessibility 24
3.3.2.1.2 Secondary Structure 24
3.3.3 Functional Based Features 25
3.3.3.1 SignalP-5.0 25
3.4 Model Evaluation 26
3.4.1 Accuracy(ACC) 26
3.4.2 F-Measure(F1) 27
3.4.3 Sensitivity(Sn)& Specificity(Sp) 27
3.4.4 Matthews Correlation Coefficient(MCC) 28
3.4.5 K-fold Cross-Validation 28
3.5 Feature Selection 29
4. Result and Discussion 30
4.1 Comparison of Machine Learning Algorithms 30
4.2 Feature Analysis 33
4.3 Performance of Independent Test Data 35
4.4 Case Study 36
4.4.1 Species Testing 36
4.5 Structural Analysis for Our Model 36
5. Conclusion 38
6. Reference 40
[1]Stanley P. Golgi glycosylation. Cold Spring Harb Perspect Biol. 2011 Apr 1;3(4):a005199. doi: 10.1101/cshperspect.a005199.
[2]Gavel Y, von Heijne G. Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein Eng. 1990 Apr;3(5):433-42. doi: 10.1093/protein/3.5.433.
[3]Kowarik M, Young NM, Numao S, Schulz BL, Hug I, Callewaert N, Mills DC, Watson DC, Hernandez M, Kelly JF, Wacker M, Aebi M. Definition of the bacterial N-glycosylation site consensus sequence. EMBO J. 2006 May 3;25(9):1957-66. doi: 10.1038/sj.emboj.7601087.
[4]Thanka Christlet TH, Veluraja K. Database analysis of O-glycosylation sites in proteins. Biophys J. 2001 Feb;80(2):952-60. doi: 10.1016/s0006-3495(01)76074-2.
[5]Krieg J, Hartmann S, Vicentini A, Gläsner W, Hess D, Hofsteenge J. Recognition signal for C-mannosylation of Trp-7 in RNase 2 consists of sequence Trp-x-x-Trp. Mol Biol Cell. 1998 Feb;9(2):301-9. doi: 10.1091/mbc.9.2.301.
[6]Gupta SK, Shukla P. Glycosylation control technologies for recombinant therapeutic proteins. Appl Microbiol Biotechnol. 2018 Dec;102(24):10457-10468. doi: 10.1007/s00253-018-9430-6
[7]Hossler P. Protein glycosylation control in mammalian cell culture: past precedents and contemporary prospects. Adv Biochem Eng Biotechnol. 2012;127:187-219. doi: 10.1007/10_2011_113.
[8]Jimenez del Val I, Nagy JM, Kontoravdi C. A dynamic mathematical model for monoclonal antibody N-linked glycosylation and nucleotide sugar donor transport within a maturing Golgi apparatus. Biotechnol Prog. 2011 Nov-Dec;27(6):1730-43. doi: 10.1002/btpr.688.
[9]Kremkow BG, Lee KH. Glyco-Mapper: A Chinese hamster ovary (CHO) genome-specific glycosylation prediction tool. Metab Eng. 2018 May;47:134-142. doi: 10.1016/j.ymben.2018.03.002.
[10] McDonald AG, Hayes JM, Bezak T, Głuchowska SA, Cosgrave EF, Struwe WB, Stroop CJ, Kok H, van de Laar T, Rudd PM, Tipton KF, Davey GP. Galactosyltransferase 4 is a major control point for glycan branching in N-linked glycosylation. J Cell Sci. 2014 Dec 1;127(Pt 23):5014-26. doi: 10.1242/jcs.151878.
[11] Medlock GL, Papin JA. Guiding the Refinement of Biochemical Knowledgebases with Ensembles of Metabolic Networks and Machine Learning. Cell Syst. 2020 Jan 22;10(1):109-119.e3. doi: 10.1016/j.cels.2019.11.006
[12] Hahm YH, Hahm SH, Jo HY, Ahn YH. Comparative Glycopeptide Analysis for Protein Glycosylation by Liquid Chromatography and Tandem Mass Spectrometry: Variation in Glycosylation Patterns of Site-Directed Mutagenized Glycoprotein. Int J Anal Chem. 2018 Sep 2;2018:8605021. doi: 10.1155/2018/8605021.
[13] Kotidis P, Kontoravdi C. Harnessing the potential of artificial neural networks for predicting protein glycosylation. Metab Eng Commun. 2020 May 15;10:e00131. doi: 10.1016/j.mec.2020.e00131.
[14] Gupta R, Jung E, Brunak S. Prediction of N-glycosylation sites in human proteins. 2004. http://www.cbs.dtu.dk/services/NetNGlyc/
[15] Hamby SE, Hirst JD. Prediction of glycosylation sites using random forests. BMC Bioinformatics. 2008 Nov 27;9:500. doi: 10.1186/1471-2105-9-500.
[16] Chauhan JS, Rao A, Raghava GP. In silico platform for prediction of N-, O- and C-glycosites in eukaryotic protein sequences. PLoS One. 2013 Jun 28;8(6):e67008. doi: 10.1371/journal.pone.0067008.
[17] Chauhan JS, Bhat AH, Raghava GP, Rao A. GlycoPP: a webserver for prediction of N- and O-glycosites in prokaryotic protein sequences. PLoS One. 2012;7(7):e40155. doi: 10.1371/journal.pone.0040155.
[18] Naderi-Manesh H, Sadeghi M, Arab S, Moosavi Movahedi AA. Prediction of protein surface accessibility with information theory. Proteins. 2001 Mar 1;42(4):452-9. doi: 10.1002/1097-0134(20010301)42:4<452::aid-prot40>3.0.co;2-q.
[19] Senger RS, Karim MN. Prediction of N-linked glycan branching patterns using artificial neural networks. Math Biosci. 2008 Jan;211(1):89-104. doi: 10.1016/j.mbs.2007.10.005.
[20] Ho T. Random decision forests. in Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, Quebec, Canada, 1995 pp. 278. doi: 10.1109/ICDAR.1995.598994
[21] Pirooznia M, Deng Y. SVM Classifier - a comprehensive java interface for support vector machine classification of microarray data. BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S25. doi: 10.1186/1471-2105-7-S4-S25.
[22] Wang S, and Yao X,. Diversity analysis on imbalanced data sets by using ensemble models. IEEE. 2009, pp. 324-331, doi: 10.1109/CIDM.2009.4938667.
[23] Nguyen D, Stutz R, Schorr S, Lang S, Pfeffer S, Freeze HH, Förster F, Helms V, Dudek J, Zimmermann R. Proteomics reveals signal peptide features determining the client specificity in human TRAP-dependent ER protein import. Nat Commun. 2018 Sep 14;9(1):3765. doi: 10.1038/s41467-018-06188-z.
[24] Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, von Heijne G, Nielsen H. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol. 2019 Apr;37(4):420-423. doi: 10.1038/s41587-019-0036-z.
[25] Braga PL, Oliveira ALI, Ribeiro GHT and Meira SRL. Bagging Predictors for Estimation of Software Project Effort. IEEE. 2007, pp. 1595-1600, doi: 10.1109/IJCNN.2007.4371196.
[26] Eibl G, Pfeiffer KP. How to Make AdaBoost.M1 Work for Weak Base Classifiers by Changing Only One Line of the Code. Machine Learning: ECML. 2002 pp. 72–83 doi: 10.1007/3-540-36755-1_7.
[27] Chen T and Guestrin C. XGBoost: A Scalable Tree Boosting Systemin. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016 pp. 785–794 doi: 10.1145/2939672.2939785.
[28] Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH. dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D622-7. doi: 10.1093/nar/gkj083
[29] Lu CT, Huang KY, Su MG, Lee TY, Bretaña NA, Chang WC, Chen YJ, Chen YJ, Huang HD. DbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic Acids Res. 2013 Jan;41(Database issue):D295-305. doi: 10.1093/nar/gks1229.
[30] Huang KY, Su MG, Kao HJ, Hsieh YC, Jhong JH, Cheng KH, Huang HD, Lee TY. dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins. Nucleic Acids Res. 2016 Jan 4;44(D1):D435-46. doi: 10.1093/nar/gkv1240.
[31] Gupta R, Brunak S (2002) Prediction of glycosylation across the human proteome and the correlation to protein function. Pac Symp Biocomput: 310–322.
[32]The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099.


[33] Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One. 2017 Jun 2;12(6):e0177678. doi: 10.1371/journal.pone.0177678.
[34] Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003 Jan 1;31(1):365-70. doi: 10.1093/nar/gkg095.
[35] Bhat AH, Mondal H, Chauhan JS, Raghava GP, Methi A, Rao A. ProGlycProt: a repository of experimentally characterized prokaryotic glycoproteins. Nucleic Acids Res. 2012 Jan;40(Database issue):D388-93. doi: 10.1093/nar/gkr911.
[36] Choudhary P, Nagar R, Singh V, Bhat AH, Sharma Y, Rao A. ProGlycProt V2.0, a repository of experimentally validated glycoproteins and protein glycosyltransferases of prokaryotes. Glycobiology. 2019 Jun 1;29(6):461-468. doi: 10.1093/glycob/cwz013.
[37] Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods Mol Biol. 2016;1374:23-54. doi: 10.1007/978-1-4939-3167-5_2.
[38] Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007 May 15;23(10):1282-8. doi: 10.1093/bioinformatics/btm098.
[39] Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R. UniProt archive. Bioinformatics. 2004 Nov 22;20(17):3236-7. doi: 10.1093/bioinformatics/bth191.
[40] Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI, Chou KC, Smith AI, Daly RJ, Li J, Song J. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020 May 21;21(3):1047-1057. doi: 10.1093/bib/bbz041.
[41] Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188-90. doi: 10.1101/gr.849004.
[42] Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015 Jul 1;43(W1):W65-71. doi: 10.1093/nar/gkv458.
[43] Klausen MS, Jespersen MC, Nielsen H, Jensen KK, Jurtz VI, Sønderby CK, Sommer MOA, Winther O, Nielsen M, Petersen B, Marcatili P. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins. 2019 Jun;87(6):520-527. doi: 10.1002/prot.25674.
[44] Pancsa R, Tompa P. Structural disorder in eukaryotes. PLoS One. 2012;7(4):e34687. doi: 10.1371/journal.pone.0034687
[45] Wood MJ, Hirst JD. Protein secondary structure prediction with dihedral angles. Proteins. 2005 May 15;59(3):476-81. doi: 10.1002/prot.20435.
[46] Liu W, Wang Z, Liu X, Zeng N, Liu Y, F.E. Alsaadi
A survey of deep neural network architectures and their applications
Neurocomputing, 234 (2017), pp. 11-26
[47] Izard JW, Kendall DA. Signal peptides: exquisitely designed transport promoters. Mol Microbiol. 1994 Sep;13(5):765-73. doi: 10.1111/j.1365-2958.1994.tb00469.x.
[48] Holmes G, Donkin A and H I. WEKA: a machine learning workbench. Proceedings of ANZIIS ’94 - Australian New Zealnd Intelligent Information Systems Conference Nov. 1994 pp. 357–361 doi: 10.1109/ANZIIS.1994.396988.

[49] Breiman L. Bagging predictors”. Mach. Learn. vol. 24 no. 2 pp. 123–140 Aug. 1996 doi: 10.1007/BF00058655.
[50] OzaNC. Online bagging and boosting.IEEE 2005 vol. 3 pp. 2340-2345 Vol. 3 doi: 10.1109/ICSMC.2005.1571498.
[51] Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. 2009. Stochastic gradient boosted distributed decision trees. 18th ACM conference on Information and knowledge management. 2009 pp. 2061–2064 doi: 10.1145/1645953.1646301.
[52] Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neurorobot. 2013 Dec 4;7:21. doi: 10.3389/fnbot.2013.00021.
[53] Razi MA and Athappilly K. A comparative predictive analysis of neural networks (NNs) nonlinear regression and classification and regression tree (CART) models. Expert Syst. Appl. vol. 29 no. 1 pp. 65–74 Jul. 2005 doi: 10.1016/j.eswa.2005.01.006.
[54] Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000 Jan 1;28(1):374. doi: 10.1093/nar/28.1.374.
[55] Cedano J, Aloy P, Pérez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol. 1997 Feb 28;266(3):594-600. doi: 10.1006/jmbi.1996.0804.
[56] Chen YZ, Tang YR, Sheng ZY, Zhang Z. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinformatics. 2008 Feb 18;9:101. doi: 10.1186/1471-2105-9-10.
[57] Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics. 2008 Dec 1;9:510. doi: 10.1186/1471-2105-9-510
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊