(3.238.130.97) 您好!臺灣時間:2021/05/18 11:11
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

: 
twitterline
研究生:彭家尼
研究生(外文):Bongani Brian Dlamini
論文名稱:使用遷移學習與語言模型來鑑別外泌體蛋白之功能類別
論文名稱(外文):Identifying Major Exosomal Protein Biomarkers using the Idea of Transfer Learning with Bidirectional Encoder Representations from Transformers
指導教授:歐昱言博士
指導教授(外文):Dr Yu-Yen Ou
口試委員:黎阮國慶張經略
口試委員(外文):Nguyen Quoc Khanh LeChin-Lueh Chang
口試日期:2020-07-15
學位類別:碩士
校院名稱:元智大學
系所名稱:生物與醫學資訊碩士學位學程
學門:生命科學學門
學類:生物訊息學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:英文
論文頁數:49
外文關鍵詞:Exosome
IG URL:nkhosenye
Facebook:Brian B. Dlamini
相關次數:
  • 被引用被引用:0
  • 點閱點閱:48
  • 評分評分:
  • 下載下載:2
  • 收藏至我的研究室書目清單書目收藏:0
使用遷移學習與語言模型來鑑別外泌體蛋白之功能類別
研究生:彭家尼 指導教授:歐昱言博士

生物與醫學資訊碩士學位學程
元智大學
摘要

摘要
外泌體是來自體內的奈米大小的的囊泡,其主要的功能是細胞間的通訊。外泌體由細胞分泌產生,但不僅健康的細胞會分泌外泌體,受感染的細胞也會。大量的科學文獻表明,腫瘤細胞釋放的外泌體比大多數細胞釋放的還多,從而開始生長,發展和耐藥。然而,已經有一些鑑定主要外泌體蛋白的方法。這項研究使用Support Vector Machine(SVM)、Synthesized Minority Oversampling Technique(SMOTE)並由Bidirectional Encoder Representations from Transformers(BERT)生成可以代表外泌體蛋白序列的概念化詞向量來鑑定主要的外泌體蛋白。BERT是用於自然語言處理(NLP)的性能最高的預訓練模型之一。因此,BERT預訓練模型的概念化詞向量能有效地抓到句子和字詞的多種意思,我們將這個想法應用在蛋白質序列中,希望在序列相同的氨基酸中能找到多個功能和意義。首先,實驗建立了主要的外泌體蛋白家族的數據集,並帶有與腫瘤相關的外泌體蛋白註釋。從預訓練模型的隱藏層中提取固定特徵向量。 本篇論文的方法在獨立數據集中實現了91.15%的準確度和MCC為0.82。值得一提的是,我們提出的特徵產生的方法與PSSM特徵有著相當的結果。
關鍵字: BERT、SVM、自然語言處理、概念化詞向量、四跨膜蛋白、橫紋肌蛋白、絮凝素、ALIX、腫瘤易感蛋白、熱激蛋白
Identifying Major Exosomal Protein Biomarkers
Using the Idea of Transfer Learning with Bidirectional Encoder Representations from Transformers

Student: Bongani Brian Dlamini Advisor: Prof, Yu-Yen Ou

Department of Computer Science and Engineering
Graduate Program in Biomedical Informatics
Yuan Ze University

ABSTRACT
Exosomes are nano-sized endosomal-derived vesicles, which plays a major role in cell-cell communication. They are secreted by multiple cells, both healthy and infected cells. Quite several scientific pieces of literature suggest that tumour cells release excessive amounts of exosomes than most cell lines to initiate growth, progression and drug resistance. However, a few methods have been proposed on the identification of major Exosomal proteins. This study proposes a Support Vector Machine (SVM) and SMOTE with conceptualized word embeddings from Bidirectional Encoder Representation from Transformers (BERT) to represent the exosomal protein sequences for identifying major exosomal proteins. BERT is one of the highest performing pre-trained models for Natural Language Processing (NLP). For that reason, the idea of the conceptualized word embeddings from the BERT pre-trained model was applied to capture multiple meanings from the same amino acid residue in protein sequence. To begin with, a dataset of major exosomal protein families was established with tumour-related Exosomal protein annotations. The fixed feature vectors were extracted from the hidden layers of the pre-trained model. Finally, the proposed method trained on full datasets achieved an accuracy of 91.15% and MCC of 0.82 in the independent datasets. It is worth mentioning that our proposed method yields comparable results with the PSSM features.
Keywords: BERT, Support Vector Machine, Natural Language Processing, contextualized word embeddings, Tetraspanins, Rab-proteins, Flottinins, ALIX, Tumor Susceptibility Proteins, Heat shock proteins
embedding is from the BERT pre-trained model was applied to capture multiple meanings from the same amino acid residue in protein sequence. To begin with, a dataset of major Exosomal protein families was established with tumour-related Exosomal protein annotations. The fixed feature vectors were extracted from the hidden layers of the pre-trained model. Finally, the proposed method trained on full datasets achieved an accuracy of 91.15% and MCC of 0.82 in the independent datasets. It is worth mentioning that our proposed method yields comparable results with the PSSM features.

Keywords: BERT, Support Vector Machine, Natural Language Processing, contextualized word embedding’s, Tetraspanins, Rab-proteins, Flottinins, ALIX, Tumor Susceptibility Proteins, Heat shock proteins

Tittle Page i
Letter of Approval ii
Abstract in Chinese iii
Abstract in English iv
ACKNOWLEDGEMENTS v
LIST OF TABLES viii
LIST OF FIGURES ix
CHAPTER 1 INTRODUCTION 1
1.1 Exosomes 1
1.2 Exosomal Contents 2
1.2.1 Tetraspanins 3
1.2.2 Rab-proteins 3
1.2.3 Annexins 3
1.2.4 Flottinins 3
1.2.5 Proteins Involved in ESCRT Complex (ALIX and TSP) 4
1.2.6 Heat Shock Proteins 4
1.3 Motivation and Scope of the Thesis 4
1.4 Thesis Structure 5
CHAPTER 2 LITERATURE REVIEW 6
2.1 Introduction 6
2.2 Related Research on Exosomal Proteins 6
2.3 Literature and Development NLP Research 7
2.4 Associations of NLP Techniques and Bioinformatics 8
2.5 Introduction to Word Embedding 8
2.6 Using Word Embedding Techniques in Solving Bioinformatics Problems 10
2.7 Using Language Models to Learn a Representation 10
2.8 Research Objectives 10
CHAPTER 3 METHODOLOGY 12
3.1 Methodology Overview 12
3.2 Data Collection 13
3.3 Data Search Query 15
3.4 Traditional Feature Extractions Methods 16
3.4.1 Amino Acid Composition (AAC) 16
3.4.2 Dipeptide Pair Composition (DPC) 17
3.4.3 PSSM Profiles 18
3.4.4 Feature Extraction using Contextualized Word Embedding from BERT Pre-trained Models 20
3.5 Feature Normalizing and Standardization 22
3.6 Dataset Balancing 23
3.7 Feature Selection with Random Forest 23
3.8 Classification with Support Vector Machine (SVM) 24
3.9 Environment Settings 24
3.9.1 Setting for BERT Summing Method 24
3.10 Cross-Validation and Evaluation Metrics 24
CHAPTER 4 26
EXPERIMENTAL RESULTS 26
4.1 Sequence Analysis 26
4.2 Analysis of 10 Most Frequent Motifs in the Datasets 27
4.3 Performance Results of Traditional Algorithms using PSSM Feature 29
4.4 Performance Comparison with Existing Feature Extraction Techniques 30
4.5 Results of Independent Test Sets 32
CHAPTER 5 CONCLUSION AND FUTURE WORKS 35
5.1 Research Contributions 35
5.2 Conclusions 35
5.3 Limitations and Future Works 36
References 37
References

[1] J. S. Hamid, M. Ramezani, S. A. Jalalian, K. Abnous and S. M. Taghdisi, "Exosomes, new biomarkers in early cancer detection," Analytical Biochemistry, pp. 1-13, 2019.
[2] K. A. and E. M., "Ticket to ride: targeting proteins to exosomes for brain delivery," Molecular Therapy, pp. 1264-1266, 2017.
[3] G. D. W, S. K. Gopal, R. Xu, R. J. Simpson and W. Chen, "Exosomes and their roles in immune regulation and cancer," Seminars in cell & developmental biology, vol. 40, pp. 72-81, April 2015.
[4] L. Y. Hee, H. K. Park, Q. Auh, H. Nah, J. S. Lee, H. J. Moon, D. N. Heo, I. S. Kim and I. K. Kwon, "Emerging potential of exosomes in regenerative medicine for temporomandibular joint osteoarthritis," international journal of molecular sciences, vol. 21, p. 1541, 2020.
[5] A. S. Asfar, B. B. and F. H. Sarkar, "Exosomes in cancer development, metastasis, and drug resistance: a comprehensive review," Cancer and Metastasis Reviews, vol. 32, pp. 623-642, 2013.
[6] B. M. Demory, J. N. Higginbotham, J. L. Franklin, A. J. Ham, P. J. Halvey, I. E. Imasuen, C. W. M. Li, D. C. Liebler and R. J. Coffey, "Proteomic analysis of exosomes from mutant KRAS colon cancer cells identifies intercellular transfer of mutant KRAS," Molecular & cellular proteomics, vol. 12, no. 2, pp. 343-355, 2013.
[7] Z. Xu, X. Yuan, H. Shi, L. Wu, H. Qian and W. Xu, "Exosomes in cancer: small particle," Journal of haematology & oncology, vol. 8, p. 83, 2015.
[8] W. L. Theresa, "Exosomes and tumour-mediated immune suppression," The Journal of clinical investigation, vol. 126, pp. 1216-1223, 2016.
[9] V. Roberta, V. Huber, M. Iero, P. Filipazzi, G. Parmiani and L. Rivoltini, "Tumor-released microvesicles as vehicles of immunosuppression," Cancer Research, vol. 67, pp. 2912-2915, 2007.
[10] K. Kourembanas, "Exosomes: vehicles of intercellular signalling, biomarkers, and vectors of cell therapy," Annual review of physiology, vol. 77, pp. 13-17, 2015.
[11] W. J. Chi, L. R. Bégin, N. G. Bérubé, S. Chevalier, A. G. Aprikian, H. Gourdeau and M. Chevrette, "Down-regulation of CD9 expression during prostate carcinoma progression is associated with CD9 mRNA modifications," Wang, J. C., Bégin, L. R., Bérubé, N. G., Chevalier, S., Aprikian, A. G., Gourdeau, H., & Chevrette, M. (2007). Down-regulation of CD9 expression during prostate carcinoma progression is associated with CD9 mRNA modifications. Clinical cancer research, vol. 13, pp. 2354-2361, 2007.
[12] L. Bing, P. Peng, S. Chen, L. Li, M. Zhang, D. Cao and J. Yang, "Characterization and proteomic analysis of ovarian cancer-derived exosomes," Journal of proteomics, vol. 80, pp. 171-182, 2013.
[13] L. Valbona, L. Zhang, A. M. Viloria-Petit, A. A. Ogunjimi, M. R. Inanlou, E. Chiu, M. Buchanan, A. N. Hosein, M. Basik and J. L. Wrana, "Exosomes mediate stromal mobilization of autocrine Wnt-PCP signalling in breast cancer cell migration," Cell, vol. 151, pp. 1542-1556, 2012.
[14] S. Paulsen, B. K. R. Jakobsen, R. Bæk, B. H. Folkersen, P. Meldgaard, K. Varming, M. M. Jørgensen and B. S. Sorensen, "Exosomal proteins as diagnostic biomarkers in lung cancer," Journal of Thoracic Oncology, vol. 11, pp. 1701-1710, 2016.
[15] T. H. Tai and Y. C. Wang, "Rab-mediated vesicle trafficking in cancer," Journal of biomedical science, vol. 23, no. 1, pp. 1-7, 2016.
[16] S. P. P. Mendoza, S. Rivas, J. Díaz, C. Moraga, A. F. Quest and V. A. Torres, "Hypoxia promotes Rab5 activation, leading to tumour cell migration, invasion and metastasis," Oncotarget, vol. 7, no. 20, p. 29548, 2016.
[17] C. Y. Chiang, W. C. Wei, S. H. Huang, C. M. Shih, C. P. Hsu, K. J. Chang and W. T. Chao, "Rab11 regulates E-cadherin expression and induces cell transformation in colorectal carcinoma," BMC Cancer, vol. 14, no. 1, pp. 1-9, 2014.
[18] W. T. Stefan, Y. Meyer, M. Gottschalk, C. A. Weis, J. V. Hardenberg, C. Frank, A. Steidler, M. S. Michel and P. Erben, "RAB27A, RAB27B and VPS36 are downregulated in advanced prostate cancer and show functional relevance in prostate cancer cells," International journal of oncology, vol. 50, no. 3, pp. 920-923, 2017.
[19] M. M. Duthika, S. Hallal, B. Russell, L. Ly, S. Ebrahimkhani, H. Wei, R. I. Christopherson, M. E. Buckland and K. E. Kaufman, "Comprehensive proteome profiling of glioblastoma-derived extracellular vesicles identifies markers for more aggressive disease," Journal of neuro-oncology, vol. 131, no. 2, pp. 233-244, 2017.
[20] B. Zied, B. Bouchon, C. Viallard, M. D. Incan and F. Degoul, "Annexin A1 localization and its relevance to cancer.," Clinical Science, vol. 130, no. 4, pp. 205-220, 2016.
[21] W. Ling, T. Skotland, V. Berge, K. Sandvig and A. Llorente, "Exosomal proteins as prostate cancer biomarkers in urine: from mass spectrometry discovery to immunoassay-based validation," European Journal of Pharmaceutical Sciences, vol. 98, pp. 80-85, 2017.
[22] V. Gábor, O. Galamb, T. Krenács, S. Spisák, A. Kalmár, Á. V. Patai, B. Wichmann, K. Dede, Z. Tulassay and B. Molnár, "Exosomes in colorectal carcinoma formation: ALIX under the magnifying glass," Modern Pathology, vol. 29, no. 8, pp. 928-938, 2016.
[23] M. &. M. G. Shevtsov, "Heat shock protein–peptide and HSP-based immunotherapies for the treatment of cancer," Frontiers in immunology, vol. 7, p. 171, 2016.
[24] A. Soheimi, S. S, B. L. Kiong, O. H. Hashim and A. S. Shuib, "Patients with ovarian carcinoma excrete different altered levels of urine CD59, kininogen-1 and fragments of inter-alpha-trypsin inhibitor heavy chain H4 and albumin," Proteome science, vol. 8, p. 58, 2010.
[25] W. Yan, W. Du, Y. Liang, X. Chen, C. Zhang, W. Pang and Y. Xu, "PUEPro: a computational pipeline for prediction of urine excretory proteins.," n International Conference on Advanced Data Mining and Applications, pp. 714-725, 2016.
[26] H. S. Celine, J. Cui, Z. Ni, Y. Su, D. Puett, F. Li and Y. Xu, "A computational method for prediction of excretory proteins and application to the identification of gastric cancer markers in urine," PLoS One, p. e16875, 2011.
[27] G. Isabelle, J. Weston, S. Barnhill and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine learning, vol. 46, pp. 389-422, 2002.
[28] C. J. David, W. E. Fondrie, Z. Liao, P. I. Hanson, A. Fulton, L. Mao and A. J. Yang, "Redefining the breast cancer exosome proteome by tandem mass tag quantitative proteomics and multivariate cluster analysis," Analytical Chemistry, vol. 87, pp. 10462-10469, 2015.
[29] P. Jaena, M. Hwang, B. Choi, H. Jeong, J. h. Jung, H. K. Kim, S. Hong, J. h. Park and Y. Choi, "Exosome classification by pattern analysis of surface-enhanced Raman spectroscopy data for lung cancer diagnosis," Analytical Chemistry, vol. 89, pp. 6695-6701, 2017.
[30] B. Yoshua, R. Ducharme, P. Vincent and C. Jauvin, "A neural probabilistic language model," Journal of machine learning research, vol. 3, pp. 1137-1155, 2003.
[31] C. Ronan and J. Weston, "A unified architecture for natural language processing: Deep neural networks with multitask learning," Proceedings of the 25th international conference on Machine learning, pp. 160-167, 2008.
[32] M. Tomas, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv, p. 1301.3781, 2013.
[33] P. Jeffrey, R. Socher and C. D. Manning, "Glove: Global vectors for word representation," In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, October 2014.
[34] D. Richard, S. R. Eddy, A. Krogh and G. Mitchison, "Biological sequence analysis: probabilistic models of proteins and nucleic acids," Cambridge university press, 1998.
[35] P. S. Hee, K. S. Jung and K. H. Ryu, "Implementation of an Information Management System for Nucleotide Sequences based on BSML using Active Trigger Rules," Journal of KIISE: Databases, vol. 32, pp. 24-42, 2005.
[36] H. J. Bussemaker, H. Li and E. D. Siggia, "Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis," Proceedings of the National Academy of Sciences, vol. 97, pp. 10096-10100, 2000.
[37] B. Alex, K. Shapiro, E. N. Trifonov and I. Ioshikhes, "Enhancement of the nucleosomal pattern in sequences of lower complexity," Nucleic acids research, vol. 25, pp. 3248-3254, 1997.
[38] T. G. Olga, O. Arbell, Y. Koren, G. M. Landau and A. Bolshoy, "Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity," Bioinformatics, vol. 18, pp. 679-688, 2002.
[39] H. Q. Thai, D. V. Phan and Y. Y. Ou, "TNFPred: Identifying tumour necrosis factors using hybrid features based on word embeddings.," bioRxiv, p. 860791, 2019.
[40] H. Q. Thai, D. V. Phan and Y. Y. Ou, "Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters.," Analytical Biochemistry, vol. 577, pp. 73-81, 2019.
[41] L. N. Quoc Khanh, "iN6-methylated (5-step): identifying DNA N 6-methyladenine sites in rice genome using a continuous bag of nucleobases via Chou’s 5-step rule," Molecular Genetics and Genomics, vol. 294, pp. 1173-1182, 2019.
[42] H. M. Nafiz and I. Friedberg, "Identifying antimicrobial peptides using a word embedding with deep recurrent neural networks," Bioinformatics, vol. 35, pp. 2009-2016, 2019.
[43] P. Hanna, A. Chernodub, N. Grabar and T. Hamon, "Improving automatic categorization of technical vs. Laymen medical words using FastText word embeddings," November 2018.
[44] B. Megyesi, "Proceedings of the 20th Nordic Conference of Computational Linguistics," NODALIDA 2015, pp. 11-13, 2015.
[45] D. Jacob, M. W. Chang, K. Lee and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv, p. 1810.04805, 2018.
[46] L. Yinhan, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov, "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv, p. 1907.11692, 2019.
[47] S. Victor, L. Debut, J. Chaumond and T. Wolf, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," arXiv preprint arXiv, p. 1910.01108, 2019.
[48] Y. Zhilin, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov and Q. V. Le, "Xlnet: Generalized autoregressive pretraining for language understanding," n Advances in neural information processing systems, pp. 5753-5763, 2019.
[49] T. D. K, B. G. PremaSudha and F. Xiong, "Auto-detection of epileptic seizure events using a deep neural network with different feature scaling techniques," Pattern Recognition Letters, vol. 128, pp. 544-550, 2019.
[50] C. V. Nitesh, K. W. Bowyer, L. O. Hall and P. W. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[51] Z. Qifeng, H. Zhou and T. Li, "Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features," Knowledge-based Systems, vol. 95, pp. 1-11, 2016.
[52] G. Robin, J. M. Poggi and C. T. Malot, "Variable selection using random forests," Pattern recognition letters, vol. 14, pp. 2225-2236, 2010.
[53] V. R. Passos Machado, F. R. Vieira Alves and P. M. Catarino, "Alternative views of some extensions of the padovan sequence with the Google Colab," Anale. Seria Iformatica, vol. 2, no. XVII, pp. 266-273, 2019.
[54] S. Bernhard, A. J. Smola and F. Bach, "Learning with kernels: support vector machines, regularization, optimization, and beyond," the MIT Press, 2018.
[55] B. M. Demory, J. N. Higginbotham, J. L. Franklin, A. J. Ham, P. J. Halvey, I. E. Imasuen, C. Whitwell, M. Li, D. C. Liebler and R. J. Coffey, "Proteomic analysis of exosomes from mutant KRAS colon cancer cells identifies intercellular transfer of mutant KRAS.," Molecular & cellular proteomics, vol. 12, pp. 343-355, 2013.
[56] W. Alex, A. Singh, J. Michael, F. Hill, O. Levy and S. R. Bowman, "Glue: A multi-task benchmark and analysis platform for natural language understanding," arXiv preprint arXiv, p. 1804.07461, 2018.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文