跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.152) 您好!臺灣時間:2025/11/02 09:25
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:胡光泰
研究生(外文):Quang-Thai Ho
論文名稱:利用連續詞袋模型去解譯電子傳遞蛋白中的隱藏訊息
論文名稱(外文):Using Continuous Bag of Words to Interpret the Hidden Information of Protein Sequences in Electron Transport Proteins
指導教授:歐昱言
指導教授(外文):Yu-Yen Ou
口試委員:張經略歐展言
口試委員(外文):Ching-Lueh ChangChan-Yen Ou
口試日期:2018-07-19
學位類別:碩士
校院名稱:元智大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:106
語文別:英文
論文頁數:31
中文關鍵詞:機器學習; 深度學習; 自然語言處理; 單詞嵌入; 轉運蛋白; 電子傳遞蛋白;
外文關鍵詞:machine learning; deep learning; natural language processing; word embedding; transport proteins; electron transport proteins;
相關次數:
  • 被引用被引用:1
  • 點閱點閱:314
  • 評分評分:
  • 下載下載:40
  • 收藏至我的研究室書目清單書目收藏:0
深度學習是人工智能和機器學習的一個子集,它使用多層人工神經網絡在對象檢測,語音識別,語言翻譯等任務中提供最先進的性能。
在生物信息學方面,自21世紀初以來,深度學習也被應用於將生物醫學數據轉化為有價值的知識,並展示了當今卓越的科學成就。因此,它不僅縮短了研究時間,而且產生了更準確可靠的結果。
在這項研究中,我提出了一種新的方法,使用文字嵌入提取主要蛋白質結構中的氨基酸序列的隱藏信息,已成功應用於自然語言研究。與傳統語言模型不同,主要蛋白質結構類似於未知語言,20個單詞對應20個氨基酸。有序的單詞鏈本身可能包含生物功能的信息,這些信息對蛋白質的分類有重要貢獻。
試圖直接從氨基酸序列的FASTA形式進行該研究。我選擇電子傳遞蛋白,這是一組在細胞功能的代謝過程中轉移電子的蛋白質,以實現這項工作。因此,對電子傳遞蛋白的鑑定和分類的研究將有助於提高對轉運蛋白過程及其在細胞能量產生中的作用的理解。 Facebook Fasttext用作建模工具,它實現了哈希技巧,以實現快速和內存有效的映射以執行此研究。成功鑑定轉運蛋白中電子傳遞蛋白的存檔靈敏度為60.53%,特異度為94.84%,準確度為91.71%,MCC為0.53,為獨立檢測。與先前的相關工作相比,該方法還改進了測量指標。所提出的技術提供了用於研究目的的基於網絡的在線工具,並且還開啟了用於實現自然語言方法以解決該領域中的問題的承諾。
Deep learning, a subset of AI and machine learning, uses multi-layered artificial neural networks to deliver state-of-the-art performance in tasks such as object detection, speech recognition, language translation and so on.
In bioinformatics, Deep learning also has been applied to transform biomedical data into valuable knowledge since the early 2000s and demonstrated remarkable scientific achievements nowadays. As a result, it not only shortens the study time but also produces more accurate and reliable results.
In this study, I propose a novel approach for extracting hidden information on the amino acid sequence in the primary protein structure using word embedding which has applied successfully in the natural language researches. Unlike conventional language models, the primary protein structure which resembles an unknown language with 20 words corresponding to 20 amino acids. An ordered chain of words may itself contain the information of biological functions which make a significant contribution to the classification of proteins.
This study is attempted to perform directly from the FASTA format of amino acid sequences. I choose electron transport proteins, a bunch of proteins that transfer electrons in metabolic processes for cellular function, to implement this work. Therefore, the study of the identification and classification of electron transport proteins will contribute significantly to improving the understanding of the transport protein process and their roles in energy production in cells. Facebook Fasttext used as a modeling tool which implements the hashing trick for a fast and memory efficient mapping to perform this study. The result of successfully identifying electron transport proteins in transport proteins archived sensitivity of 60.53%, specificity of 94.84%, and accuracy of 91.71%, with MCC of 0.53 for independent testing. This method also improves the measurement metrics compared to previous related works. The proposed technique provides a web-based online tool for research purposes and also opens up promises for implementing natural language methods to solve problems in this field.
Abstract iii
Acknowledgements v
Table of Contents vi
List of Tables viii
List of Figures ix
Chapter 1、Introduction 1
1.1 Motivation 1
1.2 Research background 2
1.2.1 Machine learning and deep learning in bioinformatics 2
1.2.2 Electron transport protein 4
1.3 Scope of the study 6
1.4 Organization of the study 7
Chapter 2、Literature Review 8
2.1 Introduction 8
2.2 Related researches on deep learning 8
2.3 Related researches on electron transport protein 9
2.5 The purpose of the study and motivation research 10
Chapter 3、Identifying Electron Transport Proteins 11
3.1 Data collection 11
3.1.1 Electron transport protein and corresponding molecular function 11
3.1.2 Pre-processing data 13
3.2 Feature extractions 13
3.2.1 Composition of amino acids and amino acid pairs 13
3.2.2 Position specific scoring matrices 15
3.2.3 Continuous bag of words 16
3.4 FastText 18
3.5 Effectiveness evaluation method 20
3.6 Results and discussions 22
3.6.1 Execution time 22
3.6.2 The number of word in bags 22
3.6.3 Identify electron transport protein 24
3.6.4 fastETC 25
Chapter 4、Conclusions 27
4.1 Research contributions 27
4.2 Limitations and further study 27
References 29
1. Alipanahi, B. D. (2015). Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33(8), 831.
2. Altschul, S. F. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389-3402.
3. Asgari, E. &. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS ONE, 10(11), e0141287.
4. Boeckmann, B. B. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research, 31(1), 365-370.
5. Chang, C. C. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 27.
6. Chen, K. M. (2011). Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics, 28(3): 331-341.
7. Chen, S. A. (2011). Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics, 27(15), 2062-2067.
8. Chou, K. C. (2001). Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3), 246-255.
9. Consortium, U. (2016). UniProt: the universal protein knowledgebase. Nucleic acids research, 45(D1), D158-D169.
10. Frank, E. H. (2004). Data mining in bioinformatics using Weka. Bioinformatics, 20(15), 2479-2481.
11. Gromiha, M. M. (2002). Important amino acid properties for determining the transition state structures of two‐state protein mutants. FEBS letters, 526(1-3), 129-134.
12. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. eprint arXiv:1607.01759, 1607.01759.
13. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure. 405 (2), 442–451.
14. Mikolov, T. S. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111-3119.
15. Moody, G. (Feb 2004). Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business. Wiley.
16. Ou, Y. Y. (2005). QuickRBF: a package for efficient radial basis function networks. . QuickRBF software available at http://csie. org/~ yien/quickrbf.
17. Ou, Y. Y. (2010). Classification of transporters using efficient radial basis function networks with position‐specific scoring matrices and biochemical properties. Proteins: Structure, Function, and Bioinformatics, 78(7), 17.
18. Park, K. J. (2003). Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19(13), 1656-1663.
19. Piotr Bojanowski, E. G. (2016). Enriching Word Vectors with Subword Information. eprint arXiv:1607.04606, 1607.04606.
20. Qian, Y. &. (2016). Very deep convolutional neural networks for robust speech recognition. In Spoken Language Technology Workshop (SLT), 2016 IEEE, 481-488.
21. Saier Jr, M. H. (2006). TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic acids research, 34(suppl_1), D181-D186.
22. Shi, J. Y. (2007). Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition. Amino acids, 33(1), 69-74.
23. Spencer, M. E. (2015). A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM transactions on computational biology and bioinformatics (TCBB), 12(1), 103-112.
24. Taju, S. W. (2016). DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters. Bioinformatics.
25. UniProt, C. (2014). UniProt: a hub for protein information. Nucleic acids research, 43(D1), D204-D212.
26. Yu-Yen Ou, S.-A. C.-Y. (2011). Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics, Volume 27, Issue 15, 1 August 2011, Pages 2062–2067.
27. Zhang, X. Z. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems, 649-657.
28. Zhao, M. N. (2014). Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information. PLoS ONE, 9(6): e100278. doi:10.1371/journal.pone.0100278.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top