臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.91) 您好！臺灣時間：2025/11/30 16:35

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
論文連結
QR Code

本論文永久網址:

研究生:

胡光泰

研究生(外文):

Quang-Thai Ho

論文名稱:

利用連續詞袋模型去解譯電子傳遞蛋白中的隱藏訊息

論文名稱(外文):

Using Continuous Bag of Words to Interpret the Hidden Information of Protein Sequences in Electron Transport Proteins

指導教授:

歐昱言

指導教授(外文):

Yu-Yen Ou

口試委員:

張經略、歐展言

口試委員(外文):

Ching-Lueh Chang、Chan-Yen Ou

口試日期:

2018-07-19

學位類別:

碩士

校院名稱:

元智大學

系所名稱:

資訊工程學系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2018

畢業學年度:

106

語文別:

英文

論文頁數:

中文關鍵詞:

機器學習; 深度學習; 自然語言處理; 單詞嵌入; 轉運蛋白; 電子傳遞蛋白;

外文關鍵詞:

machine learning; deep learning; natural language processing; word embedding; transport proteins; electron transport proteins;

相關次數:

被引用:1
點閱:321
評分:
下載:40
書目收藏:0

深度學習是人工智能和機器學習的一個子集，它使用多層人工神經網絡在對象檢測，語音識別，語言翻譯等任務中提供最先進的性能。
在生物信息學方面，自21世紀初以來，深度學習也被應用於將生物醫學數據轉化為有價值的知識，並展示了當今卓越的科學成就。因此，它不僅縮短了研究時間，而且產生了更準確可靠的結果。
在這項研究中，我提出了一種新的方法，使用文字嵌入提取主要蛋白質結構中的氨基酸序列的隱藏信息，已成功應用於自然語言研究。與傳統語言模型不同，主要蛋白質結構類似於未知語言，20個單詞對應20個氨基酸。有序的單詞鏈本身可能包含生物功能的信息，這些信息對蛋白質的分類有重要貢獻。
試圖直接從氨基酸序列的FASTA形式進行該研究。我選擇電子傳遞蛋白，這是一組在細胞功能的代謝過程中轉移電子的蛋白質，以實現這項工作。因此，對電子傳遞蛋白的鑑定和分類的研究將有助於提高對轉運蛋白過程及其在細胞能量產生中的作用的理解。 Facebook Fasttext用作建模工具，它實現了哈希技巧，以實現快速和內存有效的映射以執行此研究。成功鑑定轉運蛋白中電子傳遞蛋白的存檔靈敏度為60.53％，特異度為94.84％，準確度為91.71％，MCC為0.53，為獨立檢測。與先前的相關工作相比，該方法還改進了測量指標。所提出的技術提供了用於研究目的的基於網絡的在線工具，並且還開啟了用於實現自然語言方法以解決該領域中的問題的承諾。

Deep learning, a subset of AI and machine learning, uses multi-layered artificial neural networks to deliver state-of-the-art performance in tasks such as object detection, speech recognition, language translation and so on.
In bioinformatics, Deep learning also has been applied to transform biomedical data into valuable knowledge since the early 2000s and demonstrated remarkable scientific achievements nowadays. As a result, it not only shortens the study time but also produces more accurate and reliable results.
In this study, I propose a novel approach for extracting hidden information on the amino acid sequence in the primary protein structure using word embedding which has applied successfully in the natural language researches. Unlike conventional language models, the primary protein structure which resembles an unknown language with 20 words corresponding to 20 amino acids. An ordered chain of words may itself contain the information of biological functions which make a significant contribution to the classification of proteins.
This study is attempted to perform directly from the FASTA format of amino acid sequences. I choose electron transport proteins, a bunch of proteins that transfer electrons in metabolic processes for cellular function, to implement this work. Therefore, the study of the identification and classification of electron transport proteins will contribute significantly to improving the understanding of the transport protein process and their roles in energy production in cells. Facebook Fasttext used as a modeling tool which implements the hashing trick for a fast and memory efficient mapping to perform this study. The result of successfully identifying electron transport proteins in transport proteins archived sensitivity of 60.53%, specificity of 94.84%, and accuracy of 91.71%, with MCC of 0.53 for independent testing. This method also improves the measurement metrics compared to previous related works. The proposed technique provides a web-based online tool for research purposes and also opens up promises for implementing natural language methods to solve problems in this field.

Abstract iii
Acknowledgements v
Table of Contents vi
List of Tables viii
List of Figures ix
Chapter 1、Introduction 1
1.1 Motivation 1
1.2 Research background 2
1.2.1 Machine learning and deep learning in bioinformatics 2
1.2.2 Electron transport protein 4
1.3 Scope of the study 6
1.4 Organization of the study 7
Chapter 2、Literature Review 8
2.1 Introduction 8
2.2 Related researches on deep learning 8
2.3 Related researches on electron transport protein 9
2.5 The purpose of the study and motivation research 10
Chapter 3、Identifying Electron Transport Proteins 11
3.1 Data collection 11
3.1.1 Electron transport protein and corresponding molecular function 11
3.1.2 Pre-processing data 13
3.2 Feature extractions 13
3.2.1 Composition of amino acids and amino acid pairs 13
3.2.2 Position specific scoring matrices 15
3.2.3 Continuous bag of words 16
3.4 FastText 18
3.5 Effectiveness evaluation method 20
3.6 Results and discussions 22
3.6.1 Execution time 22
3.6.2 The number of word in bags 22
3.6.3 Identify electron transport protein 24
3.6.4 fastETC 25
Chapter 4、Conclusions 27
4.1 Research contributions 27
4.2 Limitations and further study 27
References 29

1. Alipanahi, B. D. (2015). Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33(8), 831.
2. Altschul, S. F. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389-3402.
3. Asgari, E. &. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS ONE, 10(11), e0141287.
4. Boeckmann, B. B. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research, 31(1), 365-370.
5. Chang, C. C. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 27.
6. Chen, K. M. (2011). Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics, 28(3): 331-341.
7. Chen, S. A. (2011). Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics, 27(15), 2062-2067.
8. Chou, K. C. (2001). Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3), 246-255.
9. Consortium, U. (2016). UniProt: the universal protein knowledgebase. Nucleic acids research, 45(D1), D158-D169.
10. Frank, E. H. (2004). Data mining in bioinformatics using Weka. Bioinformatics, 20(15), 2479-2481.
11. Gromiha, M. M. (2002). Important amino acid properties for determining the transition state structures of two‐state protein mutants. FEBS letters, 526(1-3), 129-134.
12. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. eprint arXiv:1607.01759, 1607.01759.
13. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure. 405 (2), 442–451.
14. Mikolov, T. S. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111-3119.
15. Moody, G. (Feb 2004). Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business. Wiley.
16. Ou, Y. Y. (2005). QuickRBF: a package for efficient radial basis function networks. . QuickRBF software available at http://csie. org/~ yien/quickrbf.
17. Ou, Y. Y. (2010). Classification of transporters using efficient radial basis function networks with position‐specific scoring matrices and biochemical properties. Proteins: Structure, Function, and Bioinformatics, 78(7), 17.
18. Park, K. J. (2003). Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19(13), 1656-1663.
19. Piotr Bojanowski, E. G. (2016). Enriching Word Vectors with Subword Information. eprint arXiv:1607.04606, 1607.04606.
20. Qian, Y. &. (2016). Very deep convolutional neural networks for robust speech recognition. In Spoken Language Technology Workshop (SLT), 2016 IEEE, 481-488.
21. Saier Jr, M. H. (2006). TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic acids research, 34(suppl_1), D181-D186.
22. Shi, J. Y. (2007). Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition. Amino acids, 33(1), 69-74.
23. Spencer, M. E. (2015). A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM transactions on computational biology and bioinformatics (TCBB), 12(1), 103-112.
24. Taju, S. W. (2016). DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters. Bioinformatics.
25. UniProt, C. (2014). UniProt: a hub for protein information. Nucleic acids research, 43(D1), D204-D212.
26. Yu-Yen Ou, S.-A. C.-Y. (2011). Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties. Bioinformatics, Volume 27, Issue 15, 1 August 2011, Pages 2062–2067.
27. Zhang, X. Z. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems, 649-657.
28. Zhao, M. N. (2014). Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information. PLoS ONE, 9(6): e100278. doi:10.1371/journal.pone.0100278.

電子全文

國圖紙本論文

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供，不一定有電子全文可供下載，若連結有誤，請點選上方之〝勘誤回報〞功能，我們會盡快修正，謝謝！

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	應用人工智慧於客服效率提升-基於類神經技術之顧客意見分類

無相關期刊

1.	Using multiple windows scanning and natural language processing techniques to study electron transport proteins
2.	利用神經網路和K線圖來預測股市
3.	大學生參與社區服務對國小生英語學習動機與國小英語教師教學策略之影響---以元智大學與自立國小合作之大學社會責任計畫為例之研究
4.	智慧型銷售量預測模型
5.	在LWA環境中行動感知資源規劃策略的設計與分析
6.	使用遷移學習與語言模型來鑑別外泌體蛋白之功能類別
7.	利用深度學習方法與PSSM屬性來預測主動傳輸蛋白的類別
8.	網路論壇持續性之研究：使用社會網絡分析法
9.	總經理離職與繼任形式對貸款合約的影響
10.	陸隴其學術思想研究
11.	企業員工個人良好習慣的養成與其工作生活品質及離職傾向關聯性之研究
12.	使用自然語言處理技術來提高運輸蛋白質識別的性能
13.	網路議題與情感：社群傳播爆紅商品的推手
14.	Identifying Needs and Expectations of Foreign Residents in Taiwan Regarding Convenience Store Services
15.	使用自然語言處理技術來識別蛋白質中的功能結合位點和功能類別

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室