(44.192.66.171) 您好!臺灣時間:2021/05/18 21:02
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

: 
twitterline
研究生:陳君星
研究生(外文):Chun-hsing Chen
論文名稱:以隱性主題抽取為基礎之蛋白質序列分類之研究
論文名稱(外文):Remote homology detection based on latent topic extraction in protein databases
指導教授:葉建華葉建華引用關係
指導教授(外文):Jian-hua Yeh
學位類別:碩士
校院名稱:真理大學
系所名稱:資訊工程學系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2010
畢業學年度:97
語文別:中文
論文頁數:56
中文關鍵詞:隱性主題萃取蛋白質序列支援向量機蛋白質分類
外文關鍵詞:SVM、LDA、protein sequences
相關次數:
  • 被引用被引用:1
  • 點閱點閱:181
  • 評分評分:
  • 下載下載:23
  • 收藏至我的研究室書目清單書目收藏:0
在計算生物學中,多個蛋白質序列之間利用遠親同源檢測進而找出蛋白質之間的關係,是一個很重要的核心問題,這類型問題的辨別方法中,最有效且常見的是支援向量機(Support Vector Machine , SVM ),大多數使用SVM 為基礎的方法前提是在於使用明確特徵向量或是核函數(Kernel Function),找出較具代表性質的蛋白質序列,所以此類研究大多強調在特徵的擷取,讓分類器能夠有效的區分蛋白質的特性。
在本研究中,我們的蛋白質資料庫使用Structural Classification of Proteins version 1.53 (SCOP), 在進行方法上則採用了隱性主題抽取技術(Latent Dirichlet Allocation , LDA),這是一種自然語言處理的抽出技術,可以十分有效率的將隱性特徵抽取出來。同時在蛋白質序列的基本區塊組成部分,我們使用N-grams 的方法,將蛋白質序列轉換成語言和文字,其中每個蛋白質序列被認定為一個文件,由多個裝著文字的袋子組成(bags-of-word),然後再用LDA 計算文件向量和相似度,最後再使用SVM 的方法將其分類。在我們的實驗中,我們發現使用LDA 結合SVM 的方法明顯地比單純使用SVM 或是使用LSA 結合SVM 的效能及結果還要優良許多。
Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the Support Vector Machine (SVM) is one of most effective methods. Many of SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions.So this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification.
In this study, we use protein database from Structural Classification of Proteins version 1.53 (SCOP). This method uses latent topic extraction technique (Latent Dirichlet Allocation model), which is an efficient feature extraction technique from natural language processing. The basic building blocks of our model are word documents generated from protein sequence by N-gram segmentation. The LDA phase applies on these documents for latent topic extraction, while the SVM method acts as a classifier of latent topic. In our experiment, the LDA-SVM model is outperformed than LSA-LDA model is previous research.
第一章 緒論……………………………………………………………01
1-1. 蛋白質簡介 ………………………………………………01
1-2. 研究動機與目的 …………………………………………05
1-3. 系統架構 ……………………………………………………06
第二章 國內外相關研究………………………………………………07
2-1. SCOP 研究探討………………………………………………07
2-2. 蛋白質區塊的建立………………………………………… 10
2-3. 隱性主題萃取(LDA) ……………………………………… 13
2-4. 支援向量機(SVM) ………………………………………… 18
第三章 提出的方法 ……………………………………………………22
3-1. 基於TF-IDF 和N-GRAMS 蛋白質區塊建立模型…………22
3-2. 基於隱性主題萃取的預測模型……………………………25
3-3. 基於支援向量機分類器分析的預測模型…………………28
第四章 實驗步驟………………………………………………………30
4-1. 由序列轉成詞彙(N-GRAM) …………………………………34
4-2. 詞彙的過濾(TF-IDF) …………………………………… 36
4-3. 隱性主題的萃取(LDA)…………………………………… 38
4-4. 超級家族的分類(SVM)…………………………………… 40
第五章 討論……………………………………………………………43
第六章 結論……………………………………………………………51
參考文獻………………………………………………………………52
[1].Karplus K, Barrett C, Hughey R: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 1998, 14(10):846-856.
[2].Vapnik VN: Statistical Learning Theory. New York 1998.
[3].Alexey G. Murzin, Steven E. Brenner, Tim Hubbard and Cyrus Chothia (1995) : SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. JMB—MS 422 Cust. Ref. No. CAM 502/94
[4].Andreeva,A. et al. (2004) : SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res., 32, D226–D229.
[5].Chandonia,J.M. et al. (2004) : The ASTRAL Compendium in 2004. Nucleic Acids Res.,32, 189–192.
[6].Dong,Q.W. et al. (2005) : A pattern-based SVM for protein remote homology detection. In The Fourth International Conference on Machine Learning and Cybernetics. GuangZhou, China, pp. 3363–3368.
[7].Ben-Hur,A. and Brutlag,D. (2003) : Remote homology detection: a motif based approach. Bioinformatics. 19(Suppl 1), i26–i33.
[8].Leslie,C., Eskin,E. and Noble,W.S. (2002) : The spectrum kernel: a string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing, Hawaii, USA, pp. 564–575.
[9].Rigoutsos,I. and Floratos,A. (1998) : Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, 14, 55–67.
[10]. Salton, Gerard and Buckley, C. (1988) : Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5): 513–523.
[11]. Bailey,T.L. and Elkan,C. (1994) : Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In roceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, California.pp. 28–36.
[12]. J. Z. Wang, J. Li, and G. Wiederhold (2001) : SIMPLIcity: Semantics-sensitive Integrated Matching for Picture Libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 9, pp. 947-963, 2001.
[13]. T. Hofmann, (2001) : Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, vol. 42, no. 1, pp.177–196, 2001.
[14]. D. M. Blei, A. Y. Ng and M. I. Jordan (2003): Latent Dirichlet allocation. Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[15]. T. Minka and J. Lafferty (2002) : Expectation-propagation for the generative aspect model. Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.
[16]. T. L. Griffiths and M. Steyvers (2004) : Finding scientific topics. Proceedings of the National Academy of Science, vol. 101, pp. 5228–5235, 2004.
[17]. T. Minka (2000) : Estimating a Dirichlet distribution. Technical Report, MIT, 2000.
[18]. M. Girolami and A. Kaban (2003) : On an equivalence between PLSI and LDA. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 433-434, 2003.
[19]. Landauer, T. K., Foltz, P. W., & Laham, D. (1998) : Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
[20]. Jaakkola,T. et al. (2000) : A discriminative framework for detecting remote protein homologies. J. Comput. Biol., 7, 95–114.
[21]. Gribskov,M. and Robinson,N.L. (1996) : use of receiver operating characteristic(ROC) analysis to evaluate sequence matching. Comput. Chem., 20, 25–33.
[22]. Simpson, A. J., & Fitter, M. J. (1973) : What is the best index of detectability? Psychological Bulletin, 80, 481–488.
[23]. Altschul,S.F. et al. (1997) : Gapped Blast and Psi-blast: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
[24]. Li,L. and Noble,W.S. (2003) : Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Boil., 10, 857–868.
[25]. Bin Liu, Xiaolong Wang, Lei Lin, Qiwen Dong and Xuan Wang (2008) : A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics 2008, 9:510
[26]. Dong QW, Wang XL, Lin L (2006) : Application of Latent Semantic
Analysis to Protein Remote Homology Detection. Bioinformatics 2006, 22(3):285-290.
[27]. Dong Q, Lin L, Wang XL (2007): Protein Remote Homology Detection Based on Binary Profiles. Proc 1st International Conference on Bioinformatics Research and Development (BIRD) Germany 2007:212-223.
[28]. http://en.wikipedia.org/wiki/Support_vector_machine
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top