跳到主要內容

臺灣博碩士論文加值系統

(44.192.115.114) 您好!臺灣時間:2023/09/30 16:52
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:楊雅玲
研究生(外文):Ya-Ling Yang
論文名稱:以分類、分群和關聯探勘法探討蛋白質序列的階層式分類
論文名稱(外文):Exploring Hierarchical Classifications of Protein Sequences with Classification, Clustering and Mining Association Methods
指導教授:林宣華林宣華引用關係
指導教授(外文):Shian-Hua Lin
學位類別:碩士
校院名稱:國立暨南國際大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2007
畢業學年度:95
語文別:英文
論文頁數:39
中文關鍵詞:分群分類蛋白質序列
外文關鍵詞:ClusteringClassificationProtein SequenceShortest PathHierarchical Agglomerative Clustering AlgorithmGraph TheoryMotifBLASTSmith-WatermanSVM
相關次數:
  • 被引用被引用:0
  • 點閱點閱:250
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於現今生物技術迅速進步,定序完成的蛋白質序列數目快速增加,且公開於網際網路。但在這些蛋白質序列當中,只有少部份蛋白質的功能和結構經由人工的方式注釋完成,大部分已知蛋白質序列的附加資訊仍然是未知的。如果能將相似的蛋白質分類在一起,就能提供生物學家一些有意義的資訊,以幫助研究尚未注釋的蛋白質。網路上有許多公用的蛋白質分類資料庫,像是SCOP、Pfam等。因為用人工的方式去注釋並分類蛋白質是很耗時的,所以我們可以使用分群的方法將相似的蛋白質序列歸類在同一群組,經由同一群組之中已經注釋好的蛋白質去預測未知蛋白質的功能,因為相似的序列可能隱含著相似的功能。此外,我們也參考了分類方法,使用一組特徵或參數將每一條蛋白質序列特徵化以做分類。在本論文中,我們先使用BLAST或Smith-Waterman排比工具計算出蛋白質序列之間的相似度,並使用圖論中的shortest path概念將相似度作最佳化。為考慮將蛋白質功能之性質也整合到分群演算法中,我們以蛋白質的motif作為特徵,利用資訊檢索中的向量模型方法計算蛋白質功能的相似度。然後我們將兩個相似度圖形作合併,再使用凝聚式階層分群演算法將蛋白質作分群。另一方面,我們選擇了三組特徵,以使用SVM將蛋白質作分類。實驗結果顯示,我們的方法不盡理想,但仍可從實驗結果中觀察到一些資訊。
With the improvement of bio-technologies, the number of sequenced proteins is rapidly increased. Among these proteins, only a few proteins are manually annotated with functions and structures; most sequenced proteins are still unknown about curated information. If we can categorize similar proteins, it may provide significant information for biologists to investigate novel proteins based on cluster information with well-curated proteins. Several public protein classification databases are available on the Web, like SCOP, Pfam, etc. Manually constructing classification information for proteins is a tedious and time-consuming task. Therefore, we apply the clustering method to group similar protein sequences into the same cluster. To compare with the clustering method, we also use the classification method to classify proteins based on a set of features or parameters to characterize each protein sequence. BLAST and Smith-Waterman tools are employed to calculate pairwise similarities between two protein sequences so that a set of proteins form a set of graph nodes with similarities as edge weights. The shortest path method of graph theory is then applied to optimize the similarities between sequences. The protein function is also considered to improve the quality of the graph by regarding protein motifs as features and constructing protein similarities based on the vector model of information retrieval. By combining both graphs, the hierarchical agglomerative clustering (HAC) algorithm is employed to cluster protein sequences. We also select three features to classify proteins with SVM. Experiments show that these methods are not good enough. However, in this thesis, we observe some future works and obtain experiences of doing research.
中文摘要
Abstract
Contents
List of Tables
List of Figures
1. Introduction
2. Related Works
2.1. Protein Databases
2.1.1. PDB (Protein Data Bank)
2.1.2. SCOP (Structural Classification of Proteins)
2.1.3. PROSITE
2.2. Tools of Sequence Alignment
2.2.1. Smith-Waterman
2.2.2. BLAST (Basic Local Alignment Search Tool)
2.3. Shortest Path
2.3.1. Illustration
2.3.2. Problems and Solutions
2.3.3. Example
2.4. Vector Model
2.5. Clustering and Classification
2.5.1. Graph-based Clustering Algorithm
2.5.2. Hierarchical-based Clustering Algorithm
2.5.3. Partition-based Clustering Algorithm
2.5.4. Classification Method: SVM
3. The Systems
3.1. HCP
3.1.1. Definitions of Graphs
3.1.2. Data Flow and Algorithm
3.1.3. Sequence Similarity
3.1.4. Function Similarity
3.1.5. Combination Rule
3.2. SVM
3.2.1. The Programs
3.2.2. The Algorithm
3.2.3. Cross Validation
3.2.4. Parameter Selection
4. Experiment
4.1. Data Source
4.2. Evaluation
4.3. The Results of Experiment
5. Conclusion and Future Work
6. References
[1]Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., and Sonnhammer, E. L. L., “The Pfam Protein Families Database,” Nucleic Acids Research, 28(1):263-266, 2000.
[2]Barker, W. C., Garavelli, J. S., Huang, H., McGarvey, P. B., Orcutt, B. C., Srinivasarao, G. Y., Xiao, C., Yeh, L. S. L., Lendley, R. S., Janda, J. F., Pfeiffer, F., Mewes, H. W., Tsugita, A. and Wu, C., “The Protein Information Resource (PIR),” Nucleic Acids Research, 28(1):41-44, 2000.
[3]Vinga, S., Oliveira, R. G. and Almeida, J. S., “Comparative evaluation of word composition distances for the recognition of SCOP relationships,” Bioinformatics, 20(2):206-215, 2004.
[4]Gibas, C. and Jambeck, P., “Bioinformatic,” O’relly, 2002
[5]Ahn, G. T., Kim, J. H., Hwang, E. Y., Lee M. J. and Han, I. S., “SCOPExplorer: A Tool for Browsing and Analyzing Structural Classification of Proteins (SCOP) Data,” Molecules and Cells, 17(2):360-364, 2004.
[6]Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C., “SCOP: A Structure Classification of Proteins Database for the Investigation of Sequences and Structures,” Journal of Molecular Biology, 247:536-540, 1995.
[7]Vinga, S., Oliveira, R. G. and Almeida, J. S., “Comparative evaluation of word composition distances for the recognition of SCOP relationships,” Bioinformatics, 20(2):206-215, 2004.
[8]Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C. J. A., Hofmann, K. and Bairoch, A., “The PROSITE database, its status in 2002,” Nucleic Acids Research, 30(1):235-238, 2002.
[9]Hulo, N., Sigrist, C. J. A., Saux, V. L., Genevaux, P. S. L., Bordoli, L., Gattiker, A., Castro, E. D., Bucher, P. and Bairoch, A., “Recent improvements to the PROSITE database,” Nucleic Acids Research, 32:D134-D137, 2004.
[10]Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, 25(17): 3389-3402, 1997.
[11]Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology, 147: 195-197
[12]Cormen, T.H., Leiserson, C. E., Rivest, R. L., “Introduction to Algorithms,” The MIT Press, 1990.
[13]Kawaji, H., Takenaka, Y. and Matsuda, H., “Graph-based clustering for finding distant relationships in a large set of protein sequences,” Bioinformatics, 20(2):243-252, 2004.
[14]Pipenbacher, P., Schliep, A., Schneckener, S., Schomburg, D. and Schrader, R., “ProClust:improved clustering of protein sequences with an extended graph-based approach,” Bioinformatics, 18(2):S182-S191, 2002.
[15]Jain, A. K., Murty, M. N., Flynn and P. J., “Data Clustering: A Review,” ACM Computing Surveys, 31(3):264-323, 1999.
[16]Sasson, O., Linial, N. and Linial, M., “The metric space of proteins-comparative study of clustering algorithms,” Bioinformatics, 18(1):S14-S21, 2002.
[17]Willi Klosgen and Jan M. Zytkow, “Handbook of data mining and knowledge discovery,” Oxford University Press, 2002
[18]Cover, T.M., Hart, P.E., "Nearest Neighbor Pattern Classification," IEEE Transactions on Information Theory, 13: 21-27, 1967.
[19]Breiman, L., "Bagging predictors," Machine Learning, 24:123-140, 1996.
[20]Quinlan, J. R., “Introduction of Decision Trees,” Machine Learning, 1(1):86-106, 1986.
[21]Pearl, J. “Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,” Morgan-Kaufmann, 1988.
[22]Wiener, E., Pedersen, J. O. and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95), 1995.
[23]Cortes, C. and Vapnik, V., “Support-Vector Networks,” Machine Learning, 20(3):273-297, September 1995.
[24]Dumais, S., Platt, J. and Heckman, D., "Inductive Learning Algorithms and Representations for Text Categorization," ACM Conference on Information and Knowledge Management, 1998.
[25]Joachims, T., "Text categorization with support vector machines: learning with many relevant features.," In Proc. of the 10th European Conf. on Machine Learning, pages 137--142, Chemnitz, DE, 1998.
[26]Agrawal, R., Imieliński, T. and Swami, A. “Mining association rules between sets of items in large databases,” ACM SIGMOD Record, Vol. 22, 207-216, 1993.
[27]Wright, S. J., “Primal-dual interior-point methods,” SIAM, Philadelphia, Pennsylvania, 1997.
[28]Knerr, S., Personnaz, L. and Dreyfus, G., “Single-layer learning revisited: a stepwise procedure for building and training a neural network,” Neurocomputing: Algorithms, Architectures and Application, 1990.
[29]Friedman, J., “Another approach to polychotomous classification,” Technical report, Department of Statistics, Stanford University, 1996.
[30]Wu, T. F., Lin, C. J. and Weng, R. C. “Probability Estimates for Multi-class Classification by Pairwise Coupling,” The Journal of Machine Learning Research, Vol. 5, 2004.
[31]Platt, J. C., Cristianini, N. and Shawe-Taylor. “Large margin DAGs for multiclass classification,” Advances in Neural Information Processing Systems 12, 547-553. MIT Press, 2000.
[32]Hsu, C. W. and Lin, C. J., "A comparison of methods for multi-class support vector machines," IEEE Trans. on Neural Networks, vol. 13, pp. 415--425, 2002.
[33]Chang, C. C. and Lin, C. J., “LIBSVM: a library for support vector machines,” 2005, http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊