研究生(外文):Che-Chi Wu
論文名稱(外文):A Study of Error-Correcting Output Codes for Structure Classification of Protein Sequence
指導教授(外文):Jyh-Jong Tasy
外文關鍵詞:Structure Classification of Protein SequenceError-Correcting Output Codes
蛋白質序列的結構分類與預測在生物資訊學領域上已經是一個很重要的研究題目。由於目前可以取得的蛋白質摺疊數量相當龐大,在機器學習上大部分的分類方法都會遭遇到眾所皆知的”誤測”問題。在這篇碩士論文之中,我們透過有關多類別的分類法來學習新的方法以減少已知問題所帶來的影響。我們首先使用支撐向量法(或支持向量機, SVM)作為基礎的分類器,並且使用錯誤更正輸出碼(ECOC)方法來達到高階的多重類別的分類。當多組的參數資料集合的分數一起組合使用時,多數投票決定的方法可以減少錯誤資料的干擾以及增加辨識的正確率。結果顯示我們的方法,在一個大部分的蛋白質測試資料集跟訓練資料集只有低於25% 序列相同的情況下,可以達到62.72%的預測正確率。
Structure classification and prediction of protein sequence has been
an important research theme in structural bioinformatics. Most
discriminative methods in machine learning suffers the well-known
"False Positives" problem due to the larger amount of folds
available. In this thesis, we study new approach with multi-class
classification methods to reduce the influence of the existing
problem. We use the support vector machine (SVM) method as base
classifiers and apply Error-Correcting Output Codes (ECOC) methods
to achieve high-level multi-class classification. When scores of
multiple parameter datasets are combined, majority voting reduces
noise and increases recognition accuracy. The results show that our
methods can obtain prediction accuracy 62.72% on a protein test
dataset, where most of the proteins have
below 25% sequence identity with the proteins used in training.
1 Introduction
2 Machine Learning Background
2.1 Support Vector Machines
2.2 Multi-class classi¯cation methods
2.2.1 One-against-others
2.2.2 One-against-one
2.3 Error-Correcting Output Codes(ECOC)
3 Preprocessing and Data Sets
3.1 Training Dataset
3.2 Independent Test Dataset
3.3 Feature Vector Extraction
4 Experiments and Results
4.1 One-against-one method
4.2 ECOC method
4.3 Combine one-against-one and ECOC
5 Conclusion
