研究生(外文):Ting-Kng Tiun
論文名稱(外文):A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
指導教授(外文):Wei-Ning Yang
口試委員(外文):Yun-Shiow ChenYung-Ho Leu
外文關鍵詞:Feature ExtractionPCAR squareFisher InformationGenetic AlgorithmArea under ROC curve
我們提出一種基於資料類別與主成份分析間的相關性之特徵提取方法,利用主成份分析之各別主成份之間互相不相關 (Uncorrelated) 的特性,進行屬性選擇,每個主成份並沒有對於分類有直接的貢獻,因此我們思考如何將類別資訊帶入每個主成份,使得每個主成份能夠反應出分類的能力差別,讓我們能夠透過此差別來進行選擇。我們提出了三種不同的資料類別與主成份間的關聯方式,分別為「相關係數(R^2)」、「ROC 曲線下面積」、「費雪資訊」並與原始的「主成份分析」方法進行比較。本研究中對於原始資料的使用「主成份分析」,將資料轉換成互相不相關的「屬性向量」,將這些「資料向量」透過「相關係數( R^2 )」、「ROC 曲線下面積」、「費雪資訊」等方法進行計算,將計算之結果排序並加以選擇。在本研究中,我們使用了三種不同的分類方法來對於此選擇方法進行驗證,分別是「基因演算法搭配 ROC 曲線下面積」、「支持向量機(SVM)」與「簡單貝氏分類法(Naive Bayes)」。選擇完畢之「屬性向量」透過三種分類方法進行訓練、建模之後,我們使用了兩個資料集加以驗證,實驗結果顯示我們提出的假設在某些情況下具有突出表現,但是在有些情形下,我們提出之假設並不比傳統的「主成份分析」效果好,可能是我們對於資料的了解與對於演算法的限制並不夠清楚,期望後續研究能針對這些部份進行探討與延伸。
Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There exist two major approaches to reducing the number of features. One is to select a subset of indigenous features which maintains the original meaning of each feature. The relevance among original features makes it difficult to find a proper subset of significant features from a large number of features, resorting to the need for random optimization algorithms. Another approach first transforms the original attributes to uncorrelated integrated features by the principal component analysis (PCA) and then sequentially search for the subset of significant integrated features. The second approach removes the relevance among integrated features, making the sequential search for the subset of significant integrated features feasible, while losing the interpret ability of significant features.

In this study, we first transform the original features to uncorrelated integrated features by PCA and then rank the integrated features according to associated variances. To find the subset of significant integrated features, starting with the integrated features according to the corresponding ranks.
For each subset of integrated features, a test score which is a linear combination of the integrated features is generated for classification. The coefficient on each integrated feature in the linear combination is determined such that the area under the Receiver Operating Characteristic(ROC) cure corresponding to the test score is maximized using the Genetic Algorithm(GA). Beside the self-developed classifier, we applied two other commonly used classifiers for comparison. Using the training data, the classification accuracy for each subset is evaluated and the subset with the largest classification accuracy is the final subset of significant integrated features used for classification. In addition to ranking the integrated features by the corresponding variances, we can also rank the integrated features by the corresponding Fisher Information, $R^2$ and AUC and then sequentially inflate the subset of integrated features according to the resulting ranks.

Experimental results show that using Fisher Information has chances to get a better subset than merely PCA with variance. However, using PCA has a much consistant result. Using PCA can preduce a more consistance performance and more economy for calculating power. We assume that there are more to investigate further for the situation of using Fisher Information or other correlation methods as selection measurement to get a better classification performance than PCA variance.
1 研究背景與目的
1.1 特徵選擇
1.1.1 利用屬性相似程度進行選擇
1.1.2 利用基因演算法進行選擇
1.1.3 最小冗餘最大相關性(mRMR)特徵選擇
1.2 特徵提取
1.2.1 主成份分析(Principle Component Analysis,PCA)
1.2.2 線性判別分析 (Linear discriminant Analysis,LDA)
2 資料集與研究方法
2.1 資料集簡介
2.1.1 Diabetic Retinopathy Debrecen (DRD)
2.1.2 Wisconsin Diagnostic Breast Cancer (WDBC)
2.2 資料處理方法
2.2.1 主成份分析(Principle Component Analysis, PCA)
2.2.2 費雪資訊(Fisher Information)
2.2.3 屬性相關性分析(R^2)
2.2.4 ROC 曲線下面積
2.3 分類法介紹
2.3.1 基因演算法與 ROC 曲線下面積
2.3.2 支持向量機 (Support Vector Machine, SVM)
2.3.3 簡單貝氏分類法 (Naive Bayes)
2.4 實驗步驟
3 結果與討論
3.1 Diabetic Retinopathy Debrecen (DRD)
3.2 Wisconsin Diagnostic Breast Cancer (WDBC)
3.3 結論
3.4 討論

SVM Example
Accuracy of DRD using (GA/AUC)
Accuracy of DRD using svm
Accuracy of DRD using naive bayes
Accuracy of WDBC using (GA/AUC)
Accuracy of WDBC using svm
Accuracy of WDBC using naive bayes
Diabetic Retinopathy Debrecen
Accuracy on DRD Dataset
Wisconsin Diagnostic Breast Cancer
Accuracy on WDBC Dataset
PC Ranking for DRD
PC Ranking for WDBC
