研究生(外文):Zhen-fu Hong
論文名稱(外文):Distance Features in Automatic Data Classification
指導教授(外文):Chih-fong Tsai
外文關鍵詞:feature extractionclassificationclustering
In data mining and pattern classification, feature extraction and representation is a very important step since the extracted features have a direct and significant impact on the classification accuracy. In literature, numbers of novel feature extraction and representation methods have been proposed. However, many of them only focus on specific domain problems. In this thesis, we introduce a novel distance based feature extraction method for various pattern classification problems. Specifically, three distances are extracted, which are based on the distance between the data and its intra-cluster center and the distance between the data and its extra-cluster centers. Experiments based on ten datasets containing different numbers of classes, samples, and dimensions are examined. The experimental results using naïve Bayes, k-NN, and SVM classifiers show that concatenating the original features provided by the datasets to the distance based features can improve classification accuracy except image related datasets. In particular, the distance based features are suitable for the datasets which have smaller numbers of classes, numbers of samples, and the lower dimensionality of features. Moreover, two datasets, which have similar characteristics, are further used to validate this finding. The result is consistent with the first experiment result that adding the distance based features can improve the classification performance.
中文摘要 i
英文摘要 ii
目錄 iii
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的與預期效益 3
1.4 範圍與假設 3
1.5 研究步驟 4
第二章 文獻探討 6
2.1 資料探勘(Data Mining) 6
2.1.1 資料探勘的定義 6
2.1.2 資料探勘的過程 8
2.1.3 資料探勘的功能 9
2.2 資料探勘相關技術 12
2.2.1 分群(Clustering) 12 K-means分群演算法 13
2.2.2 分類 17 支援向量機 17 K-最鄰近鄰居(K Nearest Neighbor, KNN) 19
2.3 特徵選取(Feature Selection) 22
2.3.1 主成份分析(Principal Component Analysis, PCA) 24
2.4 相關文獻 25
第三章 研究方法 28
3.1 研究流程 28
3.2 計算群中心點與群內外距離 30
3.2.1 Euclidean distance公式 30
3.2.2 群內距離 30
3.2.3 群外距離 31
3.3 訓練與測試 32
第四章 實驗結果 33
4.1 實驗設計 33
4.1.1 資料集 33
4.1.2 分類器 34
4.1.3 K褶交叉驗證 34
4.1.4 PCA用於特徵選取 36
4.2 PCA檢定 36
4.3 分類正確率 37
4.4 結果討論 44
4.5 結果驗證 45
第五章 結論與建議 47
5.1 結論 47
5.2 未來展望與建議 48
5.3 研究限制 48
參考文獻 50
附錄A 54
