研究生(外文):Yao-hwei Fang
論文名稱(外文):The Data Complexity Index to Construct an Efficient Cross-validation Method
指導教授(外文):Der-chiang Li
外文關鍵詞:Binary ClassificationCross-validationData Complexity
交互驗證在資料探勘領域中常被用來做模式的驗證。然而,在實驗過程中通常要決定一些重要參數,像是訓練資料個數或實驗次數。對於二元分類問題,本研究發展一個新的交互驗證模式,稱作“Complexity-based Efficient (CBE)”交互驗證,CBE交互驗證建立一個CBE複雜度指標,其中CBE指標跟分類正確率有正相關。我們利用CBE指標及統計樣本數決定概念來計算最佳的訓練樣本個數及實驗次數,對於大量且複雜的分類資料可以減少模式驗證的時間。

實驗結果顯示CBE指標跟分類正確率有高度相關,CBE交互驗證和K-fold 交互驗證法和Repeated Random Sub-sampling Validation法有相同的效果,而且CBE交互驗證的驗證時間比K-fold 交互驗證法和Repeated Random Sub-sampling Validation更快速。CBE交互驗證不僅可以計算最佳的訓練樣本數及實驗次數,更近一步可以瞭解分類資料的特徵及結構。
Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size or the number of experiment runs, to implement a validated evaluation. This research develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index called the CBE index, which has high correlation with the classification accuracies. The CBE index and the sample size determination can be used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with complex and computationally expensive classification data sets.

The experiment results show that the high correlation between the found CBE index and the classification accuracies, and the performances of CBE cross-validation and K-fold Cross-validation and Repeated Random Sub-sampling Validation are similar and that the training time required for CBE cross-validation is lower than that for K-fold Cross-validation and Repeated Random Sub-sampling Validation. CBE index helps users understand the characteristics of the analyzed data in advance, and CBE cross-validation helps users find optimal training data size and the number of experiment runs to reduce model evaluation time.
