研究生(外文):Yuh-Shii Chiang
論文名稱(外文):Performance Evaluation of Class Distribution on Data Mining
指導教授(外文):Tsung-Yuan Tseng
資料探勘分類學中,在資料類別分配不平衡(imbalanced class distribution)或類別誤判成本(misclassification cost)差異很大時,成本應列入考量且績效標準應由極大化正確率轉為極小化錯誤分類成本,也就是所謂的成本敏感性分類(cost-sensitive classification)。要使機器學習具有成本敏感性,最常用的方法就是以具有成本敏感性類別分配的訓練資料來取代傳統自然比例分配(natural distribution)的訓練資料來進行資料分類,也就是將誤判成本較高的類別資料以較高訓練樣本比率來訓練分類器,以降低總誤判成本。本研究以決策樹,約略集合,倒傳遞類神經網路,與支援向量機四種分類器,配合四個UCI資料集,在類別誤判成本已知且訓練樣本量固定條件下,調整訓練樣本的類別比例分配,訓練出不同的分類器,再以自然比例之測試樣本進行測試,獲得不同類別比例分配之總誤判成本,並深入探討不同的分類器對訓練樣本的類別比例分配之成本敏感性。結果顯示:(1)超抽少數類別訓練樣本數(1:10,1:9,…,1:2)所訓練出來的分類器,會使誤判成本穩定維持在最低點,而且超抽2倍至10倍少數類別訓練樣本數,對總誤判成本並無顯著影響。反之,減抽少數類別訓練樣本數(2:1,3:1,…,10:1)所訓練出來的分類器,會使總誤判成本逐漸增加。(2)改變不同類別分配所得到不同分類器對誤判成本曲線斜率高低之影響決定於少數類別誤判增加筆數對多數類別誤判減少筆數之比( )。(3)相較於決策樹、約略集合、與倒傳遞類神經網路,支援向量機對於改變不同類別分配訓練樣本之成本敏感性較大,在微超抽改變至微減抽少數類別訓練樣本(1:2 to 2:1)時,誤判成本會急速增加。

In the environment of imbalanced class distribution or misclassification cost diverseness, misclassification cost should be emphasized and the performance measure should be switched from maximizing accurate rate to minimizing misclassification cost. To build a cost-sensitive classifier, the most common way is to replace training data set of traditional natural class distribution by one of over-sampling train data of higher misclassification cost.
In this paper, we train classifiers with various combinations of class distribution training data and then test by natural class distribution testing data. The accumulated misclassification cost cures along with different class distributions training data are then plotted and analyzed to explore the cause-effect relationship. Four types of classifier: decision tree, rough set, back-propagation neural network, and support vector machine, and four UCI data sets with imbalanced class distribution are included in this experiment. Assuming the minority of four UCI data set is actual positive, the misclassified minority belongs to false positive (FP) while the misclassified majority belongs to false negative (FN).
Results showed that: First, classifiers trained by over-sampling minority class maintain minimum accumulated misclassification cost in wide ratio range, 1:10, 1:9, …, 1:2, of majority to minority while under-sampling minority increase accumulated misclassification cost dramatically. Secondly, the slope of accumulated misclassification cost curves is determined by the ratio of the derivative of fault positive (FP) to the derivative of fault negative (FN) rate. Eventually, support vector machine is most sensitive to the class distribution training data. The accumulated misclassification cost will dramatically increase from transition of slight over-sampling of majority to minority (1:2) to slight under-sampling minority (2:1).

keywords:Cost-sensitive classification; imbalanced class distribution
