

( 您好!臺灣時間:2024/12/14 09:06
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Li Wei
論文名稱(外文):Using Over-sampling and Multi-classifier Committee Approach for skewed class distribution – a case study of diagnosis model construction of Benign prostate hypertrophy and Cancer of prostate
指導教授(外文):Fan Wu
外文關鍵詞:skewed distributionOver-samplingMulti-classifier Committee Approach
  • 被引用被引用:1
  • 點閱點閱:854
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:2
對於類別數量平均分佈的資料集合,利用目前已知的資料探勘分類法所建立出的預測模型,已經能達到一定程度的預測準確率,然而在真實的資料探勘的研究當中,資料集合往往會有不對稱分佈(Skewed Distribution)的情況。在臨床上,健康的人的數量往往遠大於不健康者的數量,因此在資料收集上,會有先天上的不對稱分佈。利用這些不對稱分佈的資料來建立預測模型,往往會有嚴重預測偏向的問題。
Regarding the non-skewed distribution, to utilize the existing data mining classification to construct the prediction model can reach a certain level of prediction accuracy. However, in the real data mining case, the dataset distribution is always skewed distribution. In clinical case, because the number of healthy people is more than the number of unhealthy people, the collected data would be congenital skewed distribution. If we utilize those dataset with skewed distribution to construct the prediction model, the prediction deviation should be a big problem.
There are three existing solutions for skewed distribution – Under-sampling, Over-sampling, and Multi-classifier Committee Approach. This research will utilize Over-sampling and Multi-classifier Committee Approach for skewed distribution and improve them. The research objective is to raise the prediction accuracy of the minor part of the dataset. The case study is the disease of benign prostate hypertrophy and cancer of prostate. And this research will use those data to test the classification efficiency of my algorithm.
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Organization of thesis 4
Chapter 2 Literature Review 5
2.1 Decision tree and Classification and Regression Tree 5
2.2 k-means Method 8
2.3 Skewed class distribution 8
2.4 Benign Prostate Hypertrophy and Cancer of Prostate 10
Chapter 3 Algorithm 12
3.1 Problem definition and prediction strategy 12
3.2 Modeling phase - Improved SMOTE 14
3.3 Predict phase –Multi-classifier co-work 19
Chapter 4 Simulation 21
4.1 Evaluation criteria 21
4.2 Experiment design and simulation 22
4.3 The case of prostate 32
Chapter 5 Conclusion 35
5.1 Conclusion and Achievement 36
5.2 Research Restriction 37
5.3 Future work 37
Reference: 39
[1].P. S. Bradley, U. M. Fayyad, and O. L. Mangasarian. Data mining: Overview and optimization opportunities. INFORMS Journal on Computing, 11:217{238, 1999.ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-01.ps.
[2].M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice Hall, New Jersey, 2003.
[3].Chyi Yu Meei: Classification Analysis Techniques for Skewed Class distribution Problems, 2003.
[4].Roger J. Lewis, M.D., Ph.D. Department of Emergency Medicine Harbor-UCLA Medical Center Torrance, California: An Introduction to Classification and Regression Tree (CART) Analysis
[5].Kate McCarthy, Bibi Zabar and Gary Weiss: Does Cost-Sensitive Learning Beat Sampling for Classifying Rare Classes? Fordham University
[6].L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and Regression Trees. Wadsworth, Belmont, Ca., 1984
[7].Kubat, M. and Matwin, S., “Addressing the curse of imbalanced training sets: one-sided selection,” Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 179-186.
[8].Lewis, D. and Catlett, J., “Heterogeneous Uncertainty Sampling for Supervised Learning,” Proceedings of the 11th International Conference on Machine Learning, 1994, pp.144-156.
[9].Wei, C., Piramuthu, S. and Shaw, M. J., “Knowledge Discovery and Data Mining,” Chapter 41 in Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp.157-189.
[10].Ling Shih-Shiung: Empirical Evaluations of Different Strategies for Classification with Skewed Class Distribution 2004
[11].Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, Vol. 16, 2002, pp.321-357.
第一頁 上一頁 下一頁 最後一頁 top