(44.192.66.171) 您好!臺灣時間:2021/05/18 01:16
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

: 
twitterline
研究生:曾俊凱
研究生(外文):Chun-Kai Tseng
論文名稱:單一類別分類方法於不平衡資料集-搭配遺漏值填補和樣本選取方法
論文名稱(外文):One-class classification on imbalanced datasets with missing value imputation and instance selection
指導教授:蔡志豐蔡志豐引用關係蘇坤良蘇坤良引用關係
指導教授(外文):Chih-Fong TsaiKuen-Liang Sue
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊管理學系
學門:電算機學門
學類:電算機一般學類
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:91
中文關鍵詞:不平衡資料集單一類別分類方法遺漏值填補樣本選取方法
外文關鍵詞:Imbalance data setsOne-Class ClassificationMissing value imputationInstance selection
相關次數:
  • 被引用被引用:0
  • 點閱點閱:82
  • 評分評分:
  • 下載下載:18
  • 收藏至我的研究室書目清單書目收藏:0
不平衡資料集在實務資料分析中是非常重要的一環,如信用卡盜刷、醫療診斷分類和網路攻擊分類等不同領域內重要問題。面對不平衡資料集我們可以採取不同的資料處理或使用不同分類方法達到更好的分類效果。單一類別分類方法在不同的領域中可以稱作為離群值檢測或奇異點偵測,本論文嘗試使用單一類別分類方法於不平衡資料集中二分類問題如單分類支援向量機器(One-Class SVM)、孤立森林(Isolation Forest)和局部異常因子(Local Outlier Factor)。進一步探討若資料發生缺失的情況,透過模擬遺漏值10%~50%且將使用如分類與回歸樹方法(Classification And Regression Trees)將資料填補至接近原始資料,增加分類模型的分類正確率。同時也對不平衡資料中存在影響分類方法的雜值採取樣本選取方法如Instance Based algorithm(IB3)、Decremental Reduction Optimization Procedure(DROP3)、Genetic Algorithm(GA)希望減少資料集中雜質與減少訓練模型的時間成本且找出足夠影響力的資料
本論文baseline使用完整的不平衡資料與單一類別分類方法與各項實驗分析比較。探討遺漏值填補與單一類別分類方法以及哪個樣本選取方法會使單一類別分類方法正確率提升,最後探討模擬遺漏值和樣本選取方法與填補的先後順序,流程改善能夠增加分類器正確率。經過上述實驗流程以及結果,可以發現不平衡資料經過遺漏值填補之後分類正確率接近;透過樣本選取方法可以增加分類正確率同時發現樣本篩檢率會直接影響分類正確率;最後透過遺漏值與樣本選取方法的搭配,可以發現將完整資料與不完整資料拆開處理的流程可以改善分類正確率,而選擇平穩正確率的情況下使用完整資料進行模擬遺漏與填補以及搭配樣本選取方法則會有較佳的表現。
Imbalanced data sets are a very important part of practical data analysis, such as credit card fraud, medical diagnosis classification and network attack. Faced with imbalanced data sets, we can adopt different data processing or use different classification methods to achieve better classification results. This paper attempts to use the one-class classification methods to classify two classification problems in imbalanced data sets, such as the one-class SVM, Isolated Forest and Local Outlier Factor. To further explore the case of missing data, by simulating missing values of 10% to 50% and using methods such as CART to impute the data, increase the classification accuracy. At the same time, Instance selection methods such as IB3, DROP3, and GA are also adopted for the imbalanced data. Hope to reduce impurities in the data set and reduce the time to train the model cost and find sufficient information
Discuss the missing value filling and one-class classification methods and which instance selection methods will improve the accuracy. Simulate missing value and instance selection methods and the order of filling. After the above experimental process and results, it can be found that when missing value is filled classification accuracy is close to classification accuracy; through the instance selection methods, the classification accuracy can be increased and the reduction rate is found to directly affect the classification correct rate; finally, the missing value and combination of selection methods, it can be found the process of separating the incomplete data from the complete data can improve the classification accuracy. However, when the stable accuracy is selected, using the complete data to simulate the missing values and filling and uses the instance selection methods will have good performance.
摘要 i
Abstract ii
目錄 iii
表目錄 v
圖目錄 v
附表目錄 vi
一、 緒論 1
1-1研究背景 1
1-2研究動機 2
1-3研究目的 3
1-4研究架構 3
二、文獻探討 5
2-1不平衡資料集 5
2-2單一類別分類方法(ONE-CLASS CLASSIFICATION, OCC) 7
2-2-1單類別支援向量機(One-Class SVM, OCSVM) 9
2-2-2孤立森林(Isolation Forest, iForest) 11
2-2-3局部異常因子(Local Outlier Factor, LOF) 13
2-3資料遺漏 15
2-3-1遺漏值補值流程和方法 16
2-4樣本選取方法 17
三、研究方法與設計 19
3-1實驗架構以及實驗準備 19
3-2實驗一 22
3-3實驗二 23
3-4實驗三之一 24
3-5實驗三之二 25
3-6評估標準 26
四、實驗結果 27
4-1實驗一結果 27
4-2實驗二結果 28
4-3實驗三之一結果 30
4-4實驗三之二結果 33
4-5實驗結果總結 36
五、結論 42
5-1總結 42
5-2 研究貢獻與未來展望 43
參考文獻 45
附錄一、分類正確率詳細實驗數據 44
1-1遺漏率10%~50%分類正確率(MI) 44
1-2樣本選取方法分類正確率(IS) 49
1-3遺漏率10%~50%搭配樣本選取方法分類正確率 52
附錄二、樣本選取篩檢率詳細實驗數據 70
1. Khan, S.S. and M.G. Madden, One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, 2014. 29(3): p. 345-374.
2. Puri, A. and M. Gupta, Review on Missing Value Imputation Techniques in Data Mining. IJSRCSEIT 2017. 2(7).
3. Olvera-López, J.A., et al., A review of instance selection methods. Artificial Intelligence Review, 2010. 34(2): p. 133-143.
4. Haixiang, G., et al., Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 2017. 73: p. 220-239.
5. Hempstalk, K. and E. Frank, Discriminating Against New Classes: One-class versus Multi-class Classification, in AI 2008: Advances in Artificial Intelligence. 2008. p. 325-336.
6. Olvera-López, J.A., et al., A review of instance selection methods. 2010. 34(2): p. 133-143.
7. Tan, A.C., D. Gilbert, and Y. Deville, Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics, 2003. 14: p. 206-217.
8. Abe, N., B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004.
9. Zhou, Z.-H. and X.-Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on knowledge and data engineering, 2005. 18(1): p. 63-77.
10. Chen, K., B.-L. Lu, and J.T. Kwok. Efficient classification of multi-label and imbalanced data using min-max modular classifiers. in The 2006 IEEE International Joint Conference on Neural Network Proceedings. 2006. IEEE.
11. Sun, Y., M.S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbalanced class distribution. in Sixth International Conference on Data Mining (ICDM'06). 2006. IEEE.
12. Zhou, Z.H. and X.Y. Liu, On multi‐class cost‐sensitive learning. Computational Intelligence, 2010. 26(3): p. 232-257.
13. Haibo, H. and E.A. Garcia, Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 2009. 21(9): p. 1263-1284.
14. Weiss, G.M., Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 2004. 6(1): p. 7-19.
15. Kotsiantis, S., D. Kanellopoulos, and P. Pintelas, Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 2006. Vol.30.
16. Chawla, N.V., et al., SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 2002. 16: p. 321-357.
17. Bekkar, M. and T.A. Alitouche, Imbalanced Data Learning Approaches Review. International Journal of Data Mining & Knowledge Management Process, 2013. 3(4): p. 15-33.
18. Japkowicz, N., Learning from Imbalanced Data Sets: A Comparison of Various Strategies, in AAAI. 2000.
19. Drummond, C. and R.C. Holte, C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling, in Workshop on Learning from Imbalanced Datasets II
ICML. 2003: Washington DC.
20. Chawla, N.V., et al., SMOTE: Synthetic Minority Over-sampling Technique. Artificial Intelligence Research, 2002. 16: p. 321-257.
21. Wah, Y.B., et al., Handling imbalanced dataset using SVM and k-NN approach. 2016.
22. Khan, S.S. and M.G. Madden, A Survey of Recent Trends in One Class Classification. AICS 2009, 2010: p. 88–197.
23. Breunig, M.M., et al., LOF: Identifying Density-Based Local Outliers, in ACM SIGMOD 2000 2000.
24. Scholkopf, B., et al., Support Vector Method for Novelty Detection. Advances in Neural Information Processing Systems, 2000.
25. TAX, D.M.J. and R.P.W. DUIN, Support Vector Data Description. Machine Learning, 2004. 54: p. 45-66.
26. Liu, F.T., K.M. Ting, and Z.-H. Zhou, Isolation-based Anomaly Detection. ACM Transactions on Knowledge Discovery from Data, 2012. 5.
27. Shin, H.J., D.-H. Eom, and S.-S. Kim, One-class support vector machines—an application in machine fault detection and classification. Computers & Industrial Engineering, 2005. 48(2): p. 395-408.
28. Lin, W.-C. and C.-F. Tsai, Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 2019.
29. Strike, K., K.E. Emam, and N. Madhavji, Software Cost Estimation with Incomplete Data. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2001. 27(10).
30. RAYMOND, M.R. and D.M. ROBERTS, A COMPARISON OF METHODS FOR TREATING INCOMPLETE DATA IN SELECTION RESEARCH. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1987.
31. Silva-Ramirez, E.L., et al., Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw, 2011. 24(1): p. 121-9.
32. Pelckmans, K., et al., Handling missing values in support vector machine classifiers. Neural Netw, 2005. 18(5-6): p. 684-92.
33. Farhangfar, A., L. Kurgan, and J. Dy, Impact of imputation of missing values on classification error for discrete data. Pattern Recognition, 2008. 41(12): p. 3692-3705.
34. Acurna, E. and C. Rodriguez. The treatment of missing values and its effect in the classifier accuracy, classification, clustering, and data mining applications. in Proceedings of the Meeting of the International Federation of Classification Societies (IFCS). 2004.
35. Burgette, L.F. and J.P. Reiter, Multiple imputation for missing data via sequential regression trees. American journal of epidemiology, 2010. 172(9): p. 1070-1076.
36. Shah, A.D., et al., Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American journal of epidemiology, 2014. 179(6): p. 764-774.
37. Doove, L.L., S. Van Buuren, and E. Dusseldorp, Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 2014. 72: p. 92-104.
38. Breiman, L., et al., Classification and regression trees. 1984: CRC press.
39. Wilson, D.R. and T.R. Martinez, Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 2000. 38(3): p. 257-286.
40. Tsai, C.-F. and F.-Y. Chang, Combining instance selection for better missing value imputation. Journal of Systems and Software, 2016. 122: p. 63-71.
41. Cover, T. and P. Hart, Nearest neighbor pattern classification. IEEE transactions on information theory, 1967. 13(1): p. 21-27.
42. Wilson, D.L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 1972(3): p. 408-421.
43. AHA, D.W., D. KIBLER, and M. ALBERT, Instance-Based Learning Algorithms. Machine Learning, 1991. 6: p. 37-66.
44. Tsai, C.-F., W. Eberle, and C.-Y. Chu, Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 2013. 39: p. 240-247.
45. Woods, K.S., et al., Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, in State of The Art in Digital Mammographic Image Analysis. 1994, World Scientific. p. 213-231.
46. Wang, K. and S. Stolfo, One-class training for masquerade detection. 2003.
47. Devi, D., S.K. Biswas, and B. Purkayastha, Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique. Connection Science, 2019. 31(2): p. 105-142.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關論文
 
無相關期刊
 
無相關點閱論文