跳到主要內容

臺灣博碩士論文加值系統

(2600:1f28:365:80b0:61f7:3035:7b86:b3e6) 您好!臺灣時間:2024/12/15 00:08
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:蔡武霖
研究生(外文):WU-LIN TSAI
論文名稱:資料淨化於類別不平衡問題: 機器學習觀點
指導教授:蔡志豐蔡志豐引用關係
指導教授(外文):Chih-Fong Tsai
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊管理學系在職專班
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2018
畢業學年度:107
語文別:中文
論文頁數:97
中文關鍵詞:機器學習資料探勘類別不平衡抽樣特徵選取
外文關鍵詞:Machine LearningData MiningClass Imbalanced ProblemSamplingFeature Selection
相關次數:
  • 被引用被引用:0
  • 點閱點閱:250
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
機器學習在Google Alpha Go出現之後再次受到矚目,這也顯現出收集資料的重要性。但在現實生活中,資料收集時的困難與限制會造成收集資料的不平均。這容易使得分類困難與不準確,因為特徵選取與不平衡處理(抽樣)後,影響分類器在向量空間中的學習與分類效能。本研究使用知名公開網站的資料集,並設計二個流程來發掘類別不平衡問題,而特徵選取與抽樣誰該放置於前或後,使用五種不平衡處理抽樣模組,分別為三增加少數抽樣法、二減少多數抽樣法放置於前後,另外特徵選取使用二種模組,並加入有無正規化在這二項流程上。分類器目前在類別不平衡中,最常被使用支持向量機(SVM)與決策樹(Decision Tree Classifier)的二種分類器進行分類。從本研究實驗過程得知,類別不平衡資料在先執行特徵選取之後,再執行不平衡處理(抽樣),低資料量在抽樣後為 SMOTE 增加少數抽樣法,高資料量在抽樣後Random為減少多數抽樣法(Under Sampling),特徵選取小於20建議使用PCA,20維以上使用GA,分類器SVM 為佳的分類器,至於資料是否要正規化為決策樹不使用、支持向量機使用。
After the invention of Alpha Go, machine learning caught the public eye and showed us the essential need for data collection. Nevertheless, in reality, data collection is often uneven owing to its many difficulties and confinement. Feature selection and imbalanced (Sampling) have inherent impacts on Classifier in vector space. This in turn impacts the ability of learning and classification which also leads to difficulty and inaccuracy during data classification. This research aims to utilize data from public websites to design two processes to excavate imbalanced (Sampling), feature selection and place sampling in the beginning and at the end. It will utilize five examples of imbalanced (sampling); three examples of increased over sampling and two of reduced under sampling placed in the beginning and the back. Moreover, it will use two different models and utilize normalization with non-normalization in the two processes. Classifier in class imbalanced is often used to support vector machines and decision trees two model. From this research, we can find out that class imbalanced need use after then use feature selection, SMOTE is when low data amounts after sampling increase over sampling. Random is when high data amounts after sampling reduce under sampling. It is recommended to use PCA when feature selection is under 20 dimensions, as GA is recommended if feature selection is above 20 dimensions. Moreover, the ideal classifier is SVM. When it comes to the question of utilizing normalization in data, we can utilize classification to selection. decision tree abandons it. support vector machines use it.
中文摘要 I
Abstract II
誌謝 III
目錄 IV
表目錄 VI
圖目錄 VIII
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 研究貢獻 3
1.5 論文架構 4
第二章 文獻探討 6
2.1 資料前處理 6
2.2 特徵選取(Feature selection) 7
2.2.1 監督式學習 7
2.2.2 遺傳演算法(Genetic Algorithm GA) 8
2.2.3 非監督式學習 10
2.2.4 主成分分析PCA 11
2.3 類別不平衡Class Imbalance 12
2.3.1 增加少數抽樣法Over Sampling 12
2.3.2 SMOTE 13
2.3.3 SMOTE Borderline 14
2.3.4 ANDANY 16
2.3.5 減少多數抽樣法Under Sampling 18
2.3.6 Random under Sampling 19
2.3.7 Edited Nearest Neighbours 20
2.4 支持向量機SVM 21
2.5 決策樹Decision Tree Classifier 22
2.6 正規化 24
2.7 近期學者研究資料 26
第三章 研究方法 28
3.1 研究架構 28
3.1.1 研究架構流程 30
3.2 資料收集與前處理 32
3.2.1 資料收集 33
3.2.2 資料前處理與流程 34
3.3 特徵選取模型與種類 35
3.3.1 特徵選取流程 36
3.4 不平衡資料處理模型與種類 37
3.5 分類模型與種類 39
3.6 評估方法 41
3.6.1 混亂矩陣(Confusion Matrix) 41
3.6.2 ROC曲線下面積AUC 43
3.7 小結 44
第四章 系統建置與實驗 45
4.1 實驗環境 46
4.2 實驗設計 47
4.2.1 模型參數設定 49
4.3 實驗結果分析 50
4.3.1 Feature Selection VS Imbalanced 51
4.3.2 Over Sampling VS Under Sampling: 54
4.3.3 PCA VS GA 56
4.3.4 Normalization VS Non Normalization: 59
第五章 結論 71
5.1 研究總結 71
5.2 研究建議與未來研究方向 72
參考文獻 73
附錄 81
1. Inside (2019),“Google Alpha Go,”(accessed 2019/03/10, available at: https://www.inside.com.tw/article/9071-how-alphago-inspire-human-in-go).
2. Wikipedia (2019),“圍棋,”(accessed 2019/03/10, available at: https://zh.wikipedia.org/wiki/%E5%9B%B4%E6%A3%8B).
3. Su, C. T., Chen, L. S., & Yih, Y. (2006). Knowledge acquisition through information granulation for imbalanced data. Expert Systems with applications, 31(3), 531-541.
4. Su, C. T., Yang, C. H., Hsu, K. H., & Chiu, W. K. (2006). Data mining for the diagnosis of type II diabetes from three-dimensional body surface anthropometrical scanning data. Computers & mathematics with applications, 51(6-7), 1075-1092.
5. Liao, T. W. (2008). Classification of weld flaws with imbalanced class data. Expert Systems with Applications, 35(3), 1041-1052.
6. Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., & Ji, S. H. (2001). Data mining approach to policy analysis in a health insurance domain. International journal of medical informatics, 62(2-3), 103-111.
7. Barandela, R., Sánchez, J. S., Garca, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.
8. Zhou, Z. H., & Liu, X. Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge & Data Engineering, (1), 63-77.
9. An, A., & Wang, Y. (2001). Comparisons of classification methods for screening potential compounds. In Proceedings 2001 IEEE International Conference on Data Mining (pp. 11-18). IEEE.

10. Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1), 7-19.
11. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
12. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.
13. Bernhard Scholkopf and Alexander Smola. Support vector machine. KDD 99 The First Annual International Conference on Knowledge Discovery in Data, pages 321–357, 2001.
14. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
15. Sun, Z., Song, Q., & Zhu, X. (2012). Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1806-1817.
16. Yang, Z., Tang, W. H., Shintemirov, A., & Wu, Q. H. (2009). Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 39(6), 597-610.
17. Zhu, Z. B., & Song, Z. H. (2010). Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chemical Engineering Research and Design, 88(8), 936-951.
18. Khreich, W., Granger, E., Miri, A., & Sabourin, R. (2010). Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs. Pattern Recognition, 43(8), 2732-2752.
19. Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., & Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 21(2-3), 427-436.
20. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
21. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
22. Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310.
23. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
24. Napierała, K., Stefanowski, J., & Wilk, S. (2010, June). Learning from imbalanced data in presence of noisy and borderline examples. In International Conference on Rough Sets and Current Trends in Computing (pp. 158-167). Springer, Berlin, Heidelberg.
25. 陳逸真. (2017). Comparison of Imbalanced Data Classification Methods., (頁 17~18).
26. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
27. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
28. Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., ... & Haley, C. S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific reports, 5, 10312.
29. Li, T. S. (2006). Feature selection for classification by using a GA-based neural network approach. Journal of the Chinese Institute of Industrial Engineers, 23(1), 55-64.
30. Liu, H., & Motoda, H. (Eds.). (1998). Feature extraction, construction and selection: A data mining perspective (Vol. 453). Springer Science & Business Media.
31. Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media.
32. Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
33. 譚琳,“非平衡數據挖掘簡介”,計算機科學與技術研討會,南京,2008。
34. Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1), 7-19.
35. Barandela, R., Rangel, E., Sánchez, J. S., & Ferri, F. J. (2003, November). Restricted decontamination for the imbalanced training sample problem. In Iberoamerican Congress on Pattern Recognition (pp. 424-431). Springer, Berlin, Heidelberg.
36. Barandela, R., Sánchez, J. S., Garca, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.
37. Dorronsoro, J. R., Ginel, F., Sgnchez, C., & Cruz, C. S. (1997). Neural fraud detection in credit card operations. IEEE transactions on neural networks, 8(4), 827-834.
38. Tseng, Y. H., & Chien, J. T. (2017). International Journal of Computational Linguistics & Chinese Language Processing, Volume 22, Number 1, June 2017. International Journal of Computational Linguistics & Chinese Language Processing, Volume 22, Number 1, June 2017, 22(1).
39. Inside (2019),“監督式學習,”(accessed 2019/03/10, available at: https://www.inside.com.tw/article/9945-machine-learning).
40. Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.
41. Wikipedia(2019),“遺傳演算法,”(accessed 2019/03/15, available at: https://zh.wikipedia.org/wiki/%E9%81%97%E4%BC%A0%E7%AE%97%E6%B3%95).
42. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559-572.


43. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
44. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009, April). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and data mining (pp. 475-482). Springer, Berlin, Heidelberg.
45. Tetko, I. V., Livingstone, D. J., & Luik, A. I. (1995). Neural network studies. 1. Comparison of overfitting and overtraining. Journal of chemical information and computer sciences, 35(5), 826-833.
46. Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61, 863-905.
47. Mi, Y. (2013). Imbalanced classification based on active learning SMOTE. Research Journal of Applied Sciences, Engineering and Technology, 5(3), 944-949.
48. Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878-887). Springer, Berlin, Heidelberg.
49. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE.
50. More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048.
51. Medium(2019),“ADASYN,”(accessed 2019/03/15, available at: https://medium.com/@ruinian/an-introduction-to-adasyn-with-code-1383a5ece7aa).
52. He, H., & Garcia, E. A. (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, (9), 1263-1284.
53. Mani, I., & Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126).
54. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3), 408-421.
55. Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine learning, 38(3), 257-286.
56. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
57. 雷祖強, 周天穎, 萬絢, 楊龍士, & 許晉嘉. (2007). 空間特徵分類器支援向量機之研究. Journal of Photogrammetry and Remote Sensing, 12(2), 145-163.
58. Cuingnet, R., Rosso, C., Chupin, M., Lehéricy, S., Dormont, D., Benali, H., ... & Colliot, O. (2011). Spatial regularization of SVM for the detection of diffusion alterations associated with stroke outcome. Medical image analysis, 15(5), 729-737.
59. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.,
60. Breiman, L. (2017). Classification and regression trees. Routledge.
61. Scikit-learn (2019),“Decision Trees,”(accessed 2019/02/10, available at: https://scikit-learn.org/stable/modules/tree.html).
62. Scikit-learn (2019),“StandardScaler,”(accessed 2019/02/10, available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).


63. Harvard Business Review (2019),“garbage-in, garbage-out,”(accessed 2019/02/05, available at: https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless).
64. Provost, F., & Kohavi, R. (1998). Guest editors' introduction: On applied research in machine learning. Machine learning, 30(2), 127-132.
65. Marimont, R. B., & Shapiro, M. B. (1979). Nearest neighbour searches and the curse of dimensionality. IMA Journal of Applied Mathematics, 24(1), 59-70.
66. Chávez, E., Navarro, G., Baeza-Yates, R., & Marroquín, J. L. (2001). Searching in metric spaces. ACM computing surveys (CSUR), 33(3), 273-321.
67. Wikipedia (2019),“維數災難,”(accessed 2019/04/15, available at: https://zh.wikipedia.org/wiki/%E7%BB%B4%E6%95%B0%E7%81%BE%E9%9A%BE.
68. Johnstone, I. M., & Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 682-693.
69. Lu, Y., Cohen, I., Zhou, X. S., & Tian, Q. (2007, September). Feature selection using principal feature analysis. In Proceedings of the 15th ACM international conference on Multimedia (pp. 301-304). ACM.
70. Aksoy, S., & Haralick, R. M. (2001). Feature normalization and likelihood-based similarity measures for image retrieval. Pattern recognition letters, 22(5), 563-582.
71. Wikipedia(2019),“Feature scaling,”(accessed 2019/05/01, available at: https://en.wikipedia.org/wiki/Feature_scaling).
72. Archive(2019),“normalization,”(accessed 2019/04/20, available at: https://web.archive.org/web/20121230101134/http://www.qsarworld.com/qsar-statistics-normalization.php).
73. Wu, G., & Chang, E. Y. (2005). KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge & Data Engineering, (6), 786-795.
74. KEEL (2019),“keel Imbalanced data sets,”(accessed 2018/10/10, available at: http://sci2s.ugr.es/keel/imbalanced.php).
75. Uottawa (2019),“NANA Imbalanced data sets,”(accessed 2018/10/10, available at: http://promise.site.uottawa.ca/SERepository/datasets-page.html).
76. Github (2019),“NANA Imbalanced data sets,”(accessed 2018/10/10, available at: https://github.com/klainfo/NASADefectDataset/tree/master/OriginalData/MDP).
77. You, C., Li, C., Robinson, D. P., & Vidal, R. (2018, September). A Scalable Exemplar-Based Subspace Clustering Algorithm for Class-Imbalanced Data. In European Conference on Computer Vision (pp. 68-85). Springer, Cham.
78. Lin, W. C., Tsai, C. F., Hu, Y. H., & Jhang, J. S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409, 17-26.
79. Zhai, J., Zhang, S., & Wang, C. (2017). The classification of imbalanced large data sets based on mapreduce and ensemble of elm classifiers. International Journal of Machine Learning and Cybernetics, 8(3), 1009-1017.
80. Sun, Y., Kamel, M. S., & Wang, Y. (2006, December). Boosting for learning multiple classes with imbalanced class distribution. In Sixth International Conference on Data Mining (ICDM'06) (pp. 592-602). IEEE.
81. 林冠宇,“發展改良試支持向量資料摸述改善不平衡資料分類”,國立臺北科技大學工業工程與管理學系碩士論文,2013
82. 林佳蒨,“支援向量機於不平衡資料類別問題之應用”,國立暨南國際大學資訊管理學系碩士論文,2012
83. 羅隆晉,“以集群為基礎之多分類器模型對不平衡資料預測之研究”,銘傳大學工程學系碩士論文,2010
84. 張毓珊,“以集群為基礎之多分類器模型對不平衡資料預測之研究”,朝陽科技大學資訊管理學系碩士論文,2009
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊