跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.106) 您好!臺灣時間:2026/04/05 01:56
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:黃正廷
研究生(外文):Cheng-Ting Huang
論文名稱:混合抽樣策略以改善類別不平衡分類問題
論文名稱(外文):Hybrid Sampling Strategy to Class-Imbalanced Classification Problem
指導教授:賴國華
指導教授(外文):K. Robert Lai
口試委員:周志岳藍中賢
口試委員(外文):Chih-Yueh ChouChung-Hsien Lan
口試日期:2015-11-02
學位類別:碩士
校院名稱:元智大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:105
語文別:中文
論文頁數:56
中文關鍵詞:製造業資料增加少數類別抽樣減少多數類別抽樣機器學習敏感度特異度
外文關鍵詞:manufacturing dataOver-SamplingUnder-Samplingmachine learningsensitivityspecificity
相關次數:
  • 被引用被引用:0
  • 點閱點閱:448
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於製造業生產資料的不平衡特性,即瑕疵品的數量遠遠少於非瑕疵品之數量,此會影響機器學習的分類能力,大大降低對少數類別資料也就是瑕玼品資料的預測準確率。解決此問題的方法可分兩大類,一類在資料分佈的改善,另一類在演算法的改善,本研究的方向是在資料分佈的改善,也就是抽樣方法的改善;抽樣方法大致有減少多數類別抽樣(Under-Sampling)、增加少數類別抽樣(Over-Sampling)、或是這兩種的一些修正方法。減少多數類別抽樣的缺點是可能去除一些有用的資料,而增加少數類別抽樣的缺點是可能產生過度配飾(Over fitting)或增加雜訊資料,這些缺點將影響分類的準確性。本研究結合增加少數類別抽樣與減少多數類別抽樣來處理類別不平衡的製造業資料,並比較四種不同機器學習演算法於靈敏度(Sensitivity)、特異度(specificity)與G-mean等相關統計分析的預測能力,研究結果顯示,此種混合抽樣法對生產資料可以大量縮減訓練資料所需時間,其配合隨機森林(Random Forest)、LibSVM或K–最近鄰(KNN)的分類方法更明顯的提升了整體的預測準確率,也就是G-mean值的提高,兼顧對小類資料的靈敏度提升與大類資料特異度的最小影響。

關鍵詞:製造業資料、增加少數類別抽樣、減少多數類別抽樣、機器學習、敏感度、特異度
Due to the imbalance characteristic of manufacturing production data, which means the number of defective product is far less than the number of non-defective product, the classification capability of machine learning is impacted such that the prediction accuracy for classification of minority data is greatly reduced. There are two known approaches to solve the problem. One is to improve data imbalance. The other is to improve classification algorithm. This thesis adopts data improving approach which is also called sampling. Under-Sampling and Over-Sampling are two major sampling methods. Under-Sampling may remove some useful data while Over-Sampling may cause over fitting or creating noise data. This study combines Over-Sampling and Under-Sampling to process the imbalanced manufacturing data, then compares prediction capability of sensitivity, specificity, G-mean and other related statistical analysis for four different machine learning algorithms. The result shows that it significantly reduces the training time for optical thin-film manufacturing data, and the classification method with Random Forest, LibSVM or K- nearest neighbor (KNN) even dramatically improved the total prediction accuracy G-mean which considers accuracy for both majority and minority.

Key words: manufacturing data, Over-Sampling, Under-Sampling, machine learning, sensitivity, specificity.
目 錄
摘 要 I
ABSTRACT II
致 謝 III
目 錄 IV
表 目 錄 VI
圖 目 錄 VII
第一章 緒論 1
1.1. 研究背景 1
1.2. 研究動機與重要性 2
1.3. 研究目的 3
1.4. 論文架構 3
第二章 文獻回顧 4
2.1. 不平衡資料(IMBALANCED DATA) 4
2.2. 解決不平衡資料的方法 6
2.3. 抽樣(SAMPLING) 7
2.3.1. 減少多數類別抽樣法(Under Sampling) 7
2.3.2. 增加少數類別抽樣法(Over Sampling) 8
2.4. 分類(CLASSIFICATION) 9
2.4.1. LIBSVM(A Library For Support Vector Machine) 11
2.4.2. 隨機森林(Random Forests) 13
2.4.3. 單純貝氏分類法(Naïve Bayes classifier) 15
2.4.4. K-最近鄰(K-nearest neighbor, KNN)分類法 17
第三章 研究方法 19
3.1. 定義問題(PROBLEM DEFINITION) 19
3.2. 混合抽樣 (HYBRID SAMPLING) 20
3.2.1. 減少多數類別抽樣(Under Sampling) 22
3.2.2. SMOTE增加少數類別抽樣(SMOTE Over Sampling) 25
3.3. 評估指標(EVALUATION MEASURE) 29
第四章 研究結果及分析 32
4.1. 問題描述 32
4.2. 資料預處理(DATA PREPROCESSING) 33
4.3. 減少多數類別抽樣法(UNDER SAMPLING) 37
4.4. 混合抽樣法(HYBRID SAMPLING) 40
4.5. 演算法獨立比較 43
4.6. G-MEAN和F_MEAN比較 46
4.7. 時間效率比較 49
4.8. 實驗總結 50
第五章 結論 52
參考文獻 54
參考文獻
[1] Alpaydin, E. (2010). Introduction to Machine Learning.
[2] Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
[3] Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Paper presented at the Workshop on learning from imbalanced datasets II.
[4] Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(2), 539-550.
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16,321-357.
[6] Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[7] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
[8] Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia.
[9] Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 28(4), 667-671.
[10] Liu, T.-Y. (2009). Easyensemble and feature selection for imbalance data sets. Paper presented at the Bioinformatics, International Joint Conference on Systems Biology and Intelligent Computing, 2009. IJCBS'09.
[11] 张琦, 吴斌, & 王柏. (2006). 非平衡数据训练方法概述. 计算机科学, 32(10), 181-186.
[12] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182.
[13] Domingos, P. (1999, August). Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 155-164).
[14] Dietterich, T. G. (2000). Ensemble methods in machine learning. Springer Berlin Heidelberg In Multiple classifier systems (pp. 1-15).
[15] Drummond, C., & Holte, R. C. (2003, August). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11).
[16] Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Letters, 34(12),1339-1347.
[17] Park, S. H., & Ha, Y. G. (2014, July). Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2014 Eighth International Conference on (pp. 45-49). IEEE.
[18] Grefenstette, J. J. (1993). Genetic algorithms and machine learning. Paper presented at the Proceedings of the sixth annual conference on Computational learning theory.
[19] Akima, H. (1970). A new method of interpolation and smooth curve fitting based on local procedures. Journal of the ACM (JACM), 17(4), 589-602.
[20] Shepard, D. (1968). A two-dimensional interpolation function for irregularly-spaced data. Paper presented at the Proceedings of the 1968 23rd ACM national conference.
[21] Zeng, Z.-Q., & Gao, J. (2009). Improving SVM classification with imbalance data set. Paper presented at the Neural Information Processing.
[22] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning Advances in intelligent computing (pp. 878-887): Springer.
[23] Lewis, D., & Gale, W. (1994). Training text classifiers by uncertainty sampling.
[24] Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine learning, 30(2-3), 195-215.
[25] Tan, P., Steinbach, M., & Kumar, V. (2006). others. Introduction to data mining: Pearson Addison Wesley Boston.
電子全文 電子全文(本篇電子全文限研究生所屬學校校內系統及IP範圍內開放)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊