臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.106) 您好！臺灣時間：2026/04/05 01:56

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
論文連結
QR Code

本論文永久網址:

研究生:

黃正廷

研究生(外文):

Cheng-Ting Huang

論文名稱:

混合抽樣策略以改善類別不平衡分類問題

論文名稱(外文):

Hybrid Sampling Strategy to Class-Imbalanced Classification Problem

指導教授:

賴國華

指導教授(外文):

K. Robert Lai

口試委員:

周志岳、藍中賢

口試委員(外文):

Chih-Yueh Chou、Chung-Hsien Lan

口試日期:

2015-11-02

學位類別:

碩士

校院名稱:

元智大學

系所名稱:

資訊工程學系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2015

畢業學年度:

105

語文別:

中文

論文頁數:

中文關鍵詞:

製造業資料、增加少數類別抽樣、減少多數類別抽樣、機器學習、敏感度、特異度

外文關鍵詞:

manufacturing data、Over-Sampling、Under-Sampling、machine learning、sensitivity、specificity

相關次數:

被引用:0
點閱:448
評分:
下載:0
書目收藏:0

由於製造業生產資料的不平衡特性，即瑕疵品的數量遠遠少於非瑕疵品之數量，此會影響機器學習的分類能力，大大降低對少數類別資料也就是瑕玼品資料的預測準確率。解決此問題的方法可分兩大類，一類在資料分佈的改善，另一類在演算法的改善，本研究的方向是在資料分佈的改善，也就是抽樣方法的改善;抽樣方法大致有減少多數類別抽樣(Under-Sampling)、增加少數類別抽樣(Over-Sampling)、或是這兩種的一些修正方法。減少多數類別抽樣的缺點是可能去除一些有用的資料，而增加少數類別抽樣的缺點是可能產生過度配飾(Over fitting)或增加雜訊資料，這些缺點將影響分類的準確性。本研究結合增加少數類別抽樣與減少多數類別抽樣來處理類別不平衡的製造業資料，並比較四種不同機器學習演算法於靈敏度(Sensitivity)、特異度(specificity)與G-mean等相關統計分析的預測能力，研究結果顯示，此種混合抽樣法對生產資料可以大量縮減訓練資料所需時間，其配合隨機森林(Random Forest)、LibSVM或K–最近鄰(KNN)的分類方法更明顯的提升了整體的預測準確率，也就是G-mean值的提高，兼顧對小類資料的靈敏度提升與大類資料特異度的最小影響。

關鍵詞：製造業資料、增加少數類別抽樣、減少多數類別抽樣、機器學習、敏感度、特異度

Due to the imbalance characteristic of manufacturing production data, which means the number of defective product is far less than the number of non-defective product, the classification capability of machine learning is impacted such that the prediction accuracy for classification of minority data is greatly reduced. There are two known approaches to solve the problem. One is to improve data imbalance. The other is to improve classification algorithm. This thesis adopts data improving approach which is also called sampling. Under-Sampling and Over-Sampling are two major sampling methods. Under-Sampling may remove some useful data while Over-Sampling may cause over fitting or creating noise data. This study combines Over-Sampling and Under-Sampling to process the imbalanced manufacturing data, then compares prediction capability of sensitivity, specificity, G-mean and other related statistical analysis for four different machine learning algorithms. The result shows that it significantly reduces the training time for optical thin-film manufacturing data, and the classification method with Random Forest, LibSVM or K- nearest neighbor (KNN) even dramatically improved the total prediction accuracy G-mean which considers accuracy for both majority and minority.

Key words: manufacturing data, Over-Sampling, Under-Sampling, machine learning, sensitivity, specificity.

目錄
摘要 I
ABSTRACT II
致謝 III
目錄 IV
表目錄 VI
圖目錄 VII
第一章緒論 1
1.1. 研究背景 1
1.2. 研究動機與重要性 2
1.3. 研究目的 3
1.4. 論文架構 3
第二章文獻回顧 4
2.1. 不平衡資料(IMBALANCED DATA) 4
2.2. 解決不平衡資料的方法 6
2.3. 抽樣(SAMPLING) 7
2.3.1. 減少多數類別抽樣法(Under Sampling) 7
2.3.2. 增加少數類別抽樣法(Over Sampling) 8
2.4. 分類(CLASSIFICATION) 9
2.4.1. LIBSVM(A Library For Support Vector Machine) 11
2.4.2. 隨機森林(Random Forests) 13
2.4.3. 單純貝氏分類法(Naïve Bayes classifier) 15
2.4.4. K-最近鄰（K-nearest neighbor, KNN）分類法 17
第三章研究方法 19
3.1. 定義問題(PROBLEM DEFINITION) 19
3.2. 混合抽樣 (HYBRID SAMPLING) 20
3.2.1. 減少多數類別抽樣(Under Sampling) 22
3.2.2. SMOTE增加少數類別抽樣(SMOTE Over Sampling) 25
3.3. 評估指標(EVALUATION MEASURE) 29
第四章研究結果及分析 32
4.1. 問題描述 32
4.2. 資料預處理(DATA PREPROCESSING) 33
4.3. 減少多數類別抽樣法(UNDER SAMPLING) 37
4.4. 混合抽樣法(HYBRID SAMPLING) 40
4.5. 演算法獨立比較 43
4.6. G-MEAN和F_MEAN比較 46
4.7. 時間效率比較 49
4.8. 實驗總結 50
第五章結論 52
參考文獻 54

參考文獻
[1] Alpaydin, E. (2010). Introduction to Machine Learning.
[2] Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial： special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
[3] Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity： why under-sampling beats over-sampling. Paper presented at the Workshop on learning from imbalanced datasets II.
[4] Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B： Cybernetics, 39(2), 539-550.
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE： synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16,321-357.
[6] Chang, C.-C., & Lin, C.-J. (2011). LIBSVM： A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[7] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
[8] Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia.
[9] Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 28(4), 667-671.
[10] Liu, T.-Y. (2009). Easyensemble and feature selection for imbalance data sets. Paper presented at the Bioinformatics, International Joint Conference on Systems Biology and Intelligent Computing, 2009. IJCBS'09.
[11] 张琦, 吴斌, & 王柏. (2006). 非平衡数据训练方法概述. 计算机科学, 32(10), 181-186.
[12] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182.
[13] Domingos, P. (1999, August). Metacost： A general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 155-164).
[14] Dietterich, T. G. (2000). Ensemble methods in machine learning. Springer Berlin Heidelberg In Multiple classifier systems (pp. 1-15).
[15] Drummond, C., & Holte, R. C. (2003, August). C4. 5, class imbalance, and cost sensitivity： why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11).
[16] Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Letters, 34(12),1339-1347.
[17] Park, S. H., & Ha, Y. G. (2014, July). Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2014 Eighth International Conference on (pp. 45-49). IEEE.
[18] Grefenstette, J. J. (1993). Genetic algorithms and machine learning. Paper presented at the Proceedings of the sixth annual conference on Computational learning theory.
[19] Akima, H. (1970). A new method of interpolation and smooth curve fitting based on local procedures. Journal of the ACM (JACM), 17(4), 589-602.
[20] Shepard, D. (1968). A two-dimensional interpolation function for irregularly-spaced data. Paper presented at the Proceedings of the 1968 23rd ACM national conference.
[21] Zeng, Z.-Q., & Gao, J. (2009). Improving SVM classification with imbalance data set. Paper presented at the Neural Information Processing.
[22] Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE： a new over-sampling method in imbalanced data sets learning Advances in intelligent computing (pp. 878-887)： Springer.
[23] Lewis, D., & Gale, W. (1994). Training text classifiers by uncertainty sampling.
[24] Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine learning, 30(2-3), 195-215.
[25] Tan, P., Steinbach, M., & Kumar, V. (2006). others. Introduction to data mining： Pearson Addison Wesley Boston.

電子全文(本篇電子全文限研究生所屬學校校內系統及IP範圍內開放)

國圖紙本論文

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供，不一定有電子全文可供下載，若連結有誤，請點選上方之〝勘誤回報〞功能，我們會盡快修正，謝謝！

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	嬰幼兒綜合發展測驗動作分測驗同時效度與診斷效率檢測
2.	以現場介入探討鋼鐵業ＭＳＤｓ適用性
3.	房屋貸款違約危險因子暨預測模型之探討
4.	台灣性侵害加害人再犯調查與靜態因素九九評估量表適用性之探討
5.	非對稱性分類分析解決策略之效能比較
6.	出院準備服務危險因子篩選量表之建立--以神經科病人為例
7.	學前兒童語言檢核表編製之研究
8.	資料複雜度指標對資料探勘分類技術的影響
9.	探討精神科住院病患跌倒危險評估量表之準確度
10.	0-3歲華語嬰幼兒溝通及語言發展篩檢測驗同時效度與預測效度之研究
11.	利用肥胖指標預測代謝症候相關疾病最適切點探討
12.	住宅抵押貸款逾期率之廣義線性模型研究
13.	職業災害殘廢勞工重返職場勝算探討
14.	探討以真實虛無假設個數估計量修正控制FDR之多重比較法
15.	精神分裂症病人出院準備服務危險因子篩選量表之建立

無相關期刊

1.	即時監控分析方法對於製程良率提升之研究－以製造業W公司為例
2.	知識圖譜構建與視覺化平台發展-應用於製造業A公司
3.	探索LINE持有有價貼圖之自我形象經營
4.	多目標最佳化方法應用於互聯網資訊中心利益及服務品質之最優化
5.	利用多目標決策分析進行危險分數評估於時間序列資料上
6.	基於模糊最佳化引擎之自動化行車危險程度分析
7.	結合多重時間序列與深度循環網路模型於短期電力預測
8.	中文反諷Valence-Arousal-Irony語料庫的創建和評估
9.	霧社事件在文學、影視作品中的詮釋演變研究
10.	開發混合採樣方法在不平衡分類問題中
11.	以基本形狀物體為主的擴增實境互動式三維模型建立系統
12.	應用於基地台天線之韋瓦第天線設計與縮距場量測系統天線設計
13.	含苯并噻二唑和吡啶并噻二唑之小分子有機太陽能電池之優化與具銀電極有機太陽能電池之研究
14.	最小曼哈頓距離應用於多目標最佳化衍生之多準則決策問題
15.	基於時間序列資料之駕駛風險預測用於汽車保費分析

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室