跳到主要內容

臺灣博碩士論文加值系統

(44.213.60.33) 您好!臺灣時間:2024/07/17 04:20
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:羅元澤
研究生(外文):LUO, YUAN-ZE
論文名稱:基於Adaboost的特徵選擇在提升分類演算法性能上之研究
論文名稱(外文):Research on Adaboost-based Feature Selection in Improving the Performance of Classification Algorithms
指導教授:李御璽李御璽引用關係
指導教授(外文):LEE, YUE-SHI
口試委員:林川傑顏秀珍李御璽
口試委員(外文):LIN, CHUAN-JIEYEN, SHOW-JANELEE, YUE-SHI
口試日期:2022-01-13
學位類別:碩士
校院名稱:銘傳大學
系所名稱:資訊工程學系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2022
畢業學年度:110
語文別:中文
論文頁數:73
中文關鍵詞:Adaboost決策樹集成學習特徵選擇
外文關鍵詞:AdaboostDecision TreeEnsemble LearningFeature Selection
相關次數:
  • 被引用被引用:0
  • 點閱點閱:293
  • 評分評分:
  • 下載下載:65
  • 收藏至我的研究室書目清單書目收藏:0
特徵選擇是一種能夠提升分類演算法分類效果的技術,此技術透過從資料中獲得有效的特徵子集降低資料維度,以解決因為高維度的資料帶來模型複雜度高、資料分類效度不佳的問題。除了特徵選擇外,集成學習的方法,利用透過建立多個分類器,來共同決定資料的類別,也可提升分類模型的效能。
Adaboost為集成學習中領域的佼佼者。它利用資料抽樣的方式訓練多個相同分類演算法的分類器。Adaboost的優點在於能夠對分類錯誤的資料,給予較高的權重,使其有更高的機率給下一個分類器訓練,以不斷優化下一個類器的分類效能。Adaboost以決策樹為基礎建立多個模型時,會因為決策樹有特徵選擇的功能,在建模完成後,除輸出模型外,也可輸出特徵選擇的結果。但如果基礎分類演算法不是決策樹時,則只能輸出模型。
本研究以Adaboost為基礎提出一種新的分類方法以及兩種產生屬性重要性的方法。我們的分類方法將特徵選擇融入Adaboost集成學習之中,讓每個分類器在訓練抽樣後的資料之前,先做特徵選擇,以期能再提升每個分類器的分類效能。實驗的結果顯示我們的方法確實能夠提升分類的正確率。兩種產生屬性重要性的方法也讓Adaboost在選擇非決策樹為基礎分類演算法時,也能輸出特徵選擇的結果。
Feature selection is a technique that can improve the efficiency of the classification algorithm. It reduces the data dimensionality by obtaining effective feature subsets from the data to solve the problem of high model complexity and poor classification efficiency caused by high-dimensional data. In addition to feature selection, the ensemble learning algorithms build multiple classification models to determine the classification results. It can also improve the classification efficiency.
Adaboost is a perfect algorithm in ensemble learning, which trains multiple classifiers of the same classification algorithm by sampling data. The advantage of Adaboost is that it can give a higher weight to the misclassified data, so that it has a higher probability to train the next classifier, so as to continuously optimize the classification performance of the next classifier. When Adaboost builds multiple models based on the decision tree, because the decision tree has the function of feature selection, after the modeling is completed, in addition to the output model, the feature selection result can also be output. Just the model can only be output if the underlying classification algorithm is not a decision tree.
Based on Adaboost, this study proposes a classification method and two methods to generate feature importance. The experimental results show that our method can indeed improve the classification performance. Two feature importance methods allow Adaboost to output feature selection results when selecting non-decision tree based classification algorithms.
摘要 i
Abstract ii
致謝 iii
目錄 iv
表目錄 vi
圖目錄 viii
第壹章 緒論 1
第貳章 文獻探討 4
第一節 特徵選擇 4
一 過濾法特徵選擇 4
(一)變異數分析 5
(二)皮爾森相關係數 5
(三)卡方檢定 5
(四)高度相關特徵的選擇 7
二 包裝法特徵選擇 7
(一)向前搜尋 8
(二)向後搜尋 8
(三)雙向搜尋 9
(四)基因搜尋 9
三 嵌入法特徵選擇 11
(一)決策樹 11
(二)隨機森林 15
(三)提升法 19
第二節  集成學習 19
一 Adaboost 19
(一)設定資料初始抽樣機率 20
(二)訓練弱分類器及計算弱分類器的錯誤率 20
(四)利用錯誤率計算弱分類器權重 20
(五)更新資料抽樣機率 21
(六)多模型投票 21
二 SAMME 22
(一)設定資料初始權重 22
(二)訓練弱分類器並計算弱分類器的錯誤率 22
(三)利用錯誤率計算弱分類器權重 23
(四)更新資料權重 23
(五)多模型投票 24
三 SAMME.R 25
(一)設定資料初始權重並訓練弱分類器 25
(三)計算弱分類器對資料在各類的預測機率 25
(四)將目標屬性編碼 26
(五)更新資料權重 26
(六)計算多模型投票依據 27
(七)多模型投票 28
四 使用Adaboost計算屬性重要性 29
第參章 研究方法 31
第一節  分類 31
第二節  屬性重要性(方法一) 31
第三節  屬性重要性(方法二) 32
第肆章 實驗結果 35
第一節  研究評估方法 35
一 交叉驗證 35
二 AUC 36
(一)將分類器對資料預測的機率排序 37
(二)計算累積1的比例 37
(三)計算累積0的比例 37
(四)畫成圖表 38
第二節  實驗資料 39
第三節  分類實驗 39
一 實驗參數 40
二 真實資料 40
三 人工資料 43
四 為什麼不用SAMME.R 46
第四節  屬性重要性實驗 47
一 實驗參數 47
二 真實資料實驗結果 48
(一)選擇前20%重要屬性 50
(二)選擇前40%重要屬性 51
(三)選擇前60%重要屬性 53
(四)選擇前80%重要屬性 54
三 人工資料實驗結果 56
第伍章 結論與未來展望 61
[1] C. C. Aggarwal, Data Mining, Springer, Cham, 2015.
[2] A. Zheng and A. Casari, Feature Engineering for Machine Learning, O'Reilly Media, Inc., 2018.
[3] S. Moro, P. Cortez and P. Rita, "A Data-driven Approach to Predict the Success of Bank Telemarketing," Decision Support Systems, Vol. 62, pp. 22-31, 2014.
[4] H. Liu, M. Zhou and Q. Liu, "An Embedded Feature Selection Method for Imbalanced Data Classification," IEEE/CAA Journal of Automatica Sinica, Vol. 6, Issue 3, pp. 703-715, 2019.
[5] A.-M. B. T. V. Mehdi Naseriparsa, "A Hybrid Feature Selection Method to Improve Performance of a Group of Classification Algorithms," International Journal of Computer Applications, Vol. 69, pp. 28-35, 2013.
[6] R. E. S. Yoav Freund, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Journal of Computer and System Sciences, Vol. 55, Issue 1 pp. 119-139, 1997.
[7] F. Salo, A. B. Nassif and A. Essex, "Dimensionality Reduction with IG-PCA and Ensemble Classifier for Network Intrusion Detection," Computer Networks, Vol. 148, pp. 164-175, 2019.
[8] L. Yu and H. Liu, "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution," in Proceedings of International Conference on Machine Learning, Washington DC, pp. 856-863, 2003.
[9] H. Scheffé, The Analysis of Variance, Wiley, 1999.
[10] Belleaya, “[教學] [統計] Pearson 相關,” Pixnet, 20 July 2012. Available: https://belleaya.pixnet.net/blog/post/30950579. [存取日期: 6 January 2022].
[11] D. R. Cox, "Karl Pearson and the Chi-Squared Test," in Goodness-of-Fit Tests and Model Validity, Boston, Birkhäuser, 2002, pp. 3-8.
[12] pandas, "pandas.DataFrame.corr," pandas, Available: https://pandas.pydata.org. [存取日期: 8 February 2022].
[13] Y.-H. Chan, W. W. Y. Ng, D. S. Yeung and P. P. K. Chan, "Empirical Comparison of Forward and Backward Search Strategies in L-GEM Based Feature Selection with RBFNN," in Proceedings of International Conference on Machine Learning and Cybernetics, Qingdao, pp. 1524-1527, 2010.
[14] D. E. Goldberg and J. H. Holland, "Genetic Algorithms and Machine Learning," Machine Learning, pp. 95–99, 1988.
[15] 學習堅持,堅持學習, “GA 基因演算法,” 3 January 2013. Available: https://dotblogs.com.tw/dragon229/2013/01/03/86692. [存取日期: 22 July 2021].
[16] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992.
[17] R. Purgstaller, "scikit learn - Feature Importance Calculation in Decision Trees," stackoverflow, 12 March 2018. Available: https://stackoverflow.com. [存取日期: 25 12 2021].
[18] X. Zhang, Y. Han and Q. Wang, "HOBA: A Novel Feature Engineering Methodology for Credit Card Fraud Detection with A Deep Learning Architecture," Information Sciences, Vol. 557, pp. 302-316, 2019.
[19] R. J. Lewis, "An Introduction to Classification and Regression Tree (CART) Analysis," in Proceedings of Annual meeting of the society for academic emergency medicine, San Francisco, pp. 2-14, 2000.
[20] L. Breiman, "Random Forests," Machine Learning, Vol. 45, pp. 5-32, 2001.
[21] J. Starmer, "StatQuest: Random Forests Part 1 - Building, Using and Evaluating," Youtube, 5 February 2018. Available: https://www.youtube.com. [存取日期: 25 December 2021].
[22] scikit-learn, "RandomForestClassifier," scikit-learn, 2007. Available: https://scikit-learn.org. [存取日期: 25 December 2021].
[23] J. H. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine," The Annals of Statistics, Vol. 29, No. 5, pp. 1189-1232, 2001.
[24] T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of Knowledge Discovery and Data Mining, San Francisco, pp. 785-794, 2016.
[25] S. R. Z. H. Ji Zhu, "Multi-class AdaBoost," Statistics and Its Interface, Vol. 2, p. 349 – 360, 2009.
[26] scikit-learn, "AdaBoostClassifier," scikit-learn, 2007. Available: https://scikit-learn.org. [存取日期: 25 December 2021].
[27] S. JOGLEKAR, "Cross Validation and the Bias-Variance Tradeoff (for Dummies)," 30 August 2015. Available: https://codesachin.wordpress.com. [存取日期: 22 July 2021].
[28] scikit-learn, "Receiver Operating Characteristic (ROC)," 2020. Available: https://scikit-learn.org. [存取日期: 22 July 2021].
[29] S. Shilaskar and A. Ghatol, "Feature Selection for Medical Diagnosis : Evaluation for Cardiovascular Diseases," Expert Systems with Applications, Vol. 40, pp. 4146-4153, 2013.
[30] L. Shang, "A Feature Selection Method Based on Information Gain and Genetic Algorithm," in Proceedings of International Conference on Computer Science and Electronics Engineering, Hangzhou, China, pp. 355-358, 2012.
[31] S. Ozdemir and D. Susarla, Feature Engineering Made Easy, Packt Publishing, 2018.
[32] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, Springer US, 1998.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊