( 您好!臺灣時間:2024/05/28 12:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Song-Hui Ye
論文名稱(外文):Methods for the Random Forests Analysis to Choose Important Features
指導教授(外文):Chang-Yun Lin
口試委員(外文):Shin-Fu TsaiMing-Chung Chang
外文關鍵詞:Decision TreeFeature SelectionMachine LearningRandom ForestsRegressionSHAP
  • 被引用被引用:0
  • 點閱點閱:377
  • 評分評分:
  • 下載下載:57
  • 收藏至我的研究室書目清單書目收藏:0
近年來機器學習的蓬勃發展,對於資料分析多了很多的方法,但對於解密其當中的奧秘,到現在還是十分重要的課題,所以可解釋機器學習在這幾年慢慢成為了機器學習的重要研究方向,在許多商業應用下,不僅要建立一個準確的模型而且對於模型的可解釋性同樣要非常的重視。通常,我們除了想知道模型的預測是什麼之外,我們還想知道哪些功能對確定預測,也就是特徵對於模型的影響是什麼。另外,一個好的模型,可以確定哪些變量是很重要的,並且可以幫助我們及早發現問題並且提出改善方法,作為數據科學家需要防止模型存在問題,並且可以幫助決策者理解如何正確地使用模型,越是嚴苛的場景,越需要模型提供證明它們是如何運作且避免錯誤的決策。大家所熟知的決策樹是一種數據分析和機器學習的演算法,於1980 年代提出,它具有樹狀架構,包含了一個根節點、節點以及葉節點;其節點代表對當前子集進行訊息中特徵的測試,葉節點則表示標籤的類別。另外,CART 演算法是使用Gini 來測試資料不純度,並且將決策樹用於迴歸問題。而更廣為大家所使用的隨機森林則由多個決策樹所組成的分類器,每個決策樹都是用將數據集透過bootstrap 來生成的訓練集做訓練,並且在節點則隨機選擇特徵子集來分割數據,以此降低隨機森林的錯誤率。本論文即是使用隨機森林為基礎,並且使用CART 演算法中的Gini 來做一個特徵排序的方法。另外由於隨機森林本身附加OOB 分數來做為測試的一個基準,所以延伸出使用特徵置換的方法來觀看特徵的重要度。而為了使機器學習可以透明化,2017 年Scott M.Lundberg也提出了一種名為SHAP 的公平分配的方法,來解決特徵重要性的排序,最後,根據回歸分析的基礎,採取了一種刪除特徵的方法,進而觀看每個特徵對於預測分數的變動,延伸出觀看特徵的排序。本篇論文就是比較此四種方法,看何種情況下,算法的不同是否會影響特徵重要的排序,也舉了一些例子,來印證推論是否正確。
In recent years, the vigorous development of machine learning has provided many more methods for data analysis, but it is still a very important topic to decrypt the mysteries, so explainable machine learning has gradually become an important part of machine learning in recent years. Research direction, in many commercial applications, not only to establish an accurate model, but also to attach great importance to the interpretability of the model. Usually, in addition to knowing what the model's predictions are, we also want to know which features are used to determine the predictions, that is, what are the effects of features on the model. In addition, a good model can determine which variables are important, and can help us to find problems earlier and propose ways to improve. As a data scientist, we need to prevent modeling problems and help decision makers understand how to use the model correctly. The more demanding scenarios are, the more models are required to provide proof of how they work and avoid wrong decisions. The decision tree is a well-known algorithm for data analysis and machine learning. It was proposed in the 1980s. It has a tree structure and includes a root node, node and leaf node; its nodes represent the characteristics of the current subset of messages. In the test, the leaf node represents the category of the label. In addition, the CART method uses Gini to test the impurity of the data, and uses decision trees for regression problems. The more widely used random forest is a classifier composed of multiple decision trees. Each decision tree is trained with the training set generated by the data set through bootstrap, and the feature is randomly selected at the node set to divide the data to reduce the error rate of the random forest. This paper is based on random forest, and uses Gini in the CART algorithm to do a feature sorting method. In addition, because the random forest comes with OOB scores as a benchmark for testing, it extends the use of feature replacement methods to view the importance of features. In order to make machine learning transparent, in 2017, Scott M. Lundberg also proposed a fair distribution method called SHAP to solve the ranking of feature importance. Finally, based on the basis of regression analysis, a step-by-step approach was adopted. The method of deleting features, and then viewing the changes in the prediction score of each feature, extends the ranking of viewing features. This paper compares these four methods to see under what circumstances, whether the difference of the algorithm will affect the important ranking of features, and also gives some examples to confirm whether the inference is correct.
摘要 i
Abstract ii
目錄 iii
表目錄 v
圖目錄 vi
第一章緒論 1
1.1 研究背景與動機 1
1.2 研究目的 1
1.3 研究流程 2
第二章文獻探討 3
2.1 決策樹 3
2.2 隨機森林 8
2.2.1 Bagging(Bootstrap Aggregation) 9
2.2.2 Boosting 9
2.3 SHAP 11
2.3.1 Shapley 值 11
2.3.2 SHAP 值 14
第三章研究方法 17
3.1 吉尼不純度(Gini Importance) 17
3.2 置換特徵(Permutation Importance) 17
3.3 SHAP 19
3.4 特徵刪除(Column Remove) 20
第四章例子探討 21
4.1 模擬資料 21
4.1.1 資料一 21
4.1.2 資料二25
4.1.3 資料三 30
4.1.4 資料四 35
4.1.5 模擬總結 40
4.2 實際資料 42
4.2.1 單次實驗結果 43
4.2.2 運行100 次的平均實驗結果 48
4.3 資料可視化 49
4.3.1 係數符號推測 50
4.3.2 球員A 51
4.3.3 球員B 52
第五章結論 53
5.1 更改模型53
5.2 虛擬變數 53
5.3 多重共線性 53
參考文獻 54
[1] M. A. M. Hasan, M. Nasser, S. Ahmad, and K. I. Molla, “Feature selection for intrusion detection using random
forest,” Journal of information security, vol. 7, no. 3, pp. 129–140, 2016.
[2] A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques
to build intelligent systems. O’Reilly Media, 2019.
[3] H. Sharma and S. Kumar, “A survey on decision tree algorithms of classification in data mining,” International
Journal of Science and Research (IJSR), vol. 5, no. 4, pp. 2094–2097, 2016.
[4] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986.
[5] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp.
179–188, 1936.
[6] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
[7] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, “A comparison of prediction accuracy, complexity, and training time of
thirty-three old and new classification algorithms,” Machine learning, vol. 40, no. 3, pp. 203–228, 2000.
[8] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[9] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
[10] L. Breiman, “Out-of-bag estimation,” 1996.
[11] L. S. Shapley, “A value for n-person games,” Contributions to the Theory of Games, vol. 2, no. 28, pp. 307–317,
[12] S. Hart, “Shapley value,” in Game Theory. Springer, 1989, pp. 210–216.
[13] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” arXiv preprint arXiv:
1705.07874, 2017.
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
第一頁 上一頁 下一頁 最後一頁 top