跳到主要內容

臺灣博碩士論文加值系統

(44.192.22.242) 您好!臺灣時間:2021/07/28 05:33
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:凌士雄
研究生(外文):Shih-Shiung Ling
論文名稱:非對稱性分類分析解決策略之效能比較
論文名稱(外文):Empirical Evaluations of Different Strategies for Classification with Skewed Class Distribution
指導教授:鄭滄祥鄭滄祥引用關係魏志平魏志平引用關係
指導教授(外文):Tsang-Hsiang ChengChih-Ping Wei
學位類別:碩士
校院名稱:國立中山大學
系所名稱:資訊管理學系研究所
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2004
畢業學年度:92
語文別:中文
論文頁數:72
中文關鍵詞:分類分析非對稱性分配決策樹歸納技術多專家分類器增加少數法減少多數法
外文關鍵詞:Classification AnalysisDecision Tree InductionMulti-classifier Committee ApproachUnder-samplingOver-samplingSkewed Class Distribution
相關次數:
  • 被引用被引用:11
  • 點閱點閱:361
  • 評分評分:
  • 下載下載:54
  • 收藏至我的研究室書目清單書目收藏:0
由於應用常見的分類分析技術在類別數量分佈平均的資料集合上,即可建構出預測效能良好的分類模式。然而,在如信用卡詐欺偵測的實務運用上,資料集合內卻常存在著類別間數量分佈極不平均的非對稱性分配問題,因此以一般的分類分析技術所建構出的分類模式,常有嚴重的類別預測偏向問題,使得預測模式無法對數量稀少的目標資料做出正確的類別預測。

減少多數法、增加少數法及多專家分類器等處理策略是目前文獻上常用以解決資料集合的非對稱性分配問題的方法,但卻少有文獻比較這些處理策略間的效能差異。因此本研究收集了十組具有非對稱性分配問題的資料集合,分別先以減少多數法、增加少數法及多專家分類器等策略處理資料集合內的非對稱性,再利用常見的C4.5決策樹建構分類器,進而比較各種非對稱處理策略間的效能差異,藉以瞭解各種處理策略的特性與適用的情境。

本研究收集了十組具有非對稱性問題的資料集合,並利用十摺交互驗證法(10-fold cross-validation)的實證評估方法,以分類精確度、回應率及F1衡量等三種效標,比較不同處理策略的效能差異。實證結果顯示,多專家分類器處理策略在各種效標下皆能有效地提昇分類器對少數類別資料的分類效能;倘若實務應用著重於分類器回應率的效能表現,則利用增加少數法將較能有效地提昇分類器的分類效能;若實務應用著重於分類器精確度的表現,則建議直接以原資料集合建構分類器。
Existing classification analysis techniques (e.g., decision tree induction,) generally exhibit satisfactory classification effectiveness when dealing with data with non-skewed class distribution. However, real-world applications (e.g., churn prediction and fraud detection) often involve highly skewed data in decision outcomes. Such a highly skewed class distribution problem, if not properly addressed, would imperil the resulting learning effectiveness.

In this study, we empirically evaluate three different approaches, namely the under-sampling, the over-sampling and the multi-classifier committee approaches, for addressing classification with highly skewed class distribution. Due to its popularity, C4.5 is selected as the underlying classification analysis technique. Based on 10 highly skewed class distribution datasets, our empirical evaluations suggest that the multi-classifier committee generally outperformed the under-sampling and the over-sampling approaches, using the recall rate, precision rate and F1-measure as the evaluation criteria. Furthermore, for applications aiming at a high recall rate, use of the over-sampling approach will be suggested. On the other hand, if the precision rate is the primary concern, adoption of the classification model induced directly from original datasets would be recommended.
第一章 緒論1
第一節 研究背景1
第二節 研究動機與目的2
第三節 論文架構3
第二章 文獻探討4
第一節 分類分析技術4
一、決策樹4
二、倒傳遞類神經網路5
三、最近鄰居分類法7
第二節 非對稱性問題的處理策略8
一、減少多數法8
二、增加少數法10
三、多專家分類器11
第三章 實證資料集合13
第四章 實證評估33
第一節 減少多數法方法建立33
第二節 增加少數法方法建立34
第三節 評估程序與評估指標36
第四節 實證結果分析39
第五章 結論67
第一節 綜合結論與貢獻67
第二節 未來研究方向68
參考文獻69
中文
[吳旭志01] 吳旭志,賴淑真譯,Michael J.A. Berry 以及Gordon S. Linoff 著,「DataMining 資料採礦理論與實務 顧客關係管理的技巧與科學」, 數博網資訊股份有限公司, 2001。
[邱義堂99]邱義堂, 「通訊資料庫之資料探勘:客戶流失預測之研究」, 國立中山大學資訊管理研究所論文, 1999年。
[袁繼銓03]袁繼銓, 「以類神經網路預測燒傷病患住院日之研究」, 國立中山大學資訊管理研究所論文, 2003年。
[張勳騰99]張勳騰, 「通信資料庫之資料探勘:目標行銷之應用」, 國立中山大學資訊管理研究所碩士論文, 1999 年。
[許哲銘99]許哲銘, 「時間序列型態之知識探索」, 國立中山大學資訊管理研究所碩士論文, 1999 年。
[彭文正01]彭文正譯,Michael J.A. Berry 以及Gordon S. Linoff 著,「DataMining 資料採礦客戶關係管理暨電子行銷之應用」, 數博網資訊股份有限公司, 2001。
[楊傑能01]楊傑能, 「一個找尋型態鑑別問題決策邊界區域的新方法」, 國立中山大學機械工程研究所碩士論文, 2001年。
[楊景婷02]楊景婷, 「時間序列分類分析方法:技術發展與評估」, 國立中山大學資訊管理研究所論文, 2002年。
[葉怡成01]葉怡成, 「應用類神經網路」, 儒林圖書公司, 2001年。
[熊正輝00] 熊正輝, 「以類神經網路為工具預估癌症末期病人之存活」, 財團法人安寧照顧基金會研究成果, 2000年。
[齊玉美03]齊玉美, 「不對稱性分類分析之研究」, 國立中山大學資訊管理研究所論文, 2003年。

英文
[AKA91]Aha, D., Kibler, D., and Albert, M. K., “Instance-Based Learning Algorithms,” Machine Learning, Vol. 6, No. 1, 1991, pp.37-66.
[BL97]Berry, M. J. A. and Linoff, G., Data Mining Techniques: For Marketing Sale and Customer Support, John Wiley & Sons, Inc., 1997.
[CBH02]Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, Vol. 16, 2002, pp.321-357.
[CFPS99]Chan, P. K., Fan, W., Prodromidis, A. L. and Stolfo, S. J., “Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, November/December 1999, pp67-74.
[CH67]Cover, T. M. and Hart, P. E., “Nearest Neighbor Pattern Classification,” IEEE Transactions on Information Theory, Vol. IT-13, No. 1, 1967, pp.21-27.
[CN89]Clark, P. and Niblett, T., “The CN2 Induction Algorithm,” Machine Learning, Vol. 3, 1989, pp.261-283.
[DBB91]DeRouin, E., Brown, J., Beck, H., Fausett, L., and Schneider, M., “Neural Network Training on Unequally Represented Classes,” Intelligent Engineering Systems Through Artificial Neural Networks, C. H. Dagli, S. R. T. Kumara, and Y. C. Shin (Eds.), ASME Press, New York, 1991, pp.135-145.
[EN96]Ezawa, K. J. and Norton, S. W., “Constructing Bayesian Networks to Predict Uncollectible Telecommunications Accounts,” IEEE Expert, Vol. 11, No.5, 1996, pp.45-51.
[G99]Gerritsen, R., “Assessing Loan Risks: A Data Mining Case Study,” IT Professional, Vol. 1, No. 6, 1999, pp.16-21.
[H68]Hart, P. E., “The Condensed Nearest Neighbor Rule,” IEEE Transactions on Information Theory, IT-14, 1968, pp.515-516.
[H95]Hall, C., “The Devil’s in the Details: Techniques, Tools, and Applications for Database Mining and Knowledge Discovery—PartⅡ,” Intelligent Software Strategies, Vol. XI, No.9, 1995, pp.1-16.
[H96]Hall, C., “Intelligent Data Mining at IBM: New Products and Applications.”, Intelligent Software Strategies, Vol. XⅡ, No.5, 1996, pp.1-11.
[HFT95]Han, J., Fu, Y. and Tang, S., “Advances of the DBLearn System for Knowledge Discovery in Large Databases,” Proceedings of 1995 International Joint Conference on Artificial Intelligence (IJCAI’95), Montreal, Canada, August 1995, pp.2049-2050.
[HMH97]Honda, T., Motizuki, H., Ho, T. B., and Okumura, M., “Generating Decision Trees from an Unbalanced Data Set,” Proceedings of the 9th European Conference on Machine Learning (ECML), 1997, pp.68-77.
[HP98]Ha, S. H. and Park, S. C., “Application of Data Mining Tools to Hotel Data Mart on the Intranet for Database Marketing,” Expert Systems With Applications, Vol. 15, 1998, pp.1-31.
[J00A]Japkowicz, N., “ The Class Imbalance Problem: Significance and Strategies,” Proceedings of the International Conference on Artificial Intelligence, Las Vegas, June 2000.
[K93]Kononenko, I., “Inductive and Bayesian Learning in Medical Diagnosis,” Applied Artificial Intellifence, Vol. 7, 1993, pp.317-337.
[KM97]Kubat, M. and Matwin, S., “Addressing the curse of imbalanced training sets: one-sided selection,” Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 179-186.
[L99]Lavrac, N., “Selected Techniques for Data Mining in Medicine,” Artificial Intelligence in Medicine, Vol. 16, 1999, pp.3-23.
[LC94]Lewis, D. and Catlett, J., “Heterogeneous Uncertainty Sampling for Supervised Learning,” Proceedings of the 11th International Conference on Machine Learning, 1994, pp.144-156.
[LL98]Ling, C. X. and Li, C. “Data mining for direct marketing: Problems and solutions,” Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998, pp. 73-79.
[LM01]Lin, F. Y. and McClean, S., “A Data Mining Approach to the Prediction of Corporate Failure,” Knowledge-Based Systems, Vol. 14, No. 3-4, 20001, pp.189-195.
[RHW86] Rumelhart,D. E., Hinton, G. E., and Williams, R. J., “Learning Internal Representations by Back-propagating Errors,” Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press, Vol. 1, 1986, pp.318-362.
[Q86]Quinlan, J. R., “Induction of Decision Tree,” Machine Learning, Vol. 1, No. 1, 1986, pp.81-106.
[Q93]Quinlan,J. R., C4.5: Programs for Machine Learning, MorganKaufmann, San Mateo, CA, 1993.
[SFL97]Stolfo, S. J., Fan, D. W., Lee, W., Prodromidis, A. L. and Chan, P. K., “Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results,” Proceedings of AAAI-97 Workshop on AI Methods in Fraud and Risk Management, 1997.
[SS96]Solberg, A. H. S. and Solberg, R., “A Large-Scale Evaluation of Features For Automatic Detection of Oil Spills in ERS SAR Images,” IEEE Symp. Geosc. Rem. Sens (EGARSS), 1996, pp.1484-1486.
[T76]Tomek, I., “Two Modifications of CNN,” IEEE Transactions on Systems, Man and Communications, SMC-6, 1976, pp.769-772.
[WBS97] Wong, B. K., Bonovich, T. A., and Selvi, Y., “Neural Network Applications in Business: A Review and Analysis of the Literature (1988-95),” Decision Support Systems, Vol. 19, 1997, pp. 301-320.
[WC02]Wei, C. and Chiu, I., “Turning Telecommunications Call Details to Churn Prediction: A Data Mining Approach, ” Expert Systems with Applications, Vol. 23, No. 2, 2002, pp.103-112.
[YCB99]Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T. and Liu, X., “Learning Approaches for Detecting and Tracking News Events,” IEEE Intelligent Systems, Vol. 14, No. 4, July-Aug. 1999, pp.32-43.
[Z92]Zhang, J., “Selecting Typical Instanced in Instance-Based Learning,” Proceedings of the 9th International Machine Learning Workshop, 1992, pp.470-479.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top