跳到主要內容

臺灣博碩士論文加值系統

(44.222.134.250) 您好!臺灣時間:2024/10/08 04:09
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張峻銘
研究生(外文):Chun-Min Chang
論文名稱:對不平衡的資料有效率的訓練和自我訓練的門檻分析
論文名稱(外文):Efficient Training for Imbalance Data and Threshold Analysis for Self-training
指導教授:林守德林守德引用關係
指導教授(外文):Shou-De Lin
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊網路與多媒體研究所
學門:電算機學門
學類:軟體發展學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:中文
論文頁數:32
中文關鍵詞:半監督式學習自我訓練不平衡資料
外文關鍵詞:semi-supervised learningself-trainingimbalanced datakddcup 08
相關次數:
  • 被引用被引用:1
  • 點閱點閱:245
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
本篇論文提出兩個方法解決在不平衡資料下的分類問題。首先提出利用自我訓練時,比之前自我訓練的方法少的參數維度以及提升效能的方法。透過在已標記的訓練資料上,對每個分類器單獨訓練出預測信心值門檻,用來分辨高信心的未標記資料,並將結果作聯集給它們虛擬類別加入已標記資料中重新訓練。藉此不但降低了選參數的時間,效能也跟複雜的參數差不多。再者我們提出有效率地訓練不平衡資料的方法,從速度快的down-sampling開始透過類似booststrap的方法,將模型逼近得與up-sampling一樣,由於使用的資料量少,速度獲得了提升。我們在KDD cup 2008的極端不平衡資料中為它們實驗,實驗結果顯示在自我訓練中我們的方法選擇參數表現較之前方法稍好;而在效率上提出的方法是直接使用up-sampling的1.3倍快,而且在AUC上的表現差距不多。
There are two methods proposed to address classification problems of imbalanced data. First, we propose a method that has smaller parameter space and more performance when using self-training. We train confidence thresholds for each classifier using labeled data to identify high confident data, and label them pseudo labels for re-train. Through this scheme we get less training time for parameters and get better performance. Second, we proposed an efficient training method for imbalanced data. We start with down-sampling and using a method like bootstrap. The model will approximate the model of up-sampling. Using less training data leads to less training time. We do experiments on KDDCUP 2008 data. The result shows that our threshold-based self-training has better performance and the approximated model has the same performance as up-sampling but cost only 0.75 times training time of up-sampling.
摘要 ii
Abstract iii
List of Figures v
List of Tables vi
Chapter 1 1
1.1 背景及動機 1
Chapter 2 7
2.1 Semi-supervised learning 7
2.2 不平衡資料的訓練 9
Chapter 3 11
3.1 自我訓練的門檻分析 11
3.2 對不平衡的資料有效率地訓練 18
Chapter 4 21
4.1 實驗資料 21
4.2 評估方式 22
4.3 Confidence threshold exploitation in self-training of MCS 23
4.4 Approximate up-sampling from down-sampling 26
Chapter 5 30
Bibliography 31
[1]Luca Didaci, Fabio Roli: Using Co-training and Self-training in Semi-supervised Multiple Classifier Systems. SSPR/SPR 2006: 522-530
[2]G. M. Weiss. Mining with rarity - problems and solutions: A unifying framework. SIGKDD Explorations, 6(1):7–19, 2004
[3]A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): 1145-1159, 1997.
[4]F. Provost, and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42: 203-231, 2001.
[5]Rong Zhang, Alexander I. Rudnicky, "A New Data Selection Principle for Semi-Supervised Incremental Learning," Pattern Recognition, International Conference on, vol. 2, pp. 780-783, 18th International Conference on Pattern Recognition (ICPR''06) Volume 2, 2006.
[6]R. C. Holte, L. E. Acker, and B. W. Porter. Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 813-818, 1989.
[7]M. Kubat, and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179-186, Morgan Kaufmann, 1997.
[8]R. G. Swensson, “Unified measurement of observer performance in detecting and localizing target objects on images,” Med. Phys. 23, 1709–1725 s1996d.
[9]Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[10]R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear
[11]Hung-Yi Lo, Chun-Min Chang, Tsung-Hsien Chiang, Cho-Yi Hsiao, Anta Huang, Tsung-Ting Kuo, Wei-Chi Lai, Ming-Han Yang, Jung-Jung Yeh, Chun-Chao Yen and Shou-De Lin, Learning to Improve Area-Under-FROC for Imbalanced Medical Data Classification Using an Ensemble Method, SIGKDD Explorations, 10(2), pp.43-46, December 2008.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top