跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.85) 您好!臺灣時間:2024/12/15 00:27
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:蔡孟峰
研究生(外文):Meng-Fong Tsai
論文名稱:植基於反向排序K鄰近法和合成少數類樣本的增量技術在資料不平衡之研究與應用
論文名稱(外文):Application and Study of imbalanced datasets base on Top-N Reverse k-Nearest Neighbor (TRkNN) coupled with Synthetic Minority Over-Sampling Technique (SMOTE)
指導教授:喻石生喻石生引用關係
口試委員:黃政治劉正忠詹永寬王仁澤
口試日期:2017-06-26
學位類別:博士
校院名稱:國立中興大學
系所名稱:資訊工程學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:英文
論文頁數:36
中文關鍵詞:不平衡分類合成少數類樣本的增量技術距離度量UCI資料庫
外文關鍵詞:Imbalanced classificationSynthetic minority oversampling techniqueDistance MetricUCI Dataset
相關次數:
  • 被引用被引用:0
  • 點閱點閱:217
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
不平衡分類是指數據集具有不均勻的類別分佈。若不考慮數據集的不平衡問題,大多數分類方法對於多數類的預測有較高的準確率,而少數類的準確率則明顯較低。本研究第一項工作是提出一個有效的演算法,此演算法結合反向排序K鄰近法(TRkNN)和合成少數類樣本的增量技術(SMOTE),以克服UCI資料庫中不平衡數據集的問題。為了研究此演算法,本研究也將其應用於不同的分類方法,如邏輯回歸、C4.5、SVM和BPNN。此外,還採用不同的距離度量來分類相同的UCI數據集。經驗結果表明,歐幾里德距離和曼哈頓距離不僅具有更高的準確率,而且還有比切比雪夫距離和餘弦距離更快的計算效率。因此,基於TRkNN和SMOTE的演算法可以廣泛用於處理不平衡數據集,如何選擇合適的距離度量可以作為未來研究的參考。
對癌症預測的研究已經應用多種機器學習演算法,如類神經網絡,基因演算法和粒子群演算法,以找出分類疾病或癌症的關鍵屬性或傳統的統計預測模型,有效地區別不同類型的癌症,從而建立可以提早發現和治療的預測模型。其中以現有患者的資料作為訓練集來建立模型以預測新病患樣本的分類準確度。這個問題在資料探勘領域引起了相當大的關注,學者們提出了各種方法(例如,隨機抽樣和特徵選取)來解決類別不平衡並實現重新平衡的類別分佈,從而提高分類器的有效性。雖然重新採樣方法可以快速處理不平衡樣本的問題,但它們更重視多數類中的數據,忽略少數類中潛在的重要數據,從而限制分類的有效性。根據在不平衡醫學數據集中發現的模式,本研究第二項工作是使用合成少數類樣本的增量技術來改善不平衡數據集的問題。此外,這項研究還使用三個UCI醫療數據集來比較基於機器學習、軟計算和仿生計算之各種方法的重新採樣性能。
The imbalanced classification means the dataset has an unequal class distribution among its population. For a given dataset without considering the imbalanced issue, most classification methods often predict the high accuracy for the majority class, but significantly low accuracy for the minority class. The first task in this dissertation is to provide an efficient algorithm, Top-N Reverse k-Nearest Neighbor (TRkNN), coupled with Synthetic Minority Over-Sampling TEchnique (SMOTE) to overcome this issue for several imbalanced datasets from famous UCI datasets. To investigate the proposed algorithm, it was applied into different classified methods, such as Logistic regression, C4.5, SVM, and BPNN. In addition, this research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrated that the Euclidean distance and Manhattan distance not only perform higher percentage of accuracy rate, but also show greater computational efficiency than the Chebyshev distance and Cosine distance. Therefore, the TRkNN and SMOTE based algorithm could be widely used to handle the imbalanced datasets and how to choose the suitable distance metrics can be as the reference for the future researches.
Research into cancer prediction has applied various machine learning algorithms, such as neural networks, genetic algorithms, and particle swarm optimization, to find the key to classifying illness or cancer properties or to adapt traditional statistical prediction models to effectively differentiate between different types of cancers, and thus build prediction models that can allow for early detection and treatment. Training data from existing patients is used to establish models to predict the classification accuracy of new patient samples. This issue has attracted considerable attention in the field of data mining, and scholars have proposed various methods (e.g., random sampling and feature selection) to address category imbalances and achieve a re-balanced class distribution, thus improving the effectiveness of classifiers with limited data. Although resampling methods can quickly deal with the problem of unbalanced samples, they give more importance to the data in the majority class, and neglect potentially important data in the minority class, thus limiting the effectiveness of classification. Based on patterns discovered in imbalanced medical data sets, the second task in this dissertation is to use the synthetic minority oversampling technique to improve imbalanced data set issues. In addition, this research also compares the resampling performance of various methods based on machine learning, soft-computing, and bio-inspired computing, using three UCI medical data sets.
Contents
致 謝 ii
中文摘要 iii
Abstract iv
Contents vi
List of Figures viii
List of Tables ix
Chapter 1 Introduction 1
Chapter 2. Literatures Review 4
2.1 Sampling Techniques 4
2.2 Synthetic Minority Oversampling Technique (SMOTE) 5
2.3 Machine Learning 6
Chapter 3. Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation 8
3.1. Imbalanced Class 8
3.2. Machine Learning 10
3.3. DataBase 11
3.4. Methods 13
3.4.1. UCI Data Set Collection Stage 13
3.4.2. Data Preprocessing Stage 13
3.4.3. Prediction Model Implementation Stage 15
3.4.4. Performance Evaluation Stage 15
Chapter 4. Distance Metric Based Over-Sampling Method for Bioinformatics and Performance Evaluation 16
4.1. Materials 16
4.2. Top-N Reverse k-Nearest Neighbor (TRkNN) Algorithm 17
4.3. Distance Metrics 19
Chapter 5. Experimental Results 22
5.1. Performances of Data Mining Based Design for Bioinformatics with Oversampling 22
5.1.1. Experimental Design and Parameter Setting 22
5.1.2. Performance Evaluation 22
5.2. Performances of Distance Metric Based Over-Sampling Method for Bioinformatics 25
5.2.1. Experimental Design and Parameter Setting 25
5.2.2. Performance Evaluation 25
Chapter 6 Conclusions 31
References 33
References
[1] F. Amato, A. Lo´pez, E. M. Pen˜a-Me´ndez, P. Vanˇ hara, A. Hampl, & J. Havel, (2013). Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 11: 47–58.
[2] A. Anand, G. Pugalenthi, G. Fogel, & P. Suganthan (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids, 39: 1385–1391.
[3] B. Atoufi, E. N. Kamavuako, B. Hudgins, & K. Englehart, (2014). Toward proportional control of myoelectric prostheses with muscle synergies. Journal of Medical and Biological Engineering, 34: 475–481.
[4] G. Batista, R. Prati, & M. Monard (2005). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newslett. 6: 20–29.
[5] C. Bunkhumpornpat, K. Sinapiromsaran, & C. Lursinsap (2009). Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’09, Springer- Verlag, Berlin, Heidelberg: 475–482.
[6] M. Castillo, & J. Serrano (2004). A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explorations Newsletter, 6: 70–79.
[7] N. V. Chawla, K. W. Bowyer, L. O. Hall, & W. P. Kegelmeyer (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321–357.
[8] M. Y. Chen (2014). Using a Hybrid Evolution Approach to Forecast Financial Failures for Taiwan Listed Companies. Quantitative Finance, 14(6): 1047-1058.
[9] M. Y. Chen (2013). A Hybrid ANFIS Model for Business Failure Prediction - Utilization of Particle Swarm Optimization and Subtractive Clustering. Information Sciences, 220: 180-195.
[10] C. C. Chiu, S. J. Yeh, Y. H. Hu, & K. Y. K. Liao (2014). SVM Classification for diabetics with various degrees of autonomic neuropathy based on cross-correlation features. Journal of Medical and Biological Engineering, 34: 495–500.
[11] G. Cohen, M. Hilario, H. Sax, S. Hogonnet, & A. Geissbuhler (2006). Learning from imbalanced data in surveillance of nosocomial infection, Artificial Intelligence in Medicine 37: 7–18.
[12] C. Cortes, & V. Vapnik, (1995). Support-vector networks. Machine Learning, 20: 273–297.
[13] Q. Du, K. Nie, & Z. Wang, (2014). Application of entropy- based attribute reduction and an artificial neural network in medicine: A case study of estimating medical care costs associated with myocardial infarction. Entropy, 16: 4788–4800.
[14] S. Dumais, J. Platt, D. Heckerman, & M. Sahami, (1998). Inductive learning algorithms and representations for text categorization. Proceedings of the 7th international conference on information and knowledge management: 148–155.
[15] S. Ertekin, J. Huang, L. Bottou, & L. Giles, (2007). Learning on the border: active learning in imbalanced data classification. Proceedings of the 16th ACM conference on information and knowledge management: 127–136.
[16] V. Garc´ıa, J. S. Sa´nchez, & R. A. Mollineda, (2011). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25: 13–21.
[17] N. Giannakeas, P. S. Karvelis, T. P. Exarchos, F. G. Kalatzis, & D. I. Fotiadis, (2013). Segmentation of microarray images using pixel classification: Comparison with clustering-based methods. Computers in Biology and Medicine, 43: 705–716.
[18] H. Guo, & H. L. Viktor (2004). Learning from imbalanced data sets with boosting and data generation: the data boosting approach. SIGKDD Explorations, 6(1): 30–39.
[19] I. Guyon, J. Weston, S. Barnhill, & V. Vapnik, (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46: 389–422.
[20] H. Han, W. Y. Wang, & B. H. Mao (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proceedings of the International Conference on Intelligent Computing 2005, Part I, Hefei, China: 878–887.
[21] M. Hao, Y. Wang, & S. H. Bryant, (2014). An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Analytica Chimica Acta, 806: 117–127.
[21] M. Hao, Y. Wang, & S. H. Bryant (2014). An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal. Chim. Acta., 806: 117-127.
[22] P.E. Hart (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 18: 515–516.
[23] H. He, & E. A. Garcia (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21: 1263–1284.
[24] D. Hebb (2002). The organization of behavior. New York: Wiely.
[25] J. J. Hopfield (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79: 2554–2558.
[26] V. KATOS (2007). Network intrusion detection: Evaluating cluster, discriminant, and logit analysis. Information Sciences 177(15): 3060-3073.
[27] K. C. Khor, C. Y. Ting, & S. Phon-Amnuaisuk (2012). A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl. Intell. 36: 320–329.
[28] M. Kubat & S. Matwin (1997). Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the 14th international conference information Machine Learning: 179–186.
[29] J. Laurikkala (2001). Improving identification of difficult small classes by balancing class distribution. Artificial Intelligence in Medicine 2101: 63–66.
[30] Y. H. Lee, C. J. Chen, , Y. J. Shiah, S. F. Wang, M. S. Young, C. Y. Hsu, , et al. (2014). Support-vector-machine-based meditation experience evaluation using electroencephalography signals. Journal of Medical and Biological Engineering, 34: 589–597.
[31] D. D. Lewis, & J. Catlett (1994). Heterogenous uncertainty sampling for supervised learning. In Proceedings of the 11th international conference on machine learning:148–156.
[32] Y. Liu, A. An, & X. Huang (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Lecture Notes in Computer Science, 3918: 107–118.
[33] Y. Liu, X. Yu, J. X. Huang, & A. An (2011). Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Information Processing and Management, 47: 617–631.
[34] T. Maciejewski, & J. Stefanowski (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. Proceedings of IEEE Symposium on Computational Intelligence and Data Mining, SSCI IEEE, IEEE Press: 104–111.
[35] I. Mani, & I. Zhang (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. In International conference on machine learning, workshop on learning from
[36] M. Mazurowski, P. A. Habas, J. M. Zurada, , J. Y. Lo, J. A. Baker, & G. D. Tourassi (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks, 21: 427–436.
[37] W. S. McCulloch, & W. Pitts (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5: 115–133.
[38] J. A. MICHAEL, & S. L. GORDON (1997): Data mining technique for marketing, sales and customer support. New York: Wiley.
[39] G. Motalleb (2014). Artificial neural network analysis in pre- clinical breast cancer. Cell Journal, 15: 324–331.
[40] C. Phua, D. Alahakoon, & V. Lee (2004). Minority report in fraud detection: classification of skewed data. SIGKDD Explorations Newsletter, 6: 50–59.
[41] J. R. Quinlan (1993): Programs for machine learning. San Fransisco: Morgan Kaufmann.
[42] J. A. Roayaei, S. Varma, W. Reinhold, & J. N. Weinstein (2013). A microarray analysis for differential gene expression using Bayesian clustering algorithm, support vector machines (SVMs) to investigate prostate cancer genes. Journal of Com- putational Biology, 5: 15–22.
[43] F. Rosenblatt (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65: 386–408.
[44] J. A. Sáez, J. Luengo, , J. Stefanowski, & F. Herrera (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences 291: 184–203.
[45] S. L. Salzberg (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1: 317–327.
[46] Y. Sun, M. S. Kamel, A. K. Wong, & Y. Wang (2007). Cost- sensitive boosting for classification of imbalanced data. Pattern Recognition, 40: 3358–3378.
[47] A. Sun, E.P. Lim, & Y. Liu (2009). On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst., 48: 191–201.
[48] T. Sun, J. Wang, X. Li, P. Lv, F. Liu, Y. Luo, et al. (2013). Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimen- sional data set. Computer Methods and Programs in Biomedicine, 111: 519–524.
[49] P. Thanathamathee, & C. Lursinsap (2013). Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Letters, 34: 1339–1347.
[50] I. Tomek (1976). Two modifications of CNN, IEEE Trans. Syst. Man Commun. 6: 769–772.
[51] B. X. Wang, & N. Japkowicz (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25: 1–20.
[52] D. Wang, C. Quek, & G. S. Ng (2014). Ovarian cancer diagnosis using a hybrid intelligent system with simple yet convincing rules. Applied Soft Computing, 20: 25–39.
[53] W. Wei, J. Li, L. Cao, Y. Ou, & J. Chen (2013). Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16: 449–475.
[54] D. L. Wilson (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2: 408–420.
[55] J. Yang, Y. Liu, X. Zhu, Z. Liu, & X. Zhang (2012). A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management, 48: 741–754.
[56] Y. Yuan, M. L. Giger, H. Li, N. Bhooshan, & C. A. Sennett (2012). Correlative analysis of FFDM and DCE-MRI for improved breast CADx. Journal of Medical and Biological Engineering, 32: 42–50.
[57] B. Zheng, S. W. Yoon, & S. S. Lam (2014). Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Systems with Applications, 41: 1476–1482.
[58] Z. H. Zhou, & X. Y. Liu (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18: 63–77.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top