(3.235.191.87) 您好!臺灣時間:2021/05/13 04:12
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:游華英
研究生(外文):Yugowati Praharsi
論文名稱:用於偵測糖尿病的監督式學習法和特徵選取
論文名稱(外文):Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
指導教授:繆紹綱繆紹綱引用關係
指導教授(外文):Shaou-Gang Miaou
學位類別:碩士
校院名稱:中原大學
系所名稱:電子工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:62
中文關鍵詞:特徵選取監督式學習法糖尿病
外文關鍵詞:feature selectiondiabetesSupervised learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:164
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
摘要

資料的描述與分類是有趣的且重要的工作並廣泛的應用於監督式學習。本論文考慮三種監督式學習的方法:第k位最近鄰(k-NN)、支持向量資料描述(SVDD)以及支持向量機(SVM).
監督式學習的特徵選取有助於找到產生高分類精確度的特徵子集。本論文同時考慮了前向選取為基礎的包裝器和相關性為基礎的濾波器方法,其中特徵和分類標記之間的關係是使用熵和資訊增益來衡量,而特徵與特徵間的相關性則是使用Pearson相關來計算。本研究比較有及沒有特徵選取下的k-NN, SVDD 以及SVM這三種分類方法,預期有搭配特徵選取的分類方法的表現可以優於沒搭配特徵選取的分類方法。此外,經選取的特徵子集合可以用來描述資料結構,不管用何種分類方法或者是特徵選取方法。
所選用的資料樣本是來自UCI資料庫裡的PIMA印地安人糖尿病資料。結果顯示,前向特徵的選取對於SVM 和 5-NN產生最好的子集合。此外,以平均的資訊增益和一個標準差門檻值為基礎的特徵選取對1-NN分類器產生最好的結果且這樣的選取方法可以當成前向選取法的替代品。與前向選取法相較,它在計算上有效率且可以避免對SVM 和 5-NN產生大幅度的精確率下降。最後,在八個候選特徵中,對所有分類以及特徵選取方法,葡萄糖準位在糖尿病偵測上都是最突出的特徵。資訊增益中的相關性量測可用來將特徵從最重要依序排列到最不重要,它對於像是定義特徵優先順序供病症辨識之用的這種醫學應用會有非常大的幫助
Abstract
Data description and classification are interesting and important tasks which are applied widely in supervised learning. In this thesis, three supervised learning methods are considered: k-Nearest Neighbor (k-NN), Support Vector Data Description (SVDD) and Support Vector Machine (SVM).
Feature selection in supervised learning is useful to find a feature subset that produces higher classification accuracy. Both forward selection based wrapper and correlation based filter approaches are considered in this thesis. Correlation between features and class label is measured using entropy and information gain (IG) while feature-feature correlation is calculated using Pearson correlation. This study compares the performance of three classifiers (k-NN, SVDD and SVM) with and without feature selection. It is expected that the classifiers with the proposed feature selection methods will perform better than the classifiers without feature selection. In addition, the selected feature subset can be used to describe data structure no matter what classifier types or feature selection methods are used.
The data sample chosen is PIMA Indians diabetes from UCI database. The results show that forward feature selection produces the best feature subset for SVM and 5-NN. In addition, feature selection based on mean information gain and a standard deviation threshold gives the best result for 1-NN classifier and such a selection method can be considered as a substitute for forward selection. It is computationally efficient and the accuracy does not decrease significantly for SVM and 5-NN, as compared to forward selection. Finally, among eight candidate features, glucose level is the most prominent feature for diabetes detection in all classifiers and feature selection methods under consideration. Relevancy measurement in IG can be used to sort from the most important feature to the least significant one. It can be very useful in medical applications such as defining feature prioritization for symptom recognition.
Abstract in Chinese................................................................................................I
Abstract in English................................................................................................II
Acknowledgements in English............................................................................III
Table of Contents................................................................................................IV
List of Figures.....................................................................................................VI
List of Tables......................................................................................................VII
List of Symbols................................................................................................VIII
Chapter I. Introduction..........................................................................................1
I-1. Background.............................................................................................2
I-1.1. Supervised Learning Approaches................................................2
I-1.2. Feature Selection.........................................................................3
I-2. Objectives...............................................................................................3
I-3. Outline....................................................................................................4
Chapter II. Literature Study..................................................................................5
II-1. Support Vector Machine........................................................................5
II-2. Support Vector Data Description..........................................................9
II-3. Nearest Neighbor................................................................................13
II-4. Feature Selection.................................................................................14
II-4.1 Correlation Based Feature Selection.........................................15
II-4.2. Pearson's Correlations..............................................................17
II-4.3. Entropy.....................................................................................18
II-5. Diabetes……………………………………………………………...19
Chapter III. Experiment Design and Methods....................................................23
III-1. Data Set..............................................................................................23
III-2. Performance Evaluation Measure......................................................25
III-3. Classifiers...........................................................................................27
III-4. Feature Selection Methods................................................................30
Chapter IV. Results and Discussions...................................................................36
IV-1. Performance of Supervised Learning Approaches without Feature Selection..............................................................................................36
IV-1.1. Performance Evaluation Measure...........................................36
IV-1.2. Computational Time................................................................37
IV-2. Performance of Supervised Learning Approaches with Feature Selection..............................................................................................38
IV-2.1. Performance Evaluation Measure...........................................38
IV-2.2. Computational Time................................................................40
IV-2.3. The Best Feature Subset of Each Classifier............................41
IV-2.4. Feature Order Based on Information Gain..............................42
Chapter V. Conclusions and Future Work...........................................................43
V-1. Conclusions.........................................................................................43
V-2. Future Work.........................................................................................44
References….......................................................................................................46
Appendix A…......................................................................................................50
Appendix B…......................................................................................................52


List of Figures
Fig. II-1. Margin of SVM.................................................................................5
Fig. II-2. Optimal separating hyperplane.........................................................6
Fig. II-3. Nonlinear mapping to the feature space...........................................6
Fig. II-4. Illustration of SVDD.........................................................................9
Fig. II-5. Illustration of nearest neighbor classifier.......................................13
Fig. II-6. Correlation based feature selection.................................................17
Fig. III-1. Confusion matrix............................................................................26
Fig. III-2. A flow chart showing the grid algorithm for SVM, NN, and SVDD classifiers.........................................................................................28
Fig. III-3. A flow chart of the main program for SVDD, SVM, and NN classifiers.........................................................................................29
Fig. III-4. A flow Chart of the main program for feature selection.................31
Fig. III-5. Decision tree of the 1st feature........................................................33
Fig. III-6. Decision tree of the 2nd feature.......................................................34



List of Tables
Table III-1. Data sets of Pima Indians database.................................................23
Table III-2. Example for features discretisation.................................................32
Table IV-1. Performance evaluation measure for supervised learning without feature selection..............................................................................36
Table IV-2. Computational time without feature selection................................37
Table IV-3. Accuracy performance for supervised learning with feature selection..........................................................................................38
Table IV-4. SVM performance with feature selection ......................................39
Table IV-5. SVDD performance with feature selection.....................................39
Table IV-6. 1-NN performance with feature selection.......................................39
Table IV-7. 5-NN performance with feature selection.......................................40
Table IV-8. Computational time with feature selection.....................................41
Table IV-9. The best features subset selected....................................................42
Table IV-10. Ranking in relevance of each feature to class.................................43
References

[1]R. Ji, D. Liu, M. Wu, and J. Liu, "The Application of SVDD in Gene Expression Data Clustering," in Proc. of the 2nd Int. Conf. on Bioinformatics and Biomedical Engineering, pp. 371-374, May 2008.

[2]R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001.

[3]L. I. Smith. (2002) Tutorial on Principal Component Analysis. [Online]. Available: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_
components.pdf.

[4]S. Haykin, Neural Network: A Comprehensive Foundation, New Jersey: Prentice-Hall, 1999.

[5]D. M. J. Tax and R. P. W. Duin, "Support Vector Data Description," Machine Learning Journal, vol. 54, pp. 45-66, 2004.

[6]D. M. J. Tax and R. P. W. Duin, "Support Vector Domain Description," Pattern Recognition Letter, vol. 20, pp. 1191-1199, 1999.

[7]P. Cunningham and S. J. Delany, "k-Nearest Neighbour Classifiers," in Technical Report UCD-CSI-2007-4, Dublin: Artificial Intelligence Group, Department of Computer Science, Trinity College, 2007.

[8]Y. S. Kim, W. N. Street, and F. Menczer, Data Mining: Opportunities and Challenges, USA: IGI, 2003.

[9]M. Bacauskiene, A. Verikas, A. Gelzinis, and D. Valincius, "A Feature Selection Technique for Generation of Classification Committees and Its Application to Categorization of Laryngeal Images," Pattern Recognition, vol. 42, pp. 645-654, 2009.

[10]M. A. Hall, "Correlation-based Feature Selection for Machine Learning," Ph.D. dissertation, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1998.

[11]V. Sigillito, "Pima Indians Diabetes Data Set," UCI Machine Learning Repository [Online]. Available: http://archieve.ics.uci.edu/ml.datasets/Pima+Indians+Diabetes.

[12]M. Giardina, Y. Huo, F. Azuaje, P. McCullagh, and R. Harper, "A Missing Data Estimation Analysis in Type II Diabetes Databases," in Proc. of the 18th IEEE Symposium on Computer-Based Medical System (CBMS'05), 2005.



[13]P. Shekelle and S. Vijan, "Quality Indicators for the Care of Diabetes Mellitus in Vulnerable Elders," Journal of the American Geriatrics Society, vol. 55, pp. S312-S317, 2007.

[14]C. L. Huang and C. J. Wang, "A GA-Based Feature Selection and Parameters Optimization for Support Vector Machines," Expert Systems with Applications Journal, vol. 31, pp. 231-240, 2006.

[15]Y.-H. Liu, "Image Classification," Lecture Notes in Image Classification Course, Chung Yuan Christian University, Chung-Li, Taiwan, 2008.

[16]"Margin of SVM," http://www.dtreg.com/SvmMargin2.jpg.

[17]"Optimal Separating Hyperplane," http://research.microsoft.com/en-us/um
/people/manik/projects/trade-off/figs/svm2.PNG.

[18]"Nonlinear Mapping to the Feature Space," http://www/imtech.res.in/rag-
hava/rbpred/svm.jpg.

[19]K. Y. Lee, D. W. Kim, K. H. Lee, and D. Lee, "Density-Induced Support Vector Data Description," IEEE Trans. on Neural Networks, vol. 18, pp. 284-289, 2007.

[20]"Support Vector Data Description,' http://www.scholarpedia.org/wiki/
images/b/bb/sphere2.png.

[21]L. E. Peterson, "K-Nearest Neighbor," http//www.scholarpedia.org/article/
K-nearest_neighbor.

[22] "k-Nearest Neighbor," http://engineering.purdue.edu/people/mireille.
boutin.1/ECE301kiwi/KNN.jpg

[23]H. Liu and L. Yu, "Toward Integrating Feature Selection Algorithms for Classification and Clustering," IEEE Trans. on Knowledge and Data Engineering, vol. 17, pp. 491-502, April 2005.

[24]Y.-S. Chen and C.-H. Cheng, "Evaluating Industry Performance using Extracted RGR Rules Based on Feature Selection and Rough Sets Classifiers," Expert Systems with Applications, vol. 36, pp. 9448-9456, 2009.

[25]H. Altun and G. Polat, "Boosting Selection of Speech Related Features to Improve Performance of Multi-Class SVMs in Emotion Detection," Expert Systems with Applications, vol. 36, pp. 8197-8203, 2009.

[26]C.-M. Wang and Y.-F. Huang, "Evolutionary-based Feature Selection Approaches with New Criteria for Data Mining: A Case Study of Credit Approval Data," Expert Systems with Applications, vol. 36, pp. 5900-5908, 2009.

[27]"What Diabetes Is," http://diabetes.niddk.nih.gov/.

[28]R. N. D. Manzella, "Recommended Blood Glucose Numbers," http//diab
etes.about.com/od/symptomsdiagnosis/a/glucoselevels.htm.

[29]Familydoctor.org and E. Staff, "Diabetes: Blood Tests to Help You Manage Your Diabetes," http://familydoctor.org/online/famdocen/home/
common/diabetes/living/779.html.

[30]"High Blood Pressure in Pregnancy," http://www.healthinfotranslations.
com/pdfDocs/070522-High_BP_Pregnancy-TrChinese-FINAL.pdf.

[31]"LDL and HDL Cholesterol: What's Bad and What's Good?," http://www.americanheart.org/presenter.jhtml?identifier=180.

[32]"Gestational Diabetes," http://www.diabetes.org/gestational-diabetes.jsp.

[33]"Glucose," http"//www.diabetesaction.com.au/diabetesaustralia/display.as
p?entityid=4082.

[34]"Glucose Tolerance Test," http://www.diabetesaction.com.au/diabetesaustr
alia/display.asp?entityid=4082.

[35]"Triceps Skin-Fold Thickness," http://medical-dictionary.thefreedictionary
.com/triceps+skin+fold+thickness.

[36]D. A. Williams and T. L. Lemke, Foye's Principles of Medicinal Chemistry, Lippincott Williams & Wilkins, 2002.

[37]"Body Mass Index (BMI)," http://www.diabetesaction.com.au/diabetesaus
tralia/display.asp?entityid=4082.

[38]H. E. Bays, R. H. Chapman, and S. Grandy, "The Relationship of Body Mass Index to Diabetes Mellitus, Hypertension and Dyslipidaemia: Comparison of Data from Two National Surveys," Int. Journal of Clinical Practice, vol. 61, pp. 737-747, 2007.

[39]"Pedigree for Diabetes Mellitus," http://www.diabetes-uncovered.com/ped
igree-for-diabetes-mellitus.html.

[40]Y.-H. Liu, Y.-T. Chen, and S.-S. Lu, "Face Detection Using Kernel PCA and Imbalanced SVM," Lecture Notes in Computer Science, vol. 4221, pp. 351-360, Sept. 2006.

[41]M. Kubat and S. Matwin, "Addressing the Curse of Imbalanced Training Sets: One-Sided Selection," in Proc. of the 14th Int. Conf. on Machine Learning, pp. 179-186, 1997.

[42]I. K. Timotius, S.-G. Miaou, and Y.-H. Liu, "Abnormality Detection for Capsule Endoscope Images Based on Color Histogram and Support Vector Machines," in The 21th IPPR Conf. on Computer Vision Graphics and Image Processing, Yilan, Taiwan, 2008.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔