跳到主要內容

臺灣博碩士論文加值系統

(98.80.143.34) 您好!臺灣時間:2024/10/14 00:35
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:方耀輝
研究生(外文):Yao-hwei Fang
論文名稱:以資料複雜度指標建構效率型交互驗證方法
論文名稱(外文):The Data Complexity Index to Construct an Efficient Cross-validation Method
指導教授:利德江利德江引用關係
指導教授(外文):Der-chiang Li
學位類別:博士
校院名稱:國立成功大學
系所名稱:工業與資訊管理學系碩博士班
學門:商業及管理學門
學類:其他商業及管理學類
論文種類:學術論文
論文出版年:2009
畢業學年度:97
語文別:英文
論文頁數:49
中文關鍵詞:交互驗證資料複雜度二元分類
外文關鍵詞:Binary ClassificationCross-validationData Complexity
相關次數:
  • 被引用被引用:1
  • 點閱點閱:429
  • 評分評分:
  • 下載下載:80
  • 收藏至我的研究室書目清單書目收藏:0
交互驗證在資料探勘領域中常被用來做模式的驗證。然而,在實驗過程中通常要決定一些重要參數,像是訓練資料個數或實驗次數。對於二元分類問題,本研究發展一個新的交互驗證模式,稱作“Complexity-based Efficient (CBE)”交互驗證,CBE交互驗證建立一個CBE複雜度指標,其中CBE指標跟分類正確率有正相關。我們利用CBE指標及統計樣本數決定概念來計算最佳的訓練樣本個數及實驗次數,對於大量且複雜的分類資料可以減少模式驗證的時間。

實驗結果顯示CBE指標跟分類正確率有高度相關,CBE交互驗證和K-fold 交互驗證法和Repeated Random Sub-sampling Validation法有相同的效果,而且CBE交互驗證的驗證時間比K-fold 交互驗證法和Repeated Random Sub-sampling Validation更快速。CBE交互驗證不僅可以計算最佳的訓練樣本數及實驗次數,更近一步可以瞭解分類資料的特徵及結構。
Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size or the number of experiment runs, to implement a validated evaluation. This research develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index called the CBE index, which has high correlation with the classification accuracies. The CBE index and the sample size determination can be used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with complex and computationally expensive classification data sets.

The experiment results show that the high correlation between the found CBE index and the classification accuracies, and the performances of CBE cross-validation and K-fold Cross-validation and Repeated Random Sub-sampling Validation are similar and that the training time required for CBE cross-validation is lower than that for K-fold Cross-validation and Repeated Random Sub-sampling Validation. CBE index helps users understand the characteristics of the analyzed data in advance, and CBE cross-validation helps users find optimal training data size and the number of experiment runs to reduce model evaluation time.
Abatract I
Contents IV
List of Tables VII
List of Figures VIII

Chapter 1 Introduction 1

Chapter 2 Literature Reviews 3
2.1 Linear data complexity 3
2.2 The concept of geometric structure of data and noise 5
2.3 Common types of cross-validation method 6
2.3.1 Holdout Validation 7
2.3.2 Repeated Random Sub-sampling Validation 7
2.3.3 K-fold Cross-validation 8
2.3.4 Leave-one-out Cross-validation 8

Chapter 3 Proposed Method 10
3.1 CBE index 10
3.1.1 DBSCAN algorithm 11
3.1.2 The calculation of the CBE index 13
3.1.3 The determination of the CBE index 15
3.2 CBE cross-validation method 16

Chapter 4 Experiment Results 18
4.1 Simulated and real data set experiments use to measure the relationship between CBE index and classification accuracy 18
4.1.1 Simulation experiments 18
4.1.2 Real data experiment 27
4.2 CBE cross-validation 31
4.2.1 The CBE cross-validation of the Pima data set 32
4.2.2 The CBE cross-validation of the Haberman data set 34
4.2.3 The CBE cross-validation of the MAGIC data set 37

Chapter 5 Conclusion and Discussions 40
5.1 Conclusion 40
5.2 Discussion of CBE index for various data set characteristics 41
5.2.1 Unbalance class 41
5.2.2 Dimensions 42
5.2.3 Sample size 43
5.2.4 Design and analysis of experiments 44

Reference 46
[1] C.M. Bishop, Pattern recognition and machine learning, Springer, 2006.

[2] D.C. Montgomery, Design and analysis of experiments, 5th edition, Wiley, 2001.

[3] L.J. Cao, H.P. Lee, and W.K. Chong, Modified support vector novelty detector using training data with outliers, Pattern Recognition Letters 24, 2479-2487, 2003.

[4] G. Casella and R.L. Berger, Statistical Inference, second edition, Duxbury, 2002.

[5] R. Clarke, H.W. Ressom, A. Wang, J. Xuan, M.C. Liu, E.A. Gehan, Y. Wang, The properties of high-dimensional data spaces: implications for exploring gene and protein expression data, Nature Reviews Cancer 8 (1), 37-49, 2008.

[6] M. Daszykowski, B. Walczak, D.L. Massart, Looking for natural patterns in data part 1. density-based approach, Chemometrics and Intelligent Laboratory Systems 56 (2), 83-92, 2001.

[7] M. Daszykowski, B. Walczak, D.L. Massart, Representative subset selection, Analytica Chimica Acta 468, 91-103, 2002.

[8] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noisy, In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland 226-231, 1996.
[9] M.T. Hagan, H.B. Demuth, M. Beale, Neural network design, Thomson, Singapore, 1996.

[10] H. Han, Y. Ko, J. Seo, Using the revised EM algorithm to remove noisy for improving the one-against-the-rest method in binary text classification, Information Processing & Management 43 (5), 1281-1293, 2007.

[11] T.K. Ho, A data complexity analysis of comparative advantages of decision forest constructors, Pattern Analysis & Applications 5, 102-112, 2002.

[12] M.Y. Hu, M. Shanker , G.P. Zhang, and M.S. Hung, Modeling consumer situational choice of long distance communication with neural networks, Decision Support Systems 44 (4), 899-908, 2008.

[13] V.N. Vapnik, The nature of Statistical learning theory, second edition, Springer, New York, 2000.

[14] M. Kantardzic, Data mining: concept, model, method, and algorithms, wiley-interscience, 2003.

[15] E.W.M. Lee, Y.Y. Lee, C.P. Lim, C.Y. Tang, Application of a noisy classification technique to determine the occurrence of flashover in compartment fires, Advanced Engineering Informatics 20, 213–222, 2006.

[16] D.C. Li, Y.H. Fang, An algorithm to cluster data for efficient classification of support vector machines, Expert Systems with Applications 34, 2013-2018, 2008.

[17] D.C. Li, Y.H. Fang, A non-linearly virtual sample generation technique using cluster discovery and parametric equations of hypersphere, Expert Systems with Applications 36, 844-851, 2009.

[18] D.C. Li, C.W. Yeh, T.I Tsai, Y.H. Fang, Susan C. Hu, Acquiring knowledge with limited experience, Expert Systems 24 (3), 162-170, 2007.

[19] E.B. Mansilla, On classifier domains of competence, proceedings of the 17th international conference on pattern recognition (ICPR’04), 2004.

[20] H.V. Nguyen, W. Yonggwan, Classification of unbalanced medical data with weighted Regularized Least Squares, Proceedings of the Frontiers in the Convergence of Bioscience and Information Technologies (IEEE), 347-352, 2007.

[21] A.T. Peterson, K.P. Cohoon, Sensitivity of distributional prediction algorithms to geographic data completeness, Ecological Modelling 117 (1), 159-164, 1999.

[22] S. Piramuthu, M.J. Shaw, J.A. Gentry, A classification approach using multi-layered neural networks, Decision Support Systems 11 (5), 509-525, 1994.

[23] A.M. Rubinov, N.V. Soukhorkova, J. Ugon, Classes and clusters in data analysis, European Journal of Operational Research 173, 849-865, 2006.

[24] C. Schaffer, Technical Note: Selecting a classification method by cross-validation, Machine Learning 13, 135-143, 1993.

[25] D.R.B. Stockwell, A.T. Peterson, Effects of sample size on accuracy of species distribution models, Ecological Modelling 148, 1-13, 2002.

[26] P.N. Tan, M. Steinbach, V. Kumar, Introduction to data mining, 1st edition, Pearson Addison Wesley, Boston, 2006.

[27] S. Wang, M. Dash, L.T. Chia, Efficient sampling: Application to image data, advances in Knowledge Discovery and Data Mining, Proceedings, Book Series: Lecture Notes in Artificial Intelligence 3518, 452-463, 2005.

[28] I.H. Witten, Frank Eibe, Data mining: practical machine learning tools and techniques, second edition, Morgan Kaufman, Amsterdam, 2005.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top