跳到主要內容

臺灣博碩士論文加值系統

(44.220.62.183) 您好!臺灣時間:2024/02/22 22:57
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:游孟綸
研究生(外文):Mon-loon You
論文名稱:兩階段混合學習法於資料分類之研究
論文名稱(外文):A Two-Stage Hybrid Learning Approach for Effective Pattern Classification
指導教授:蔡志豐蔡志豐引用關係
指導教授(外文):Chih-Feng Tsai
學位類別:碩士
校院名稱:國立中央大學
系所名稱:資訊管理學系
學門:電算機學門
學類:電算機一般學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:中文
論文頁數:74
中文關鍵詞:資料探勘樣本選取資料縮減機器學習支援向量機
外文關鍵詞:data mininginstance selectiondata reductionmachine learningsupport vector machines
相關次數:
  • 被引用被引用:0
  • 點閱點閱:214
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
當今的企業常常需要從龐大的資料庫以及資料倉儲中尋找對企業有價值的知識,但越是大型的資料庫所包含的雜訊資料越多,這些雜訊資料會降低資料探勘的精確度,且龐大的資料更會增加知識發掘過程中所需的時間。
雖然樣本選取可以在資料前處理的階段中幫我們過濾掉一些雜訊,是目前最常被用來進行資料縮減的方法,但不同的樣本選取的演算法所篩選出來的資料不盡相同,且常常會發生過度選取 (Over Selection) 或是選取不足 (Under Selection) 的情況進而影響資料探勘的精確度。因此本研究提出了一個新的資料前處理流程 (TSHLA, 兩階段混合學習) ,並且應用在資料分類上。先將訓練集的資料做樣本選取後,分別對被樣本選取演算法判定為雜訊及非雜訊的資料集訓練SVM模型;並且將測試集的資料做KNN的相似度比對,較相似為雜訊的測試資料集用雜訊資料集所訓練的模型做測試,同理,較相似為非雜訊的測試資料集用非雜訊資料集所訓練的模型做測試,希望在雜訊類的資料中找出被篩選掉,但卻有效的樣本,最後合併為最終結果。
本研究的實驗分成兩部分,在樣本選取步驟皆分別實驗了IB3、DROP3、GA等三種效能較佳的演算法。在第一部分的實驗以TSHLA對50個小型資料集做測試,並以SVM作為本研究所使用的分類器。在第二部分的實驗則是使用大型資料集 (十萬筆以上) ,以SVM為分類器,與傳統樣本選取方法比較彼此精準度。

Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time of processing the large scale datasets is usually very large.
Instance selection, which is the widely used data reduction approach, can filter out noisy data from large datasets. However, different instance selection algorithms over different domain datasets filter out different noisy data, which are likely to result in over or under selection since there is no exact definition of outliers. Thus, the quality of data mining results can be affected. Therefore, this thesis proposes a new data pre-processing (TSHLA, Two-Stage Hybrid Learning Approach) for effective data classification. First, instance selection is performed over a given training dataset to filter out the noisy and non-noisy data to train two individual SVM classifiers respectively. Then, using the KNN to compare the similarity of the testing data. As a result, the noisy and non-noisy testing sets are identified and they are fed into their corresponding SVM classifiers for classification.
There two experimental studies in this thesis and three instance selection algorithms are used for comparison, which are IB3, DROP3 and GA. The first and second studies are based on 50 small UCI datasets and large scale datasets containing more than 100,000 data samples. In addition, our proposed TSHLA is compared with the baseline without instance selection and the one based on the conventional instance selection approach.

摘要……………………………………………………………………………………………………….. i
Abstract…………………………………………………………………………………………………….. ii
致謝辭………………………………………………………………………………………………………… iii
目錄……………………………………………………………………………………………………………… iv
圖目錄…………………………………………………………………………………………………………… vi
表目錄………………………………………………………………………………………………………… vii

第一章 緒論 - 1 -
1.1 研究背景 - 1 -
1.2 研究動機 - 3 -
1.3 研究目的 - 4 -
1.4 研究架構 - 5 -
第二章 文獻探討 - 7 -
2.1樣本選取 (Instance selection) - 7 -
2.1.1樣本選取簡介 - 7 -
2.1.2基因演算法 (Genetic Algorithm, GA) - 9 -
2.1.3 DROP3 - 13 -
2.1.4 IB3 - 15 -
2.2 機器學習 - 17 -
2.2.1 監督式學習 - 18 -
2.2.2 支援向量機 (Support Vector Machine, SVM) - 19 -
第三章 TSHLA方法介紹 - 22 -
3.1 實驗架構 - 22 -
3.2.1 TSHLA實驗流程 - 22 -
3.2.2 TSHLA實驗虛擬碼 (pseudo-code) - 25 -
3.3 一般Base-line的流程 - 26 -
3.4一般Base-line2的流程 - 27 -
3.4 討論與分析 - 28 -
第四章 實驗結果 - 29 -
4.1 實驗一 - 29 -
4.1.1 資料集 - 29 -
4.1.2 驗證 - 32 -
4.1.3實驗一結果 - 33 -
4.2實驗二 - 34 -
4.2.1資料集 - 34 -
4.2.2驗證 - 35 -
4.2.3實驗二結果 - 35 -
第五章 結論 - 38 -
5.1 結論與貢獻 - 38 -
5.2 未來研究方向與建議 - 39 -
參考文獻 - 41 -
附錄 - 45 -


中文部分
林嘉陞,2009,“CANN:一個整合分群中心與最鄰近鄰居之入侵偵測系統”,
國立中正大學會計與資訊科技研究所碩士論文。
洪嘉彣,2013,“樣本選取與代表性資料偵測之研究”,國立中央大學資訊管理研究所碩士論文。
英文部分
Aha, D.W., Kibler, D., and Albert, M.K., 1991, “Instance-based learning algorithms.” Machine Learning, vol. 6, no.1, pp. 37-66.
Baker, J. E., 1987, “Reducing bias and inefficiency in the selection algorithm.” Proc. Second Int. Conf. on Genetic Algorithms (L. Erlbaum Associates, Hillsdale, MA), 14–21.
Barnett, V. and Lewis, T., 1994, “Outliers in statistical data.” 3rd Edition, John Wiley &; Sons.
Ben-Gal I., 2005, “Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers.” Kluwer Academic Publishers, ISBN 0-387-24435-2.
Brieman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J., 1984, “Classification and regression trees.” Belmont, CA: Wadsworth.
Chandola, V., Banerjee, A., and Kumar, V., 2009, “Anomaly detection: a survey.” ACM Computing Surveys, vol. 41, no. 3, article 15.
Cano, J.R., Herrera, F., and Lozano, M., 2003, “Using evolutionary algorithms as instance selection for data reduction: an experimental study.” IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575.
Cover, T. M., and Hart, P. E., 1967, “Nearest neighbor pattern classification.” IEEE Transactions on Information Theory, Vol. 3, pp.21-27.
Devijver, P. A., Kittler, J., 1982, “Pattern Recognition: A Statistical Approach.” Prentice-Hall, London, GB.
De Jong, K. A., (1975), “An Analysis of the Behavior of a class of Genetic Adaptive Systems.” Department of Computer and Communication Sciences.
Duda, R.O., Hart, P.E., and Stork, D.G., 2001, “Pattern Classification.” 2nd Edition, John Wiley, New York.
Edgeworth, F. Y., 1887, “On discordant observations.” Philosophical Magazine 23, 5, 364-375.
Garcı´a, S., Derrac, J., Cano, J.R., and Herrera, F., 2012, “Prototype Selection for Nearest NeighborClassification: Taxonomy and Empirical Study.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.3.
Gates, G.W., 1972, “The Reduced Nearest Neighbor Rule.” IEEE Transactions on Information, Theory 18, pp.431-433.
Gen, M., and Cheng, R., 1997, “Genetic Algorithm and Engineering Design.” John Wiley and Sons.
Goldberg, D.E., 1989, “Genetic Algorithms in Search, Optimization, and Machine Learning.” Addison Wesley.
Hawkins, D., 1980, “Identification of Outliers.” Chapman and Hall.
Herrera, F., Lozano, M., and Verdegay, J.L., 1998, “Tackling Real-Coded Genetic Algorithms: Operators and Tools for Behavioural Analysis.” Artificial Intelligence Review, vol.12, pp.265-319.
Hodge, V.J. and Austin, J., 2004, “A survey of outlier detection methodologies.” Artificial Intelligence Review, vol. 22, pp. 85-126.
Holland, J.H., 1975, “Adaptation in Natural and Artificial Systems.” The University of Michigan Press.
Jang, J.R., Sun, C.T., and Mizutani, E., 1997, “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence.” Prentice Hall, Inc. Upper Saddle River, NJ 07458.
Jankowski, N. and Grochowski, M., 2004, “Comparison of instances selection algorithms I: algorithms survey.” International Conference on Artificial Intelligence and Soft Computing, pp. 598-603.
Kohavi, R., 1995, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Vol. 2, pp.1137-1145.
Kotsiantis, S.B., Kanellopoulos, D. and Pintelas, P.E., 2006, “Data Preprocessing for Supervised Leaning.” Intermational Journal of Computer Science, vol.1, pp.1306-4428.
Kuncheva, L. I., and S´anchez, J. S., 2008, “Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling.” Eighth IEEE International Conference on Data Mining.
Li, X.-B. and Jacob, V.S., 2008, “Adaptive data reduction for large-scale transaction data.” European Journal of Operational Research, vol. 188, no. 3, pp. 910-924.
Liu, H., Shah, S., and Jiang, W., 2004, “On-line outlier detection and data cleaning.” Computers and Chemical Engineering, 28, 1635–1647.
Meyer, D., 2012, “Support Vector Machines * The Interface to libsvm in package e1071.” Technische Universitat Wien, Austria.
Mitchell, T., 1997, “Machine Learning.” McGraw Hill, New York.
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. and Kittler, J., 2010, “A review of instance selection methods.” Artif Intell Rev, vol.34, pp.133-143.
Pyle, D., 1999, “Data preparation for data mining.” Morgan Kaufmann.
Reeves, C. R., 1999, “Foundations of Genetic Algorithms.” Morgan Kaufmann Publishers.
Reinartz, T., 2002, “A unifying view on instance selection.” Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
Richard, J.R. and Michael, W.G., 2003, “Data Mining A Tutorial-Based Primer.”Addison-Wesley.
Ritter, G.L., Woodruff, H.B., Lowry, S.R., and Isenhour, T.L., 1975, “An algorithm for aselective nearest neighbor decision rule.” IEEE Transactions on Information, Theory 21, pp. 665–669.
Rousseeuw, P. and Leroy, A., 1996, “Robust Regression and Outlier Detection.” 3 edition, John Wiley &; Sons
Salvador García, Joaquín Derrac, José Ramón Cano, and Francisco Herrera, 2012, “Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3.
Sikora, R., and Piramuthu, S., 2007, “Framework for efficient feature selection in genetic algorithm based data mining.” European Journal of Operational Research, vol. 180, no. 2, pp. 723-737.
Sipser, M., 2006, “Introduction to the Theory of Computation.” Course Technology Inc. ISBN 0-619-21764-2.
Syswerda, G., 1989, “Uniform Crossover in Genetic Algorithms.” In Proceedings of the Third International Conference on Genetic Algorithms, J. Schaffer (ed.), Morgan Kaufmann, 2-9.
Tan, P.N., Steinbach, M., and Kumar, V., 2006, “Introduction to Data Mining.” Addison Wesley.
Vapnik, V.N., 1995, “The Nature of Statistical Learning Theory.” Springer, New York.
Williams, B. K., Nichols, J. D., and Conroy, M. J., 2002, “Analysis and management of animal populations.” London: Academic Press.
Wilson, D., 1972, “Asymptotic properties of nearest neighbor rules using edited data.” IEEE Transactions on Systems, Man, and Cybernetics 2 pp.408–421.
Wilson, D.R., and Martinez, T.R., 2000, “Reduction techniques for instance-based learning algorithms.” Machine Learning, vol.38, pp. 257-286.
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., and Steinberg, D., 2008, “Top 10 algorithms in data mining.” Knowl Inf Syst 14:1–37.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊