(3.236.231.14) 您好!臺灣時間:2021/04/15 07:13
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:李憶瑤
研究生(外文):Yi-yao Lee
論文名稱:具概念遞移之串流資料的分類技術
論文名稱(外文):Efficient Classification for Mining Concept-Drifting Data Streams
指導教授:陳銘憲陳銘憲引用關係
指導教授(外文):Ming-Syan Chen
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2005
畢業學年度:93
語文別:英文
論文頁數:45
中文關鍵詞:串流資料概念遞移資訊探勘決策樹
外文關鍵詞:Data StreamsConcept DrfitClassificationData Mining
相關次數:
  • 被引用被引用:0
  • 點閱點閱:150
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
如何有效地管理大量、精細且快速累積的串流資料(data streams)對於以
分析靜態資料為主的資訊勘測是一項新的挑戰。傳統的分類(classifiers)主要功能在於自資料中分析出不變的(stationary)預測觀測,但對於具概念遞移(Concept-Drifting)串流資料而言,無法有效地捕捉並學習遞移後的新概念。在本論文中,我們提出以SODA(Speedy cOncept-drift etection Algorithm,高速概念遞移偵測演算法)為基礎的一個高效率概念遞移串流資料分類器。SODA 演算法是一個線上漸進式的分類器,其主要優點在於可在常數時間(constant time)內分析新進資料並學習遞移後的新概念。
本論文的主要貢獻主要包含幾個部份,首先,有別於以往相關研究,我們將串流資料的概念遞移定義為「對主要預測標的,最具鑑別度的維度發生資料分佈之顯著改變」。基於此定義與統計檢定而研發之快速概念遞移偵測法,SODA 演算法可快速且有效地捕捉在串流資料中的概念遞移。再者,整合以資訊量精進度及分類正確率為基礎之決策樹修剪檢驗函數,決策樹之深度可維持於最佳大小,使其並能同時兼顧偵測效率與分類器準確度。最後,藉由具高度成效之候選決策樹選擇策略,SODA 演算法可有效地檢驗決策樹的適時性,並且選擇最佳的候選樹。經由一系列的實驗結果實證,本論文所提出的SODA 演算法在概念遞移的偵測效度、運行效率、分類正確率及耗用記憶體空間各方面,均能有效改善相關研究中所提出之演算法。
We devise in this thesis a concept-drift-driven classification algorithm, called SODA(Speedy Concept-Drift Detection Algorithm) to mine data streams with concept drift. SODA is an on-line incremental learning algorithm which is able to keep its model consistent with new concepts and to process each example in constant time. The contributions of the algorithm SODA are many folds. We address the problem of detecting concept drifts by inspecting the distribution of one attribute which is most discriminative to target class. The SODA algorithm is capable of capturing concept drifts in data streams efficiently, and looks after execution performance and accuracy of classifiers. From the empirical studies in Section 4, by applying the efficient split checking method, the concept drift detection with statistical analysis, and the effective alternative tree selection strategy, algorithm SODA outperforms prior works in terms of execution efficiency, performance of detecting concept drifts, and economic usage of memory. Thus, the concepts in data streams can be captured and learned efficiently. Therefore, SODA algorithm is able to strike a balance between the memory usage and accuracy of the classifier in data streams.
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Decision Tree Approach . . . . . . . . . . . . . . . . . . . 12
2.3 TheMutual Consistency Test . . . . . . . . . . . . . . . . 14
3. The SODA Algorithm . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Definition of ConceptDrift . . . . . . . . . . . . . . . . . . 18
3.2 Timers for Three Processes . . . . . . . . . . . . . . . . .. 19
3.3 The Split Checking Process . . . . . . . . . . . . . . . . .. 21
3.4 Capability of Fast Concept Drift Detection . . . . . . 24
3.5 Learning NewConcepts . . . . . . . . . . . . . . . . . . . .. 27
3.5.1 Alternative SubtreesGeneration . . . . . . . . . . . . . 27
3.5.2 An Effective Subtree Selection Strategy . . . . . . 28
4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Scalability of SODA . . . . . . . . . . . . . . . . . . . . . .. 35
4.3 Evaluation of Concept Drifts Detection . . . . . . ... 35
5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
[1] L. Breiman. Bagging predictors. Machine Learning, 24(2):123—140, 1996.
[2] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data
streams. In Proc. Of PAKDD, 2004.
[3] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. Of SIAM, 2002.
[4] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. Of SIGMOD, 2002.
[5] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. Of
SIGKDD, 2000.
[6] F. Esposito, D. Malerba, and G. Semeraro. A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):476—491, 1997.
[7] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. Of ICML, 1996.
[8] J. Gama, P. Medas, and R. Rocha. Forest trees for on-line data. In Proc. Of
SAC, 2004.
[9] M. Garofalakis, J. Gehrke, and R. Rastogi. Querying and mining data streams: You only get one look. In Proc. Of SIGMOD, 2002.
[10] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. BOAT-optimistic decision tree construction. In Proc. Of SIGMOD, 1999.
[11] J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. SIGMOD Rec., 30(2):13—24, 2001.
[12] J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In Proc. Of VLDB, 1998.
[13] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Proc. Of ASFCS, 2000.
[14] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[15] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proc. Of SIGKDD, 2001.
[16] R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proc. Of SIGKDD, 2003.
[17] R. A. Johnson, D. A. Wichern, and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice-Hall International, Inc., 2002.
[18] D. Kalles and T. Morris. Efficient incremental induction of decision trees. Machine Learning, 24(3):231—242, 1996.
[19] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. In International Conference on Data Engineering, 2003.
[20] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. Of VLDB, 2002.
[21] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Highperformance clustering of streams and large data sets. In Proc. Of ICDE, 2002.
[22] J. R. Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies, 27(3):221—234, 1987.
[23] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Proc. Of ICML, 1997.
[24] J. C. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In Proc. Of VLDB, 1996.
[25] W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proc. Of SIGKDD, 2001.
[26] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. Of SIGKDD, 2003.
[27] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69—101, 1996.
[28] D. Zhang, D. Gunopulos, V. J. Tsotras, and B. Seeger. Temporal aggregation over data streams using multiple granularities. In Proc Of ICEDT, 2002.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔