跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.87) 您好!臺灣時間:2024/12/04 17:27
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:戴碧如
研究生(外文):Bi-Ru Dai
論文名稱:條件性叢集技術
論文名稱(外文):Constrained Data Clustering
指導教授:陳銘憲陳銘憲引用關係
指導教授(外文):Ming-Syan Chen
學位類別:博士
校院名稱:國立臺灣大學
系所名稱:電機工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2006
畢業學年度:94
語文別:英文
論文頁數:121
中文關鍵詞:資料探勘資料叢集資料串流
外文關鍵詞:Data MiningData ClusteringData Stream
相關次數:
  • 被引用被引用:0
  • 點閱點閱:200
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來,由於各種應用中快速累積了大量資料,資料探勘相關的研究領域越來越受到重視,而其中的資料叢集分析技術,則提供了使用者觀察相似資料群集的途徑。
由於資料探勘的研究常因應用領域而異,其中限制性探勘技術是將應用領域之專業知識加入資料探勘分析考量中的一種方式,在此論文之中的第一個研究課題,即為提出新的限制性資料叢集定義:同一個叢集之中的任意兩個成員,其限制性屬性的差值不可超過所給定的限制範圍。根據此定義,我們提出了幾個相對應的限制性資料叢集演算法,接著,由於觀察到階層式叢集演算法具有的一個基本特性,即資料分群的順序會影響最後的叢集成果,因此又更進一步地設計了漸進式解除限制(progressive constraint relaxation)之技術,以降低分群順序的影響,並提昇分群的成果。
除了針對靜態資料進行資料叢集演算法的研究之外,我們也探討了資料串流環境中的資料叢集技術。在資料串流環境中,資料通常是快速累積,因此需要利用有限的時間與空間資源,提出有效的解決方案。此論文中,我們提出了一個針對多條資料串流進行叢集分析的架構,此架構包含了兩個階段,第一個階段處理並儲存資料串流,第二個階段則提供動態回應使用者叢集分析需求的機制。
最後,我們將限制性資料叢集技術延伸至資料串流環境中,配合本論文中所提出的限制性資料叢集定義,設計相對應的資料串流儲存架構,以產生符合使用者需求與限制的資料叢集。
Among various data mining capabilities, data clustering is a useful technique for group behavior investigation, and is helpful for many applications. Since data mining is an application dependent technology, the information involving domain knowledge is usually imposed on the mining systems as various constraints. In this dissertation, we address the problem of constrained clustering with numerical constraints, in which the constraint attribute values of any two data items in the same cluster are required to be within the corresponding constraint range. Several algorithms are proposed to solve such a clustering problem. It is noted that due to the intrinsic nature of the numerical constrained clustering, there is an order dependency on the process of attaining the clustering, which in many cases degrades the clustering results. In view of this, we devise a progressive constraint relaxation technique to remedy this drawback and improve the overall performance of clustering results.
In addition to clustering on static data sets, the problem of clustering multiple data streams is also addressed in this dissertation. We devise a Clustering on Demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. The COD framework consists of two phases, i.e., the online maintaining phase and the offline clustering phase. The online maintaining phase provides an efficient mechanism to maintain the summary hierarchies of the data streams with multiple resolutions. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve the approximations of the desired sub-streams from the summary hierarchies according to the clustering queries.
Finally, the concepts of constraints and data streams are combined and considered together. We devise a framework of Constrained Clustering for the Evolving Data Stream, abbreviated as CCDS framework, to cluster the data stream under the pairwise range constraint. Two phases are designed to maintain the data points and to generate clusters respectively.
1 Introduction 7
1.1 Motivation and Overview of the Dissertation . . . . . . . . . . . . . . . . . . 7
1.2 Organization of theDissertation . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Clustering with Pairwise Numerical Constraints 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Pair-Wise Constrained Clustering . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Algorithms for Pair-Wise Constrained Clustering . . . . . . . . . . . . . . . 19
2.3.1 Partition-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Hierarchical Clustering Algorithms . . . . . . . . . . . . . . . . . . . 25
2.4 Progressive Constraint Relaxation . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 OrderDependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Progressive Constraint Relaxation . . . . . . . . . . . . . . . . . . . . 35
2.5 Extension toMultiple Constraint Attributes . . . . . . . . . . . . . . . . . . 38
2.6 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.1 On Progressive Constraint Relaxation . . . . . . . . . . . . . . . . . . 41
2.6.2 On Partition-Based Clustering . . . . . . . . . . . . . . . . . . . . . . 42
2.6.3 OnHierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.4 Overall Comparison between theseAlgorithms . . . . . . . . . . . . . 45
2.6.5 On the Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Adaptive Clustering for Multiple Evolving Streams 51
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Clustering onDemand forMultipleData Streams . . . . . . . . . . . . . . . 54
3.2.1 Framework Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.2 Advantages of the CODFramework . . . . . . . . . . . . . . . . . . . 57
3.3 The Framework of COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 OnlineMaintaining Phase . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 Offline Clustering Phase . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 ComplexityAnalyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.4 Adaptive Use of COD . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.2 SensitivityAnalyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.3 AdaptivityAnalyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.4 CODon Real Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.5 On the Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4 Constrained Clustering for the Evolving Data Stream 87
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 Constrained Clustering for the EvolvingData Stream . . . . . . . . . . . . . 89
4.3 CCDS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Statistics Reserving Phase . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.2 Clustering Responding Phase . . . . . . . . . . . . . . . . . . . . . . 101
4.4 Performance Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.1 On Parameters of Framework CCDS . . . . . . . . . . . . . . . . . . 104
4.4.2 On Clustering Requests . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.3 On Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5 Conclusions 113
[1] C. C. Aggarwal. On change diagnosis in evolving data streams. IEEE Trans. On Knowledge and Data Engineering, 17(5):587—600, 2005.
[2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In Proc. of VLDB, Sep. 2003.
[3] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for projected clustering of high dimensional data streams. In Proc. of VLDB, pages 852—863, 2004.
[4] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On demand classification of data streams. In Proc. of ACM SIGKDD, pages 503—508, 2004.
[5] C. C. Aggarwal, C. Procopiuc, J. L.Wolf, P. S. Yu, and J.-S. Park. Fast algorithms for projected clustering. In Proceedings of ACM SIGMOD, 1999.
[6] C. C. Aggarwal and P. S. Yu. Online analysis of community evolution in data streams. In Proc. of ACM SIAM on Data Mining (SDM05), 2005.
[7] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, June 2002.
[8] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. pages 19—26, 2002.
[9] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, NY, 1981.
[10] P. S. Bradley, K. P. Bennett, and A. Demiriz. Constrained K-Means Clustering. MSRTR-2000-65, Microsoft Research, May 2000.
[11] P. S. Bradley and U. Fayyad. Refining initial points for k—means clustering. In Proc. of ICML, pages 91—99, July 1998.
[12] A. G. Buchner and M. Mulvenna. Discovery internet marketing intelligence through online analytical web usage mining. In ACM SIGMOD Record, volume 27(4), pages 54—61, Dec. 1998.
[13] A. Bulut and A. K. Singh. Swat: Hierarchical stream summarization in large networks. In Proc. of ICDE, pages 303—314, Mar. 2003.
[14] A. Bulut and A. K. Singh. A unified framework for monitoring data streams in real time. In Proc. of ICDE, pages 44—55, 2005.
[15] J. H. Chang and W. S. Lee. estWin: Adaptively Monitoring the Recent Change of Frequent Itemsets over Online Data Streams. In Proceedings of ACM CIKM, 2003.
[16] M.-S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Trans. On Knowledge And Data Engineering, 5(1):866—883, Dec. 1996.
[17] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In Proc. of VLDB, 2002.
[18] Y. Chi, H.Wangt, P. Yu, and R.Muntz. Moment: Maintaining Closed Frequent Itemsets Over a Stream Sliding Window. In Proceedings of IEEE ICDM, 2004.
[19] G. Cormode, S. Muthukrishnan, and I. Rozenbaum. Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. In Proceedings of VLDB, 2005.
[20] P. Crescenzi and V. Kann. A compendium of np optimization problems [http://www.nada.kth.se/ viggo/problemlist/compendium.html].
[21] B.-R. Dai, J.-W. Huang, M.-Y. Yeh, and M.-S. Chen. Clustering on demand for multiple data streams. In Proc. of ICDM, pages 367—370, 2004.
[22] B.-R. Dai, C.-R. Lin, and M.-S. Chen. Constrained data clustering by depth control and progressive constraint relaxation. to appear in Very Large Data Base Journal (VLDB
J).
[23] B.-R. Dai, C.-R. Lin, and M.-S. Chen. On the techniques for data clustering with numerical constraints. In Proc. of SDM, 2003.
[24] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. of SODA, pages 635—644, Jan. 2002.
[25] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. of ACM SIGMOD, pages 61—72, June 2002.
[26] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of ACM SIGKDD, pages 71—80, Aug. 2000.
[27] R. C. Dubes. How many clusters are best? - an experiment. Pattern Recognition, 20(6):645—663, 1987.
[28] V. Estivill-Castro and I. Lee. Autoclust+:automatic clustering of point-data sets in the presence of obstacles. In Proc. of TSDM, pages 133—146, 2000.
[29] W. Fan. Systematic data selection to mine concept-drifting data streams. In Proc. of ACM SIGKDD, pages 128—137, 2004.
[30] W. Fan, Y. an Huang, and P. Yu;. Decision Tree Evolution Rsing Limited Number of Labeled Data Items from Drifting Data Streams. In Proceedings of IEEE ICDM, 2004.
[31] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996.
[32] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Proc. of FOCS, pages 359—366, Nov. 2000.
[33] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[34] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. DIMACS, 50:107—118, 1999.
[35] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley Longman Publ. Co., Inc., Reading, MA., 1991.
[36] W. Huang, E. Omiecinski, and L. Mark. Compression Schemes for Differential Categorical Stream Clustering. In Proceedings of ACM CIKM, 2004.
[37] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proc. of ACM SIGKDD, pages 97—106, Aug. 2001.
[38] V. S. Iyengar. On detecting space-time clusters. In Proceedings of ACM KDD, 2004.
[39] H. V. Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In Proc. of VLDB, pages 186—198, 1999.
[40] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou. Dynamically Maintaining Frequent Items over a Data Stream. In Proceedings of ACM CIKM, 2003.
[41] T. Johnson, S. Muthukrishnan, and I. Rozenbaum:. Sampling algorithms in a stream operator. In Proc. of ACM SIGMOD Conference, 2005.
[42] E. Keogh, J. Lin, andW. Truppel. Clustering of time series subsequences is meaningless: Implications for past and future research. In Proc. of ICDM, Nov. 2003.
[43] B. King. Step-wise clustering procedures. J. Am. Stat. Assoc., 69:86—101, 1967.
[44] D. Klein, S. D. Kamvar, and C. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings
of the The Nineteenth International Conference on Machine Learning (ICML-2002), Sydney, Australia, 2002.
[45] T. Li, Q. Li, S. Zhu, and M. Ogihara. A survey on wavelet applications in data mining. SIGKDD Explorations, 4(2):49—68, 2002.
[46] C.-R. Lin and M.-S. Chen. A robust and efficient clustering algorithm based on cohesion self-merging. In Proc. of ACM SIGKDD, pages 582—587, July 2002.
[47] C.-R. Lin and M.-S. Chen. On the optimal clustering of sequential data. In Proceedings of the 2nd SIAM International Conference on Data Mining, April 2002.
[48] C.-R. Lin, K.-H. Liu, and M.-S. Chen. Dual clustering: Integrating data clustering over optimization and constraint domains. IEEE Trans. On Knowledge and Data Engineering, 17(5):628—637, May 2005.
[49] S. Y. Lu and K. S. Fu. A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans. Syst. Man Cybern, 8:381—389, 1978.
[50] G. S. Manku and R. Motwani. Approximate frequency counts over streaming data. In Proc. of VLDB, pages 346—357, Aug. 2002.
[51] O. Nasraoui, C. Uribe, C. Coronel, and F. Gonzalez. TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams With a Scalable Immune System Learning Model. In Proceedings of IEEE ICDM, 2003.
[52] R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. of VLDB, 1994.
[53] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-data algorithms for high-quality clustering. In Proc. of ICDE, 2002.
[54] Y.-J. Oyang, C.-Y. Chen, and T.-W. Yang. A study on the hierachical data clustering algorithm based on gravity theory. In Proceedings of 5th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 350—361, 2001.
[55] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for data mining and clustering. In ACM SIGMOD International Conference on Management of Data, 2000.
[56] L. Qiao, D. Agrawal, and A. E. Abbadi. RHist: Adaptive Summarization over Continuous Data Streams. In Proceedings of ACM CIKM, 2002.
[57] K. Rose, E. Gurewitz, and G. Fox. Constrained clustering as an optimization method. IEEE Transactions On Pattern Analysis and Machine Intelligence, 15(8):785—794, Aug. 1993.
[58] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.
[59] W.-G. Teng, M.-S. Chen, and P. S. Yu. A regression-based temporal pattern mining scheme for data streams. In Proc. of VLDB, Sep. 2003.
[60] W.-G. Teng, M.-S. Chen, and P. S. Yu. Using wavelet-based resource-aware mining to explore temporal and support count granularities in data streams. In Proc. of SDM, Apr. 2004.
[61] A.K.H. Tung, J. Han, L. V. S. Lakshmanan, andR. T. Ng. Constraint-based clustering in large databases. In Proceedings of 2001 International Conference on Database Theory, Jan. 2001.
[62] A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. In Proc. of ICDE, pages 359—367, 2001.
[63] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-Means clustering with background knowledge. 2001.
[64] K.-L. Wu, S.-K. Chen, and P. S. Yu. Interval Query Indexing for Efficient Stream Processing. In Proceedings of ACM CIKM, 2004.
[65] J. Yang. Dynamic clustering of evolving streams with a single pass. In Proc. of ICDE03, Mar. 2003.
[66] M. Yeung and B. Yeo. Time-constrained clustering for segmentation of video into story units. In International Conference on Pattern Recognition, pages 375—380, May 1996.
[67] O. R. Zaïane, A. Foss, C.-H. Lee, andW.Wang. On data clustering analysis: Scalability, constraints, and validation. In PAKDD, pages 28—39, 2002.
[68] D. Zhang, D. Gunopulos, V. J. Tsotras, and B. Seeger. Temporal aggregation over data streams using multiple granularities. In Proc. of EDBT, pages 646—663, Mar. 2002.
[69] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large database. In Proceedings of the ACM SIGMOD Conference on Management
of Data, pages 103—114, 1996.
[70] X. Zhu, X. Wu, and Y. Yang. Dynamic Classifier Selection for Effective Mining from Noisy Data Streams. In Proceedings of IEEE ICDM, 2004.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊