研究生(外文):Piin-Tsong Dai
論文名稱(外文):An Efficient Density-based Clustering Algorithm
指導教授(外文):Don-Lin Yang
外文關鍵詞:Clustering AlgorithmData MiningDensity-BasedCluster Parameters
在這篇論文中,我們提出一個新的基植於密度式組群化的Density-Based Clustering using Statistical Partition (DBCSP)演算法來加速尋找最終組群之收斂形心。我們研究的主要目的在於了解組群化演算法以及分析組群特性來決定組群化演算法之參數。在以往的研究中,有各式各樣處理大型多維度之資料集的組群化演算法被提出。然而,幾乎所有的組群化演算法都先要求給定參數,但是使用者卻沒有足夠的知識背景來決定所要求之參數。因此,我們的研究將分析與探討組群資料分佈的意義以及參數對於組群的意義。我們提出一個組群評估式以有效地找出適當的組群參數,以此來幫助使用者省略組群參數的設定。
In the recent years, many clustering algorithms have been recognized as powerful tools for Data Mining. Most clustering algorithms were extended to work efficiently on large datasets and other various domains of the particular applications. It is used in many diversified applications such as image compression, market segmentation, spatial data discovery and statistical data analysis.
In this thesis, we propose a novel algorithm called Density-Based Clustering using Statistical Partition (DBCSP) to speedup the time in finding final converged centroids based on density-based clustering method. The goal of our research is to study the clustering algorithms and to analyze the characteristics of clusters for setting the parameters of clustering algorithms. In the past research results, many clustering algorithms are able to deal with high dimensional datasets. While almost all of the clustering algorithms require input parameters, most users do not have enough domain knowledge to determine these parameters. Thus, our research focuses on analyzing and finding the meaning of data distributions in clusters and the relation of parameters to their respective clusters. We propose a formula of evaluating the important factors in a cluster to determine whether a dataset is well clustered. It relieves users from parameter settings.
In some clustering applications, such as target market research, a specified number of clusters need to be given. Thus, we also propose a more efficient algorithm based on the k-means method that can produce the same or comparable clustering results with much better performance. The total cost of distance calculation and processing time can be reduced.
Table of Contentsiii
List of Figuresv
List of Tablesvi
Chapter 1 Introduction1
1.1 Motivation1
1.2 Data Mining2
1.2.1 Classification2
1.2.2 Cluster Analysis4
1.2.3 Association Rule9
1.2.4 Sequential Pattern11
1.3 Summary11
1.4 Organizations of the Thesis12
Chapter 2 Related Work13
2.1 Density Clustering13
2.1.1 DBSCAN Method13
2.1.2 OPTICS Method16
2.1.3 DBCLASD Method18
2.1.4 DENCLUE Method19
2.1.5 K-means Method21
Chapter 3 Proposed Methods24
3.1 Overview24
3.2 The Proposed Density-Based clustering Methods26
3.2.1 Data Preprocessing26
3.2.2 Partitioning the Dataset into Unit Blocks29
3.2.3 Density-Based Clustering31
Chapter 4 Experiments and Results37
4.1 Dataset Description37
4.2 Experimental Results39
Chapter 5 Conclusion and Future Work44
5.1 Conclusion44
5.2 Future Work44
