跳到主要內容

臺灣博碩士論文加值系統

(44.192.20.240) 您好!臺灣時間:2024/02/27 12:26
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:廖秋閔
研究生(外文):Chiu-Min Liao
論文名稱:使用凝聚型階層式分群法對流成行資料分群
論文名稱(外文):Agglomerative Hierarchical clustering with the string data
指導教授:曾富祥曾富祥引用關係
指導教授(外文):Fu-Shiang Tseng
學位類別:碩士
校院名稱:國立中央大學
系所名稱:工業管理研究所
學門:商業及管理學門
學類:其他商業及管理學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:英文
論文頁數:53
中文關鍵詞:資料採礦群集分析相似度測量字串型資料凝聚型階層式分群
外文關鍵詞:Data miningCluster analysisSimilarity MeasureString dataAgglomerative clustering
相關次數:
  • 被引用被引用:0
  • 點閱點閱:566
  • 評分評分:
  • 下載下載:114
  • 收藏至我的研究室書目清單書目收藏:1
由於科技的進步,使得資料量快速地成長。而資料採礦(Data mining)是可有效幫助我們組織成千上萬資料的方法,讓管理者可以從資料中得到相關資訊,做出適當的決策。其中群集分析為資料採礦中常使用的方法之一,而分群的依據來自於資料的特徵。在群集分析中較常使用的資料型態為類別型資料(Qualitative data)與數值型資料(Quantitative data),而流程型資料或字串型資料在過去較少被大家所討論,因此在本研究中,我們將針對流程型資料(字串型資料)提出可行的分群方法。
關於相似度的衡量方法,我們採用以下兩種方法,分別為Jaro similarity與Edit distance,其中距離愈大表示相似度愈小,且根據所定義的相似度或距離,我們可列出相似度矩陣,並利用相似度來對資料做分群。而在本研究中,我們採用凝聚型階層式分群方法來做分群,其中包含最短距離法、最長距離法和平均距離法等方法。在凝聚型階層式分群方法中,一開始每筆資料為各自一群,將最相似的群體逐一合併後,最終全部資料將會屬於同一群體。階層式分群方法的優點為可自己決定分群的群數,且透過階層分群圖可清楚明瞭分群的步驟。
本研究所探討的個案資料,資料型態皆為流程型資料(字串型資料),共使用了三個例子,其中兩個例子為標竿資料,廣泛被許多學者使用;另外一個例子來自於發動機在執行翻修工作時,所產生的待維修零件,因為不同的維修零件所經過的維修站不同,所以各自會有不同的維修流程。本研究中主要在解決流程型資料(字串型資料)間的相似度問題,使我們可以針對資料相似度做分群,讓管理者可以根據分群結果安排適當的維修工作或做其它決策。
Due to the progressing of the science and technology, the data is growing rapidly. Data mining help us to organize the thousands of data efficiently and the managers can obviously find out the information that they do not know before and make appropriate decisions. Cluster analysis is one of the methods that are widely used in data mining according to the features of the data. Most of data applied to cluster analysis are qualitative and quantitative and the string data (flow data) is seldom discussed in cluster analysis. Therefore in this research, we try to propose some possible clustering methods to handle the string data.
About the similarity measure, we adopt two measurements as follows. One is Jaro similarity and the other is Edit distance. The larger the value of distance is, the smaller the value of similarity will be. According to the similarity or distance that we define, we can obtain the similarity matrix. Hence, clustering the data is based on this matrix. In our study, we consider the agglomerative hierarchical clustering such as single linkage, complete linkage and average linkage to group string data. In the initial of agglomerative clustering, each string data is in its own cluster. It means that every cluster includes exactly one string. Then the most similar strings are grouped. After a series of merge operations, finally lead all strings to the same cluster. The advantage of hierarchical clustering algorithm is that we can decide the number of groups which we want to divide and we can obviously know the clustering steps through the hierarchical tree.
We use three examples to present our methodology. The data type in our research is string data. Two benchmark examples and an engine parts dataset. Because different parts are passing different repair workstations, every part has its own repair procedure. Our study is focusing on dealing with the problem about counting similarity between strings. We want to cluster the string data and the clustering result can help the workstations work efficiently.

Contents
中文摘要 i
Abstract iii
1. Introduction 1
1.1 Background/Motivation 1
1.2 Research objectives 2
1.3 Research Methodology 3
2. Literature Review 4
2.1 Cluster Analysis 4
2.2 Group Technology 5
2.3 Similarity measure 5
2.3.1 Euclidean distance 6
2.3.2 Jaro similarity 6
2.3.3 Jaro-Winkler similarity 8
2.3.4 Edit distance 9
2.4 Clustering Techniques 10
2.4.1 Hierarchical 11
2.4.2 Non-Hierarchical 13
2.5 Clustering Validity assessment 14
3. Methodology 17
3.1 Jaro similarity 18
3.2 Normalized Edit distance 20
3.2.1 Edit distance 20
3.2.2 Normalized Edit Distance 22
4. Numerical Example 24
4.1 Example 1 24
4.2 Example 2 28
4.3 Example 3 31
5. Conclusion and Future Research 35
5.1 Conclusion 35
5.2 Future Research 36
Reference 37
Appendix: The Procedure of Clustering 41


Reference
1. Baeza-Yates, R. A., “Introduction to data structures and algorithms related to information retrieval”, In Information Retrieval: Data Structures and Algorithms, pp.13-27.
2. Cohen, W. W., Ravikumar, P. and Fienberg, S. E., “A Comparison of String Distance Metrics for Name-Matching Tasks”, Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington DC, August 2003.
3. Dunn, J. C., “Well-Separated Clusters and Optimal Fuzzy Partitions”, Journal of Cybernetics, Vol. 4, No. 1, pp.95-104, 1974.
4. Gupta, V. and Lehal, G. S., “A Survey of Text Mining Techniques and Applications ”, Journal of Emerging Technologies in Web Intelligence, Vol. 1, No. 1, pp.60-76, 2009.
5. Halkidi, M., Vazirgiannis, M., “A density-based cluster validity approach using muti-representatives.”, Pattern Recognition, Vol. 29, No. 6, pp.773-786, 2008.
6. Harhalakis, G., Nagi, R. and Proth, J. M., “An efficient heuristic in manufacturing cell formation for group technology applications,” International Journal of Production Research, Vol. 28, pp.185-198, 1990.
7. Heragu, S., “Group technology and cellular manufacturing”, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 24, No. 2, pp.203-215, 1994.
8. Jain, A. K., Murty, M. N. and Flynn, P. J., “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No. 3, pp.264-323, 1999.
9. Jain, A. K., “Data clustering: 50 years beyond K-means”, Pattern Recognition Letters, Vol. 31, pp. 651-666, 2010.
10. Jaro, M. A., “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida”, Journal of the American Statistical Association, Vol.89, pp.414-420, 1989.
11. Jaro, M. A., “Probabilistic linkage of large public health data file”, Statistics in Medicine, vol.14, pp.491-498, 1995.
12. Jon, R. K., “A patent analysis of cluster analysis”, Applied Stochastic Models in Business and Industry - Special issue on the 6th International Symposium on Business and Industrial Statistics (ISBIS-6), Vol. 25, No. 4, pp.460-467, 2009.
13. Kim, Y. G., Suh, J. H. and Park, S. C., “Visualization of patent analysis for emerging technology”, Expert Systems with Applications, vol. 34, pp. 1804-1812, 2008.
14. Knuth, D., “The Art of Computer Programming”, Addison-Wesley, Reading, MA, 1973.
15. Kusiak, A., “The generalized group technology concept”, International Journal of Production Research, Vol. 25, No. 4, pp. 561-569, 1987.
16. Kusiak, A. and Chow, W., “Decomposition of manufacturing systems”, IEEE Trans. Robotics and Automation, Vol. 4, No. 5, pp. 457-471, 1988.
17. Kusiak, A. and Cho, M., “Similarity coefficient algorithms for solving the group technology problem”, International Journal of Production Research, Vol. 30, No. 11, pp. 2633-2646, 1992.
18. McCallum, A. and Wellner, B. (2003), “Object Consolidation by Graph Partitioning with a Conditionally-Trained Distance Metric,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington DC, August 2003.
19. Murty, M. N. and Jain, A. K., “knowledge based clustering scheme for collection management and retrieval of library books”, Pattern Recognition, Vol. 28, No. 7, pp.949-963, 1995.
20. Nair, G. J., and Narendran, T. T., “CASE: a clustering algorithm for cell formation with sequence data,” International Journal of Production Research, Vol. 36, pp.157-179, 1998.
21. Ngai, E. W. T., Xiu, L. and Chau, D. C. K., “Application of data mining techniques in customer relationship management:A literature review and classification”, Expert Systems with Applications, Vol. 36, pp.2592-2602, 2009.
22. Oehler, K. L. and Gray, R.M., “Combining Image Compression and Classification Using Vector Quantization”, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 17, No. 5, pp.461-473, 1995.
23. Okuda, T., Tanaka, E. and Kasai, T., “A Method for the Correction of Garbled Words Based on the Levenshtein Metric”, IEEE Transactions on Computers, Vol. 25, No. 2, pp.172-178, 1976.
24. Rohlf, F. J., “Methods of Comparing Classifications”, Annual Review of Ecology and Systematics, Vol. 5, pp.101–113, 1974.
25. Seyed Hosseini, S. M., Maleki, A. and Gholamian, M. R., “Cluster analysis using data mining approach to develop CRM methodology to assess the customer loyalty”, Expert Systems with Applications, Vol.37, pp.259–5264, 2010.
26. Teymourian, E., Mahdavi, I. and Kayvanfar, V., “A new cell formation model using sequence data and handling cost factors”, Industrial Engineering and Operations Management, Vol.4, pp.22–24, 2011.
27. Vendramin, L. and Campello, R. J. and Hruschka, E. R., “Relative clustering validity criteria: A comparative overview”, Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol.3, No. 4, pp.209-235, 2010.
28. Wemmerlov, U. and Hyer, N. L., “Procedures for the part family/machine group identification problem in cellular manufacturing”, Journal of Operations Management, Vol.6, No. 2, pp.125-147, 1986.
29. Winkler, W. E., “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”, Proceedings of the Section on Survey Research, pp.354-359, 1990.
30. Winkler, W. E., “The State of Record Linkage and Current Research Problems”, Statistical Society of Canada, Proceedings of the Survey Methods Section, pp.73-80, 1999.

31. Winkler, W. E., “Overview of Record Linkage and Current Research Directions”, Statistical Research Division U.S. Census Bureau, 2006.
32. Won, Y. and Kim, S., “Multiple criteria clustering algorithm for solving the group technology problem with multiple process routings”, Computers &; Industrial Engineering, Vol.32, No. 1, pp.207-220, 1997.
33. Xu, R. and Wunsch, D., “Survey of clustering algorithms”, IEEE Transactions on Neural Networks, Vol.16, No. 3, pp.645-678, 2005.
34. Zalik, K. R. and Zalik, B., “Validity index for clusters of different sizes and densities.”, Pattern Recognition, Vol. 32, pp.211-234, 2011.
35. Zhang, K., “Algorithms for the constrained editing distance between ordered labeled trees and related problems”, Pattern Recognition, Vol. 28, pp.463-471, 1995.
36. 曾固鈺,「依流程相似度對目標群組做集分析-以航空發動機維修廠之自修工件為例」,國立中央大學工業管理研究所碩士論文,2013.

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊