跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.213) 您好!臺灣時間:2025/11/09 23:37
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林佳志
研究生(外文):Jia-Zhi Lin
論文名稱:以SimHash改良K-Means分群效率
論文名稱(外文):Improving Clustering Efficiency by SimHash-based K-Means Algorithm
指導教授:王正豪王正豪引用關係
口試委員:劉傳銘楊凱翔
口試日期:2014-06-26
學位類別:碩士
校院名稱:國立臺北科技大學
系所名稱:資訊工程系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:中文
論文頁數:43
中文關鍵詞:SimHashK-Means文件分群相似度計算
外文關鍵詞:SimHashK-MeansDocument ClusteringSimilarity Calculation
相關次數:
  • 被引用被引用:0
  • 點閱點閱:562
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在資料探勘與分析的過程經常需要應用分類和分群的技術。K-Means為常用的分群方法之一,但分群的過程中需要大量的計算相似度距離和中心點,造成分群效率過低。
目前已有相關研究提出更好的方法來挑選初始中心點,能將中心點分散來提升準確率。另外,在文件分到適合的群組時,降低計算相似度距離的次數。但是,當資料量越大,向量維度就越高,相似度計算也就越耗時。
本論文引入SimHash降維方法以改善分群效率。SimHash本為運用於文件重複偵測,將每個文件表示成固定長度之hash value,再利用漢明距離來計算文件之間的相似度。運用SimHash在分群方法,實驗結果顯示有效降低運算成本,分群結果又不會受太大影響。


K-Means is one of the popular methods for clustering, but it needs a lot of processing time in similarity calculation, which caused lower performance. Some studies proposed new methods for finding better initial centroids to provide an efficient way of assigning the data points to suitable clusters with reduced time complexity. However, with the large amount of data, vector dimension will be higher and needs more time in similarity calculation. In this paper, we propose SimHash-based K-Means algorithm that used dimensionality reduction and Hamming distance to handle large amount of data. The experiment of results showed that our proposed method can improve efficiency without significantly affecting effectiveness.

摘要 i
ABSTRACT ii
致謝 iii
目錄 iv
表目錄 vi
圖目錄 vii
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究貢獻 3
1.4 論文架構 3
第二章 相關研究 4
2.1 K-Means演算法與各種改良方法 4
2.2 K-Means分群效率改良 6
2.3 文件相似度計算 7
第三章 研究方法 9
3.1 Preprocessing 10
3.2 Feature Conversion 10
3.3 Clustering 11
3.3.1 Initial Centroids 13
3.3.2 Grouping 17
3.3.3 Recalculate Centroids 18
第四章 實驗結果 20
4.1 實驗環境、資料、評估 20
4.2 SimHash-based K-Means 24
4.2.1 Initial Centroids 24
4.2.2 Grouping 25
4.3 SimHash參數 26
4.3.1 Text Segmentation 27
4.3.2 Term Weight 28
4.3.3 Features 30
4.4 SimHash-based K-Means與K-Means比較 32
4.5 群集數量k 34
4.6 文件數量n 36
第五章 結論與未來展望 40
參考文獻 41

[1]L. Egghe, “Untangling Herdan’s law and Heaps’ law: Mathematical and informetric arguments,” JASIST, Vol. 58, No. 5, 2007, pp.702-709.
[2]fxjtoday "海量文檔查同或聚類問題 -- Locality Sensitive Hash 算法," CSDN.NET., [Online]. Available: http://blog.csdn.net/fxjtoday/article/details/6200257
[3]H. Steinhaus, “Sur la division des corps materiels en parties,” Bull. Acad. Polon. Sci. French, Vol. 4, No. 12, 1957, pp. 801-804.
[4]J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Vol. 1, 1967, pp. 281-297.
[5]S. P. Lloyd, “Least square quantization in PCM,” IEEE Trans. Inform. Theory, Vol. 28, No. 2, 1982, pp. 129-137.
[6]M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Machine Intell, Vol. 24, No. 3, 2002, pp. 381-396.
[7]G. Usman, U. Ahmad and M. Ahmad, “Improved K-Means Clustering Algorithm by Getting Initial Centroids,” World Applied Sciences Journal, Vol. 27, No. 4, 2013, pp. 543-551.
[8]A. Alrabea, A. V. Senthilkumar, H. Al-Shalabi, and A. Bader, “Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with PCA,” Journal of Advances in Computer Networks, Vol. 1, No. 2, June 2013, pp. 137-142.
[9]B. Scholkopf, A. Smola, and K. R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput, Vol. 10, No. 5, 1998, pp. 1299-1319.
[10]J. Mao and A. K. Jain, “A self-organizing network for hyper-ellipsoidal clustering(HEC),” IEEE Trans. Neural Networks, Vol. 7, 1996, pp. 16-29.
[11]R. Chitta, R. Jin, T. C. Havens and A. K. Jain, “Approximate Kernel k-means: Solution to Large Scale Kernel Clustering,” KDD, San Diego, California, USA, 2011, pp. 895-903.
[12]A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An Efficient enhanced k-means clustering algorithm,” journal of Zhejiang University, Vol. 10, No. 7, 2006, pp. 1626-1633.
[13]K. A. Abdul Nazeer and M. P. Sebastian, “Improving the accuracy and efficiency of the k-means clustering algorithm,” International Conference on Data Mining and Knowledge Engineering (ICDMKE), Proceedings of the World Congress on Engineering (WCE-2009), Vol. 1, July 2009, London, UK, pp. 308-312.
[14]M. Yedla, S. R. Pathakota and T. M. Srinivasa, "Enhancing K-means Clustering Algorithm with Improved Initial Center," International Journal of Computer Science and Information Technologies (IJCSIT), Vol. 1, No. 2, 2010, pp. 121-125.
[15]M. Charikar. Similarity estimateon techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), 2002, pp 380-388.
[16]R. W. Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal, Vol. 29, 1950, p. 147-160.
[17]A. Appleby, “MurmurHash,” [Online]. Available: http://sites.google.com/site/murmurhash/.
[18]D. D. Lewis, “Reuters-21578 Text Categorization Test Collection Distribution 1.0”, AT&T Labs, September 1997.
[19]C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008.
[20]S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action, Manning Publications, October 2011.
[21]T. Dunning "MiA," GitHub Inc., [Online]. Available: https://github.com/tdunning/MiA.


電子全文 電子全文(本篇電子全文限研究生所屬學校校內系統及IP範圍內開放)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top