臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.213) 您好！臺灣時間：2025/11/09 23:37

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
電子全文
紙本論文
QR Code

本論文永久網址:

研究生:

林佳志

研究生(外文):

Jia-Zhi Lin

論文名稱:

以SimHash改良K-Means分群效率

論文名稱(外文):

Improving Clustering Efficiency by SimHash-based K-Means Algorithm

指導教授:

王正豪

口試委員:

劉傳銘、楊凱翔

口試日期:

2014-06-26

學位類別:

碩士

校院名稱:

國立臺北科技大學

系所名稱:

資訊工程系研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2014

畢業學年度:

102

語文別:

中文

論文頁數:

中文關鍵詞:

SimHash、K-Means、文件分群、相似度計算

外文關鍵詞:

SimHash、K-Means、Document Clustering、Similarity Calculation

相關次數:

被引用:0
點閱:562
評分:
下載:0
書目收藏:0

在資料探勘與分析的過程經常需要應用分類和分群的技術。K-Means為常用的分群方法之一，但分群的過程中需要大量的計算相似度距離和中心點，造成分群效率過低。
目前已有相關研究提出更好的方法來挑選初始中心點，能將中心點分散來提升準確率。另外，在文件分到適合的群組時，降低計算相似度距離的次數。但是，當資料量越大，向量維度就越高，相似度計算也就越耗時。
本論文引入SimHash降維方法以改善分群效率。SimHash本為運用於文件重複偵測，將每個文件表示成固定長度之hash value，再利用漢明距離來計算文件之間的相似度。運用SimHash在分群方法，實驗結果顯示有效降低運算成本，分群結果又不會受太大影響。

K-Means is one of the popular methods for clustering, but it needs a lot of processing time in similarity calculation, which caused lower performance. Some studies proposed new methods for finding better initial centroids to provide an efficient way of assigning the data points to suitable clusters with reduced time complexity. However, with the large amount of data, vector dimension will be higher and needs more time in similarity calculation. In this paper, we propose SimHash-based K-Means algorithm that used dimensionality reduction and Hamming distance to handle large amount of data. The experiment of results showed that our proposed method can improve efficiency without significantly affecting effectiveness.

摘要 i
ABSTRACT ii
致謝 iii
目錄 iv
表目錄 vi
圖目錄 vii
第一章緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究貢獻 3
1.4 論文架構 3
第二章相關研究 4
2.1 K-Means演算法與各種改良方法 4
2.2 K-Means分群效率改良 6
2.3 文件相似度計算 7
第三章研究方法 9
3.1 Preprocessing 10
3.2 Feature Conversion 10
3.3 Clustering 11
3.3.1 Initial Centroids 13
3.3.2 Grouping 17
3.3.3 Recalculate Centroids 18
第四章實驗結果 20
4.1 實驗環境、資料、評估 20
4.2 SimHash-based K-Means 24
4.2.1 Initial Centroids 24
4.2.2 Grouping 25
4.3 SimHash參數 26
4.3.1 Text Segmentation 27
4.3.2 Term Weight 28
4.3.3 Features 30
4.4 SimHash-based K-Means與K-Means比較 32
4.5 群集數量k 34
4.6 文件數量n 36
第五章結論與未來展望 40
參考文獻 41

[1]L. Egghe, “Untangling Herdan’s law and Heaps’ law: Mathematical and informetric arguments,” JASIST, Vol. 58, No. 5, 2007, pp.702-709.
[2]fxjtoday "海量文檔查同或聚類問題 -- Locality Sensitive Hash 算法," CSDN.NET., [Online]. Available: http://blog.csdn.net/fxjtoday/article/details/6200257
[3]H. Steinhaus, “Sur la division des corps materiels en parties,” Bull. Acad. Polon. Sci. French, Vol. 4, No. 12, 1957, pp. 801-804.
[4]J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Vol. 1, 1967, pp. 281-297.
[5]S. P. Lloyd, “Least square quantization in PCM,” IEEE Trans. Inform. Theory, Vol. 28, No. 2, 1982, pp. 129-137.
[6]M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Machine Intell, Vol. 24, No. 3, 2002, pp. 381-396.
[7]G. Usman, U. Ahmad and M. Ahmad, “Improved K-Means Clustering Algorithm by Getting Initial Centroids,” World Applied Sciences Journal, Vol. 27, No. 4, 2013, pp. 543-551.
[8]A. Alrabea, A. V. Senthilkumar, H. Al-Shalabi, and A. Bader, “Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with PCA,” Journal of Advances in Computer Networks, Vol. 1, No. 2, June 2013, pp. 137-142.
[9]B. Scholkopf, A. Smola, and K. R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput, Vol. 10, No. 5, 1998, pp. 1299-1319.
[10]J. Mao and A. K. Jain, “A self-organizing network for hyper-ellipsoidal clustering(HEC),” IEEE Trans. Neural Networks, Vol. 7, 1996, pp. 16-29.
[11]R. Chitta, R. Jin, T. C. Havens and A. K. Jain, “Approximate Kernel k-means: Solution to Large Scale Kernel Clustering,” KDD, San Diego, California, USA, 2011, pp. 895-903.
[12]A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An Efficient enhanced k-means clustering algorithm,” journal of Zhejiang University, Vol. 10, No. 7, 2006, pp. 1626-1633.
[13]K. A. Abdul Nazeer and M. P. Sebastian, “Improving the accuracy and efficiency of the k-means clustering algorithm,” International Conference on Data Mining and Knowledge Engineering (ICDMKE), Proceedings of the World Congress on Engineering (WCE-2009), Vol. 1, July 2009, London, UK, pp. 308-312.
[14]M. Yedla, S. R. Pathakota and T. M. Srinivasa, "Enhancing K-means Clustering Algorithm with Improved Initial Center," International Journal of Computer Science and Information Technologies (IJCSIT), Vol. 1, No. 2, 2010, pp. 121-125.
[15]M. Charikar. Similarity estimateon techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), 2002, pp 380-388.
[16]R. W. Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal, Vol. 29, 1950, p. 147-160.
[17]A. Appleby, “MurmurHash,” [Online]. Available: http://sites.google.com/site/murmurhash/.
[18]D. D. Lewis, “Reuters-21578 Text Categorization Test Collection Distribution 1.0”, AT&T Labs, September 1997.
[19]C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008.
[20]S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action, Manning Publications, October 2011.
[21]T. Dunning "MiA," GitHub Inc., [Online]. Available: https://github.com/tdunning/MiA.

電子全文(本篇電子全文限研究生所屬學校校內系統及IP範圍內開放)

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	一個文件相似度測量方法及其應用

無相關期刊

1.	基於財經新聞詞彙分佈之新聞極性分析
2.	雲端測試服務平台Android測試過程錄影服務設計與實作
3.	朝向抵抗機器學習攻擊的對抗型物理不可克隆函數
4.	基於可適性傳輸與情境感知之自動化活動分類
5.	用於棒球賽事轉播之資料庫與字卡系統
6.	智慧型手機使用者需求之研究
7.	室外冰上活動場凍結影響因子之實驗研究
8.	利用改良式選擇性映射與部份傳送序列技術改善單載波分頻多重存取系統之峰值對均值功率比
9.	應用於4G LTE發射機之低雜訊降壓式轉換器
10.	運用價值鏈分析與知識管理的方法提升企業競爭力-以M公司為例
11.	使用有限元素法分析防爆門之結構強度
12.	多點足底壓力感測器結合藍芽通訊之設計與應用
13.	雷射掃描全足量測系統開發
14.	應用於輕型電動機車之外轉子直流激磁式磁通切換馬達設計
15.	能源管理系統建置之研究

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室