臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.106) 您好！臺灣時間：2026/04/06 05:01

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

趙士賢

研究生(外文):

Shih-Hsien Chao

論文名稱:

平行化資訊理論共分群演算法

論文名稱(外文):

Parallel Information-Theoretic Co-Clustering based on MapReduce

指導教授:

張嘉惠

指導教授(外文):

Chia-Hui Chang

學位類別:

碩士

校院名稱:

國立中央大學

系所名稱:

軟體工程研究所

學門:

社會及行為科學學門

學類:

經濟學類

論文種類:

學術論文

論文出版年:

2012

畢業學年度:

100

語文別:

中文

論文頁數:

中文關鍵詞:

共分群、雲端

外文關鍵詞:

co-clustering、could computing、Hadoop、MapReduce

相關次數:

被引用:0
點閱:245
評分:
下載:0
書目收藏:0

資料分群(Data Clustering)在各種領域被廣泛的應用，如:資料探勘(Data Mining)、文件檢索(Document Retrieval)、影像分割(Image Segmentation)、樣式分類(Pattern Classification)等等。傳統資料分群演算法通常只能用在小規模資料分析上。如今，做資料分群時，常常必須面臨好幾Gigabytes的資料量，一般電腦已經無法再處理龐大的資料。為了解決這些問題，許多研究員嘗試去設計出許多有效率的平行化分群演算法(Parallel Clustering Algorithm) 來做大型資料分群。
本論文中我們聚焦在Information-Theoretic Co-clustering (ITCC)演算法，ITCC是一種共分群演算法，它可以同時對行與列去作分群，並且其objective function是以行向量與列向量之mutual information作為基礎。ITCC被廣泛地用在許多領域，如: Text mining、Social recommendation system、生物資訊領域等等。
在本篇論文中，我們提出Parallel Information-Theoretic Co-Clustering (PITCC)演算法，由於要處理的資料量相當龐大，我們使用一種近幾年來新興且熱門的平行化運算平台Hadoop，以Map-Reduce的方式來進行運算。Map-Reduce廣泛的被學術界(Academia)與業界(Industry)所接受，是一種簡單而且非常強大的programming方法。Hadoop除了具有高擴充性，還具有易於使用等優點。並且我們使用了CAMRa2011比賽所release的資料集。最後我們將在實驗部分使用了三種評估效能的方法來衡量我們的實驗，並且證明我們所提出的演算法，是一個相當有效率且能處理龐大的資料集的方法。

Data clustering is used in many domains widely. For example: data mining, document retrieval, image segmentation, pattern classification, etc. Traditional clustering algorithms are usually used for small-scale data analysis. At present, we usually have to deal with the large data, which cannot be dealt with in single computer. To solve these problems, many researchers attempt to design efficient parallel clustering algorithms for huge data.
In this paper we focus on Information-Theoretic Co-clustering (ITCC) which is a simultaneous clustering of the rows and columns based on mutual information between the clustered random variables subject to constraints on the number of row and column clusters. ITCC is widely used in many domains, such as text mining, social recommendation system, and bio-informatics, etc.
We propose a Parallel Information-Theoretic Co-Clustering (PITCC) algorithm based on MapReduce. Because we need to analyze huge data, we develop our algorithm on cloud computing platform based on Hadoop. MapReduce is a programming model which has been widely embraced by both academia and industry because of high scalability and easy use. We use the movie recommendation contest “CAMRa2011” dataset for our experiments, and evaluate our experiment results in terms of speedup, sizeup and scaleup. The experimental results demonstrate that the proposed algorithm is very powerful and efficient, and it can process large datasets on commodity hardware.

中文摘要 II
1. 緒論 1
2. 相關研究探討 4
2.1. MAPREDUCE &HADOOP 4
2.2. 共分群 (CO-CLUSTERING) 8
3. 背景知識: INFORMATION-THEORETIC CO-CLUSTERING (ITCC) 12
3.1. 符號定義 12
3.2. ITCC 框架與演算法 13
4. 平行化共分群演算法(PITCC ALGORITHM) 16
4.1. PARALLEL ITCC FRAMEWORK 16
4.2. PARALLEL ITCC ALGORITHM 19
5. 實驗 27
5.1. 資料集與條件 27
5.2. 評估方法 30
5.3. 結果 31
5.3.1. Speedup 31
5.3.2. Sizeup 32
5.3.3. Scaleup 33
6. 結論 34
7. 參考文獻 35

[1]Hadoop. http://hadoop.apache.org/core/.
[2]HBase. http://hadoop.apache.org/hbase/.
[3]Tom Write, “Hadoop: The Definitive Guide, 2nd Edition,” O''Reilly (2011).
[4]Borthakur, D., “The Hadoop Distributed File System: Architecture and Design” (2007).
[5]Ghemawat, S., Gobioff, H., Leung, S. “The Google File System.” Symposium on Operating Systems Principles (2003).
[6]F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. “Bigtable: A distributed storage system for structured data,” Operating Systems Design and Implementation (OSDI 2006).
[7]Dean, J., Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” Operating Systems Design and Implementation (OSDI 2004).
[8]Dean, J., Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” Communications of The ACM (2008).
[9]Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce,” Morgan & Claypool Publishers (2010).
[10]Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C., “Evaluating MapReduce for Multi-core and Multiprocessor Systems.” High-Performance Computer Architecture (HPCA 2007).
[11]Lammel, R. “Google’s MapReduce Programming Model - Revisited.” Science of Computer Programming (2008).
[12]Weizhong Zhao, Huifang Ma, and Qing He, “Parallel K-Means Clustering Based on MapReduce,” CloudCom. Lecture Notes in Computer Science (LNCS 2009).
[13]MacQueen, J. “Some Methods for Classification and Analysis of Multivariate Observations,” 5th Berkeley Symp. Math. Statist, Prob. (1967).
[14]Xu, X., Jager, J., Kriegel, H.P “A Fast Parallel Clustering Algorithm for Large Spatial Databases,” Data Mining and Knowledge Discovery (KDD 1999).
[15]Xin Yue Yang, Zhen Liu, and Yan Fu, “MapReduce as a Programming Model for Association Rules Algorithm on Hadoop,” Information Sciences and Interaction Sciences (ICIS 2010).
[16]I. S. Dhillon, S. Mallela, and D. S. Modha. “Information theoretic Co-clustering,” Knowledge Discovery and Data Mining Conference (KDD 2003).
[17]Y. Cheng and G.M. Church. “Biclustering of expression data,” American Association for Articial Intelligence (AAAI 2000).
[18]Ramanathan, V., “Parallelizing an Information Theoretic Co-clustering Algorithm Using a Cloud Middleware,” International Conference on Data Mining Workshops (ICDMW 2010).
[19]Spiros Papadimitriou, Jimeng Sun., “DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining,” IEEE International Conference on Data Mining (ICDM 2008).
[20]H. Li and N. Abe. “Word clustering and disambiguation based on co-occurence data,” the Association for Computational Linguistics (COLING-ACL 1998).
[21]D. Agarwal and S. Merugu, “Predictive discrete latent factor models for large scale dyadic data,” Knowledge Discovery and Data Mining Conference (KDD 2007).
[22]D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsos. “Fully automatic cross-associations,” Knowledge Discovery and Data Mining Conference (KDD 2004).
[23]H. Cho, I. Dhillon, Y. Guan, and S. Sra, “Minimum sum-squared residue co-clustering of gene expression data,” SIAM International Conference on Data Mining (SDM 2004).
[24]S. C. Madeira, and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB 2004), 1.
[25]http://www.emc.com/leadership/programs/digital-universe.htm

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	基於Hadoop平台的雲端基因架構
2.	利用MapReduce軟體架構於Hadoop叢集進行地貌型直接逕流模組演算之研究
3.	Hadoop分散式資料儲存與計算環境的架構分析
4.	異質雲端上設計動態資料分配排程法以改善MapReduce績效之研究
5.	雲端運算之應用與效能評估
6.	Cloud BIM: 應用雲端運算與WebGL技術之網路式BIM系統
7.	異質雲端環境下基於節點負載之動態任務調度機制
8.	在雲端運算環境下使用分散式演化式演算法推導大型基因調控網路
9.	一個可將MapReduce程式透通地執行在多個Hadoop平台之方法
10.	應用於 MapReduce 動態記憶分配之負載平衡規劃
11.	在異質雲中以節點能力為依據之資源配置方法
12.	社交網路在Hadoop架構下之資源分享機制設計
13.	用雲端運算實作Netflow資料探勘架構
14.	藉由早期部分結果結合降低通訊成本和運算代價
15.	一個基於網頁之MapReduce資料運算之使用者圖形化界面

無相關期刊

1.	基於共分群模型整合內容式與協同式之即時推薦系統
2.	以聲符部件為主的漢字識字教學系統設計
3.	聚光型太陽光發電系統之清洗決策分析
4.	基於頁面層級之快速網頁資料擷取與綱要驗證
5.	雲端平台大數據資料庫研究-以報關訊息資料為例
6.	規則引擎結合即時雲端運算架構之設計與實作（以業績計算管理為例）
7.	以Hadoop為平台-結合異質資料庫與Hive之加速查詢應用
8.	Associated Information Extraction for Enabling Entity Search on Electronic Map
9.	基於Hadoop MapReduce與HBase之醫療資訊快速分析平台
10.	Active Learning for Incremental POI Extraction and Pairing
11.	Facebook活動事件擷取系統
12.	商家與圖片配對研究
13.	從Web擷取興趣點及驗證關係
14.	Wi-Fi分享平設計與實作
15.	巨量資料處理系統之分析與比較

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室