跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.107) 您好!臺灣時間:2025/12/18 06:41
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:林振鍇
研究生(外文):Lin, Chenkai
論文名稱:利用檔案群組增進雲端系統之小檔案存取效能
論文名稱(外文):Improving the Efficiency of Accessing Small Files in Cloud Systems through File Grouping
指導教授:葉佐任
指導教授(外文):Yeh, Tsozen
口試委員:洪茂盛黃文吉
口試委員(外文):Horng, MawshengHwang, Wen-Jyi
口試日期:2012-07-26
學位類別:碩士
校院名稱:輔仁大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2012
畢業學年度:100
語文別:中文
論文頁數:96
中文關鍵詞:雲端運算HadoopHDFS小檔案HAR 檔Last.fm
外文關鍵詞:cloud computingHadoopHDFSsmall filesHAR fileLast.fm
相關次數:
  • 被引用被引用:0
  • 點閱點閱:274
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
由於近代雲端科技的發展,各種雲端運算技術已經被廣泛的應用在許多網路服務上,而Hadoop是近年來相當具有穩定性與實用性的雲端架設軟體,它具備了高速的分散式計算以及穩定的檔案系統HDFS,並且針對處理批次大型檔案有很好的效能。然而HDFS卻不利於處理和管理大量的小檔案,因為大量的小檔案會對單一的Namenode造成很大的負擔,故如何有效的處理小檔案被視為改善系統效能的一大議題。在本研究中,我們提出方法來改善小檔案對HDFS所造成的影響,其中共包含了兩項改進。第一,我們將系統中相關的小檔案合併成大檔案,藉此降低metadata的空間使用量,同時也減少了Namenode的記憶體使用量。第二,我們利用三個不同的機制來改進小檔案存取的效率,分別為群組機制、排序機制、以及HAR檔案的兩層索引搜尋改善機制。我們的主要概念是,盡可能的將屬於相同群組的小檔案集中在同一個區塊(block)裡。此外,我們藉由動態調整HAR檔中的兩層索引長度,來改善搜尋小檔案的速度以提升存取效率。在實驗的部分,我們參考了一知名音樂分享平台Last.fm所提供的音樂排行榜,來驗證我們的方法比原始的HAR檔案效能更好。實驗結果顯示,我們的方法最多可以降低Namenode的記憶體使用量約5.54%,並且最多可增進小檔案的存取效率約9.39%。
Nowadays, many web applications and storage systems have been developed on cloud computing. Hadoop is a software framework that supports high speed distributed computing and stable storage. Hadoop Distributed File System (HDFS) is a file system designed to process large amount of data through large clusters on commodity hardware. However,
HDFS does not work well and takes performance penalty when managing a large number of small files since those small files may cause heavy load on the Namenode site. Therefore, how to solve the small-file problem is a key issue of improving the performance of Hadoop. In this thesis, we introduce several ways to improve the performance of handling
small files on HDFS. Our approach includes two improvements. First, we improve the space utilization of metadata by merging the related small files into large files, so that the memory usage ratio of Namenode can be reduced accordingly. Second, we reduce the accessing time of small files by three mechanisms - grouping, sorting, and improving two-level index searching in HAR file. Our main idea is to store the small files belonging to the same group on the same block whenever possible. In addition, we better the performance of searching small files by dynamically adjusting the length of two-level index in HAR files. In our experiments, we use the music charts from a famous music sharing platform, Last.fm, to demonstrate our approach outperforms the original HAR approach. The experiment results show that our work can reduce the memory usage ratio of Namenode by up to 5.54% and improve the efficiency of accessing huge numbers of small files on HDFS by up to 9.39%.
1 導論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 研究目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 研究貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 背景知識. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Hadoop 環境架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Hadoop 軟體平台. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 MapReduce 資料運算架構. . . . . . . . . . . . . . . . . . . . 14
2.2.1.1 運算原理. . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1.2 運作流程. . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Hadoop Distributed File System 架構. . . . . . . . . . . . . . . 16
2.2.2.1 平行資料處理. . . . . . . . . . . . . . . . . . . . . . 17
2.2.2.2 資料備份與容錯. . . . . . . . . . . . . . . . . . . . 17
2.2.2.3 資料存取流程. . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Hadoop 檔案合併工具. . . . . . . . . . . . . . . . . . . . . . 19
2.2.3.1 用途及原理. . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3.2 運作流程. . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 硬碟效能概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 優先權概要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 記憶體置換的優先權演算法. . . . . . . . . . . . . . . . . . . 23
3.1.2 硬碟存取的優先權機制. . . . . . . . . . . . . . . . . . . . . . 24
3.2 小檔案問題. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 小檔案問題改進策略. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 檔案合併. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 檔案關聯性考量. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 優化檔案管理. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 效能評比. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 記憶體使用量及檔案存取效能改進. . . . . . . . . . . . . . . 30
3.4.2 metadata 縮小策略. . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 相關研究的對照與分析. . . . . . . . . . . . . . . . . . . . . . 33
4 我們的研究方法與設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 設計架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 檔案合併策略. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 合併工具選用. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 重新合併機制. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 兩層索引搜尋改進機制. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 動態索引調整策略. . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 數學模型推導. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 群組機制. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 群組策略. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1.1 固定時間間隔. . . . . . . . . . . . . . . . . . . . . . 43
4.4.1.2 相對時間間隔. . . . . . . . . . . . . . . . . . . . . . 44
4.4.1.3 URL 的解析. . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1.4 公式解析. . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.2 群組機制應用. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2.1 概念整合. . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2.2 實際應用. . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 排序機制. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.1 排序概念. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.2 檔案大小考量. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 綜合策略整合. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.7 需要的成本討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 實驗與效能分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 實驗設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 實驗目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.2 實驗環境. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.3 實驗方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.4 配備規格. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 實驗資料. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 資料來源. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.2 資料模擬方式. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.3 使用者行為模擬. . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1 metadata 測量. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.2 記憶體使用量分析. . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.2.1 實驗一. . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2.2 實驗二. . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.2.3 實驗三. . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2.4 記憶體使用量實驗討論. . . . . . . . . . . . . . . . 77
5.3.3 存取效率分析. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.3.1 實驗一. . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.3.2 實驗二. . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.3.3 實驗三. . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.3.4 存取效率實驗討論. . . . . . . . . . . . . . . . . . . 86
5.3.4 實驗結果總結. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 結論與未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
[1] Hadoop. http://hadoop.apache.org/, 2011.
[2] T. White, "Hadoop: The Definitive Guide 2/e.", O'Reilly, October 2010.
[3] T. Yeh and Y. Pan, "Improving the Performance of the Web Proxy Server through
Group Prefetching.", In ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, 2012.
[4] T. White, "The Small Files Problem.", http://www.cloudera.com/blog/2009/02/thesmall-files-problem/, 2009.
[5] S. Ghemawat, H. Gobioff, and S. T. Leung, "The Google File System.", In SOSP' 03: Proceedings of the nineteenth ACM symposium on Operating systems principles,Oct. 2003.
[6] J. Dean and S. Ghemawat, "Mapreduce: Simplified Data Processing on Large Clusters.", In OSDI' 04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, Dec. 2004.
[7] Last.fm, http://www.last.fm/, January, 2012.
[8] MapReduce, http://en.wikipedia.org/wiki/MapReduce, March 2012.
[9] HDFS design, http://hadoop.apache.org/common/docs/current/hdfs_design.html, March 2012.
[10] K. Talattinis, A. Sidiropoulou, K. Chalkias, and G. Stephanides, "Parallel Collection of Live Data Using Hadoop.", 2010 14th Panhellenic Conference on Informatics, Sep. 2010.
[11] J. Venner, "Pro Hadoop.", Apress, June 2009.
[12] X. Liu, J. Han, Y. Zhong, C. Han, and X. He, "Implementing WebGIS on Hadoop: A Case Study of Improving Small File I/O Performance on HDFS," In CLUSTER' 09: Cluster Computing and Workshops, 2009.
[13] G. Mackey, S. Sehrish, and J. Wang, "Improving Metadata Management for Small files in HDFS.", In CLUSTER' 09: Cluster Computing and Workshops, 2009.
[14] B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li, "A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files.", 2010 IEEE International Conference on Services Computing (SCC), 2010.
[15] L. Jiang and B. Li, "The Optimization of HDFS Based on Small Files.", In: 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT), 2010.
[16] C. Lam, "Hadoop In Action.", Manning, 2011.
[17] Class HarFileSystem, http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/HarFileSystem.html, 2009.
[18] Hadoop-mapreduce/HarFilesystem, https://github.com/apache/hadoop-mapreduce/blob/trunk/src/tools/org/apache/hadoop/fs/HarFileSystem.java, March 2011.
[19] M. Bhandarkar, "MapReduce Programming with Apache Hadoop.", In Proceedings of IEEE International Symposium on Parallel & Distributed Processing (IPDPS), May 2010.
[20] Hadoop Archive: File Compaction for HDFS, http://developer.yahoo.com/blogs/hadoop/posts/2010/07/hadoop_archive_file_compaction/, July 2010.
[21] NCHC Cloud Computing Courses, http://trac.nchc.org.tw/cloud/wiki/NCHCCloud-Course090914, 2009.
[22] Hadoop Archive Guide, http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html, August 2010.
[23] Y. Zhu, J. Wang, R. Min, and Y. Hu, "UCFS - A Novel User-Space, High Performance,Custom File System for Web Proxy Servers.", IEEE Transactions on Computers,September 2002.
[24] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry, "A Fast File System for UNIX.", ACM Transactions on Computer System (TOCS), Aug 1984.
[25] M. Rosenblum and J. K. Ousterhout, "The Design and Implementation of a Log-Structured File System.", ACM Transactions on Computer System (TOCS), Feb 1992.
[26] J. Wang and Y. Hu, "Profs: Performance-Oriented Data Reorganization for Logstructured File System on Multi-Zone Disks.", IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS'01), Aug. 2001.
[27] J. Wang and Y. Hu. "WOLF - A Novel Reordering Write Buffer to Boost the Performance of Log-Structured File Systems.", USENIX 1th Conference on File and Storage Technologies, Jan. 2002.
[28] W. Wang, Y. Zhao, and R. Bunt, "HyLog: A High Performance Approach to Managing Disk Layout.", USENIX 3th Conference on File and Storage Technologies, March 2004.
[29] H. S. Jeon and S. H. Noh, "A Database Disk Buffer Management Algorithm Based on Prefetching.", 7th International Conference on Information and Knowledge Management, Nov. 1998.
[30] J. Lewis, M. Alghamdi, M. A. Assaf, X. Ruan, Z. Ding, and X. Qin, "An Automatic Prefetching and Caching System.", Performance Computing and Communications Conference (IPCCC), 2010 IEEE 29th International, Dec. 2010.
[31] S. Subha, "An Algorithm for Buffer Cache Management.", ITNG 2009, 6th International Conference on Information Technology, 2009.
[32] S. Yang, "Improving the Program Performance through Prioritized Disk Operation.", Master's thesis, Fu Jen Catholic University, July 2012.
[33] D. Lee, J. Choi, J. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, "On the Existence of a Spectrum of Policies that Subsumes the Least Recently Used (LRU) and Least Frequently Used (LFU) Policies.", ACM SIGMETRICS international conference on Measurement and modeling of computer systems, May 1999.
[34] F. I. Popovici, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Robust Portable I/O Scheduling with the Disk Minic.", USENIX Annual Technical Conference, Jun 2003.
[35] L. Yang, "Developing a multi-level priority disk scheduler.", Master's thesis, Fu Jen Catholic University, July 2007.
[36] S. Iyer and P. Druschel, "Anticipatory Scheduling: A Disk Scheduling Framework to Overcome Deceptive Idleness in Synchronous I/O.", ACM SIGOPS Operating Systems Review, October 2001.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊