跳到主要內容

臺灣博碩士論文加值系統

(3.237.6.124) 您好!臺灣時間:2021/07/24 04:39
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:卓也琦
研究生(外文):Ye-Qi Zhuo
論文名稱:Hadoop系統參數優化
論文名稱(外文):Optimization of Hadoop System Configuration Parameters
指導教授:廖世偉
口試委員:蘇中才杜憶萍黃維中
口試日期:2015-07-02
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2015
畢業學年度:103
語文別:英文
論文頁數:33
中文關鍵詞:系統優化
外文關鍵詞:tuningoptimizationpredictor
相關次數:
  • 被引用被引用:1
  • 點閱點閱:370
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
在當前big data的時代,Hadoop系統對於分析和應用大數據有著至關重要的作用,我們既希望能夠把Hadoop系統參數能夠調節到最佳的狀態又希望能夠在不花費更多在硬體的更新上。因此我的碩論的主題選擇在Hadoop系統參數的優化,在這裡主要針對希望優化的效能是在於減少單一任務的執行時間。我採用的是三段式模型:
(1)是在眾多參數中找到對於系統影響最大的參數,根據map和reduce分開觀察並選出20個參數作為我們主要要調節的參數;

(2)是建立系統時間的預測模型,根據這20個參數去搜集更多的任務執行的時間和相對應的參數作為我們建立模型的基礎,運用機器學習的方法去做建模並且選擇出最適合的三層式模型;

(3)是建立系統的優化模型,每次優化機會在設定的參數範圍內隨機選取出來參數,並且把它放到之前建好的預測的模型去預測其執行的時間,經過我設定好的優化模型最終會找到一個執行時間最短的參數組合。我總共選擇了4個程式,經過以上的方法組合去驗證。

Hadoop system is very popular recent year, which is a software framework with distributed processing large-scale data-sets by using a cluster of machines with MapReduce programming model. However, there are still two essential challenges for Hadoop users to manage the Hadoop system. (1) To tune the parameters appropriately; (2) To deal with dozens of configuration parameters which are involved to its performance. This paper will focus on optimizing the Hadoop MapReduce job performance. Our approach has two key model: Prediction and Optimization. The Prediction model is to estimate execution time of a MapReduce job and the Optimization model is to search the approximately optimal configuration parameters by invoking the prediction part repeatedly. By using an analytical method to choose approximately optimal configuration parameters to improve users’ job performance . Besides the configuration parameter tuning, the relevance of each parameters and the evaluation of our methods will also be discussed in this paper. Our paper may provide users a better method to improve the Hadoop system performance and save the hardware resource.

Chapter 1 Introduction 1
Chapter 2 Related Work 2
Chapter 3 Hadoop MapReduce System 6
3.1 Architecture of Hadoop MapReduce System . . . . . . 6
3.2 Execution Flow of a MapReduce Job . . . . . . . . . . .7
3.3 Classification of Configuration Parameters . . . . . . 8
Chapter 4 Design of Experiments 12
4.1 Configuration Parameters for Modeling . . . . . . . . . 12
4.2 The Architecture of Predictor and Optimizer . . . . . . . 15
4.3 Design of Predictor . . . . 16
4.4 Design of Optimizer . . . . . . . . . . . . . . . . . 20
Chapter 5 Evaluation 24
5.1 Performance of Predictor . . . . . . . . . . . . . . . 24
5.2 Performance of Optimizer . . . . . . . . . . . . . . 26
5.3 Comparison to Default Configuration . . . . . . . . . 27
Chapter 6 Conclusion and Future Work 29
Bibliography 31

[1] Apache hadoop nextgen mapreduce(yarn). http://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarnsite/YARN.html.
[2] Hadoop. https://hadoop.apache.org/.
[3] Mape. http://en.wikipedia.org/wiki/Mean_absolute_percentage_error.
[4] S. G. R. S. Alexander Zien, Nicole Kramer. The feature importance ranking measure. arXiv:0906.4258v1, 2009.
[5] S. B.A.Kitchenham, L.M.Pickard and M.J.Shepperd. What accuracy statistics really measure.
[6] J. Bennett and S. Lanning. The netflix prize. in Proceedings of KDD cup and workshop, page 35, 2007.
[7] Z. W. C. Weng, M. Li and X. Lu. Automatic performance tuning for the virtualized cluster system. in Distributed Computing Systems ICDCS’09. 29th IEEE Interna- tional Conference on, 2009.
[8] H. A. Carneiro and E. Mylonakis. Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clinical infectious diseases, 49:1557–1564, 2009.
[9] M. S. D. R. Jones and W. J. Welch. Efficient global optimization of expensive black- box functions. Journal of Global optimization, 13:455–492, 1998.
[10] D.Heger. Hadoop performance tuning-a pragmatic & iterative approach. CMG Jour- nal, pages 97–113, 2013.
[11] G. L. N. B. L. D. F. B. C. e. a. H. Herodotou, H. Lim. Starfish: A self-tuning system for big data analytics. IN CIDR, pages 261–272, 2011.
[12] M. Hall. Hadoop: From open source project to big data ecosystem. 2010.
[13] H. Herodotou. Hadoop performance models. Proc. of the VLDB Endowment, 4:1111– 1122, 2011.
[14] H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. arXiv preprint arXiv:1106.0940, 2011.
[15] S. B. Joshi. Apache hadoop performance-tuning methodologies and best practices.
in Proceedings of the 3rd ACM/SPEC International Conference on Performance En- gineering, 2012.
[16] L.Breiman. Random forests. Machine learning, 2001.
[17] J. Lin and C. Dyer. Data-intensive text processing with mapreduce. Synthesis Lec-
tures on Human Language Technologies, 3, 2010.
[18] H. H. Liu. Software performance and scalability: a quantitative approach, volume 7.
John Wiley & Sons, 2011.
[19] P. G. Louis Wehenkel, Damien Ernst. Ensembles of extremely randomized trees and
some generic applications. RTE-VT workshop, 2006.
[20] A. J. B. N. B. Rizvandi, A. Y. Zomaya and J. Taheri. On modeling dependency be- tween mapreduce configuration parameters and total execution time. arXiv preprint arXiv:1203.0651, 2012.
[21] D. E. P. Geurts and L. Wehenkel. Extremely randomized trees. Machine learning, 63, 2006.
[22] J. D. T. X. S. Huang, J. Huang and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, 2010, 2010.
[23] K. L. S. Islam, J. Keung and A. Liu. Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Comp., 28, 2012.
[24] B. Selman and C. P. Gomes. Hillclimbing search,. Encyclopedia of Cognitive Science, 2006.
[25] S.Joshi. Hadoop tuning guide. Advanced Micro Devices, 2012.
[26] T. K. M. S. T. Hothorn, P. Buhlmann and B. Hofner. Model-based boosting 2.0. The
Journal of Machine Learning Research, 11, 2010.
[27] T. White. Hadoop: The definitive guide. O’Reilly Media, Inc, 2012.
[28] Y. Z. C. H. X. Liu, J. Han and X. He. Implementing webgis on hadoop: A case
study of improving small file i/o performance on hdfs. in Cluster Computing and Workshops, 2009.
[29] T. Ye and S. Kalyanaraman. A recursive random search algorithm for large-scale network parameter configuration. Department of Electrical, Computer and System Engineering.
[30] T. Ye and S. Kalyanaraman. A recursive random search algorithm for large-scale net-work parameter configuration. CM SIGMETRICS Performance Evaluation Review, 31, 2003.


QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top