(3.236.6.6) 您好!臺灣時間:2021/04/23 21:17
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:李正鴻
研究生(外文):Zheng-Hong Li
論文名稱:以深度學習方法優化Apache Spark叢集任務排程之研究
論文名稱(外文):The study of Optimizing Apache Spark Cluster Task Schedule with Deep Learning Technique
指導教授:陳世穎陳世穎引用關係陳弘明陳弘明引用關係
指導教授(外文):Shih-Ying ChenHung-Ming Chen
學位類別:碩士
校院名稱:國立臺中科技大學
系所名稱:資訊工程系碩士班
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:107
語文別:中文
論文頁數:73
中文關鍵詞:Apache SparkKubernetesDocker任務排程調度異構叢集深度學習
外文關鍵詞:Apache SparkKubernetesDockerTask Scheduling and Dispatching strategyHeterogenic ClustersDeep Learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:121
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來大數據的普及,因應而出的框架越來越多,其中Apache Spark最廣為人知,Spark最初是由加州大學柏克萊分校的AMPLab開發,而後2013年將專案捐贈給Apache軟體基金會,他是一個基於記憶體的分散是數據運算框架,但隨著硬體的升級,機器叢集也將產生異構性,因此各節點的派送也是問題之一。
為了解決異構性的問題,吳秉諭學者提出CMM(Cluster Min-Max)架構,將效能相似計算能力的節點分群,再藉由一台進行排成調度,將任務派送給各群,藉此提升Spark運算時的效能,此外也利用Spark運行的參數、時間、資料大小等等數據,經由迴歸分析去每群運算的時間,再透過時間去進行排程,加速整體的運算速度,藉以實現高效的任務調度。CMM的研究為了實現客製化的群集,往往需要花費更多的時間去安裝軟體、管理群集的狀態、維護運機器服務,新增節點設定,因此很難去實行動態的擴展與動態設定。
針對以上的問題,本論文提出了KCMM架構,希望透過Kubernetes以及Docker去進行環境的部屬,解決環境安裝、動態擴展,節點故障,以及測試動態提供資源。另外針對CMM的模型再組合出新的Spark運行參數,以及透過深度學習模型優化,藉此提升整體時間預測的精準度,和加速整體任務排成的速度,達到輕易部屬且有更好運算效能的叢集環境。
In recent years, the popularity of big data, to meet more and more out of frame, including the Apache Spark the most well known, the Spark was originally developed by AMPLab at the university of California, Berkeley, and then will be donated to the Apache software foundation project in 2013, he is a dispersion is based on memory data operation framework, but along with the upgrading of hardware, machine cluster heterogeneity will be generated, so the delivery is one of the problems of each node.
In order to solve the problem of heterogeneity, wu bing yu scholars put forward the CMM (Cluster Min - Max) architecture, the efficiency of the node based on similarity computing ability, to borrow, formed by a scheduling task will be sent to each group, thus increasing the Spark at the time of operation efficiency, moreover also USES Spark running parameters, time, and the size of the data, and so on, through the regression analysis to each group of operation time, again through time for scheduling, accelerate the speed of whole, so as to achieve efficient task scheduling. In order to achieve customized clustering, CMM research often needs to spend more time to install software, manage the state of the cluster, maintain the machine service, and add new node Settings, so it is difficult to implement dynamic expansion and dynamic Settings.
In view of the above problems, this paper proposes the KCMM architecture, hoping to provide resources through Kubernetes and Docker for environment deployment, environment installation, dynamic extension, node failure, and dynamic test. In addition, new Spark operating parameters are combined with the CMM model, and the deep learning model is optimized to improve the accuracy of overall time prediction and speed up the overall task alignment, so as to achieve a cluster environment that is easily subordinate and has better computing efficiency.
目次
摘要 i
ABSTRACT ii
目次 iv
圖目次 vi
表目次 viii
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 論文架構 2
第二章 相關研究探討 4
2.1 HADOOP分散式運算框架 4
2.2 APACHE SPARK 4
2.2.1 彈性分散式資料集RDD 5
2.2.2 RDD抽象化 5
2.2.3 RDD 程式介面 6
2.2.4 RDD的優點 6
2.2.5 RDDs的表達 7
2.2.6 Job排程器 9
2.2.7 Spark DataFrame 10
2.2.8 Spark 核心架構 10
2.2.9 Spark on Mesos 12
2.3 DOCKER 13
2.4 KUBERNETES 15
2.4.1 etcd 16
2.4.2 api-server 16
2.4.3 kube-scheduler 16
2.4.4 kube-contorller-manager 17
2.4.5 kube-proxy 17
2.4.6 kubelet 17
2.5 SCIKIT-LEARN 18
2.6 深度學習 18
2.7 KERAS 20
2.8 CMM(CLUSTER MIN-MAX) 21
第三章 研究方法 23
3.1 KUBERNETES叢集分群 25
3.2 DOCKER FILE 建置流程 26
3.3 KUBERNETES 建置流程 28
3.4 CMM時間模型介紹 31
3.5 多層感知器的模型設計 32
第四章 實驗結果 34
4.1 實驗環境 34
4.2 實驗設計 36
4.2.1 Kaggle資料之實驗設計 36
4.2.2 執行任務參數設定 36
4.2.3 歷史資料預測執行之時間模型實驗設計 37
4.2.4 時間模型排程之實驗設計 38
4.3 實驗結果與分析 38
4.3.1 K-means叢集分群 38
4.3.2 CMM預測執行時間模型訓練之方法 42
4.3.3 CMM參數對時間預測進行排程測試 47
4.3.4 KCMM預測執行時間模型訓練之方法 54
4.3.5 KCMM參數時間預測進行排程測試 59
4.3.6 比較CMM與KCMM的參數結果 65
第五章 結論 68
第六章 參考文獻 69
[1]John Walker, S. (2014). Big data: A revolution that will transform how we live, work, and think.
[2]Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. HotCloud, 10, 10-10.
[3]Alsheikh, M. A., Niyato, D., Lin, S., Tan, H. P., & Han, Z. (2016). Mobile big data analytics using deep learning and apache spark. IEEE network, 30(3), 22-29.
[4]Chang, Victor, Yen-Hung Kuo, and Muthu Ramachandran. "Cloud computing adoption framework: A security framework for business clouds." Future Generation Computer Systems 57 (2016): 24-41.
[5]Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press.
[6]Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), 484.
[7]Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (pp. 160-167). ACM.
[8]Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 513-520).
[9]Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
[10]Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. Queue, 14(1), 10.
[11]Apache Hadoop. http://hadoop.apache.org/
[12]Ghazi, M. R., & Gangodkar, D. (2015). Hadoop, MapReduce and HDFS: A Developers Perspective.Procedia Computer Science,48, 45-50.
[13]K. V. Shvachko, “HDFS Scalability: The limits to growth,” ;login:. April 2010, pp. 6–16
[14]Odersky, M., Spoon, L., & Venners, B. (2008). Programming in scala. Artima Inc.
[15]P.F. Dubois, editor. Python: Batteries Included, volume 9 of Computing in Science & Engineering. IEEE/AIP, May 200
[16]Arnold, K., Gosling, J., & Holmes, D. (2005). The Java programming language. Addison Wesley Professional.
[17]Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association.
[18]Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[19]Isard, M., Budiu, M., Yu, Y., Birrell, A., & Fetterly, D. (2007, March). Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS operating systems review (Vol. 41, No. 3, pp. 59-72). ACM.
[20]Nitzberg, B., & Lo, V. (1991). Distributed shared memory: A survey of issues and algorithms. Computer, 24(8), 52-60.
[21]Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., ... & Zaharia, M. (2015, May). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1383-1394). ACM.
[22]Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... & Murthy, R. (2009). Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626-1629.
[23]Crockford, D. (2006). The application/json media type for javascript object notation (json) (No. RFC 4627).
[24]Hamilton, G., Cattell, R., & Fisher, M. (1997). JDBC Database Access with Java(Vol. 7). Addison Wesley.
[25]Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (pp. 1-7).
[26]Hoffman, S. (2013). Apache Flume: distributed log collection for Hadoop. Packt Publishing Ltd.
[27]Amazon Web Services. http://s3.amazonaws.com
[28]Kwak, H., Lee, C., Park, H., & Moon, S. (2010, April). What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp. 591-600). AcM.
[29]Amazon Kinesis http://aws.amazon.com/kinesis/ Retrieved: Jul, 2015
[30]Joachims, T. (1998). Making large-scale SVM learning practical(No. 1998, 28). Technical Report, SFB 475: Komplexitätsreduktion in Multivariaten Datenstrukturen, Universität Dortmund.
[31]Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475-511.
[32]Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models (Vol. 4, p. 318). Chicago: Irwin.
[33]Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia, 18.
[34]Freund, Y., & Mason, L. (1999, June). The alternating decision tree learning algorithm. In icml (Vol. 99, pp. 124-133).
[35]Takane, Y., Young, F. W., & De Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 42(1), 7-67.
[36]Huang, Zhexue. "Extensions to the k-means algorithm for clustering large data sets with categorical values." Data mining and knowledge discovery 2.3 (1998): 283-304
[37]Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE transactions on speech and audio processing, 3(1), 72-83.
[38]Lin, F., & Cohen, W. W. (2010, June). Power Iteration Clustering. In ICML (Vol. 10, pp. 655-662).
[39]Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
[40]Golub, G. H., & Reinsch, C. (1970). Singular value decomposition and least squares solutions. Numerische mathematik, 14(5), 403-420.
[41]Jolliffe, I. (2011). Principal component analysis. In International encyclopedia of statistical science (pp. 1094-1096). Springer, Berlin, Heidelberg.
[42]Han, J., Pei, J., & Yin, Y. (2000, May). Mining frequent patterns without candidate generation. In ACM sigmod record (Vol. 29, No. 2, pp. 1-12). ACM.
[43]Docker: https://www.docker.com/what-docker
[44]Sefraoui, O., Aissaoui, M., & Eleuldj, M. (2012). OpenStack: toward an open-source solution for cloud computing. International Journal of Computer Applications, 55(3), 38-42.
[45]Linux Containers:https://linuxcontainers.org/
[46]R. J. Creasy, "The origin of the VM/370 time-sharing system", IBM Journal of Research & Development, Vol. 25, No. 5 (September 1981), pp. 483–9
[47]T. Werner:VirtualBox im auswärtigen Amt. (Case study, slides in German)Presented at Frühjahrsfachgespräch 2008 of the German Unix User Group, Munich, Germany, March 2008.
[48]Adams, K., & Agesen, O. (2006). A comparison of software and hardware techniques for x86 virtualization. ACM SIGARCH Computer Architecture News, 34(5), 2-13.
[49]Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., ... & Warfield, A. (2003, October). Xen and the art of virtualization. In ACM SIGOPS operating systems review (Vol. 37, No. 5, pp. 164-177). ACM.
[50]Kivity, A., Kamay, Y., Laor, D., Lublin, U., & Liguori, A. (2007, July). kvm: the Linux virtual machine monitor. In Proceedings of the Linux symposium (Vol. 1, pp. 225-230).
[51]Kubernetes. [Online].Avaliable: https://kubernetes.io. [Accessed: 11-Dec-2017].
[52]Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., & Wilkes, J. (2015, April). Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (p. 18). ACM.
[53]Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830.
[54]Chollet, F. (2018). Keras: The python deep learning library. Astrophysics Source Code Library.
[55]Leonard Kaufman and J. Peter Rousseeuw, "Clustering by means of Medoids," in Statistical Data Analysis Based on the L_1–Norm and Related Methods, 1987, pp. 405–416.
[56]Leonard Kaufman and Peter J. Rousseuw, Finding Groups in Data: An Introduction to Cluster Analysis.: Wiley, 1990.
[57]Sinha, V., Doucet, F., Siska, C., Gupta, R., Liao, S., & Ghosh, A. (2000, September). YAML: a tool for hardware design visualization and capture. In Proceedings of the 13th international symposium on System synthesis (pp. 9-14). IEEE Computer Society.
[58]Taieb, S. B., & Hyndman, R. J. (2014). A gradient boosting approach to the Kaggle load forecasting competition. International journal of forecasting, 30(2), 382-394.
[59]Puurula, A., Read, J., & Bifet, A. (2014). Kaggle LSHTC4 winning solution. arXiv preprint arXiv:1405.0546.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔