(3.237.20.246) 您好!臺灣時間:2021/04/15 19:47
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:蘇郁翔
研究生(外文):Yu-XiangSu
論文名稱:移植Tensorflow至CASLAB-GPUSIM模擬平台與矩陣函式庫優化
論文名稱(外文):Porting Tensorflow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library
指導教授:陳中和陳中和引用關係
指導教授(外文):Chung-Ho Chen
學位類別:碩士
校院名稱:國立成功大學
系所名稱:電腦與通信工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2018
畢業學年度:107
語文別:中文
論文頁數:75
中文關鍵詞:終端裝置通用繪圖處理器矩陣乘法機器學習
外文關鍵詞:Edge DeviceGPGPUMatrix MultiplicationMachine Learning
相關次數:
  • 被引用被引用:0
  • 點閱點閱:40
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著雲端計算的蓬勃發展,機器學習的應用也逐漸拓展到終端裝置的應用上,為了能夠在終端硬體之開發階段或是終端應用的效能分析,本論文整合了機器學習框架Tensorflow與本實驗室所開發的OpenCL Runtime,成功將Tensorflow Runtime移植至本實驗室所開發的CASLAB-GPUSIM模擬平台上,接著又透過以Tensorflow所撰寫的測試程式進行了一系列的系統驗證,借此模擬終端裝置上的機器學習應用情境。
除了終端機器學習模擬平台的搭建,本論文認為在以通用繪圖處理器作為終端加速的解決方案中,線性代數的函式庫並沒有隨著該應用情境以及計算資源而有所變化,其中尤以矩陣乘法影響最甚,因其為建構卷積神經模型之卷積層與全連結層的基本運算單元,有鑑於此,本論文針對CLBlast函式庫的矩陣乘法演算法提出了優化建議,亦即針對終端機器學習應用的運算型態減少矩陣乘法函式庫的前處理以達到減少整體矩陣乘法函式庫所需要的執行時間。
With the rapid development of cloud computing, the application of machine learning has gradually expanded to the application of edge devices. In order to analyze the performance of edge application in the early development stage of edge hardware, we complete the suggest that integration of Tensorflow and the GPGPU simulator, called CASLAB-GPUSIM.
In addition to the building of edge device simulation platform, we propose a matrix multiplication library for the machine learning application on edge device using GPGPU as the acceleration solution. According to our experiment result, we have 5.6 average speed up in the fully-connected layer of our benchmarks, including MNIST mode, Lenet-5 and MobileNet.
摘要 I
Summary II
誌謝 VI
圖目錄 XI
第1章 序論 1
1.1 Motivation 1
1.2 Contribution 2
1.3 Organization 2
第2章 背景知識 3
2.1 Tensorflow Runtime 3
2.1.1 Tensorflow Kernel Operation 3
2.1.2 Tensorflow Stream Executor 5
2.1.3 Tf-coriander 6
2.2 OpenCL Runtime 7
2.2.1 OpenCL Programming Model 7
2.2.2 HSA Runtime 9
2.3 GPGPU Hardware 11
2.3.1 GPGPU Architecture 11
2.3.2 GPGPU Memory Model 14
第3章 矩陣乘法與機器學習相關研究 15
3.1 Convolution Neural Network 15
3.1.1 Convolution Layer 16
3.1.2 Pooling Layer 17
3.1.3 Fully Connected Layer 18
3.1.4 Activation Function 19
3.2 Matrix Multiplication in CNN 20
3.2.1 Implementation of Convolution Layer 20
3.2.2 Implementation of Fully Connected Layer 21
第4章 通用繪圖處理器上的矩陣乘法優化 23
4.1 Matrix Multiplication on GPGPU 23
4.2 Matrix Multiplication Optimization 24
4.2.1 Direct Implementation 25
4.2.2 Matrix Transposition 28
4.2.3 Shared Memory 29
4.2.4 Auto-Tuning Technique 30
4.3 Matrix Multiplication on Edge Device 33
4.3.1 Edge Computation 33
4.3.2 CASLAB Implementation 35
第5章 Tensorflow移植與矩陣乘法函式庫實作 38
5.1 Platform Introduction 38
5.2 Running Tensorflow on CASLAB-GPUSIM 42
5.2.1 OpenCL Runtime Implementation 43
5.2.2 Finalizer Implementation 44
5.3 Implementation of Matrix Multiplication 45
5.3.1 Kernel Operation Implementation 45
5.3.2 CLBlast Library 48
第6章 終端機器學習應用之矩陣乘法實驗探討 52
6.1 Experiment Environment and Benchmarks 52
6.2 Verification of Tensorflow porting 55
6.3 Performance of CASLAB MM implementation 64
6.3.1 Performance Summary 64
6.3.2 MNIST Benchmarks 66
6.3.3 MobileNet Fully Connected Layer 69
6.4 Experiment Limitation and Recommendation 70
第7章 結論 71
參考文獻 72
[1]“Movidius Official Website. [Online]. Available: https://www.movidius.com/.
[2]“Tensorflow Official Website. [Online]. Available: https://www.Tensorflow.org/.
[3]“Eigen Library Offical Website. [Online]. Available: https://eigen.tuxfamily.org/dox/.
[4]“Nvidia CUDA Toolkit. [Online]. Available: https://developer.nvidia.com/cuda-downloads.
[5]“Documentation for StreamExecutor open source proposal. [Online]. Available: https://github.com/henline/streamexecutordoc.
[6]“cuBLAS Offical Website. [Online]. Available: https://developer.nvidia.com/cublas.
[7]“Tf-coriander githut repository. [Online]. Available: https://github.com/hughperkins/Tf-coriander.
[8]“Tuned OpenCL BLAS, CLBlast. [Online]. Available: https://github.com/CNugteren/CLBlast.
[9]“EasyCL github repository. [Online]. Available: https://github.com/hughperkins/EasyCL.
[10]“coriander github repository. [Online]. Available: https://github.com/hughperkins/coriander/tree/f069f52b0574148c51151b7baee13616daba56f5.
[11]“The LLVM Compiler Infrastructure. [Online]. Available: https://llvm.org/.
[12]A.Munshi, “OpenCL 1.2 Specification, Version 1.2, p. 380, 2012.
[13]“Khronos Official Website. [Online]. Available: https://www.khronos.org/.
[14]“OpenCL Offline Compiler. [Online]. Available: https://github.com/HSAFoundation/CLOC.
[15]O.Api, R.Card, andC.Queues, “OpenCL API 1.2 Reference Card, Khronos Gr., pp. 1–8, 2011.
[16]HSA Foundation, “HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG), no. May, pp. 1–391, 2013.
[17]H.Foundation, “HSA Runtime Programmer ’ s Reference Manual, pp. 1–147, 2015.
[18]“PTX ISA. [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[19]J. L.Hennessy andD. aPatterson, Computer Architecture, Fourth Edition: A Quantitative Approach, no. 0. 2006.
[20]Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner, “Gradient-based learning applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[21]“The MNIST dataset. [Online]. Available: http://yann.lecun.com/exdb/mnist/.
[22]“Linear Regression. [Online]. Available: https://en.wikipedia.org/wiki/Linear_regression.
[23]S.Chetlur, C.Woolley, P.Vandermersch, J.Cohen, J.Tran, B.Catanzaro, andE.Shelhamer, “cuDNN: Efficient Primitives for Deep Learning, pp. 1–9, 2014.
[24]“Tensorflow MNIST tutorial. [Online]. Available: https://www.Tensorflow.org/tutorials/.
[25]“Tensorflow Lenet-5 Model. [Online]. Available: https://blog.csdn.net/NNNNNNNNNNNNY/article/details/70216265.
[26]T. D.Han andT. S.Abdelrahman, “Reducing branch divergence in GPU programs, Proc. Fourth Work. Gen. Purp. Process. Graph. Process. Units, p. 3:1--3:8, 2011.
[27]“Direct Implementation. [Online]. Available: https://www.quantstart.com/articles/Matrix-Matrix-Multiplication-on-the-GPU-with-Nvidia-CUDA.
[28]X.Cui, Y.Chen, C.Zhang, andH.Mei, “Auto-tuning dense matrix multiplication for GPGPU with cache, Proc. Int. Conf. Parallel Distrib. Syst. - ICPADS, pp. 237–242, 2010.
[29]B.Wu, F.Iandola, P. H.Jin, andK.Keutzer, “SqueezeDet: UWu, B., Iandola, F., Jin, P. H., &Keutzer, K. (2016). SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. ArXiv Preprint ArXiv:1612.01051, 129–137.nified, small, low, arXiv Prepr. arXiv1612.01051, pp. 129–137, 2016.
[30]A. G.Howard andW.Wang, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Andrew, 2012.
[31]X.Sun, N.Ansari, N. E.Sun, X., & Ansari, X.Sun, andN.Ansari, “EdgeIoT: Mobile Edge Computing for the Internet of Things, IEEE Commun. Mag., vol. 54, no. 12, pp. 22–29, 2016.
[32]P. N.Glaskowsky, “NVIDIA’s Fermi : The First Complete GPU Computing Architecture, White Pap., no. September, pp. 1–26, 2009.
[33]K.Mo, “MS108 COMPUTER SYSTEM(1) Final Report — gpgpu-sim, no. 1, pp. 1–17, 2014.
[34]“SystemC Offical Website. [Online]. Available: http://www.accellera.org/downloads/standards/systemc.
[35]“GeForce 10 series Specification. [Online]. Available: https://en.wikipedia.org/wiki/GeForce_10_series.
[36]“Adding a New Op. [Online]. Available: https://www.Tensorflow.org/extend/adding_an_op.
[37]“SWIG Official Website. [Online]. Available: http://www.swig.org/tutorial.html.
[38]“Tensorflow Tensorboard. [Online]. Available: https://www.Tensorflow.org/guide/summaries_and_tensorboard.
[39]“Python3.3 time library. [Online]. Available: https://docs.python.org/3/library/time.html.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔