跳到主要內容

臺灣博碩士論文加值系統

(44.213.60.33) 您好!臺灣時間:2024/07/22 16:17
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:Mulat Ayinet Tiruye
研究生(外文):Mulat Ayinet Tiruye
論文名稱:基於脈動陣列的雙模式捲積神經網路加速器
論文名稱(外文):Dual Mode Systolic Array-based Processing Element for CNNAccelerator
指導教授:魏一勤周煌程Teo Tee Hui
指導教授(外文):I. C. WeyH. C. ChowT. H. Teo
口試委員:許炳堅Teo Tee Hui魏一勤周煌程
口試委員(外文):B. J. SheuTeo Tee HuiI. C. WeyH. C. Chow
口試日期:2023-03-01
學位類別:碩士
校院名稱:長庚大學
系所名稱:奈米工程及設計碩士學位學程
學門:工程學門
學類:材料工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:英文
論文頁數:113
外文關鍵詞:CNNProcessing ElementANN
相關次數:
  • 被引用被引用:0
  • 點閱點閱:59
  • 評分評分:
  • 下載下載:5
  • 收藏至我的研究室書目清單書目收藏:0
To improve the performance of systolic array based CNN accelerators a dual
mode systolic array based processing element has designed and implemented. Convolutional Neural Network accelerators typically use systolic array as part of the computational unit. The systolic array architecture is quite popular in the field of spatial hardware and is an excellent choice for a variety of accelerators. The main focus of this thesis is on a dual mode systolic array-based processing element architecture for a CNN accelerator. The weight-sharing aspect of CNNs is used by systolic arrays, which reuse data during convolutional processes to reduce the resource utilization and memory usage. Systolic array data reuse is achieved by using two different types of dataflow. To improve the performance of CNN accelerators of power consumption, a dual mode systolic array based processing element is utilized. Due to its reconfigurability, programmability, and redeployment, FPGA is preferred for hardware design. Design implementations were performed on the FPGA board of ZedBoard with the Zynq 7000 series and the XC7Z020 chipset. The total on-chip power of 0.134 W was consumed on the ZedBoard Zynq-7000 to implement the dual mode systolic array based PE at 100 MHz. The proposed design achieved a throughput of 1.961 GOPs/s. The proposed design was able to achieve a power efficiency of 14.29 GOPs/W. The dual mode systolic array based processing element for CNN accelerators has achieved better performance than other systolic array based processing element.
Abstract i
Chapter I
1 Introduction 1
1.1 Motivations 2
1.2 Thesis organization 2
1.3 The Multi-Layer Perceptron(MLP) 3
1.3.1 Multi-Layer Perceptron Forward Propagation 4
1.3.2 Activation Functions 5
1.3.3 Cost Functions 8
1.3.4 Multi-Layer Perceptron Backpropagation 9
1.4 The Convolutional Neural Network (CNN) 10
1.4.1 Convolution Operations 11
1.4.2 Multi-Channel Convolution 15
1.4.3 Pooling Layer 17
1.5 Systolic Dataflow 18
1.5.1 Input Stationary Dataflow 20
1.5.2 Weight Stationary Dataflow 20
1.5.3 Output Stationary Dataflow 21
1.5.4 Row Stationary Dataflow 22
1.6 Binary and Decimal Number Representations 22
1.6.1 Floating Point 23
1.6.2 Fixed Point 24
1.6.3 Signed Numbers 24
1.7 Chapter Summary 25

Chapter II

2 Literature Review 27
2.1 Common Convolutional Neural Network 27
2.1.1 The LeNet-5 Network 27
2.1.2 The VGG16 Network 29
2.2 Benchmark Datasets 29
2.2.1 MNIST Dataset 30
2.2.2 Fashion-MNIST Dataset 31
2.3 Convolutional Neural Networks Accelerator 32
2.3.1 Tensor Processing Unit (TPU) 32
2.3.2 Eyeriss 34
2.3.3 FPGA-based Accelerator for Convolution Operations 35
2.3.4 High-Performance CNN Accelerator 35
2.3.5 Impact of the Array Shape and Memory Bandwidth onthe Execution Time of CNN Systolic Arrays 37
2.3.6 Energy-Efficient Design of PE for CNN 38
2.3.7 Automated Systolic Array for CNN on FPGA 39
2.3.8 Dynamically Reconfigurable Accelerator Design for CNN 41
2.3.9 Systolic Array Based Accelerator for Deep Learning 42
2.4 Chapter Summary 44

Chapter III

3 Implementation 45
3.1 Training and Testing of CNN on Pytorch 45
3.1.1 Training and Testing of LeNet-5 Network 46
3.1.2 Training and Testing of VGG16 Network 48
3.2 Design of Hardware System Architecture 49
3.3 Design of Hardware Dual Mode Processing Element 50
3.3.1 Design of Hardware A 3x3 Dual Mode Processing Element 54
3.4 Design of Hardware Activation Functions 60
3.4.1 Design of Hardware ReLU Function 62
3.5.1 Design of Hardware Max Pooling 62
3.6 Design of Fully Connected Layer 64
3.6.1 Design of Perceptron 64
3.6.2 Design of A 120 Neuron Layers on Fully Connected 67
3.6.3 Design of A 84 Neuron Layers on Fully Connected 70
3.7 The Output layer 72
3.8 Design of Hardware of Control Signal 74
3.9 Design of Hardware Image and Weight Memory 75
3.9.1 Design of Input Buffer 75
3.9.2 Design of BRAM 77
3.10 Chapter Summary 79

Chapter IV

4 Results and Performance Analysis 80
4.1 Utilization of FPGA 80
4.2 Resource Utilization of Systolic Array based Processing Element 82
4.2.1 Resource Utilization of Dual Mode Systolic Array based Processing Element 83
4.2.2 Resource Utilization of Activation and Pooling Layer 84
4.3 The Timing Performance 84
4.4 Performance Analysis of Systolic Array based PE 85
4.5 Performance Analysis of Dual Mode Systolic Array based Processing Element 86
4.6 The Proposed Performance Comparison 87
4.7 Chapter Summary 89

Chapter V
5 Conclusions and Future Work 90
5.1 Future Work 91
Bibliography 92


List of Figures

1.1 Representation of multi-layer perceptron 3
1.2 Single layer MLP with 9 different weights 4
1.3 Two Layer Multi-layer Perceptron (MLP) 5
1.4 The graphical plot of ReLU function 6
1.5 The graphical plot of the sigmoid logistic function 7
1.6 The graphical plot of the sigmoid logistic function 8
1.7 Convolution operation of input and kernel with output 12
1.8 3×3 image with a zero padding of 1 12
1.9 A Convolution subsets with a 3 × 3 kernel size 14
1.10 Convolution output 14
1.11 Multi-channel Convolution with in channels=3 ,out channels=4 and bias=True 15
1.12 Average pooling with filters 2 × 2 and stride s=2 18
1.13 Max pooling with filters 2 × 2 and stride 2 18
1.14 Input stationary based dataflow as illustrated in [1] 20
1.15 Weight stationary based dataflow as illustrated in [1] 21
1.16 Output stationary based dataflow as illustrated in [1] 22
1.17 Row stationary based dataflow as illustrated in [2] 23
2.1 LeNet-5 architecture as illustrated in [3] 28
2.2 VGG16 architecture as illustrated in [4] 29
2.3 MNIST dataset examples as illustrated in [3] 30
2.4 Fashion-MNIST dataset examples as illustrated in [5] 31
2.5 TPU block diagram, as illustrated in [6] 33
2.6 Matrix multiply unit dataflow, as illustrated in [6] 33
2.7 Eyeriss system architecture [2] 34
2.8 Eyeriss dataflow, as illustrated in [2] 35
2.9 System design [7] 36
2.10 PE structure design [7] 36
2.11 Convolutional Unit design as illustrated in [8] 37
2.12 The block diagram of PE as illustrated in [9] 38
2.13 The processing element design as illustrated in [9] 39
2.14 Block diagram of systolic array for CNN as illustrated in [10] 40
2.15 Block diagram of processing element and input buffer as illustrated in [10] 40
2.16 Overview structure of accelerator as illustrated in [11] 41
2.17 The computing module architecture as illustrated in [11] 42
2.18 Overall design architecture of the accelerator as illustrated in [12] 43
2.19 Systolic array based PE as illustrated in [12] 43
3.1 Test Accuracy for LeNet-5 Model with Tanh on MNIST Data 47
3.2 Test accuracy for LeNet-5 with ReLU on MNIST data 48
3.3 Test accuracy for VGG16 with on MNIST data 48
3.4 Hardware system block diagram 49
3.5 Block Diagram of Dual Processing Element (PE) Design 50
3.6 The register transfer level of dual mode systolic array based of single Processing Element (PE) 51
3.7 Dual mode systolic array based of single PE when sel=0 52
3.8 The behavioral simulation of dual mode systolic array based of single PE when sel=0 52
3.9 Dual mode systolic array based of single PE when sel=1 53
3.10 The behavioral simulation of dual mode systolic array based of single PE when sel=1 53
3.11 A 3 × 3 dual mode systolic array based processing element when select line is one 54
3.12 The proposed design of a 3 × 3 dual mode systolic array based processing element partial sum output 55
3.13 A block diagram of a 3 × 3 dual mode systolic array based PE when sel=1 56
3.14 The register transfer level schematics of a dual mode systolic array based processing element when select line is one 56
3.15 The register transfer level schematics of a 3×3 dual mode systolic array based processing element 57
3.16 The behavioral simulation waveform of a dual mode systolic ar-ray based processing element when select line (sel=1) 58
3.17 A block diagram of a 3 × 3 dual mode systolic array based pro- cessing element when sel=0 58
3.18 The register transfer level schematics of a dual mode systolic ar-ray based PE when sel=0 59
3.19 The behavioral simulation waveform of a 3×3 dual mode systolicarray processing element when select line is zero (sel=0) 59
3.20 A block diagram of ReLU activation functions 60
3.21 The RTL Schematics of ReLU activation function 61
3.22 Illustration of ReLU activation functions 61
3.23 The behavioral simulation waveform of the ReLU activation function modeling in Vivado 61
3.24 A block diagram of max pooling layer 62
3.25 An illustration of max pooling 63
3.26 The RTL Schematics of Max Pooling Layer 63
3.27 The behavioral simulation waveform of max pooling layer 63
3.28 A general block diagram of single layer perceptron 65
3.29 A block diagram of the hardware design of 10 single neurons 66
3.30 The RTL schematics of the single neuron 66
3.31 A block diagram of a 120 neuron networks 67
3.32 The RTL schematics of a 120 neuron 69
3.33 A block diagram of 84 neuron networks 70
3.34 The RTL-1 schematics of a 84 neuron 71
3.35 The RTL-2 schematics of a 84 neuron 71
3.36 The block diagram of the output layers 72
3.37 The RTL schematic of output layer 73
3.38 The block diagram of the control signal 74
3.39 The register transfer level schematics of the control signal 74
3.40 The RTL schematics of 8-bit input buffer 75
3.41 The RTL schematics of 32-bit input buffer 76
3.42 Single port BRAM configuration as illustrated in [13] 77
3.43 Single port BRAM design using IP integrator of Xilinx-Vivado 77
3.44 Dual port BRAM configuration as illustrated in [13] 78
3.45 The dual port BRAM design using IP integrator of Xilinx-Vivado 78
4.1 The FPGA ZedBoard[14] 81


List of Table
1.1 Floating point representation of 32-bit decimal is presented in [15] 23
1.2 Fixed point representation of 10-bit decimal format 24
3.1 LeNet-5 results on MNIST data with Tanh activation function and different epochs 46
3.2 LeNet-5 results on MNIST data with ReLU activation functionand different epochs 47
4.1 Resource Utilization of Systolic Array based Processing Element 82
4.2 Resource Utilization of Dual Mode systolic array based PE 83
4.3 Resource Usage of ReLU and Max Pooling layer 84
4.4 Performance Analysis of Systolic Array based Processing Element 85
4.5 Dual mode Systolic Array based Processing Element 87
4.6 Performance Analysis of Systolic Array based Processing Element 87
4.7 Performance Analysis of Systolic Array based PE 88
[1] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556,
2014.
[2] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An
energy-efficient reconfigurable accelerator for deep convolutional neural
networks. IEEE journal of solid-state circuits, 52(1):127–138, 2016.
[3] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[4] Vigneashwara Pandiyan, Pushparaja Murugan, Tegoeh Tjahjowidodo,
Wahyu Caesarendra, Omey Mohan Manyar, and David Jin Hong Then.
In-process virtual verification of weld seam removal in robotic abrasive
belt grinding process using deep learning. Robotics and Computer-Integrated
Manufacturing, 57:477–487, 2019.
[5] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint
arXiv:1708.07747, 2017.
[6] NP Jouppi, C Young, N Patil, D Patterson, G Agrawal, R Bajwa, S Bates,
S Bhatia, N Boden, A Borchers, et al. In-datacenter performance analysis
of a tensor processing unit. arxiv. arXiv preprint arXiv:1704.04760, 2017.
[7] Yunfei Cao, Xin Wei, Tingting Qiao, and He Chen. Fpga-based accelerator
for convolution operations. In 2019 IEEE International Conference on Signal,
Information and Data Processing (ICSIDP), pages 1–5. IEEE, 2019.
[8] Yufeng Li, Shengli Lu, Jihe Luo, Wei Pang, and Hao Liu. High-performance
convolutional neural network accelerator based on systolic arrays andquantization. In 2019 IEEE 4th International Conference on Signal and Image
Processing (ICSIP), pages 335–339. IEEE, 2019.
[9] Yeongjae Choi, Dongmyung Bae, Jaehyeong Sim, Seungkyu Choi, Minhye
Kim, and Lee-Sup Kim. Energy-efficient design of processing element for
convolutional neural network. IEEE Transactions on Circuits and Systems II:
Express Briefs, 64(11):1332–1336, 2017.
[10] Xuechao Wei, Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu,
Yun Liang, and Jason Cong. Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. pages 1–6, 06 2017.
[11] Yunping Zhao, Jianzhuang Lu, and Xiaowen Chen. A dynamically reconfigurable accelerator design using a sparse-winograd decomposition algorithm for cnns. Computers, Materials & Continua, 66:517–535, 01 2020.
[12] Zhijie Yang, Lei Wang, Dong Ding, Xiangyu Zhang, Yu Deng, Shiming Li,
and Qiang Dou. Systolic array based accelerator and algorithm mapping
for deep learning algorithms. In Network and Parallel Computing: 15th IFIP
WG 10.3 International Conference, NPC 2018, Muroran, Japan, November 29–
December 1, 2018, Proceedings 15, pages 153–158. Springer, 2018.
[13] Taifur. Number plate recognition :implementing block ram using
verilog available online. https://community.element14.com/
challenges-projects/design-challenges/summer-of-fpga/
b/blog/posts/number-plate-recognition-3-implementingblock-ram-using-verilog. Accessed: 2023-03-2.
[14] AMD. Zedboard available online. https://www.xilinx.com/
products/boards-and-kits/1-8dyf-11.html. Accessed: 2023-01-
20.
[15] V Rajaraman. Ieee standard for floating point numbers. Resonance,
21(1):11–30, 2016.
[16] Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K Jha. Fully dynamic
inference with deep neural networks. IEEE Transactions on Emerging Topics
in Computing, 2021.

[17] Shi Hui Chua, T. Hui Teo, Mulat Ayinet Tiruye, and I-Chyn Wey. Systolic
array based convolutional neural network inference on fpga. In 2022 IEEE
15th International Symposium on Embedded Multicore/Many-core Systems-onChip (MCSoC), pages 128–133, 2022.
[18] Sangyeob Kim, Juhyoung Lee, Sanghoon Kang, Jinsu Lee, and Hoi-Jun
Yoo. A power-efficient cnn accelerator with similar feature skipping for
face recognition in mobile devices. IEEE Transactions on Circuits and Systems I: Regular Papers, 67(4):1181–1193, 2020.
[19] Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guided cnn with spherical kernels for 3d point clouds. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 9631–9640, 2019.
[20] C. Bagavathi and O. Saraniya. Chapter 13 - evolutionary mapping techniques for systolic computing system. In Arun Kumar Sangaiah, editor,
Deep Learning and Parallel Computing Environment for Bioengineering Systems,
pages 207–223. Academic Press, 2019.
[21] S.K. BASU. Chapter 9 - a cursory look at parallel architectures and biologically inspired computing. In NARESH K. SINHA and MADAN M.
GUPTA, editors, Soft Computing and Intelligent Systems, Academic Press Series in Engineering, pages 185–216. Academic Press, San Diego, 2000.
[22] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro,
37(3):12–21, 2017.
[23] Ieee standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of
IEEE 754-2008), pages 1–84, 2019.
[24] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson,
Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation,
1(4):541–551, 1989.
[25] Chaoyang Zhu, Kejie Huang, Shuyuan Yang, Ziqi Zhu, Hejia Zhang, and
Haibin Shen. An efficient hardware accelerator for structured sparse convolutional neural networks on fpgas. IEEE transactions on very large scale
integration (VLSI) systems, 28(9):1953–1965, 2020.
[26] Eduardo Yago, Pau Castelló, Salvador Petit, María E. Gómez, and Julio
Sahuquillo. Impact of the array shape and memory bandwidth on the execution time of cnn systolic arrays. In 2020 23rd Euromicro Conference on
Digital System Design (DSD), pages 510–517, 2020.
[27] Guido Van Rossum and Fred L Drake. Python 3 reference manual. CreateSpace, 2009.
[28] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,
Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[29] D Kingma and J Ba. Adam: A method for stochastic optimization. arXiv
preprint arxiv:1412.6980, 2014.
[30] Liancheng Jia, Liqiang Lu, Xuechao Wei, and Yun Liang. Generating systolic array accelerators with reusable blocks. IEEE Micro, 40(4):85–92, 2020.
[31] Mingyu Sheng, Hongfan Zeng, Jingnan Li, and Wenbo Sun. Pooling and
convolution layer strategy on cnn for melanoma detection. In 2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence
(MLBDBI), pages 153–161. IEEE, 2021.
[32] Digilent. Zedboard zynq-7000 arm/fpga soc development board available
online. https://digilent.com/shop/zedboard-zynq-7000-armfpga-soc-development-board/. Accessed: 2023-01-19.
[33] AMD. Trademark information available online. https://www.amd.com/
en/legal/trademarks.html. Accessed: 2023-01-17.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top