臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.163) 您好！臺灣時間：2025/11/25 13:33

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

莊宗翰

研究生(外文):

Tzung-Han Juang

論文名稱:

高效能神經網路訓練加速器架構與其電路設計

論文名稱(外文):

Energy-Efficient Accelerator Architecture for Neural Network Training and Its Circuit Design

指導教授:

闕志達

口試委員:

蔡佩芸、楊佳玲

口試日期:

2018-07-30

學位類別:

碩士

校院名稱:

國立臺灣大學

系所名稱:

電子工程學研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2018

畢業學年度:

106

語文別:

英文

論文頁數:

127

中文關鍵詞:

卷積神經網路、反向傳播算法、FloatSD

相關次數:

被引用:0
點閱:518
評分:
下載:0
書目收藏:0

人工智慧(artificial intelligence, AI)已經變成近幾年最熱門的研究主題。AI可以被應用在影像識別、物件偵測和自然語言處理等領域。尤其研究人員們利用神經網路在這些領域上有所突破。神經網路又以多樣化和可以深及上百層的架構而聞名。這樣的結構也使神經網路需要大量的運算和記憶體資源。
基於圖形處理器(graphics processing units, GPU)上硬體加速的演進使神經網路有可能被使用在實際的應用上。然而GPU往往需要較大體體積，也有較大的功率消耗。許多研究者投入在如何減少神經網路的運算資源消耗和實現在特定的硬體上。這些研究中，多數的成果只能加速神經網路的推理。
除了對推理的支援，本論文提出的架構還可以進行神經網路的訓練，訓練的方式是基於向後傳遞演算法來找出最佳的神經網路模型。訓練過程包含三個步驟：向前傳遞、向後傳遞以及權重更新，而推理過程只包含向前傳遞的步驟。本論文致力於設計出一個統一的架構可以處理卷積神經網路(convolutional neural networks, CNN)訓練的三個步驟。
除此之外，資料輸入輸出的頻寬也是加速器設計上的瓶頸。為了減少資料的頻寬，本論文使用先前的研究中提出的floating-point signed digit algorithm (FloatSD)演算法和量化技巧作為基礎來減少神經網路的大小和資料位寬。先前的研究可以在ImageNet資料集中得到只比浮點數版本少0.8%的top-5正確率的結果。
本論文主要在設計可以訓練神經網路的加速器，這個設計包含資料流處理、AMBA介面以及記憶體設置上的設計。本設計是IP層級的加速器，可以接上SOC平台。除此之外，本論文也投入於優化資料的重複利用來使系統有效率的存取DRAM。
關鍵字: 卷積神經網路、反向傳播算法、FloatSD

Artificial intelligence (AI) has become the most popular research topic in recent years. AI can be applied to applications on image classification, object detection and natural language processing. Especially, researchers have breakthroughs on such fields with neural networks. Neural network is known for its versatile and deep architectures, which can have more than hundreds of layers. Such structure make neural network needs large amount of computation and memory.
Improvement of hardware acceleration on graphics processing units (GPU) make neural networks be possible to be applied to practical applications. However, GPU tends to have large volume and is very power hungry. Many researches focused on reducing the resources of computation used in neural network and implementation on specific hardware. Most of these works only support acceleration on inference phase.
Other than inference, this thesis proposed architecture that can also support training phase, which is based on backpropagation algorithm to find optimal models of neural networks. Training phase includes forward pass, backward pass and weight update, while inference only contains forward pass. This thesis is devoted to designing a unified architecture that can process these three stages in training phase on convolutional neural networks (CNN).
In addition, IO bandwidth is always the bottleneck of accelerator design. To reduce data bandwidth, this thesis uses floating-point signed digit algorithm (FloatSD) and quantization techniques in previous work as basis to reduce neural network size and bit width of data values. The previous work can reach 0.8% loss of top-5 accuracy on ImageNet dataset compared to floating-point version.
This thesis designs hardware accelerator for training neural networks, including the designs on data flow for processing, AMBA interface and memory settings. The design is an IP-level engine that can be applied to SOC platform. In addition, this thesis also focuses on optimizing data reusing to make the system have efficient DRAM access.

Keyword: Convolutional neural network, Backpropagation, FloatSD

誌謝 i
摘要 iii
Abstract v
Contents vii
List of Tables xi
List of Figures xiii
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation of Thesis 3
1.3 Organization and Contributions of Thesis 4
Chapter 2 Neural Network 6
2.1 Multilayer Perceptron (MLP) 6
2.1.1 Principles 6
2.1.2 Backpropagation 10
2.2 Convolutional Neural Networks (CNN) 12
2.2.1 Principles 13
2.2.2 Backpropagation 17
Chapter 3 Design of Low Complexity Training and Inference of Neural Network 20
3.1 Signed Digit Representation 20
3.2 FloatSD 22
3.3 Parameter Quantization and Software Simulation 24
3.3.1 Simulation Platform 24
3.3.2 Quantization Method 24
3.3.3 Simulation Results 25
3.4 Design of FloatSD MAC 27
3.4.1 MAC Architecture 27
3.4.2 Techniques for Low Power 29
3.4.3 Area and Power Reports 30
Chapter 4 CNN Training SoC Architecture Design 32
4.1 High-Level Planning 32
4.1.1 SOC platform 32
4.1.2 Hardware/Software Partitioning 34
4.1.3 Data Bit width 35
4.1.4 Data Allocation in DRAM 36
4.1.5 Support of Backward Phase 37
4.2 Dataflow and Internal Memory 38
4.2.1 Tile Computing and Data Reuse 38
4.2.2 Estimate Memory Size 43
4.2.3 List of Memory Banks 45
4.2.4 Ping-pong Buffers 45
4.3 Architecture 46
4.3.1 Scheduling for Data Access and Process 46
4.3.2 DNN Engine Architecture 48
4.3.3 PE Cube and Number of PE Cube 49
4.3.4 Comparisons with Previous Works 51
4.3.5 Module Hierarchy 58
Chapter 5 Circuit Design and Simulation 60
5.1 Design of Process Element (PE) and PE Array 60
5.1.1 Encoding and Decoding of 8-bit Weight 60
5.1.2 Backward Phase 62
5.1.3 Design of PE Dataflow 64
5.2 Design of DNN Engine Controller 67
5.2.1 Architecture of Controller 67
5.2.2 AHB Interface and Decoding of Control Registers 69
5.2.3 Central Controller 72
5.2.4 Address Generation Unit (AGU) 77
5.3 Design of Direct Memory Access (DMA) 83
5.3.1 AXI Interface 83
5.3.2 DMA for Read 84
5.3.3 DMA for Write 89
5.4 Design of Internal Memory and Register Bank 90
5.4.1 Internal Data Bandwidth 90
5.4.2 Data Realignment 91
5.4.3 Image Register Bank for Data Reuse 101
5.4.4 Memory Access on Partial Sum Buffer 103
5.4.5 Design of Internal Switch 105
5.5 Verification 108
5.5.1 Forward Pass 108
5.5.2 Backward Pass Step1 111
5.5.3 Backward Pass Step2 114
5.5.4 Section Summary 117
5.6 Summary 118
Chapter 6 Conclusion and Future Perspectives 120
Bibliography 122

[1]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proc. Adv. Neural Inf. Syst., vol. 25. 2012, pp. 1097-1105.
[2] O. Russakovsky et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
[3] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779-788.
[4]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair et al., "Generative Adversarial Nets," in Proc. Adv. Neural Inf. Syst., 2014, pp. 2672-2680.
[5]“Acceleration in the AWS Cloud”, Internet: https://www.xilinx.com/products/design-tools/acceleration-zone/aws.html, [Jul. 24, 2018].
[6] P. Norman. et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit TM,” in Proc. 44th Annu. Int. Symp. Comput. Archit. (ISCA), 2016, pp. 1-12.
[7]N. Jouppi , “Google Supercharges Machine Learning Tasks with TPU Custom Chip”, Internet: https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html, May 18, 2016 [Jul. 24, 2018].
[8]T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning,” in Proc. 19th Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2014, pp. 269–284.
[9] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun and O. Temam, “DaDianNao: A Machine-Learning Supercomputer,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014, pp. 609-622.
[10]Z. Du et al., “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proc. 42th Annu. Int. Symp. Comput. Archit. (ISCA), 2015, pp. 92-104.
[11]Liu, Daofu, et al., “PuDianNao: A Polyvalent Machine Learning Accelerator,” in Proc. 20th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015, pp. 369-381.
[12]Zhang, Shijin, et al., “Cambricon-X: An Accelerator for Sparse Neural Networks,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2016, pp. 20.
[13]S. Liu et al., “Cambricon: An Instruction Set Architecture for Neural Networks,” in Proc. 43th Annu. Int. Symp. Comput. Archit. (ISCA), 2016, pp. 393-405.
[14]“Cambricon Technologies”, Internet: http://www.cambricon.com/index.php?c=page&id=10, [Jul. 24, 2018].
[15]“Deephi Tech”, Internet: http://www.deephi.com/technology.html, [Jul. 24, 2018].
[16]D. Lin, S. Talathi and V. S. Annapureddy, “Fixed Point Quantization of Deep Convolutional Networks,” in Proc. 33th Int. Conf. Mach. Learn. (ICML), 2016, pp. 2849-2858.
[17]A. Krizhevsky and G. Hinton, “Learning Multiple Layers of Features from Tiny Images,” in Technical report, Vol. 1. No. 4, University of Toronto, 2009, p. 7.
[18]S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in Proc. 32th Int. Conf. Mach. Learn. (ICML), 2015, pp. 1-10.
[19]Y. Lecun and C. Cortes, “The MNIST Database of Handwritten Digits”, Internet: http://yann.lecun.com/exdb/mnist/, [Jul. 24, 2018].
[20]S. Han, H. Mao, W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” arXiv preprint arXiv:1510.00149, 2015.
[21]M. Rastegari, Mohammad, V. Ordonez, J. Redmon, A. Farhadi, “Xnor-Net: Imagenet Classification Using Binary Convolutional Neural Networks,” In European Conference on Computer Vision, Springer, 2016, pp. 525-542.
[22]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit. (CVPR), 2016, pp 770-778.
[23]I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activation,” arXiv preprint arXiv:1609.07061, 2016
[24] Y. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2016, pp. 1-43.
[25]S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, W. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proc IEEE/ACM Int. Symp. on Computer Archit., 2016, pp. 243-254.
[26]J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. “Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing,” in Proc. 44th Annu. Int. Symp. Comput. Archit. (ISCA), June 2016, pp. 1-13.
[27]D. Shin, J. Lee, J. Lee and H. J. Yoo, "14.2 DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks," in Proc. IEEE Int. Solid-State Circuits Conf., 2017, pp. 240-241.
[28] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim and H. J. Yoo, "UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision," in Proc. IEEE Int. Solid-State Circuits Conf., 2018, pp. 218-220.
[29]P. Lin, “Low-complexity Convolution Neural Network Training and Low Power Circuit Design of its Processing Element,” M.A. thesis, National Taiwan University, Taipei, 2017.
[30]K.-H. Chen, C.-N. Chen, and T.-D. Chiueh, “Grouped Signed Power-of-Two Algorithms for Low-Complexity Adaptive Equalization,” IEEE Trans. Circuits Syst. II, Exp. Briefs, Vol. 52, No. 12, pp. 816-820, Dec. 2005.
[31]Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding,” in Proc. ACM Int. Conf. on Multimedia, 2014, pp. 675-678.
[32] V. Nair and G. E. Hinton. “Rectified Linear Units Improve Restricted Boltzmann Machines,” in Proc. 27th Int. Conf. on Mach. Learn., 2010, pp. 807-814.
[33]K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
[34] “CIC ARC HS Based SoC Design Platform”, http://www.cic.org.tw/aisoc/aisoc.jsp, [Jul. 24, 2018].
[35] Z. Yuan, Y. Liu, J. Yue, J. Li, H. Yang, “CORAL: Coarse-grained Reconfigurable Architecture for Convolutional Neural Networks,” In Proc IEEE/ACM Int. Symp. on Low Power Electronics and Design (ISLPED), 2017, pp. 1-6.
[36]G. Venkatesh, E. Nurvitadhi and D. Marr, “Accelerating Deep Convolutional Networks using low-precision and sparsity,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), New Orleans, LA, 2017, pp. 2861-2865.
[37]A. Parashar et al., "SCNN: An accelerator for compressed-sparse convolutional neural networks," in Proc. 44th Annu. Int. Symp. Comput. Archit. (ISCA), 2017, pp. 27-40.
[38] S. Wang, D. Zhou, X. Han and T. Yoshimura, "Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks," in Proc. IEEE Des. Autom. Test Eur. (DATE), 2017, Lausanne, 2017, pp. 1032-1037.
[39]Y. Shen, M. Ferdman and P. Milder, "Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer," In Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM)., 2017, pp. 93-100.
[40]C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie and X. Zhou, "DLAU: A Scalable Deep Learning Accelerator Unit on FPGA," in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 36, No. 3, pp. 513-517, March 2017.
[41]Y. Umuroglu, et al., "Finn: A framework for Fast, Scalable Binarized Neural Network Inference." in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2017, pp. 65-74.
[42]C. Zhang, Z. Fang, P. Zhou, P. Pan and J. Cong, "Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks," in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), 2016, pp. 1-8.
[43] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2015, pp. 161-170.
[44]H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou and L. Wang, "A High Performance FPGA-based Accelerator for Large-scale Convolutional Neural Networks," In Proc. 26th Int. Conf. on Field Programmable Logic and Applications (FPL ’16) 2016, pp. 1-9.
[45] P. Judd, A. Delmas, S. Sharify, A. Moshovos "Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Vomputing," arXiv preprint arXiv:1705.00125, 2017.
[46]P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt and A. Moshovos, "Stripes: Bit-Serial Deep Neural Network Computing," in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2016, pp. 1-12.
[47] S. Venkataramani, et al. "Scaledeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks." in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017, pp. 13–26.
[48]W. Zhao, H. Fu, W. Luk, T. Yu, S. Wang, B. Feng, Y. Ma, G. Yang, “F-CNN: An FPGA-based Framework for Training Convolutional Neural Networks,” in Proc. IEEE Conf. on Application-specific Systems, Architectures and Processors, London, UK, July 2016, pp. 107-114.
[49]X. Han, D. Zhou, S. Wang, S. Kimura, “CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor for Forward and Backward Propagation of Convolutional Neural Networks,” in Proc. Int. Conf. on Computer Design (ICCD), 2017, pp. 320-327.

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

無相關論文

無相關期刊

1.	5G網路下免許可傳輸非正交多工存取之用戶偵測與領航序列碼設計
2.	低複雜度卷積神經網路訓練與其低功耗運算單元電路設計
3.	高效能卷積神經網路訓練系統加速晶片
4.	在5G網路下針對稀疏碼多工存取的低複雜度多用戶偵測器設計
5.	在FPGA上實現無乘法器卷積神經網絡推理加速電路
6.	時域干擾消除技術於未授權頻譜長期演進技術之設計與實現
7.	超高密度網路下干擾感知接收機之設計與實作
8.	基於軟體定義無線電平台之車間通訊接收機設計與實作
9.	用於機器間通訊網路中多頻帶上行鏈路的干擾感知媒體接取控制技術之研究
10.	使用深度神經網路之阻塞型睡眠呼吸中止症病人之自動化睡眠分期判讀
11.	在5G機器型態通訊下基於稀疏碼多工存取空中傳輸的接收機設計與實作
12.	基於自適應波束成形超聲波平面波成像技術的開發與其電路設計
13.	使用開放計算語言之非正交多工存取系統基頻接收機設計與實作
14.	即時室內定位及跌倒偵測系統設計與實現
15.	應用跨層最佳化與湧泉碼之即時無線影像串流設計與實現

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室