( 您好!臺灣時間:2021/02/28 09:38
字體大小: 字級放大   字級縮小   預設字形  


研究生(外文):Yi-Xian Kuo
論文名稱(外文):An NEDA-Based Accelerator with Memory-Interleaving for Deep Convolutional Neural Networks
指導教授(外文):Yeong-Kang Lai
口試委員(外文):Chen-Hao ChangChao-Tsung Huang
外文關鍵詞:AcceleratorCNNVLSIArchitecture Design
  • 被引用被引用:0
  • 點閱點閱:65
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
深度學習中的卷積神經網路(CNN)在最近許多應用中普及,從語音識別到影像分類和物件偵測。其中在物件偵測中,YOLO(You only look once)是相當有名的演算法,YOLO卷積神經網路需要大量相乘累加的計算,在edge端,需要設計專門的硬體來加速運算。
為了降低硬體成本,提出了一種類似NEDA的新distributed arithmetic(DA)架構,利用加法器取代乘法器,目的在降低功耗和面積的成本,同時保持數位信號處理(DSP)應用的高速度和準確性。數學分析證明DA可以僅使用加法以二進制補數的形式實現乘法運算,然後在最後再進行資料位移,即可達到加法器取代乘法器之運算。另外,本篇論文在Convolution後,進行Max-Pooling運算,進而達到降低Bandwidth。最後,本篇論文的最大特色為一個PE在一個clock cycle下,可執行1.78的MAC運算。以上所提出三種方法,為本篇論文較為特別的想法,目前已研究過的論文並無類似相關的想法與技術。使用DA的目的是使用越少的bit數,即可達到較少的clock cycle,未來可依權重bit數的減少,可以達到更少的clock cycle。
CNN in deep learning have become popular in many recent applications, from speech recognition to image classification and object detection. Among them, YOLO (You only look once) is a fairly well-known algorithm in object detection. YOLO require a large number of multiplication and accumulation calculations. On the edge side, special hardware needs to be designed to speed up the calculation. CNN is a large number of multiply-accumulate operations. The most advanced CNN requires billions to billions of calculations in a single image prediction process, so it is very suitable for designing hardware with high parallelism to handle multiply-accumulate operations. Data is moved in memory to support calculations. Since data movement consumes more energy than calculation, the processing of the hardware architecture must not only provide high parallelism with high throughput, but also optimize the data movement of the entire CNN system to achieve efficient data reuse. In addition, this optimization needs to adapt to convolution calculations of different forms and dimensions in CNNs. In order to meet these challenges, the design of Dataflow is very important. It is designed to support highly parallel calculations while optimizing the energy consumption of memory data movement. Leveraging data reuse reduces the cost of data movement through a multi-level storage hierarchy, and the hardware needs to be reconfigurable to support different forms and dimensions of convolution calculations. In order to reduce the hardware cost, a new distributed arithmetic (DA) architecture similar to NEDA is proposed. The multiplier is replaced by an adder. The purpose is to reduce the cost of power consumption and area, while maintaining the high speed and accuracy. Mathematical analysis proves that DA can only use addition to implement multiplication in the form of two''s complement, and then perform data shift at the end to achieve the operation of the adder instead of the multiplier. In addition, in this paper, after Convolution, Max-Pooling is performed to reduce Bandwidth. Finally, the biggest feature of this paper is that a PE can perform a MAC operation of 1.78 under a clock cycle. The three methods proposed above are more specific ideas for this paper. The papers that have been studied so far do not have similar related ideas and technologies. The purpose of using DA is to use fewer bits to achieve fewer clock cycles.
第一章 緒論 1
一、物件偵測 2
二、深度學習 4
第二章 相關理論與文獻 6
一、You only look once(YOLO) 6
(一) YOLO架構 6
1.特徵擷取 6
2.分類器 8
(二) YOLO演算法 10
二、New Distributed Arithmetic(NEDA) 12
(一)Distributed Arithmetic(DA) 12
(二)New Distributed Arithmetic(NEDA) 13
三、Quantization 15
四、Eyeriss v2 17
五、NullHop 18
六、An Architecture to Accelerate Convolution in DNN 19
七、Energy-Efficient Design of Processing Element for CNN 20
八、An Energy-Efficient Architecture for Binary Weight 21
第三章 應用於物件偵測之壓縮YOLO網路演算法以及模擬結果 22
一、前言 22
二、YOLO v2 Tiny神經網路 23
(一)網路訓練 23
1.Dataset 23
2.神經網路訓練 24
(二)Batch Normalization fused Convolution 25
(三)Post Training Quantization 29
三、模擬結果 32
第四章 硬體架構設計與實作 34
一、前言 34
二、硬體規格 34
三、硬體設計 35
(一)系統架構設計 35
(二)DRAM 36
(三)Feature Map Buffer 39
(四)Weight Buffer 44
(五)Memory Address Generator 46
(六)Accelerator 51
(七)PE 53
(八)Max Pooling 55
(九)Output Buffer 55
四、Dataflow設計 56
五、Finite State Machine與Control Unit設計 62
六、實作結果 66
(一)數位IC之設計流程 66
(二)晶片規格 67
(四)LAYOUT 72
第五章 結論 73
文獻參考 74
[1] Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. (2016)
[2] W. Pan, A. Shams, M. A. Bayoumi, "NEDA: A new distributed arithmetic architecture and its application to discrete cosine transform", IEEE Workshop on SiPS, pp. 159-168, 1999.
[3] Longa, P.; Miri, A. Area-efficient FIR filter design on FPGAs using distributed arithmetic. In Proceedings of the 6th IEEE International Symposium on Signal Processing and Information Technology, Vancouver, BC, Canada, 28–30 August 2006.
[4] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., et al.: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, arXiv preprint (2017).
[5] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE J. Emerging Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
[6] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Optimizing the convolution operation to accelerate deep neural networks on FPGA,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367, Jul. 2018. [7] Alessandro Aimar, Hesham Mostafa, and others. 2017. NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps. IEEE Transactions on Very Large Scale Integration Systems (2017).
[8] A. Ardakani, C. Condo, M. Ahmadi, W. J. Gross, "An architecture to accelerate convolution in deep neural networks", IEEE Trans. Circuits Syst. I Reg. Papers, vol. 65, no. 4, pp. 1349-1362, Apr. 2018.
[9] Y. Choi et al., "Energy-efficient design of processing element for convolutional neural network", IEEE Trans. Circuits Syst. II Exp. Briefs, vol. 64, no. 11, pp. 1332-1336, Nov. 2017.
[10] Y. Wang, J. Lin, Z. Wang, "An energy-efficient architecture for binary weight convolutional neural networks", IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 2, pp. 280-293, Feb. 2018.
[11] Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks", IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 127-138, Jan. 2017.
[12] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2, 2015.
[13] J. I. Guo, R. C. Ju, J. W. Chen, "An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization", IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 416-428, Apr. 2004.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in NIPS, 2012
[15] Y.-S. Jehng, "An efficient and simple VLSI tree architecture for motion estimation algorithms", IEEE Trans. Signal Processing, vol. 41, pp. 889-900, Feb. 1993.
[16] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, “Improving neural
[17] network quantization without retraining using outlier channel splitting,”
[18] in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 7543–7552.
[19] Η. T. Kung, "Why systolic architectures?", IEEE Computer, vol. 15, no. 1, pp. 37-46, 1982.
[20] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, pp. 2295–2329, Dec 2017.
[21] Girshick, R.: Fast R-CNN. In: ICCV (2015)
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, "SSD: Single shot multibox detector", 2015.
[23] R. Alvarez, R. Prabhavalkar, and A. Bakhtin, “On the efficient representation and execution of deep acoustic models,” in Proc. Interspeech, 2016.
[24] A. Bhandare, V. Sripathi, D. Karkada, V. Menon, S. Choi, K. Datta, and V. Saletore. “Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model”. In: arXiv:1906.00532 (2019)
[25] H. M. Jong, L. G. Chen, T. D. Chiueh, "Parallel architectures for 3-step hierarchical search block-matching algorithm", IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 407-416, Aug. 1994.
[26] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,”in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2014,pp. 609–622.
[27] H. Kung, B. McDanel, S. Q. Zhang, "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization", Proc. 24th Int. Conf. Archit. Support Program. Lang. Operating Syst., pp. 821-834, 2019.
[28] A. G. Howard et al., MobileNets: Efficient convolutional neural networks for mobile vision applications, Apr. 2017.
[29] D. Jung, W. Jung, B. Kim, S. Lee, W. Rhee, J. H. Ahn, Restructuring batch normalization to accelerate CNN training, July 2018.
[30] Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen L (2018) Mobilenetv2: Inverted residuals and linear bottleneck. In: CVPR.
電子全文 電子全文(網際網路公開日期:20230207)
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔