跳到主要內容

臺灣博碩士論文加值系統

(44.212.96.86) 您好!臺灣時間:2023/12/07 01:19
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:張紘睿
研究生(外文):Hung-Jui Chang
論文名稱:基於稀疏性列擴展之高效記憶體存取卷積神經網路加速器
論文名稱(外文):A Memory Access Efficient Accelerator Based on Sparse Row-Expansion for Convolutional Neural Networks
指導教授:張振豪
指導教授(外文):Chen-Hao Chang
口試委員:張添烜吳崇賓
口試委員(外文):Tian-Sheuan ChangChung-Bin Wu
口試日期:2023-07-24
學位類別:碩士
校院名稱:國立中興大學
系所名稱:電機工程學系所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:中文
論文頁數:46
中文關鍵詞:卷積神經網路卷積神經網路加速器特徵稀疏性
外文關鍵詞:Convolutional Neural NetworksCNN AcceleratorFeature Sparsity
相關次數:
  • 被引用被引用:0
  • 點閱點閱:37
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
卷積神經網路在圖像領域快速發展,當今許多應用也都因為卷積神經網路的興起而取得優秀的成果,但想在邊緣裝置部屬這些應用,以過去傳統的通用型處理器來說在效能或者功耗上都難以負荷,因此設計專門運行這類運算的領域專用架構也在這幾年蓬勃發展。為了加快網路的訓練過程與推論準確率,激勵函數通常會一起被加入卷積層之後,如此會導致卷積神經網路在越深層的時候特徵零值數量上升,從
而使得無效的運算數增加,如果能夠有效利用這個特性僅將非零值讀入硬體中執行運算,資料的讀取的需求可以有效降低,相當於更少的功耗,並且運算時間的方面能夠有 2 到 5 倍的加速。
我們致力於探索特徵稀疏性與卷積神經網路加速器設計的關係,並且提出一系列的設計方案,我們採用量化感知訓練的量化技巧將單精度資料降低至 8-bit、資料壓縮使用通道優先列壓縮的方法,然後根據硬體使用率、資料稀疏性、頻寬需求做為考量提出稀疏列擴展的平行化策略提升硬體利用率、緩解頻寬瓶頸問題,換句話說就是加快推論速度。我們使用硬體描述語言Verilog 實現卷積神經網路加速器,針對壓縮後的稀疏資料設計控制單元、運算單元、考量硬體縮放來設計可擴性的階層式晶片網路 (NoC),與資料流特性會產生大量部分和的緣故,因此特別設計一塊部分和整合電路來處理運算單元的輸出資料。
本論文提出的加速器架構使用 152 KB SRAM,其主要儲存輸入特徵、網路權重、偏差值,以及兩列的輸出激勵值,總共 192 個乘加器並運作於頻率 140 MHz 下,提供最大吞吐量為 53.8 GOPS。我們分別實現加速器系統架構於 FGPA 平台與 cell-based TSMC T90 製程模擬電路運作於 VGG-16 卷積層的效能和加速器核心平均功耗,並得到 380 ms 和 113 GOPS/W 的成果,DRAM 的存取上僅有 224 MB。
Convolutional Neural Networks (CNNs) are developing rapidly in the field of image processing, and many applications have achieved excellent results because of raising CNNs. If we want to deploy these applications in edge devices, it is hard to afford the
power consumption or performance drops of traditional general-purpose processors, so the design of domain-specific architectures dedicated to executing these algorithms has
also flourished in recent years.
In order to speed up the training process of the network and the accuracy of inference, the activation functions are usually followed to the convolutional layer, which will cause
the number of feature zero values to increase when the convolutional neural network goes deeper, thus increasing the number of invalid operations. If we can effectively exploit it
that only take non-zero values for computation, the demand of data accessing can be effectively reduced, equivalent to less power consumption. The operation time can be accelerated by 2 to 5 times.
We are committed to exploring the relationship between sparse feature and convolutional neural network accelerator design, and proposing solutions of accelerator design. We use the Quantization Aware Training (QAT) to reduce single-precision data to 8-bit, and Channel Priority Row Compression for data compression. Take the hardware usage, data sparsity, and bandwidth requirements into consideration to propose a parallel
strategy called Sparse Row-Expansion, which can improve hardware utilization and alleviate bandwidth bottlenecks, in other words, speed up inference. We use the hardware
description language Verilog to implement the convolutional neural network accelerator, design the control unit for compressed sparse data, the processing element (PE), and
consider the hardware scaling to design a scalable hierarchical Network-on-chip (NoC).
The data flow characteristics will produce many partial sums (psum), so a psum integration circuit is specially designed to process the output data of the PE. The accelerator architecture proposed in this thesis uses 152 KB SRAM, which mainly stores input features, weights, bias, and output activations values in two rows, with a total of 192 multiply-accumulate (MACs) operating at a frequency of 140 MHz, providing a maximum throughput of 53.8 GOPS. We realized the proposed architecture
on the FGPA platform and cell-based TSMC T90 to measure the performance of the VGG16 convolutional layer and the average power consumption of the accelerator core, respectively. The results of 380 ms, 113 GOPS/W, and only 224 MB of DRAM access can be achieved.
摘要 i
Abstract ii
目錄 iv
圖目錄 vi
表目錄 viii
第1章 緒論 1
1.1. 研究背景與動機 1
1.2. 論文架構 2
第2章 文獻探討 3
2.1. 卷積神經網路 3
2.1.1. 卷積層 3
2.1.2. 激勵函數 4
2.1.3. VGG 4
2.2. 卷積神經網路硬體加速器 6
2.2.1. Eyeriss 6
2.2.2. SCNN 8
2.2.3. Fully-connection Based 9
2.2.4. SNAP 10
第3章 研究方法 13
3.1. 卷積演算法探索與分析 13
3.1.1. 量化感知訓練(Quantization Aware Training) 13
3.1.2. 特徵稀疏性與卷積平行化 16
3.1.3. 通道優先列壓縮(Channel Priority Row Compression) 19
3.1.4. 稀疏性列擴展(Sparse Row-Expansion) 21
3.2. 硬體架構 24
3.2.1. 卷積神經網路加速器架構 24
3.2.2. 壓縮格式傳輸與控制 25
3.2.3. 運算單元(Processing Element)設計 28
3.2.4. 具可擴性之Network-on-Chip設計 29
3.2.5. Partial Sum整合與輸出 31
第4章 研究結果 33
4.1. FPGA與Cell-based實現 33
4.1.1. FPGA實驗環境 33
4.1.2. Cell-based實驗環境 38
4.2. 卷積神經網路加速器比較 40
第5章 結論 42
5.1. 結論 42
5.2. 發展 42
參考文獻 44
[1] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010.
[2] A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional neural networks,” ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 27-40, 2017.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[4] Y.-H. Chen et al., “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, 2016.
[5] Y.-H. Chen et al., “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no.2, pp. 292-308, 2019.
[6] J. F. Zhang et al., “SNAP: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference,” IEEE Journal of Solid-State Circuits, vol. 56, no. 2, pp. 636-647, 2020.
[7] V. Sze et al., “"Overview of deep neural networks,” in Efficient Processing of Deep Neural Networks, Synthesis Lectures on Computer Architecture, Springer, Cham, 2020.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems 25, 2012.
[9] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[10] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[11] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[12] R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[13] W. Liu et al., “SSD: Single shot multibox detector,” in Proceedings of the 14th European Conference Computer Vision–ECCV 2016, Part I, 2016.
[14] J. Redmon et al., “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[15] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceedings of the18th International Conference on Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Part III, 2015.
[16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[17] K. He et al., “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
[18] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 367-379, June 2016.
[19] A. Ardakani et al., “An architecture to accelerate convolution in deep neural networks,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 4, pp. 1349-1362, 2017.
[20] S. Han et al., “Learning both weights and connections for efficient neural network,” Advances in Neural Information Processing Systems 28, 2015.
[21] PyTorch. “Quantization — PyTorch 2.0 Documentation.”
[22] B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[23] AMD Xilinx, “ZCU102 Evaluation Board User Guide (UG1182),” 21 Feb. 2022.
[24] AMD Xilinx, “AXI DMA v7.1 LogiCORE IP Product Guide (PG021),” 27 Mar. 2022.
[25] Y. Lu, Y.-L. Wu, and J.-D. Huang, “A coarse-grained dual-convolver based CNN accelerator with high computing resource utilization,” in Proceedings of the 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2020.
[26] J.-S. Park et al., “A Multi-mode 8k-MAC HW-utilization-aware neural processing unit with a unified multi-precision datapath in 4-nm flagship mobile SoC,” IEEE Journal of Solid-State Circuits, vol. 58, no. 1, pp. 189-202, 2022.
[27] Arm, “AMBA AXI and ACE Protocol Specification,” 28 Oct. 2011.
電子全文 電子全文(網際網路公開日期:20260828)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊