跳到主要內容

臺灣博碩士論文加值系統

(44.221.73.157) 您好!臺灣時間:2024/06/20 19:16
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:余松恬
研究生(外文):Yu, Song-Tien
論文名稱:通用型脈動陣列 AI 加速器:評估適用性與效能研究
論文名稱(外文):A Study on the Applicability and Performance Evaluation of a General-Purpose Systolic Array AI Accelerator
指導教授:黃文吉黃文吉引用關係
指導教授(外文):Hwang, Wen-Jyi
口試委員:葉佐任董一志黃文吉
口試委員(外文):Yeh, Tso-ZenTung, Yi-ChihHwang, Wen-Jyi
口試日期:2023-07-17
學位類別:碩士
校院名稱:國立臺灣師範大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2023
畢業學年度:111
語文別:中文
論文頁數:51
中文關鍵詞:脈動陣列硬體加速器邊緣運算神經網路模型
外文關鍵詞:GemminiRISC-V
相關次數:
  • 被引用被引用:0
  • 點閱點閱:138
  • 評分評分:
  • 下載下載:24
  • 收藏至我的研究室書目清單書目收藏:0
本論文旨在評估通用型脈動陣列 AI 硬體加速器在不同類型神經網路模型上的適用性及效能。隨著深度學習在邊緣運算中的廣泛應用,硬體加速器的設計成為提升邊緣運算效率的關鍵。然而,為每種類神經網路配置專用的硬體加速器並不切實際,若硬體加速器配置需要隨著模型架構的不同而頻繁改變,將是高昂成本負擔。
本論文提出一套通用型 AI 脈動陣列硬體加速器的配置,目的是解決類神經網路應用中硬體適配的問題,使單一硬體加速器能夠適用於多種不同類神經網路架構,並建立了一個基於 RISC-V 核心且與通用型 AI 硬體加速器做整合之SoC 架構平台,實作於 FPGA 板,該 SoC架構提供一個真實情況的評估平台。
本論文選用 Gemmini 作為通用型脈動陣列 AI 硬體加速器的代表,在不同的硬體配置下,針對兩種具代表性的類神經網路模型進行實驗,分別是基於二維卷積神經網路的影像元件辨識模型以及基於一維卷積的手勢辨識模型。本研究會結合效能評估並衡量 FPGA 硬體資源使用量,提出合適的通用型脈動陣列加速器硬體配置選用方案,供 AI 領域研究者參考。
誌謝 i
摘要 ii
目錄 iii
圖目錄 v
表目錄 vi
第一章 緒論 1
1-1 研究背景 1
1-2 研究動機 3
1-3 研究目的 3
1-4 研究貢獻 4
第二章 理論基礎 5
2-1 Chipyard SoC Generators 5
2-1-1 Rocket Chip 5
2-1-2 Deep Learning Accelerator 6
2-1-3 Components and Tools 6
2-2 Gemmini AI Accelerator 8
2-2-1 Processing Element and Controller 8
2-2-2 Dataflow Type 9
2-2-3 Software Support 10
2-3 Systolic Array 11
2-4 Weight Stationary 12
第三章 研究方法 18
3-1 Gemmini: Hardware Flexibility 19
3-2 Model Architecture 21
3-2-1 Automated Optical Inspection Model 21
3-2-2 Gesture Recognition Model 22
3-3 Quantization and Deployment 25
第四章 實驗結果與效能分析 27
4-1 Experimental Environment 27
4-2 Acceleration Performance Compared to CPU 29
4-2-1 Automated Optical Inspection Model Evaluation 29
4-2-2 Gesture Recognition Model Evaluation 30
4-2-3 Execution Time and Speedup Ratio 31
4-3 Gemmini Memory Subsystem and Performance 33
4-3-1 Scratchpad Memory and Hardware Resources 34
4-3-2 Accumulator Memory and Hardware Resources 36
4-3-3 Performance of Different Memory Configurations 37
4-4 Gemmini Systolic Array Size and Performance 39
4-4-1 Systolic Array PE Size and Hardware Resources 39
4-4-2 Performance of Different Systolic Array PE Size 41
4-5 L2 Cache Capacity and Performance 43
4-5-1 L2 Cache Capacity and Hardware Resources 43
4-5-2 Performance of Different L2 Cache Capacity 45
4-6 Hardware Configuration Selection Guide 47
第五章 結論 48
參考文獻 49
[1]Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.
[2]Y. Chen et al., “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, Mar. 2020, doi: 10.1016/j.eng.2020.01.007.
[3]A. Gonzalez and C. Hong, "A chipyard comparison of NVDLA and Gemmini", 2020
[4]H. Genc et al., "Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration," 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2021, pp. 769-774, doi: 10.1109/DAC18074.2021.9586216.
[5]N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 2017, pp. 1-12, doi: 10.1145/3079856.3080246.
[6]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2012, doi: 10.1145/3065386.
[7]P. M. Kogge and H. S. Stone, “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C–22, no. 8, pp. 786–793, Aug. 1973, doi: 10.1109/tc.1973.5009159.
[8]A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta and A. A. Bharath, "Generative Adversarial Networks: An Overview," in IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53-65, Jan. 2018, doi: 10.1109/MSP.2017.2765202.
[9]S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, doi: 10.1162/neco.1997.9.8.1735.
[10]A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017, pp. 5998-6008.
[11]A. Amid et al., “Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs,” IEEE Micro, vol. 40, no. 4, pp. 10–21, Jul. 2020, doi: 10.1109/mm.2020.2996616.
[12]K. Asanović et al., “The rocket chip generator,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17 4, 2016.
[13]A. Waterman and K. Asanović, “The RISC-V Instruction Set Manual: Volume I: Unprivileged ISA. ” SiFive Inc. and University of California, Berkeley, 2019.
[14]Y. Lee et al., “The Hwacha vector-fetch architecture manual,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-262, 2015.
[15]IceNet — Chipyard 1.9.0 documentation, accessed on July 9, 2023
https://chipyard.readthedocs.io/en/latest/Generators/IceNet.html
[16]SiFive Generators — Chipyard 1.9.0 documentation, accessed on July 9, 2023
https://chipyard.readthedocs.io/en/latest/Generators/SiFive-Generators.html
[17]J. Bachrach et al., “Chisel,” Proceedings of the 49th Annual Design Automation Conference on - DAC ’12, 2012, doi: 10.1145/2228360.2228584.
[18]A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations,” Nov. 2017, doi: 10.1109/iccad.2017.8203780.
[19]Verilator — Chipyard 1.9.0 documentation, accessed on July 9, 2023 https://chipyard.readthedocs.io/en/main/Simulation/Software-RTL-Simulation.html
[20]VCS Functional Verification Solution, accessed on July 9, 2023 https://www.synopsys.com/verification/simulation/vcs.html
[21]S. Karandikar et al., “FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 29-42, 2018.
[22]R. D. Schreiber, “SYSTOLIC ARRAYS: HIGH PERFORMANCE PARALLEL MACHINES FOR MATRIX COMPUTATION,” Jan. 1984, doi: 10.1016/b978-0-12-100560-3.50019-6.
[23]U. S. Solangi, M. Ibtesam, M. A. Ansari, J. Kim, and S. Park, “Test Architecture for Systolic Array of Edge-Based AI Accelerator,” IEEE Access, vol. 9, pp. 96700–96710, 2021, doi: 10.1109/access.2021.3094741.
[24]H. Genc et al., "Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures", arXiv:1911.09925, 2019.
[25]D. A. N. Gookyi, E. Lee, K. Kim, S. -J. Jang and S. -S. Lee, "Exploring GEMM Operations on Different Configurations of the Gemmini Accelerator," 2022 19th International SoC Design Conference (ISOCC), Gangneung-si, Korea, Republic of, 2022, pp. 356-357, doi: 10.1109/ISOCC56007.2022.10031536.
[26]J. H. Koo, “A component layout inspection system based on the heat map marking rule applied to Printed Circuit Boards”, National Taiwan Normal University, July. 2022, doi: 10.6345/NTNU202201353.
[27]H. K. Chang, “Real-time gesture recognition system based on CenterNet algorithm and digital flex sensor”, National Taiwan Normal University, Aug. 2021, doi: 10.6345/NTNU202101300.
[28]Y. Li et al., “BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction,” Feb. 2021, doi: 10.48550/arxiv.2102.05426.
[29]P. S. Cheng, “Matrix multiplication based 1-D convolution with hardware accelerator”, National Taiwan Normal University, doi: 10.6345/NTNU202201331.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top