跳到主要內容

臺灣博碩士論文加值系統

(44.221.66.130) 您好!臺灣時間:2024/06/21 00:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:謝明航
研究生(外文):Ming-Hang Hsieh
論文名稱:在FPGA上實現無乘法器卷積神經網絡推理加速電路
論文名稱(外文):A Multiplier-less Convolution Neural Network Inference Acceleration Engine Based on FPGA
指導教授:闕志達
指導教授(外文):Tzi-Dar Chiueh
口試委員:楊佳玲劉宗德蔡佩芸馬席彬
口試委員(外文):Chia-Lin YangTsung-Te LiuPei-Yun TsaiHsi-Pin Ma
口試日期:2020-07-31
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:電子工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:126
中文關鍵詞:機器學習卷積神經網路推論加速系統FloatSDFloatSD4無乘法器半精度累加
外文關鍵詞:machine learningconvolutional neural networkinference acceleration systemFloatSDFloatSD4multiplier-lesshalf-precision accumulation
DOI:10.6342/NTU202003812
相關次數:
  • 被引用被引用:0
  • 點閱點閱:391
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
自2012年AlexNet [1]公開以來,機器學習所能應用的層面越來越廣泛,無論是早期的影像分類、物件辨識,還是中期的風格轉換[2]、自然語言處理[3],甚至到近期的影音生成[4][5],機器學習已展顯了它在各種領域的潛力及應用。而上述這些應用,大部分都有一項共通的特點,那便是卷積神經網路的使用。卷積神經網路已成為機器學習領域中不可或缺的一部份,因此運算速度提升的需求便隨之增加。無論是雲端運算還是終端運算,如何以更低的功耗和更有效率的方式,去進行神經網路的推論加速,便是近年研究的重點之一。
在卷積神經網路的推論過程中,會需要大量的乘加運算,而這些運算在同一層網路中,並沒有數學上的相依性,因此對於傳統的CPU來說,即使能使用向量運算的指令集進行加速,也仍會顯得吃力。而基於圖形處理器通用計算(General-purpose computing on graphics processing units, GPGPU)的硬體加速就能很好地解決這個問題。
然而,GPGPU因其發展歷史和通用的特性,使得它雖然可以平行處理卷積運算,卻不能好好地利用卷積神經網路獨有的資料共用特性,所有運算皆須經過適度的轉換及排列,才能使用GPGPU的矩陣運算功能進行加速,這也使得它的執行效率並不高,大量的能源消耗也使得它在終端裝置上顯得不切實際。
本論文基於Floating-point Signed Digit (FloatSD)演算法[6],提出更精簡的4-bit FloatSD4權重編碼,除了大幅降低神經網路的資料傳輸量,也使得神經網路卷積運算從乘加運算化簡為加法運算,顯著地降低運算複雜度。而在三種影像辨認的資料集: MNIST、CIFAR-10和ImageNet中,MNIST和CIFAR-10達到了與FP32相近的結果,ImageNet的top-1和top-5的正確率與FP32差異,皆在0.5%以下。
除了軟體的訓練結果外,本論文的另一個重點便是針對FloatSD4演算法設計的硬體電路,除了核心的加速運算單元外,亦有基於FPGA和PC平台的推論加速系統。本論文以VGG-7作為驗證系統可行性的神經網路,相較於單精度運算的CPU平台,基於FPGA的加速系統運算速度提升了4.82倍,整體的能源效率更是CPU的80倍。
Since the publication of AlexNet [1] in 2012, the applications of machine learning have become more and more extensive. Early applications include image classification, object recognition. Later, there were style transfer [2] and natural language processing [3], and more recently video and audio generation [4][5]. Machines Learning has shown its potential and applications in various fields. Most of these applications have one thing in common, which is the use of convolutional neural networks (CNN). Convolutional neural networks have become an indispensable part of the field of machine learning, so the need to enhance its execution speed has increased. Whether it is cloud computing or edge computing, how to accelerate neural network inference with low-power and high-efficiency is a major research focus in recent years.
In the inference process of CNNs, a large number of multiply-add operations are required. CPU execution of these operations can leverage vector operations for acceleration, albeit still not very inefficient. General-purpose computing by graphics processing units (GPGPU) has been known to solve this problem well. However, due to its development history and general characteristics, GPGPU can process convolution operations in parallel, but it cannot make good use of the unique data sharing characteristics of CNNs. All operations must undergo proper conversion and arrangement for them to be accelerated by matrix operations. These extra steps make GPGPU less efficient and high energy consumption makes GPGPU impractical for edge devices.
Based on the Floating-point Signed Digit (FloatSD) algorithm [6], this thesis proposed a simplified 4-bit FloatSD4 weight encoding. In addition to greatly reducing the weight transmission and storage, it also reduces the convolution operation from multiply-and-add operation to one simple addition, which significantly reduces the associated computational complexity. Among the three image recognition data sets: MNIST, CIFAR-10, and ImageNet, MNIST and CIFAR-10 CNNs with FloatSD4 weights achieve similar results to their FP32 counterparts. The top-1 and top-5 accuracies of ImageNet by FloatSD4 CNN also achieve very good results, where both accuracies are within 0.5% of that by FP32 CNN.
In addition to training CNNs with FloatSD4 weight effectively, the other focus of this thesis is the inference acceleration circuit design for the FloatSD4 CNNs. The proposed FloatSD4 inference acceleration system is based on FPGA and PC platform, where many FloatSD4 convolution processing elements, associated caches, and access circuitry are implemented in the FPGA. Besides, we used VGG-7 on CIFAR-10 application to verify the inference acceleration system. Compared to the CPU platform running single-prevision arithmetics, the acceleration system computing speed is increased by 4.82 times, while the overall energy efficiency is 80 times that of the CPU.
致謝 v
摘要 vii
Abstract ix
目錄 xi
圖目錄 xv
表目錄 xix
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 論文組織與貢獻 3
第二章 神經網路介紹 5
2.1 多層感知機 (Multilayer perceptron, MLP) 5
2.1.1 架構介紹 5
2.1.2 前向傳遞 6
2.1.3 反向傳遞 7
2.2 卷積神經網路 (Convolutional neural network, CNN) 11
2.2.1 架構介紹 11
2.2.2 前向傳遞 15
2.2.3 反向傳遞 16
第三章 低複雜度神經網路訓練與推理設計 19
3.1 Grouped Signed Digit 數字表示法 19
3.2 FloatSD 20
3.2.1 FloatSD8 21
3.2.2 FloatSD4 21
3.3 訓練流程 23
3.3.1 訓練流程概述 23
3.3.2 參數量化 25
3.3.3 訓練技巧 31
3.4 與FP8、INT8和Log2格式比較 36
3.4.1 FP8格式 36
3.4.2 INT8格式 37
3.4.3 Log2格式 38
第四章 基於FloatSD4權重格式的深度神經網路訓練 41
4.1 PyTorch平台 41
4.1.1 PyTorch使用簡介 42
4.2 測試數據資料集 49
4.2.1 MNIST 49
4.2.2 CIFAR-10 50
4.2.3 ImageNet 51
4.3 網路架構與訓練配置 52
4.3.1 LeNet 53
4.3.2 VGG 54
4.3.3 ResNet 56
4.3.4 MobileNetV2 61
4.4 訓練結果 62
第五章 適用於FloatSD4的FPGA電路及其系統設計 69
5.1 基礎運算單元(Processing Element, PE)規劃 69
5.1.1 無乘法器乘積累加設計(Multiplier-less MAC) 71
5.1.2 FP16 80
5.2 PE Cube規劃 82
5.2.1 PE Row設計 83
5.2.2 PE Array設計 84
5.2.3 PE Cube設計 85
5.2.4 後處理運算 87
5.3 系統規劃 90
5.3.1 硬體設計規劃 90
5.3.2 Matlab與FPGA整合之系統架構及資料流程 100
5.4 實際運算效能與分析 102
5.4.1 FPGA與GPU推理結果比較 103
5.4.2 效能與功耗分析 105
5.4.3 論文比較 114
第六章 結論與展望 117
參考文獻 121
A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inform. Process. Syst., 2012, pp. 1097-1105.
L. A. Gatys, A. S. Ecker and M. Bethge, “Image Style Transfer Using Convolutional Neural Networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 2414-2423.
M. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” in Proc. Empirical Methods NaturalLang. Process., 2015, pp. 1412-1421.
T. Karras et al., “Analyzing and Improving the Image Quality of StyleGAN,” arXiv preprint arXiv:1912.04958, 2019.
P. Dhariwal et al., “Jukebox: A Generative Model for Music,” arXiv preprint arXiv:2005.00341, 2020.
P. Lin, M. Sun, C. Kung, and T. Chiueh, "FloatSD: A New Weight Representation and Associated Update Method for Efficient Convolutional Neural Network Training," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 267-279, June 2019.
O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” in International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211-252, 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
H. Touvron, A. Vedaldi, M. Douze, and H. Jégou, “Fixing the train-test resolution discrepancy,” in Proc. Adv. Neural Inform. Process. Syst., 2019, pp. 8252-8262.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779-788.
C. Ledig et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 105-114.
D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484-489, Jan. 2016.
Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017.
N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan, “Training Deep Neural Networks with 8-bit Floating Point Numbers,” in Proc. Adv. Neural Inform. Process. Syst., 2018, pp. 7686-7695.
N. Mellempudi, S. Srinivasan, D. Das, and B. Kaul, “Mixed Precision Training With 8-bit Floating Point,” arXiv preprint arXiv:1905.12334, 2019.
V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in Proc. of the 27th International Conference on Machine Learning, 2010, pp. 807-814.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436-444, May 2015.
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
K. Chen, C. Chen, and T. Chiueh, “Grouped signed power-of-two algorithms for low-complexity adaptive equalization,” in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 52, no. 12, pp. 816-820, Dec. 2005.
P. Lin, “Low-complexity Convolution Neural Network Training and Low Power Circuit Design of its Processing Element,” M.S. thesis, National Taiwan University, Taipei, 2017.
“IEEE Standard for Floating-Point Arithmetic,” in IEEE Std 754-2008, pp. 1-70, 29 Aug. 2008.
X. Glorot, and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Stat., 2010, pp. 249-256.
S. Ioffe, and C. Szegedyl, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proc. of the 32nd International Conference on Machine Learning, 2015, pp. 448-456.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1026-1034.
“Training With Mixed Precision,” NVIDIA, [Online]. Available: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html. [Accessed Jul. 2020].
X. Sun et al., “Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks,” in Proc. Adv. Neural Inform. Process. Syst., 2019, pp. 4901-4910.
S. Migacz, “8-bit Inference with TensorRT,” in GPU Technology Conference, vol. 2, p. 7, 2017.
D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional Neural Networks using Logarithmic Data Representation,” arXiv preprint arXiv:1603.01025, 2016.
E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “LogNet: Energy-efficient neural networks using logarithmic computation,” 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 5900-5904.
S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, “Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary log-base,” 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, CA, 2018, pp. 1-8.
“The MNIST database of handwritten digits,” Y. LeCun, C. Cortes, and C. Burges, [Online]. Available: http://yann.lecun.com/exdb/mnist/. [Accessed Jul. 2020].
J. Steppan, “File:MnistExamples.png,” 14 Dec. 2017. [Online]. Available: https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png. [Accessed Jul. 2020].
“The CIFAR-10 and CIFAR-100 datasets,” A. Krizhevsky, V. Nair, and G. Hinton, [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html. [Accessed Jul. 2020].
“ImageNet,” [Online]. Available: http://image-net.org. [Accessed Jul. 2020].
M. D. Zeiler, and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818-833.
K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1-14.
C. Szegedy et al., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1-9.
“SenseTime Trains ImageNet/AlexNet In Record 1.5 minutes,” F. Cai, [Online]. Available: https://medium.com/syncedreview/sensetime-trains-imagenet-alexnet-in-record-1-5-minutes-e944ab049b2c. [Accessed Jul. 2020].
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
F. Li, B. Zhang, and B. Liu, “Ternary Weight Networks,” arXiv preprint arXiv:1605.04711, 2016.
A. Canziani, A. Paszke, and E. Culurciello. “An analysis of deep neural network models for practical applications,” arXiv preprint arXiv:1605.07678, 2016.
“The Revolution of Depth,” A. Vieira, [Online]. Available: https://medium.com/@Lidinwise/the-revolution-of-depth-facf174924f5. [Accessed Jul. 2020].
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, "MobileNetV2: Inverted Residuals and Linear Bottlenecks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 4510-4520.
A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv preprint arXiv:1704.04861, 2016.
M. Courbariaux, Y. Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” in Proc. Adv. Neural Inform. Process. Syst., 2015, pp. 3123-3131.
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized Neural Networks,” in Proc. Adv. Neural Inform. Process. Syst., 2016, pp. 4107-4115.
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” in Proc. European Conference on Computer Vision (ECCV), 2016, pp. 525-542.
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations,” arXiv preprint arXiv:1609.07061, 2016.
C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary Quantization,” in Proc. Int. Conf. Learn. Representations, 2017.
S. Wu, G. Li, F. Chen, and L. Shi, “Training and Inference with Integers in Deep Neural Networks,” in Proc. Int. Conf. Learn. Representations, 2018.
P. Micikevicius et al., “Mixed Precision Training,” in Proc. Int. Conf. Learn. Representations, 2018.
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients,” arXiv preprint arXiv:1606.06160, 2016.
Raghuraman Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “Post-training 4-bit quantization of convolution networks for rapid-deployment,” in Proc. Adv. Neural Inform. Process. Syst., 2019, pp. 7950-7958.
M. Nagel, M. V. Baalen, T. Blankevoort, and M. Welling, “Data-Free Quantization Through Weight Equalization and Bias Correction,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 1325-1334.
P. Yin, S. Zhang, J. Lyu, S. Osher, and Y. Qi, J. Xin, “Blended coarse gradient descent for full quantization of deep neural networks,” Research in the Mathematical Sciences, vol. 6, no. 1, pp. 14, Jan. 2019.
Z. Liu, and M, Mattina, “Learning low-precision neural networks without Straight-Through Estimator (STE),” arXiv preprint arXiv:1903.01061, 2019.
Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit Quantization of Neural Networks for Efficient Inference,” 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South), 2019, pp. 3009-3018.
S. Gupta, S. Ullah, K. Ahuja, A. Tiwari, and A. Kumar, “ALigN: A Highly Accurate Adaptive Layerwise Log_2_Lead Quantization of Pre-Trained Neural Networks,” in IEEE Access, vol. 8, pp. 118899-118911, 2020.
Y. Nahshan et al., “Loss Aware Post-training Quantization,” arXiv preprint arXiv:1911.07190v2, 2020.
S.R. Jain, A. Gural, M. Wu, and C. H. Dick, “Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks,” arXiv preprint arXiv:1903.08066v3, 2020.
B. Zhu, Z. Al-Ars, and W. Pan, “Towards Lossless Binary Convolutional Neural NetworksUsing Piecewise Approximation,” in European Conference on Artificial Intelligence, 2020.
T. Juang, “Energy-Efficient Accelerator Architecture for Neural Network Training and Its Circuit Design,” M.S. thesis, National Taiwan University, Taipei, 2018.
M. Sun, “A Convolution Neural Network Training Acceleration Solution Based on FPGA Implementation of FloatSD8 Convolution,” M.S. thesis, National Taiwan University, Taipei, 2019.
“Intel Core i9-9900K GFLOPS performance,” GadgetVersus, [Online]. Available: https://gadgetversus.com/processor/intel-core-i9-9900k-gflops-performance/. [Accessed Jul. 2020].
“New chips for machine intelligence,” J. W. Hanlon, [Online]. Available: https://jameswhanlon.com/new-chips-for-machine-intelligence.html. [Accessed Jul. 2020].
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in Proc. of the 32nd International Conference on Machine Learning, vol. 37, Jun. 2015, pp. 1737-1746.
J. Qiu et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” in Proc. of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016, pp. 26-35.
Y. Umuroglu et al., “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” in Proc. of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017, pp. 65-74.
Y. Ma, N. Suda, Yu Cao, J. Seo, and S. Vrudhula, “Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA,” 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, 2016, pp. 1-8.
Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J. Seo, “ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler,” Integration, vol. 62, pp. 14-23, Jun. 2018.
K. Guo et al., "Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35-47, Jan. 2018.
J. Chen, L. Liu, Y. Liu and X. Zeng, “A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs,” in IEEE Transactions on Neural Networks and Learning Systems (Early Access), pp. 1-15, April 2020.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊