臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.109) 您好！臺灣時間：2026/04/20 02:17

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

沈恩禾

研究生(外文):

En-Ho Shen

論文名稱:

低數值精確度捲積神經網路加速器之可重置化超大型積體電路設計

論文名稱(外文):

Reconfigurable Low Arithmetic Precision Convolution Neural Network Accelerator VLSI Design and Implementation

指導教授:

簡韶逸

指導教授(外文):

Shao-Yi Chien

口試委員:

蔡宗漢、吳安宇、楊家驤

口試日期:

2019-07-30

學位類別:

碩士

校院名稱:

國立臺灣大學

系所名稱:

電子工程學研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2019

畢業學年度:

107

語文別:

中文

論文頁數:

中文關鍵詞:

低數值精確度、捲積神經網路、加速器、可重置化、超大型積體電路設計

DOI:

10.6342/NTU201902618

相關次數:

被引用:0
點閱:229
評分:
下載:0
書目收藏:0

近年來神經網路(DNN)在各項人工智慧應用獲得廣大成功與進步。但是通常這樣的模型需要笨重且耗電的通用顯示卡(GPGPU)幫助運算，不適合在行動裝置等使用電池的設備使用。在這篇論文，我們提出一個超大型積體電路設計，專門運算經過量化(quantization)的低運算精確度捲積神經網路(CNN)，大大減少跨系統資料傳輸造成的能量消耗，特別適合行動裝置上的神經網路加速。我們首先提出一個簡單且有效的神經網路量化演算法，一個有著高度資料重複利用率且適合這樣經過量化神經網路的資料流動策略。為了發揮量化資料的最大潛力，我們設計了一個專門運算低運算精確度資料的乘法加法樹結構，接著提出了一個晶片內緩存記憶體結構與資料重新排列方法，用以減少任何不必要的資料存取浪費和記憶體分塊衝突(bank-conflict)，最後是一組接受從緩存記憶體廣播(broadcast)到各個運算單位的資料，並且將完成的結果依序傳回全域緩存記憶體(global buffer)的核心運算單元陣列。我們提出的架構能支援絕大部分的捲積神經網路構造，並且能夠重置(reconfigure)適當的運算資料精確度，適應各種量化的神經網路結構。最後的設計使用了 180KB 的晶片內記憶體，和 1340K 的邏輯閘

Deep neural networks (DNNs) shows promising results on various AI application tasks. However such networks typically are executed on general purpose GPUs with bulky size in form factor and hundreds of watt in power consumption, which unsuitable for mobile applications. In this thesis, we present a VLSI architecture able to process on quantized low numeric-precision convolution neural networks (CNNs), cutting down on power consumption from memory access and speeding the model up with limited area budget,particularlyﬁtformobiledevices.We ﬁrst propose a quantization re-trainig algorithm for trainig low-precision CNN, then a dataﬂow with high data reuse rate with a specially data multiplication accumulation strategy specially designed for such quantized model. To fully utilize the efficiency of computation with such low-precision data, we design a micro-architecture for low bit-length multiplication and accumulation, then a on-chip memory hierarchy and data re-alignment ﬂow for power saving and avoiding buffer bank-conﬂicts, and a PE array designed for taking broadcast-ed data from buffer and sending out ﬁnished data sequentially back to buffer for such dataﬂow. The architecture is highly ﬂexible for various CNN shaped and re-conﬁgurable for low bit-length quantized models. The design synthesised with a 180KB on-chip memory capacity and a 1340k logic gate counts area, the implementation resultshows state-of-the-art hardware efficiency.

Abstract i
List of Figures v
List of Tables ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 ﬁxed point quantisation . . . . . . . . . . . . . . . . . . . 6
2.1.2 ternary to binary quantisation . . . . . . . . . . . . . . . 6
2.1.3 8-bit quantization on modern models . . . . . . . . . . . 8
2.2 Hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Dataﬂow optimization: row stationary . . . . . . . . . . . 10
2.2.2 Precision reconﬁgurable and sub-word parallelism arithmetic unit . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Bit-level re-conﬁgurable arithmetic unit . . . . . . . . . . 12
3 Low numeric precision convolution neural network 15
3.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 15
3.2 Low Precision CNN . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Quantization Loss Minimization Threshold Selection . . . . . . . 19
3.4 Computational consideration and data re-packing . . . . . . . . . 21
3.4.1 Data re-packing . . . . . . . . . . . . . . . . . . . . . . . 21
4 ProposedArchitecture 25 4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Output row stationary dataﬂow . . . . . . . . . . . . . . . 27
4.1.2 Data tiling . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.3 Data re-alignment and buffer hierarchy . . . . . . . . . . 30
4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 PE processing pipeline . . . . . . . . . . . . . . . . . . . 32
4.2.2 Sub-word accumulation operation and re-conﬁgurable arithmetic logic unit . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.3 Shift dispatcher . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.4 Quantization . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Results 43
5.1 Quantization error minimization training . . . . . . . . . . . . . . 43
5.2 Implementation results . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Area and power . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusion 53
Reference 55

[1] E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Lognet: Energy-efﬁcient neural networks using logarithmic computation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 5900–5904. v, 6
[2] S.Han,H.Mao,andW.Dally,“Deepcompression: Compressingdeepneural networks with pruning, trained quantization and huffman coding,” 10 2016. v, 2, 6
[3] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classiﬁcation Using Binary Convolutional Neural Networks,” arXiv e-prints, p. arXiv:1603.05279, Mar 2016. v, 3, 6, 8, 18, 37, 44
[4] S. Migacz, “8-bit inference with tensorrt.” [Online]. Available: http://on-demand.gputechconf.com/gtc/2017/presentation/ s7310-8-bit-inference-with-tensorrt.pdf v, 8, 9, 44
[5] Y. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energy-efﬁcient reconﬁgurable accelerator for deep convolutional neural networks,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC), Jan 2016, pp. 262–263. [Online]. Available: http://people.csail.mit.edu/emer/slides/2016. 02.isscc.eyeriss.slides.pdf v, 10
[6] B. Moons and M. Verhelst, “A 0.3-2.6 TOPS/W precision-scalable processor for real-time large-scale convnets,” CoRR, vol. abs/1606.05094, 2016. [Online]. Available: http://arxiv.org/abs/1606.05094 v, 11, 26, 45
[7] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequencyscalable convolutional neural network processor in 28nm fdsoi,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), Feb 2017, pp. 246–247. v, 11, 26, 45
[8] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks,” CoRR, vol. abs/1712.01507, 2017. [Online]. Available: http://arxiv.org/abs/1712.01507 v, 12
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv e-prints, p. arXiv:1512.03385, Dec 2015. 1
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ 4824-imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf 1, 3
[11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE:EfﬁcientInferenceEngineonCompressedDeepNeuralNetwork,”arXiv e-prints, p. arXiv:1602.01528, Feb 2016. 1, 2, 25
[12] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefﬁcient dataﬂow for convolutional neural networks,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), June 2016, pp. 367–379. 2, 10, 12, 25, 26, 45
[13] J. Luo, J. Wu, and W. Lin, “Thinet: A ﬁlter level pruning method for deep neural network compression,” CoRR, vol. abs/1707.06342, 2017. [Online]. Available: http://arxiv.org/abs/1707.06342 2
[14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861 2
[15] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufﬂenet: An extremely efﬁcient convolutional neural network for mobile devices,” CoRR, vol. abs/1707.01083, 2017. [Online]. Available: http://arxiv.org/abs/1707.01083 2
[16] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolutional neural networks for object recognition,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131– 1135, 2015. 3, 6
[17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” CoRR, vol. abs/1409.0575, 2014. [Online]. Available: http://arxiv.org/abs/1409.0575 3
[18] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/ 6
[19] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.04711 6
[20] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http://arxiv.org/abs/1602.02830 6
[21] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/∼kriz/ cifar.html 6
[22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167 19
[23] johnjohnlin, “Nicotb, a python-verilog co-simulation framework.” [Online]. Available: https://github.com/johnjohnlin/nicotb 44

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	基於稀疏行式核心之卷積神經網路加速器設計
2.	H.264視訊壓縮迴路濾波器之超大型積體電路設計

無相關期刊

1.	基於深度學習之超解析度即時系統架構設計
2.	張量本位之平行運算記憶體搬運優化方法論
3.	利用參數修剪與量化技術以精簡語音除噪之深度學習模型
4.	基於向量量化之卷積神經網路處理器架構設計
5.	用於跨相機追蹤系統之非監督式行人重識別
6.	整體式深度與集成學習演算法
7.	完全卷積自動編碼器應用在心電圖訊號的壓縮與除噪
8.	基於深度學習之高解析度畫面內插演算法及即時性積體電路架構設計
9.	適合硬體實作之演算法以及硬體架構設計應用於影片近似最近鄰居搜尋
10.	即時性兩階段顯著物偵測演算法及其異質性積體電路架構設計
11.	基於向量量化之節能神經網路加速器架構設計
12.	高效能解碼之深度學習與傳統影像視訊壓縮
13.	增進語音增強於未知噪聲泛用性-運用整合性多屬性回歸模型
14.	可感知資源限制之神經網路壓縮與加速
15.	用於即時跨相機追蹤系統之具困難樣本校正之非監督式行人重識別

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室