(3.235.191.87) 您好!臺灣時間:2021/05/14 21:58
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:許晉瑋
研究生(外文):Chin-Wei Hsu
論文名稱:基於HWCK資料排程之分離式卷積加速器設計與實現
論文名稱(外文):Design and Implementation of a Separable Convolution Accelerator Based on HWCK Data Scheduling
指導教授:蔡宗漢蔡宗漢引用關係
指導教授(外文):Tsung-Han Tsai
學位類別:碩士
校院名稱:國立中央大學
系所名稱:電機工程學系
學門:工程學門
學類:電資工程學類
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:64
中文關鍵詞:硬體加速器深度學習現場可程式化邏輯閘陣列系統單晶片
外文關鍵詞:Hardware acceleratorDeep learningField Programmable Gate ArraySystem on a Chip
相關次數:
  • 被引用被引用:0
  • 點閱點閱:47
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來隨著GPU進步與大數據時代的來臨,深度學習給各領域帶來革命性的進展,從基本的影像前處理、影像切割技術、人臉辨識、語音辨識等,逐漸的取代了以往的傳統演算法,這說明了神經網路的興起已經帶動人工智慧的各項改革。但受限於GPU的功耗以及成本,其產品都極其昂貴,也因神經網路演算法龐大的計算量,須配合加速的硬體來進行實時運算,這促使近幾年來有不少研究是針對卷積網路的加速數位電路硬體設計。
本論文提出基於HWCK資料排程之分離式卷積硬體架構設計,設計深度卷積(Depthwise convolution)、逐點卷積(Pointwise convolution)、正規化硬體架構(Batch Normalization),來加速深度可分離卷積模型,可透過SoC設計,利用AXI4總線協議讓PS端(Processing System)與PL端(Programmable Logic)相互溝通,可以使CPU利用我們所開發之神經網路模組在FPGA上對神經網路進行加速。此HWCK資料排程方法,可根據所分配的記憶體頻寬資源以及內存資源進行重新配置,當頻寬與內存均足夠時,可以非常輕易的將此設計進行擴展。為了減少神經網路的權重參數,資料皆以定點數16-bit來進行運算與儲存,並以兵乓記憶體的架構來進行內存存取,且透過AXI4總線協議與CPU進行資料傳輸。整個硬體架構可實現在Xilinx ZCU106開發版上實現,藉由SoC設計,使用已預先編譯的驅動程式溝通作業系統與外部的資源,並同時控制所設計的神經網路加速模組,利用高階的程式語言來快速的重新配置神經網路加速的排程,提高硬體的重新配置能力,能在多種不同的嵌入式平台上實現此硬體架構設計,將此硬體架構運行FaceNet可以達到222FPS以及60.8GOPS,在Xilinx ZCU106開發版上只需要耗能8.82W,能達到6.89GOP/s/W的效能。
In recent years, deep learning technology becomes more popular because of the improvement of GPU and the advent of big data. The deep learning has brought revolutionary promotion in various fields. Most traditional algorithms are replaced by deep learning technologies such as basic pre-image processing, image segmentation, face recognition, speech recognition, etc. That shows the rise of the neural network has led to the reform of artificial intelligence. However, the neural network is limited by the power consumption and cost of the GPU, its products are extremely expensive. Due to the large amount of computation of the neural network, the neural network has to be used with the hardware accelerator for real-time computing. The problem of the computation of the neural network has promoted a lot of research for convolution network accelerator digital circuit hardware design.
This paper proposed a design and implementation of a separable convolution accelerator based on HWCK data scheduling. It can be used to accelerate the deep separable convolution model by the design of the deepwise convolution, pointwise convolution, and the batch normalization. The proposed system can be through the SoC design to let the PS (Processing System) and PL (Programmable Logic) communicated with each other by using the AXI4 bus protocol, so our proposed design can be used when the CPU needs to accelerate the neural network. This HWCK data scheduling method can be reconfigured by the allocated memory and the bandwidth resource on the DDR4 and can be easily extended our design when the bandwidth and memory are sufficient. To reduce the weight parameter of the neural network, the data are calculated and stored with a 16bits fixed-point. The memory access is carried out with the architecture of ping-pong memory, it can transmit the data through the AXI4 bus protocol. The while hardware design architecture can be implemented on the Xilinx ZCU106 development board. The SoC design which using a precompiled driver to communicate operating systems and external resources, and control the design of the neural network acceleration module on FPGA. The higher program language to quickly reconfigure the network schedule, it can improve the hardware reconfigurable ability. This hardware architecture can reach 222FPS and 60.8GOPS by running FaceNet. The energy consumption on the Xilinx ZCU106 board is 8.82W, it has 6.89GOP/s/W performance.
目錄
摘要 I
ABSTRACT II
致謝 III
1. 序論 1
1.1. 研究背景與動機 1
1.2. 論文架構 4
2. 文獻探討 5
2.1. 硬體加速器 5
2.2. 人臉辨識 14
3. 網路模型挑選與結果 18
3.1. 網路架構選擇 18
3.2. 資料集 21
3.3. 訓練策略與結果 22
4. 硬體架構設計 23
4.1. 整個系統的硬體方塊圖 23
4.2. 分離式硬體加速控制模組 27
4.3. 記憶體傳輸控制模組 28
4.4. HWCK資料排程方法 29
4.5. 深度卷積硬體模組(DEPTHWISE CNN MODULE) 32
4.6. 逐點卷積硬體模組(POINTWISE CNN MODULE) 35
5. 硬體實現結果 38
5.1. 深度卷積架構模擬結果 38
5.2. 逐點卷積架構模擬結果 39
5.3. 分離式卷積硬體架構合成結果 40
5.4. 實現於FPGA架構之結果 44
5.5. 晶片設計規格 48
6. 結論 49
參考文獻 50
[1] J. P. Wachs, M. Kolsch, H. Stern, and Y. Edan, “Vision-based handgesture applications,” Commun. ACM, vol. 54, no. 2, pp. 60–71, Feb. 2011.
[2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, ―Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Transaction Pattern Analysis and Machine Intelligence,1997, 19, 7, pp 711–720.
[3] G. M. Basavaraj and A. Kusagur, "Vision based surveillance system for detection of human fall," 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, 2017, pp. 1516-1520, doi: 10.1109/RTEICT.2017.8256851.
[4] C. Cortes and V. Vapnik, “Support-vector network,” Machine Learning, vol. 20, pp. 273–297, 1995.
[5] Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
[6] Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
[7] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. Tech. rep., Google Brain, 2016. arXiv preprint.
[8] A. Dadashzadeh, A. T. Targhi, M. Tahmasbi, M. Mirmehdi, “HGR-Net: A Fusion Network for Hand Gesture Segmentation and Recognition,” arXiv:1806.05653, 2018.
[9] Y. Sun, D. Liang, X. Wang, X. Tang, "Deepid3: Face recognition with very deep neural networks", arXiv preprint arXiv:1502.00873, 2015.
[10] Shahroudy, A.; Liu, J.; Ng, T.-T.; and Wang, G. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition.
[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 1998.
[12] “Why is so much memory needed for deep neural networks?” [Online]. Available: https://www.graphcore.ai/posts/why-is-so-much-memory-needed-for-deep-neural-networks. [Accessed: 13-Jan-2020].
[13] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and modularized rtl compilation of convolutional neural networks onto FPGA,” In Field Programmable Logic and Applications (FPL), 2016 26th International Conference on, pages 1–8. IEEE, 2016.
[14] “TensorFlow.” [Online]. Available: https://www.tensorflow.org/. [Accessed: 13-Jan-2020].
[15] “PyTorch.” [Online]. Available: https://pytorch.org/. [Accessed: 13-Jan-2020].
[16] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Int. J. Solid State Circuits, vol. 59, no. 1, pp. 262–263, 2016.
[17] Y. Chen et al., “DaDianNao : A Machine-Learning Supercomputer.”
[18] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” Proc. - Int. Symp. High-Performance Comput. Archit., pp. 553–564, 2017.
[19] A. Aimar et al., “NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 3, pp. 644–656, 2019.
[20] K. Abdelouahab et al., “Accelerating CNN inference on FPGAs : A Survey,” 2018.
[21] C. Zhang, P. li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” FPGA’15, February 22-24, 2015.
[22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
[23] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
[24] B. Liu et al., “An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution,” Electronics 2019, 8, 281.
[25] K. Guo et al., “Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35-47, Jan. 2018.
[26] J. Su et al., “Redundancy-reduced mobilenet acceleration on reconfigurable logic for ImageNet classification,” in Proc. Appl. Reconfig. Comput. Archit. Tools Appl., 2018, pp. 16–28.
[27] T. Wang, C. Wang, X. Zhou and H. Chen, "An Overview of FPGA Based Deep Learning Accelerators: Challenges and Opportunities," 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 1674-1681.
[28] M. Turk and A. Pentland. Eigenfaces for recognition (PDF). Journal of Cognitive Neuroscience. 1991, 3 (1): 71–86.
[29] Yeh, Jih & Pai, Yi-Chun & Wang, Chun-Wei & Yang, Fu-Wen & Lin, Hwei-Jen. (2009). Face detection using SVM-based classification. Far East Journal of Experimental and Theoretical Artificial Intelligence. 3.
[30] J. K. J. Julina and T. S. Sharmila, "Facial recognition using histogram of gradients and support vector machines," 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), Chennai, 2017, pp. 1-5, doi: 10.1109/ICCCSP.2017.7944082.
[31] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: Closing the Gap to Human-Level Performance in Face Verification," 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220.
[32] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
[34] Iandola, F.N., et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model sizes”. ArXiv e-prints, 1602. 2016.
[35] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman,“VGGFace2: A dataset for recognising faces across pose andage,” CoRR, vol. abs/1710.08092, 2017. [Online]. Available: https://arxiv.org/abs/1710.08092
[36] Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.
[37] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. Vissers, J. Wawrzynek, et al. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. arXiv preprint arXiv:1811.08634, 2018.
電子全文 電子全文(網際網路公開日期:20220801)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top