跳到主要內容

臺灣博碩士論文加值系統

(44.200.122.214) 您好!臺灣時間:2024/10/07 14:07
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:賴永崧
研究生(外文):Lai, Yung-Sung
論文名稱:基於圖靈GPU架構之軟體實體層實作
論文名稱(外文):Implementation of Soft PHY on Turing\'s GPU
指導教授:許騰尹
指導教授(外文):Hsu, Terng-Yin
口試委員:孫明福許騰尹賴煒棋廖原德
口試委員(外文):Sun, Ming-FuHsu, Terng-YinLai, Wei-ChiLiao, Yuan-Te
口試日期:2020-02-10
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊科學與工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2020
畢業學年度:108
語文別:中文
論文頁數:34
中文關鍵詞:長期演進技術半精度張量核心
外文關鍵詞:LTEHalf-precision Floating-point FormatTensor core
相關次數:
  • 被引用被引用:0
  • 點閱點閱:259
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
由於對於網路傳輸效率及品質要求提升,第五代行動通訊網路(5th Generation Mobile Networks)已經成為通訊領域的其中一個鎂光燈,而對於硬體設計,軟體化實體層(soft-PHY)在5G的規格下更能有延展性及擴充性。
然而在Soft-PHY實作的平台下,單純使用CPU及GPU當作系統中的處理器,已經慢慢地無法滿足開發者及使用者的需求,因此為了提高整體GPU平行運算的效率,在開發上除了選擇降低運算的精準度,並且嘗試使用新的GPU圖靈架來取代傳統作法,使得PHY layer能夠有更高的執行效率,因此本篇論文著重於PUSCH在半精度上使用及兩種不同的GPU架構的改良。
Due to the increase in network transmission efficiency and quality requirements, 5th Generation Mobile Networks has become one of the magnesium lights in the communication field. For hardware design, the soft-PHY layer is used in 5G. Under the specifications can be more malleable and expandable.
However, under the platform implemented by Soft-PHY, simply using the CPU and GPU as the processors in the system has gradually failed to meet the needs of developers and users. Therefore, in order to improve the efficiency of the overall GPU parallel computing, In addition to choosing to reduce the accuracy of the calculation, and trying to use the new GPU Turing frame to replace the traditional method, so that the PHY layer can have higher execution efficiency, this paper focuses on the use of PUSCH in half precision and two different Improved GPU architecture.
摘 要 i
誌謝 iii
List of Contents iv
List of Figures vi
一、 INTRODUCTION 1
二、 RELATED WORK 4
2.1 Overview of Soft PHY system 4
2.2 Motivation and problem statement 5
2.3 CUDA 6
2.3.1 Introduction of CUDA 6
2.3.2 CUDA core in GPU 7
2.3.3 Tensor core in GPU 7
2.4 Half-precision floating-point format (FP16) 8
三、 RESEARCH METHOD 10
3.1 Overview of RX flow and LTE PUSCH 10
3.1.1 Flow Description 10
3.1.2 Memory Architecture 11
3.2 Concept of Remove Cyclic Prefix 13
3.2.1 Overview of Cyclic Prefix 13
3.2.2 The Proposed Method 14
3.3 Fast Fourier Transform, FFT 14
3.3.1 Overview of FFT 14
3.3.2 CUDA FFT API 15
3.4 Channel Estimation in CUDA core 15
3.4.1 Overview of Channel Estimation 15
3.4.2 Implementation Channel Estimation in GPU 16
3.4.3 Implementation CFR and Compenstaion in GPU 18
3.4.4 Problem Encountered bteween Compensation and Inverse Fast Fourier Transform, IFFT 20
3.5 Demodulation in CUDA core 20
3.5.1 Flow Description 20
3.5.2 Problem Encountered in Demodulation 21
3.6 Channel Estimation in Tensor core 22
3.6.1 Flow Description 22
3.6.2 Problem Encountered in Tensor core 22
四、 PERFORMANCE MEASUREMENT AND ANALYSIS 24
4.1 Verifying Environment 24
4.2 Comparion from FP32 to FP16 Architecture 24
4.2.1 Remove CP and FFT 24
4.2.2 Channel Estimation 24
4.2.3 Channel Compensation 25
4.2.4 Demodulation 25
4.3 Comparion from Different environment 26
4.3.1 Compare to Tesla T4 and Quadro GV100 26
4.3.2 Compare to Intel I9 and Tesla T4 27
五、 CONCLUSION AND FUTURE WORKS 28
參考文獻 29
附錄一 31
附錄二 32
附錄三 32
附錄四 33
附錄五 34
[1] 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding (Release 12)
[2] N. Nikaein et al., “Towards a cloud-native radio access network,” in Advances in Mobile Cloud Computing and Big Data in the 5G Era, C. Mavromoustakis, G. Mastorakis, and C. Dobre, Eds. Cham, Switzerland: Springer, 2017
[3] Markidis, Stefano, et al. "Nvidia tensor core programmability, performance & precision." 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018.
[4] Jia, Zhe, et al. "Dissecting the NVidia Turing T4 GPU via Microbenchmarking." arXiv preprint arXiv:1903.07486 (2019).
[5] Jorda, Marc, Pedro Valero-Lara, and Antonio J. Peña. "Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs." IEEE Access (2019).
[6] Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017.
[7] NVIDIA, “NVIDIA Tesla V100 GPU architecture,” 2017, accessed: 2018-01-27. [Online]. Available: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[8] N. Whitehead and A. Fit-Florea, “Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs,” 2011, accessed: 2017-01-27. [Online]. Available: https://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
[9] NVIDIA, “cuBLAS library,” NVIDIA Corporation, Santa Clara, California, vol. 15, no. 27, p. 31, 2008.
[10] Luszczek, Piotr, et al. "Towards numerical benchmark for half-precision floating point arithmetic." 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2017..
[11] Abdelfattah, Ahmad, Stanimire Tomov, and Jack Dongarra. "Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs." 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019.
[12] Markidis, Stefano, et al. "Nvidia tensor core programmability, performance & precision." 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018.
[13] Raihan, Md Aamir, Negar Goli, and Tor M. Aamodt. "Modeling Deep Learning Accelerator Enabled GPUs." 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019.
[14] Akbudak, Kadir, and Cevdet Aykanat. "Exploiting locality in sparse matrix-matrix multiplication on many-core architectures." IEEE Transactions on Parallel and Distributed Systems 28.8 (2017): 2258-2271.
[15] Brini, Ouajdi, and Mounir Boukadoum. "Virtualization of the LTE physical layer symbol processing with GPUs." 2017 15th IEEE International New Circuits and Systems Conference (NEWCAS). IEEE, 2017.
[16] Kilgariff, E. M. M. E. T. T., et al. "NVIDIA Turing Architecture In-Depth." (2018).
[17] Jia, Zhe, et al. "Dissecting the nvidia volta gpu architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
[18] Mei, Xinxin, and Xiaowen Chu. "Dissecting GPU memory hierarchy through microbenchmarking." IEEE Transactions on Parallel and Distributed Systems 28.1 (2016): 72-86.
[19] NVIDIA, Tesla. "V100 GPU architecture. the world’s most advanced data center GPU. Version WP-08608-001_v1. 1." NVIDIA. Aug (2017): 108.
[20] Choquette, Jack, Olivier Giroux, and Denis Foley. "Volta: Performance and programmability." IEEE Micro 38.2 (2018): 42-52.
[21] Lindholm, Erik, et al. "NVIDIA Tesla: A unified graphics and computing architecture." IEEE micro 28.2 (2008): 39-55.
[22] Parallel Forall, Cooperative Groups: Flexible CUDA Thread Programming; https://devblogs.nvidia.com/parallelforall/cooperative-groups/.
[23] “Warp matrix functions (B.16),” CUDA C Programming Guide; http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma.
[24] CUTLASS: Fast Linear Algebra in CUDA C++; https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda/.
電子全文 電子全文(網際網路公開日期:20250214)
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top