|
[1] 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding (Release 12) [2] N. Nikaein et al., “Towards a cloud-native radio access network,” in Advances in Mobile Cloud Computing and Big Data in the 5G Era, C. Mavromoustakis, G. Mastorakis, and C. Dobre, Eds. Cham, Switzerland: Springer, 2017 [3] Markidis, Stefano, et al. "Nvidia tensor core programmability, performance & precision." 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018. [4] Jia, Zhe, et al. "Dissecting the NVidia Turing T4 GPU via Microbenchmarking." arXiv preprint arXiv:1903.07486 (2019). [5] Jorda, Marc, Pedro Valero-Lara, and Antonio J. Peña. "Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs." IEEE Access (2019). [6] Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017. [7] NVIDIA, “NVIDIA Tesla V100 GPU architecture,” 2017, accessed: 2018-01-27. [Online]. Available: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf [8] N. Whitehead and A. Fit-Florea, “Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs,” 2011, accessed: 2017-01-27. [Online]. Available: https://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf [9] NVIDIA, “cuBLAS library,” NVIDIA Corporation, Santa Clara, California, vol. 15, no. 27, p. 31, 2008. [10] Luszczek, Piotr, et al. "Towards numerical benchmark for half-precision floating point arithmetic." 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2017.. [11] Abdelfattah, Ahmad, Stanimire Tomov, and Jack Dongarra. "Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs." 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019. [12] Markidis, Stefano, et al. "Nvidia tensor core programmability, performance & precision." 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018. [13] Raihan, Md Aamir, Negar Goli, and Tor M. Aamodt. "Modeling Deep Learning Accelerator Enabled GPUs." 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019. [14] Akbudak, Kadir, and Cevdet Aykanat. "Exploiting locality in sparse matrix-matrix multiplication on many-core architectures." IEEE Transactions on Parallel and Distributed Systems 28.8 (2017): 2258-2271. [15] Brini, Ouajdi, and Mounir Boukadoum. "Virtualization of the LTE physical layer symbol processing with GPUs." 2017 15th IEEE International New Circuits and Systems Conference (NEWCAS). IEEE, 2017. [16] Kilgariff, E. M. M. E. T. T., et al. "NVIDIA Turing Architecture In-Depth." (2018). [17] Jia, Zhe, et al. "Dissecting the nvidia volta gpu architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018). [18] Mei, Xinxin, and Xiaowen Chu. "Dissecting GPU memory hierarchy through microbenchmarking." IEEE Transactions on Parallel and Distributed Systems 28.1 (2016): 72-86. [19] NVIDIA, Tesla. "V100 GPU architecture. the world’s most advanced data center GPU. Version WP-08608-001_v1. 1." NVIDIA. Aug (2017): 108. [20] Choquette, Jack, Olivier Giroux, and Denis Foley. "Volta: Performance and programmability." IEEE Micro 38.2 (2018): 42-52. [21] Lindholm, Erik, et al. "NVIDIA Tesla: A unified graphics and computing architecture." IEEE micro 28.2 (2008): 39-55. [22] Parallel Forall, Cooperative Groups: Flexible CUDA Thread Programming; https://devblogs.nvidia.com/parallelforall/cooperative-groups/. [23] “Warp matrix functions (B.16),” CUDA C Programming Guide; http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma. [24] CUTLASS: Fast Linear Algebra in CUDA C++; https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda/.
|