|
[1] GPU GFlops. http://kyokojap.myweb.hinet.net/gpu_gflops. [2] The Samsung Exynos 7420 mobile SoC. http://www.anandtech.com/show/9330/exynos-7420-deep-dive/2. [3] ARM. Corelink cci-400 cache coherent interconnect, 2012. [4] I. Bratt. HSA queuing. In HCS, 2013. [5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44–54, 2009. [6] L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang. Improving GPGPU performance via cache locality aware thread block scheduling. CAL, 2017. [7] L. Cheng, J. B. Carter, and D. Dai. An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. In HPCA, 2007. [8] A. Kayi, O. Serres, and T. El-Ghazawi. Adaptive Cache Coherence Mechanisms with Producer-Consumer Sharing Optimization for Chip Multiprocessors. IEEE Transactions on Computers, 2013. [9] NVIDIA. NVIDIA CUDA C Programming Guide Ver 4.2, 2012. [10] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous System Coherence for Integeated CPU-GPU Systems. In MICRO, 2013. [11] P. Rogers. Heterogeneous system architecture overview. In HCS, 2013. [12] L. Wang, R.-W. Tsai, S.-C. Wang, K.-C. Chen, P.-H. Wang, H.-Y. Cheng, Y.-C. Lee, S.-J. Shu, C.-C. Yang, M.-Y. Hsu, L.-C. Kan, C.-L. Lee, T.-C. Yu, R.-D. Peng, C.-L. Yang, Y.-S. Hwang, J.-K. Lee, S.-L. Tsao, and M. Ouhyoung. Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator. In ISPASS, 2017. [13] Y. Yang, P. Xiang, M. Mantor, and H. Zhou. CPU-Assisted GPGPU on Fused CPUGPU Architecures. In HPCA, 2012.
|