|
[1]http://en.wikipedia.org/wiki/Graphics_processing_unit [2] Chang, R. G., Chuang, T. R., & Lee, J. K. (1998, July). Efficient support of parallel sparse computation for array intrinsic functions of Fortran 90. In Proceedings of the 12th international conference on Supercomputing (pp. 45-52). ACM. [3] Bell, N., & Garland, M. (2008). Efficient sparse matrix-vector multiplication on CUDA (Vol. 20). NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation. [4] Hong, S., & Kim, H. (2009). An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. ACM SIGARCH Computer Architecture News, 37(3), 152-163. [5] Wende, F., Cordes, F., & Steinke, T. (2012, July). On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering. In Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on (pp. 74-83). IEEE. [6] Guo, P., & Wang, L. (2012, July). Accurate CUDA performance modeling for sparse matrix-vector multiplication. In High Performance Computing and Simulation (HPCS), 2012 International Conference on (pp. 496-502). IEEE. [7] Bauer, M., Cook, H., & Khailany, B. (2011, November). CudaDMA: optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 12). ACM. [8]Christen, M., Schenk, O., & Burkhart, H. (2007, October). General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform. In First Workshop on General Purpose Processing on Graphics Processing Units. [9]Oberhuber, T., Suzuki, A., & Vacata, J. (2010). New row-grouped csr format for storing the sparse matrices on GPU with implementation in CUDA. arXiv preprint arXiv:1012.2270. [10]Garland, M. (2008, June). Sparse matrix computations on manycore GPU's. In Proceedings of the 45th annual Design Automation Conference (pp. 2-6). ACM. [11]NVIDIA,CUDA C Programming Guide , Version 4.2 [12] Xiao, S., & Feng, W. C. (2010, April). Inter-block GPU communication via fast barrier synchronization. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on (pp. 1-12). IEEE. [13]NVIDIA CUDA GPU Computing Discussion Forum. http://forums.nvidia.com/index.php?showtopic=104243. [14]The Fortran 2003 Handbook [15] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110 [16]Benchmark:The University of Florida Sparse Matrix Collection,http://www.cise.ufl.edu/research/sparse/matrices/index.html [17]Davis, T. A. (2006). Direct methods for sparse linear systems. Siam [18]http://en.wikipedia.org/wiki/System_of_linear_equations
|