|
[1]C.T. Yang, C.L. Huang and C.F. Lin, “Hybrid CUDA, OpenMP, and MPI Parallel Programming on Multicore GPU Clusters”, Computer Physics Communications, Vol. 182, Issue 1, pp. 266-269, June 25, 2010. [2]C.T. Yang, C.L. Huang, C.F. Lin and T.C. Chang, “Hybrid Parallel Programming on GPU Clusters”, International Symposium on Parallel and Distributed Processing with Applications (ISPA) 2010, pp. 142-147, Sept. 2010. [3]P. Alonso, R. Cortina, F.J. Martinez-Zaldivar and J. Ranilla, “Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA”, J. Supercomputing, in press, doi:10.1007/s11227-009-0360-z, SpringerLink Online Date: Nov. 18, 2009. [4]F. Bodin and S. Bihan, “Heterogeneous multicore parallel programming for graphics processing units”, Scientific Programming, Vol. 17, pp. 325-336, 4 Nov. 2009. [5]R. Dolbeau, S. Bihan and F. Bodin, “HMPP: A hybrid multicore parallel programming environment”, The Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007), Boston, Massachussets, USA, October 4th, 2007 [6]D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski, S. Tureka, “Exploring weak scalability for FEM calculations on a GPU-enhanced cluster”, Parallel Computing, Vol. 33, Issue 10-11, pp. 685–699, 33 Nov. 2007. [7]S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, “A performance study of general-purpose applications on graphics processors using CUDA”, Journal of Parallel and Distributed Computing, Volume 68, Issue 10, pp. 1370-1380, October 2008 [8]C.H. Liao, D. Quinlan, T. Panas and Bronis de Supinski, “A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries”, International Workshop on OpenMP (IWOMP) 2010, pp.15-28, accepted in March. 2010 [9]C.H. Liao, D. Quinlan, J. Willcock and T. Panas, “Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions”, International Journal of Parallel Programming, Vol. 38, No. 5-6, pp. 361-378, Accepted in Jan. 2010 [10] C.H. Liao, D. Quinlan, T. Panas and Bronis de Supinski, “Towards an Abstraction-Friendly Programming Model for High Productivity and High Performance Computing”, Los Alamos Computer Science Symposium (LACSS) 2009, [11] C.H. Liao, D. Quinlan, R. Vuduc and T. Panas, “Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization”, In Proceedings of LCPC'2009, pp.308-322, 2009 [12] A. Saebjornsen, J. Willcock, T. Panas, D. Quinlan and Z. Su, “Detecting code clones in binary executables”, ISSTA '09 Proceedings of the eighteenth international symposium on Software testing and analysis, pp. 117-127, 2009 [13] T. Panas and D. Quinlan, “Techniques for software quality analysis of binaries: applied to Windows and Linux”, DEFECTS '09 Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), pp.6-10, 2009 [14] C.H. Liao, D. Quinlan, J. Willcock and T. Panas, “Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore”, IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism, pp.28-41, 2009 [15]P. Carribault, M. P´erache and H. Jourdren, “Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC”, International Workshop on OpenMP (IWOMP) 2010, pp.1-14, 2010 [16] M.M. Baskaran, J. Ramanujam and P. Sadayappan, “Automatic C-to-CUDA Code Generation for Affine Programs”, In Compiler Construction, Vol. 6011, pp. 244-263, 2010 [17] S. Gupta and M.R. Babu, “Generating Performance Analysis of GPU compared to Singlecore and Multi-core CPU for Natural Language Applications”, International Journal of Advanced Computer Sciences and Applications, Vol. 2, Issue 5, pp. 50-53, 2011 [18] S. Rivoire and R. Park, “A breadth-first course in multicore and manycore programming”, SIGCSE '10 Proceedings of the 41st ACM technical symposium on Computer science education, pp.214-218, 2010 [19] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk and W.M. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA”, PPoPP '08 Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73-82, 2008 [20] M.M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev and P. Sadayappan, “A compiler framework for optimization of affine loop nests for gpgpus”, ICS '08 Proceedings of the 22nd annual international conference on Supercomputing, pp. 225-234, 2008 [21] S. Kamil, C.Y. Chan, L. Oliker, J. Shalf and S. Williams, “An auto-tuning framework for parallel multicore stencil computations”, Parallel & Distributed Processing (IPDPS) 2010, pp. 1-12, April 2010 [22] T.P. Chen and Y.K. Chen, “Challenges and opportunities of obtaining performance from multi-core CPUs and many-core GPUs”, ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 613-616, April 2009 [23]Top 500 Sites for November 2010, http://www.top500.org/lists/2010/11 [24]Top 500 Super Computer Sites, What is Gflop/s, http://www.top500.org/faq/what_gflop_s. [25]Close To Metal wiki, http://en.wikipedia.org/wiki/Close_to_Metal. [26]OpenCL, http://www.khronos.org/opencl/. [27]CUDA, http://en.wikipedia.org/wiki/CUDA. [28]Download CUDA, http://developer.nvidia.com/cuda-downloads. [29]MPI, http://www.mcs.anl.gov/research/projects/mpi/. [30]Intel 64 Tesla Linux Cluster Lincoln webpage, http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/ [31]MPICH, A Portable Implementation of MPI, http://www.mcs.anl.gov/research/projects/mpi/mpich1/index.htm. [32]Open MP Specification, http://openmp.org/wp/about-openmp/. [33]POSIX Threads Programming, https://computing.llnl.gov/tutorials/pthreads/. [34]Intel® Threading Building Blocks, http://www.threadingbuildingblocks.org/. [35]Open64, http://www.open64.net/. [36]Intel, http://software.intel.com/en-us/articles/intel-parallel-studio-xe/. [37]The Potland Group, http://www.pgroup.com/index.htm. [38]PAR4ALL, http://www.par4all.org/. [39]Specification Tesla S1070 GPU Computing System, http://www.nvidia.com/docs/IO/43395/SP-04154-001_v02.pdf. [40]The NVIDIA® Tesla™ S1070 Computing System, http://www.nvidia.com/object/product_tesla_s1070_us.html. [41]NVIDIA Tesla C1060 Computing Processor, http://www.nvidia.com/object/product_tesla_c1060_us.html. [42]NVIDIA CUDA Programming Guide, http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf. [43]Arm11MP Core http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php. [44]The CUDA Compiler Driver NVCC, http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/2.0/nvcc_2.0.pdf. [45]Intel® Xeon® Processor E5410, http://ark.intel.com/Product.aspx?id=33080 [46]Benchmarcks, http://shootout.alioth.debian.org/u32q/benchmark.php?test=nbody&lang=gcc. [47]Cross compiler, http://en.wikipedia.org/wiki/Cross_compiler. [48]CodeSourcery, http://www.codesourcery.com/. [49]crosstool-NG, http://linux.softpedia.com/get/System/Shells/crosstool-NG-28833.shtml. [50]crosstool-ng WIKI, http://ymorin.is-a-geek.org/dokuwiki/projects/crosstool. [51]How to build cross toolchains for ARM crosstool-NG, http://forum.samdroid.net/wiki/showwiki/How+to+build+cross+toolchains+for+ARM+crosstool-NG. [52]To build crosscompiler by crosstool-ng http://hi.baidu.com/caicry/blog/item/f306db639c4281680c33fa1b.html. [53]To build crosscompiler by crosstool-ng, http://blog.chinaunix.net/u3/95743/showart_2067287.html. [54]Intel® Xeon® Processor E5520, http://ark.intel.com/Product.aspx?id=40200
|