研究生(外文):Chang, Tzu-Chieh
論文名稱(外文):Performance Comparisons with OpenMP and CUDA Parallel Programming on Multicore Systems
指導教授(外文):Yang, Chao-Tung
口試委員(外文):Yang, WuuYang, Chao-TungLe, Fang-YieChang,Yue-ShanShih, Wen-Chung
外文關鍵詞:Auto-ParallelParallel ProgrammingMulticoreOpenMPCUDAMPI
Nowadays, the multicore processor has occupied more and more market shares, and the programming personnel also must face the collision brought by the revolution of multicore processor. Because of the semiconductor operating temperature and power consumption limits performance growth for single-core microprocessors. This reason leads many microprocessor vendors to turn to multicore chip organizations. Not only CPU goes along the trend of multicore processors, but also GPU. At the same time, parallel processing is not only the opportunity but also a challenge. The programmer or compiler explicitly parallelize the software is the key for enhance the performance on multicore chip. In this thesis, we introduce some of the automatic parallel tools based OpenMP, which could to reduce our time on rewrite codes for parallel processing on multicore system. Then we focus on ROSE to explore in depth. And we implement an interface to simplify the complexity of use. And some of these tools can automatic parallelization for CUDA. In other hand, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to MPI process and processed in parallel by CUDA run by the processor cores in the same computational node. Finally, there are two parts in our experiment in this thesis. First, we verified the available and correctness of the auto-parallel tools, and discussed the performance on CPU, GPU, and embedded system. And in the other part of experiment, we also verify that the hybrid programming could improve performance.
摘要 ii
Abstract iii
Acknowledgements iv
Table of Contents v
List of Tables vii
List of Figures viii
Chapter 1 Introduction 1
1.1 Motivations 1
1.2 The Goal and Contributions 2
1.3 Thesis Organization 3
Chapter 2 Background Review 4
2.1 Parallel Programming 4
2.1.1 CTM 4
2.1.2 OpenCL 5
2.1.3 CUDA 5
2.1.4 MPI 7
2.1.5 OpenMP 7
2.1.6 Pthread 8
2.1.7 TBB 8
2.2 Auto-Parallel Tools 9
2.2.1 ROSE 9
2.2.2 Open64 Compiler 9
2.2.3 Intel® Composer XE 2011 9
2.2.4 The Portland Group 10
2.2.5 PAR4ALL 10
Chapter 3 System Hardware 11
3.1 Tesla C1060 GPU computing processor 11
3.2 Tesla S1070 GPU computing system 12
3.3 ARM11 MPCore Processor 12
Chapter 4 System Design and Implementation 15
4.1 Automatic Parallelization 15
4.1.1 Algorithm 16
4.1.2 Liveness Analysis 17
4.1.3 Dependence Analysis 18
4.1.4 Variable Classification 18
4.1.5 Interface 19
4.2 Hybrid Parallel Programming 24
4.2.1 Combining MPI and CUDA 24
4.2.2 Combining OpenMP and CUDA 26
4.2.3 System model and approach 27
Chapter 5 Experimental Results 29
5.1 Part of Auto-parallelism 29
5.1.1 CPU (OpenMP version) 31
5.1.2 GPU (CUDA version) 41
5.1.3 Embedded System (OpenMP version) 43
5.2 Part of Hybrid Parallel Programming 47
Chapter 6 Conclusions and Future Work 51
6.1 Concluding Remark 51
6.2 Future Work 52
Bibliography 53
Appendix 59
A. Setup of auto-parallel tool 59
1 ROSE 59
2 Par4All 59
3 Intel® Composer XE 2011 for Linux 60
4 PGI Accelerator C/C++ Workstation 10.9 60
5 Open64 compiler 4.2.3 61
B. Interface 61
1 VMC-PPO 61
2 Plug-in of Eclipse 61

