跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.86) 您好!臺灣時間:2025/01/14 10:16
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:張子杰
研究生(外文):Chang, Tzu-Chieh
論文名稱:多核心系統之OpenMP和CUDA平行程式效能比較
論文名稱(外文):Performance Comparisons with OpenMP and CUDA Parallel Programming on Multicore Systems
指導教授:楊朝棟楊朝棟引用關係
指導教授(外文):Yang, Chao-Tung
口試委員:楊武楊朝棟呂芳懌張玉山時文中
口試委員(外文):Yang, WuuYang, Chao-TungLe, Fang-YieChang,Yue-ShanShih, Wen-Chung
口試日期:100/06/27
學位類別:碩士
校院名稱:東海大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2011
畢業學年度:99
語文別:英文
論文頁數:61
中文關鍵詞:自動平行平行編譯多核心
外文關鍵詞:Auto-ParallelParallel ProgrammingMulticoreOpenMPCUDAMPI
相關次數:
  • 被引用被引用:0
  • 點閱點閱:1217
  • 評分評分:
  • 下載下載:69
  • 收藏至我的研究室書目清單書目收藏:0
當今多核心處理器已經佔據越來越多的市場份額,而且編譯人員也必須面對多核心處理器所帶來的衝擊。由於半導體運作溫度和功耗的問題限制了單核心微處理器效能增長。這個原因導致許多微處理器供應商轉向多核心晶片。不僅處理器走向隨之而來的多核心處理器趨勢,圖形處理單元(GPU)也是。同時,並行處理不僅只是一個機會,也是一個挑戰。程式設計師或編譯器將軟體平行化是在多核心晶片上提高效能的關鍵。在本論文中,我們介紹一些基於OpenMP的自動平行化工具,以減少我們重寫運行在多核心系統上之程式碼的時間,我們針對ROSE編譯器深入探討並實作一個介面以簡化使用之複雜度。其中有一些工具可以自動平行化成CUDA程式碼。另一部分,我們提出了一個混合OpenMP、CUDA及MPI編譯技術的平行編譯方法,在含有一個C1060和一個S1070的GPU叢集裡,根據C1060裝置數量分割迴圈。然後透過OpenMP或MPI的程序平行指派任務給GPU運算。最後,本論文中實驗有兩個部分,首先我們證明了這些自動平行化的工具可行性與正確性,並分別於一般的處理器、圖形處理單元和嵌入式系統上運行並討論其效能比較。另一部分的實驗中,也驗證了混合OpenMP、CUDA及MPI編譯技術的平行編譯方法確實提高效能。
Nowadays, the multicore processor has occupied more and more market shares, and the programming personnel also must face the collision brought by the revolution of multicore processor. Because of the semiconductor operating temperature and power consumption limits performance growth for single-core microprocessors. This reason leads many microprocessor vendors to turn to multicore chip organizations. Not only CPU goes along the trend of multicore processors, but also GPU. At the same time, parallel processing is not only the opportunity but also a challenge. The programmer or compiler explicitly parallelize the software is the key for enhance the performance on multicore chip. In this thesis, we introduce some of the automatic parallel tools based OpenMP, which could to reduce our time on rewrite codes for parallel processing on multicore system. Then we focus on ROSE to explore in depth. And we implement an interface to simplify the complexity of use. And some of these tools can automatic parallelization for CUDA. In other hand, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to MPI process and processed in parallel by CUDA run by the processor cores in the same computational node. Finally, there are two parts in our experiment in this thesis. First, we verified the available and correctness of the auto-parallel tools, and discussed the performance on CPU, GPU, and embedded system. And in the other part of experiment, we also verify that the hybrid programming could improve performance.
摘要 ii
Abstract iii
Acknowledgements iv
Table of Contents v
List of Tables vii
List of Figures viii
Chapter 1 Introduction 1
1.1 Motivations 1
1.2 The Goal and Contributions 2
1.3 Thesis Organization 3
Chapter 2 Background Review 4
2.1 Parallel Programming 4
2.1.1 CTM 4
2.1.2 OpenCL 5
2.1.3 CUDA 5
2.1.4 MPI 7
2.1.5 OpenMP 7
2.1.6 Pthread 8
2.1.7 TBB 8
2.2 Auto-Parallel Tools 9
2.2.1 ROSE 9
2.2.2 Open64 Compiler 9
2.2.3 Intel® Composer XE 2011 9
2.2.4 The Portland Group 10
2.2.5 PAR4ALL 10
Chapter 3 System Hardware 11
3.1 Tesla C1060 GPU computing processor 11
3.2 Tesla S1070 GPU computing system 12
3.3 ARM11 MPCore Processor 12
Chapter 4 System Design and Implementation 15
4.1 Automatic Parallelization 15
4.1.1 Algorithm 16
4.1.2 Liveness Analysis 17
4.1.3 Dependence Analysis 18
4.1.4 Variable Classification 18
4.1.5 Interface 19
4.2 Hybrid Parallel Programming 24
4.2.1 Combining MPI and CUDA 24
4.2.2 Combining OpenMP and CUDA 26
4.2.3 System model and approach 27
Chapter 5 Experimental Results 29
5.1 Part of Auto-parallelism 29
5.1.1 CPU (OpenMP version) 31
5.1.2 GPU (CUDA version) 41
5.1.3 Embedded System (OpenMP version) 43
5.2 Part of Hybrid Parallel Programming 47
Chapter 6 Conclusions and Future Work 51
6.1 Concluding Remark 51
6.2 Future Work 52
Bibliography 53
Appendix 59
A. Setup of auto-parallel tool 59
1 ROSE 59
2 Par4All 59
3 Intel® Composer XE 2011 for Linux 60
4 PGI Accelerator C/C++ Workstation 10.9 60
5 Open64 compiler 4.2.3 61
B. Interface 61
1 VMC-PPO 61
2 Plug-in of Eclipse 61

[1]C.T. Yang, C.L. Huang and C.F. Lin, “Hybrid CUDA, OpenMP, and MPI Parallel Programming on Multicore GPU Clusters”, Computer Physics Communications, Vol. 182, Issue 1, pp. 266-269, June 25, 2010.
[2]C.T. Yang, C.L. Huang, C.F. Lin and T.C. Chang, “Hybrid Parallel Programming on GPU Clusters”, International Symposium on Parallel and Distributed Processing with Applications (ISPA) 2010, pp. 142-147, Sept. 2010.
[3]P. Alonso, R. Cortina, F.J. Martinez-Zaldivar and J. Ranilla, “Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA”, J. Supercomputing, in press, doi:10.1007/s11227-009-0360-z, SpringerLink Online Date: Nov. 18, 2009.
[4]F. Bodin and S. Bihan, “Heterogeneous multicore parallel programming for graphics processing units”, Scientific Programming, Vol. 17, pp. 325-336, 4 Nov. 2009.
[5]R. Dolbeau, S. Bihan and F. Bodin, “HMPP: A hybrid multicore parallel programming environment”, The Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007), Boston, Massachussets, USA, October 4th, 2007
[6]D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski, S. Tureka, “Exploring weak scalability for FEM calculations on a GPU-enhanced cluster”, Parallel Computing, Vol. 33, Issue 10-11, pp. 685–699, 33 Nov. 2007.
[7]S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, K. Skadron, “A performance study of general-purpose applications on graphics processors using CUDA”, Journal of Parallel and Distributed Computing, Volume 68, Issue 10, pp. 1370-1380, October 2008
[8]C.H. Liao, D. Quinlan, T. Panas and Bronis de Supinski, “A ROSE-based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries”, International Workshop on OpenMP (IWOMP) 2010, pp.15-28, accepted in March. 2010
[9]C.H. Liao, D. Quinlan, J. Willcock and T. Panas, “Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions”, International Journal of Parallel Programming, Vol. 38, No. 5-6, pp. 361-378, Accepted in Jan. 2010
[10] C.H. Liao, D. Quinlan, T. Panas and Bronis de Supinski, “Towards an Abstraction-Friendly Programming Model for High Productivity and High Performance Computing”, Los Alamos Computer Science Symposium (LACSS) 2009,
[11] C.H. Liao, D. Quinlan, R. Vuduc and T. Panas, “Effective Source-to-Source Outlining to Support Whole Program Empirical Optimization”, In Proceedings of LCPC'2009, pp.308-322, 2009
[12] A. Saebjornsen, J. Willcock, T. Panas, D. Quinlan and Z. Su, “Detecting code clones in binary executables”, ISSTA '09 Proceedings of the eighteenth international symposium on Software testing and analysis, pp. 117-127, 2009
[13] T. Panas and D. Quinlan, “Techniques for software quality analysis of binaries: applied to Windows and Linux”, DEFECTS '09 Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), pp.6-10, 2009
[14] C.H. Liao, D. Quinlan, J. Willcock and T. Panas, “Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore”, IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism, pp.28-41, 2009
[15]P. Carribault, M. P´erache and H. Jourdren, “Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC”, International Workshop on OpenMP (IWOMP) 2010, pp.1-14, 2010
[16] M.M. Baskaran, J. Ramanujam and P. Sadayappan, “Automatic C-to-CUDA Code Generation for Affine Programs”, In Compiler Construction, Vol. 6011, pp. 244-263, 2010
[17] S. Gupta and M.R. Babu, “Generating Performance Analysis of GPU compared to Singlecore and Multi-core CPU for Natural Language Applications”, International Journal of Advanced Computer Sciences and Applications, Vol. 2, Issue 5, pp. 50-53, 2011
[18] S. Rivoire and R. Park, “A breadth-first course in multicore and manycore programming”, SIGCSE '10 Proceedings of the 41st ACM technical symposium on Computer science education, pp.214-218, 2010
[19] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk and W.M. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA”, PPoPP '08 Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73-82, 2008
[20] M.M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev and P. Sadayappan, “A compiler framework for optimization of affine loop nests for gpgpus”, ICS '08 Proceedings of the 22nd annual international conference on Supercomputing, pp. 225-234, 2008
[21] S. Kamil, C.Y. Chan, L. Oliker, J. Shalf and S. Williams, “An auto-tuning framework for parallel multicore stencil computations”, Parallel & Distributed Processing (IPDPS) 2010, pp. 1-12, April 2010
[22] T.P. Chen and Y.K. Chen, “Challenges and opportunities of obtaining performance from multi-core CPUs and many-core GPUs”, ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 613-616, April 2009
[23]Top 500 Sites for November 2010, http://www.top500.org/lists/2010/11
[24]Top 500 Super Computer Sites, What is Gflop/s, http://www.top500.org/faq/what_gflop_s.
[25]Close To Metal wiki, http://en.wikipedia.org/wiki/Close_to_Metal.
[26]OpenCL, http://www.khronos.org/opencl/.
[27]CUDA, http://en.wikipedia.org/wiki/CUDA.
[28]Download CUDA, http://developer.nvidia.com/cuda-downloads.
[29]MPI, http://www.mcs.anl.gov/research/projects/mpi/.
[30]Intel 64 Tesla Linux Cluster Lincoln webpage, http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslaCluster/
[31]MPICH, A Portable Implementation of MPI, http://www.mcs.anl.gov/research/projects/mpi/mpich1/index.htm.
[32]Open MP Specification, http://openmp.org/wp/about-openmp/.
[33]POSIX Threads Programming, https://computing.llnl.gov/tutorials/pthreads/.
[34]Intel® Threading Building Blocks, http://www.threadingbuildingblocks.org/.
[35]Open64, http://www.open64.net/.
[36]Intel, http://software.intel.com/en-us/articles/intel-parallel-studio-xe/.
[37]The Potland Group, http://www.pgroup.com/index.htm.
[38]PAR4ALL, http://www.par4all.org/.
[39]Specification Tesla S1070 GPU Computing System, http://www.nvidia.com/docs/IO/43395/SP-04154-001_v02.pdf.
[40]The NVIDIA® Tesla™ S1070 Computing System, http://www.nvidia.com/object/product_tesla_s1070_us.html.
[41]NVIDIA Tesla C1060 Computing Processor, http://www.nvidia.com/object/product_tesla_c1060_us.html.
[42]NVIDIA CUDA Programming Guide, http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf.
[43]Arm11MP Core http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php.
[44]The CUDA Compiler Driver NVCC, http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/2.0/nvcc_2.0.pdf.
[45]Intel® Xeon® Processor E5410, http://ark.intel.com/Product.aspx?id=33080
[46]Benchmarcks, http://shootout.alioth.debian.org/u32q/benchmark.php?test=nbody&lang=gcc.
[47]Cross compiler, http://en.wikipedia.org/wiki/Cross_compiler.
[48]CodeSourcery, http://www.codesourcery.com/.
[49]crosstool-NG, http://linux.softpedia.com/get/System/Shells/crosstool-NG-28833.shtml.
[50]crosstool-ng WIKI, http://ymorin.is-a-geek.org/dokuwiki/projects/crosstool.
[51]How to build cross toolchains for ARM crosstool-NG, http://forum.samdroid.net/wiki/showwiki/How+to+build+cross+toolchains+for+ARM+crosstool-NG.
[52]To build crosscompiler by crosstool-ng http://hi.baidu.com/caicry/blog/item/f306db639c4281680c33fa1b.html.
[53]To build crosscompiler by crosstool-ng, http://blog.chinaunix.net/u3/95743/showart_2067287.html.
[54]Intel® Xeon® Processor E5520, http://ark.intel.com/Product.aspx?id=40200

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top