論文名稱(外文):Performance Evaluation of Graphics Processor in Finite Element Computation
指導教授(外文):Shi-Pin Ho
外文關鍵詞:finite element computationgraphics processorevaluate
本文使用的圖形處理器為NVIDIA公司中擁有雙精度浮點運算的第二代圖形處理器GT200的GTX 260,以及最近剛上市的第三代圖形處理器Fermi的GTX 470。我們將利用圖形處理器的多運算單元及高記憶體頻寬的優點,搭配新型CUDA架構整合並獨立了所有運算單元的計算能力,使得計算效能提升。


經過測試後,使用圖形處理器作有限元素計算的求解部分時,GeForce GTX 260圖形處理器的運算效能比Inter? CoreTM2 Quad Q6600中央處理器的單核心運算效能提升約27.3倍,GeForce GTX 470圖形處理器則提升約81.7倍。但是,我們也得知相對於圖形處理器的運算速度峰值,實際在有限元素計算的效能往往是受限於記憶體頻寬。
The graphics processor has been evolved into manycore processor with tremendous computational horsepower, however, we focused on the real performance increasement of graphics processor in scientific computation.

In this paper, the graphics processors used are the GTX 260 of GT200 architecture and the GTX 470 of Fermi architecture with double precision floating point operation. We will speed up the computation performance by utilizing the multi-processor and high bandwidth with the new architecture called CUDA(Compute Unified Device Architecture) that is integrated and independent of processor's calculation capability.

In finite element computation, it spend most of the computation time solving a set of linear equation. We will use the Jocobi conjugate gradient method to solve a set of linear equations in this paper. The iterative method includes vector product, vector-vector addition and multiplication, and sparse matrix-vector multiplication. We will use the graphics processor to calculate these computations, and evaluate their performance. Besides, full matrix-matrix multiplication and full matrix-vector multiplication are also studied. Finally, we solve a finite element problem using graphics processor and central processor respectively, and evaluate the performance.

The numerical testing shows that the speed up are 27.3 for GeForce GTX 260 and 81.7 for GeForce GTX 470 compare to the Inter? CoreTM2 Quad Q6600 processor. However, relative to peak performance of graphics processor, we knew that the real computational performance in finite element computation was restricted by memory bandwidth.

摘要 I
Abstract III
誌謝 V
目錄 VI
表目錄 IX
圖目錄 XI
符號說明 XIII
第一章 緒論 1
1.1 前言 1
1.2 文獻回顧 4
1.3 動機與目的 5
1.4 論文架構 6
第二章 基本理論 7
2.1 有限元素法 7
2.2 資料儲存方式 8
2.3 求解線性聯立方程組 10
第三章 圖形處理器架構 13
3.1 回顧 13
3.2 CUDA(Compute Unified Device Architecture) 14
3.2.1 整合(unified) 14
3.2.2 CUDA 15
3.3 新型圖形處理器硬體架構 15
3.3.1 概述 15
3.3.2 平行程式在硬體上的實現 18 可執行SIMD的SMs 18 記憶體架構 19
3.4 運作模式 22
3.4.1 Thread層級 22
3.4.2 記憶體層級 23
3.4.3 主要裝置與計算裝置 24
3.4.4 計算相容性 26
第四章 效能最佳化 27
4.1 記憶體最佳化 27
4.1.1 Global記憶體 27
4.1.2 Constant記憶體 29
4.1.3 Texture記憶體 29
4.1.4 Shared記憶體 29
4.2 程式碼最佳化 31
4.2.1 block數與warp數最大化 31
4.1.2 計算強度最大化 32
4.1.3 控制流指令的使用 33
4.3 最佳化總結 33
第五章 研究成果 34
5.1 向量內積 35
5.2 向量加乘 40
5.3 全矩陣相乘 44
5.4 全矩陣向量相乘 47
5.5 稀疏矩陣向量相乘 52
5.6 效能評估 56
第六章 結論 61
參考文獻 63
自述 66

