跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.10) 您好!臺灣時間:2025/09/21 23:49
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:蔡念穎
研究生(外文):TSAI,NIAN-YING
論文名稱:稀疏矩陣向量乘法於GPU中執行之工作量分配策略之探討
論文名稱(外文):On Job Allocation Strategies for Running Sparse Matrix-Vector Multiplication on GPUs
指導教授:伍朝欽伍朝欽引用關係
指導教授(外文):Wu,Chao-Chin
口試委員:伍朝欽魏凱城許慶賢
口試委員(外文):Wu,Chao-ChinWei,Kai-ChengHsu,Ching-Hsien
口試日期:2017-07-20
學位類別:碩士
校院名稱:國立彰化師範大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:45
中文關鍵詞:稀疏矩陣LightSpMV演算法Compressed Sparse Row (CSR)GPU平行計算
外文關鍵詞:Sparse MatrixLightSpMV algorithmCompressed Sparse Row (CSR)GPUParallel computing
相關次數:
  • 被引用被引用:0
  • 點閱點閱:278
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著大數據時代來臨,須處理的資料量越來越多,圖形處理單元(Graphic Processing Unit, GPU)已經被廣泛的運用來處理許多平行化問題。而稀疏矩陣向量乘法於各領域上是重要且基本的一項運算,如何在GPU上提升運算效能仍有許多改善的空間。本篇論文的研究主要是利用圖形處理單元(GPU)改善LightSpMV演算法中工作量的分配策略來加速稀疏矩陣向量乘法(Sparse Matrix-Vector Multiplication, SpMV)。LightSpMV演算法是以標準的CSR格式為主,CSR格式是常見的稀疏矩陣儲存格式,比起其他格式更靈活且更好操作。LightSpMV演算法使用兩種動態配置方法,分別為分配矩陣的Row給Vector與Warp兩種,兩種方法皆利用Atomic operations以取得Row的索引值。我們發現因為Atomic operations的執行時間耗時過長,因此我們針對這部分的工作量分配提出了三種策略:(1)以Warp為單位,將每次分配需執行的Row數量加倍,使得Atomic operations數量減少,(2)則是以Block為單位,動態分配Row的數量,相較於以Warp為單位的動態配置,能減少更多的Atomic operations數量,(3)亦是以Block為單位,靜態分配block執行的Row數量,而Block中再用以warp為單位進行動態分配,取代Atomic operations。經實作過後,我們實驗在工作環境為GTX980的GPU上,最好的效能為第三種策略,最高能提升將近100%的效能。
In the era of big data, Graphic Processing Unit (GPU) has been widely used to deal with many parallelization problems as the amount of data to be processed. Sparse Matrix – Vector Multiplication is an important and basic operation in many fields, there are still many improved space for raising the performance of the GPU operation. This paper is mainly about job allocation strategies for running Sparse Matrix-Vector Multiplication on GPUs. The LightSpMV algorithm is based on the standard CSR format. The CSR format is a common sparse matrix storage format which is more flexible and better than other formats. The LightSpMV algorithm uses two dynamic configuration methods, whose matrix row is distributed to one for vector and the other for warp. Both of the methods use Atomic operations to get the Row index values. Because Atomic operations spent too much execution time, we proposed three strategies for this part of the workload allocation: (1) Using warp as the basic unit, through doubling the number of rows which have to be executed for each allocation, to make the number of Atomic operations reduced. (2) Using block as the basic unit, the number of rows are allocated dynamically. Compared to the dynamic configuration of using warp as basic unit, this strategy can reduce the number of Atomic operations. (3) Using block as the basic unit, the number of rows executed by blocks are static allocation. In each block we reuse warp as the basic unit and the warp are allocated dynamically instead of Atomic operations.After the implementation of our experiments in the work environment of the GTX980 GPU, the best performance is the third strategy and the performance improvement is nearly 100%.
中文摘要 I
英文摘要 II
致謝 III
目錄 IV
圖目錄 V
表目錄 VI
第一章 緒論 1
1.1 研究背景 1
1.2 研究目的 3
1.3 主要貢獻 3
第二章 背景資訊與相關研究 5
2.1 CUDA與GPU架構 5
2.1.1 CUDA 5
2.1.2 GPU階層式記憶體架構 7
2.2 COMPRESSED SPARSE ROW (CSR) 格式 10
2.3 SPARSE MATRIX-VECTOR MULTIPLICATION稀疏矩陣向量乘法 10
2.4 LIGHTSPMV演算法 11
第三章 研究方法 18
3.1 LIGHTSPMV演算法缺點 18
3.2 設計概念 19
3.3 WARPCSS 24
3.4 BLOCKCSS 25
3.5 BLOCKSTATIC 27
第四章 實驗數據 29
4.1 GFLOPS 31
4.2 SPEEDUP 38
第五章 結論與未來展望 43
參考文獻 44
[1] Guillaume Colin de Verdière, "Introduction to GPGPU, a hardware and software background" , Comptes Rendus Mécanique, Volume 339, Is-sues 2–3, February–March 2011, Pages 78-89
[2] Zhiyi Yang, Yating Zhu, Yong Pu, “Parallel Image Processing Based on CUDA”, Computer Science and Software Engineering, vol.3, pp.198~201, 2008.
[3] Prof. Dr. Volker Sperschneider , "RNA Secondary Structure Prediction" , January 2008.
[4] Chang, D.-J, Kimmer, C, Ouyang, M, "Accelerating the Nussinov RNA folding algorithm with CUDA/GPU ", 2010 IEEE International Symposium on Signal Processing and Information Technology, Article number5711746, pp. 120-125.
[5] MichałCzapiński, Stuart Barnes, “Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform”, J. Parallel Distrib. Comput. Vol 71, pp.802-811, 2011.
[6] Crespo, A.J.C , Dominguez, J.M, Valdez-Balderas, D, Rogers, B.D, Gomez-Gesteira, M, "Smoothed particle hydrodynamics on GPU computing", 2nd International Conference on Particle-Based Methods, PARTICLES 2011, pp. 922-929.
[7] Rustico, E, Bilotta, G, Gallo, G, Hérault, A, Del Negro, C, "Smoothed particle hydrodynamics simulations on multi-GPU systems", Proceedings - 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Article number6169576, pp. 384-391.
[8] Liu, Yongchao, and Bertil Schmidt. "LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs." Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on. IEEE, 2015.
[9] NVIDIA CUDA. (2012). CUDA Parallel Computing Platform[Online]. Available: http://www.nvidia.com/object/cuda_home_new.html
[10] T. R. Halfhill, "Parallel processing with CUDA-NVIDA’s highperformance computing platform uses massive multithreading, " Microprocessor Rep., pp. 1–8, Jan. 2008.
[11] NVIDIA. (2009). NVIDIA Cuda2.0 Programming Guide[Online]. Available:
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf.
[12] NVIDIA CUDA. (2014). CUDA C Programming Guide[Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[13] N. Bell, M. Garland, Implementing Sparse Matrix-Vector Multiplication
on Throughput-oriented Processors, Proceedings of the Conference on
High Performance Computing Networking, Storage and Analysis, 2009
[14] A Parallel Loop Self-Scheduling on Extremely Heterogeneous PC Clusters
Chao-Tung Yang and Shun-Chyi Chang High-Performance Computing Laboratory Department of Computer Science and Information Engineering
[15] T. A. Davis, Y. Hu, The University of Florida Sparse Matrix Collection,
ACM Transactions on Mathematical Software, vol. 38, no. 1, 2011
roceedings of the 2013 IEEE 27th International Symposium on Parallel
and Distributed Processing, pp. 1085-1097, 2013
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊