臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.10) 您好！臺灣時間：2025/09/21 23:49

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
論文連結
QR Code

本論文永久網址:

研究生:

蔡念穎

研究生(外文):

TSAI,NIAN-YING

論文名稱:

稀疏矩陣向量乘法於GPU中執行之工作量分配策略之探討

論文名稱(外文):

On Job Allocation Strategies for Running Sparse Matrix-Vector Multiplication on GPUs

指導教授:

伍朝欽

指導教授(外文):

Wu,Chao-Chin

口試委員:

伍朝欽、魏凱城、許慶賢

口試委員(外文):

Wu,Chao-Chin、Wei,Kai-Cheng、Hsu,Ching-Hsien

口試日期:

2017-07-20

學位類別:

碩士

校院名稱:

國立彰化師範大學

系所名稱:

資訊工程學系

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2017

畢業學年度:

105

語文別:

中文

論文頁數:

中文關鍵詞:

稀疏矩陣、LightSpMV演算法、Compressed Sparse Row (CSR)、GPU、平行計算

外文關鍵詞:

Sparse Matrix、LightSpMV algorithm、Compressed Sparse Row (CSR)、GPU、Parallel computing

相關次數:

被引用:0
點閱:278
評分:
下載:0
書目收藏:0

隨著大數據時代來臨，須處理的資料量越來越多，圖形處理單元(Graphic Processing Unit, GPU)已經被廣泛的運用來處理許多平行化問題。而稀疏矩陣向量乘法於各領域上是重要且基本的一項運算，如何在GPU上提升運算效能仍有許多改善的空間。本篇論文的研究主要是利用圖形處理單元(GPU)改善LightSpMV演算法中工作量的分配策略來加速稀疏矩陣向量乘法(Sparse Matrix-Vector Multiplication, SpMV)。LightSpMV演算法是以標準的CSR格式為主，CSR格式是常見的稀疏矩陣儲存格式，比起其他格式更靈活且更好操作。LightSpMV演算法使用兩種動態配置方法，分別為分配矩陣的Row給Vector與Warp兩種，兩種方法皆利用Atomic operations以取得Row的索引值。我們發現因為Atomic operations的執行時間耗時過長，因此我們針對這部分的工作量分配提出了三種策略：(1)以Warp為單位，將每次分配需執行的Row數量加倍，使得Atomic operations數量減少，(2)則是以Block為單位，動態分配Row的數量，相較於以Warp為單位的動態配置，能減少更多的Atomic operations數量，(3)亦是以Block為單位，靜態分配block執行的Row數量，而Block中再用以warp為單位進行動態分配，取代Atomic operations。經實作過後，我們實驗在工作環境為GTX980的GPU上，最好的效能為第三種策略，最高能提升將近100%的效能。

In the era of big data, Graphic Processing Unit (GPU) has been widely used to deal with many parallelization problems as the amount of data to be processed. Sparse Matrix – Vector Multiplication is an important and basic operation in many fields, there are still many improved space for raising the performance of the GPU operation. This paper is mainly about job allocation strategies for running Sparse Matrix-Vector Multiplication on GPUs. The LightSpMV algorithm is based on the standard CSR format. The CSR format is a common sparse matrix storage format which is more flexible and better than other formats. The LightSpMV algorithm uses two dynamic configuration methods, whose matrix row is distributed to one for vector and the other for warp. Both of the methods use Atomic operations to get the Row index values. Because Atomic operations spent too much execution time, we proposed three strategies for this part of the workload allocation: (1) Using warp as the basic unit, through doubling the number of rows which have to be executed for each allocation, to make the number of Atomic operations reduced. (2) Using block as the basic unit, the number of rows are allocated dynamically. Compared to the dynamic configuration of using warp as basic unit, this strategy can reduce the number of Atomic operations. (3) Using block as the basic unit, the number of rows executed by blocks are static allocation. In each block we reuse warp as the basic unit and the warp are allocated dynamically instead of Atomic operations.After the implementation of our experiments in the work environment of the GTX980 GPU, the best performance is the third strategy and the performance improvement is nearly 100%.

中文摘要 I
英文摘要 II
致謝 III
目錄 IV
圖目錄 V
表目錄 VI
第一章緒論 1
1.1 研究背景 1
1.2 研究目的 3
1.3 主要貢獻 3
第二章背景資訊與相關研究 5
2.1 CUDA與GPU架構 5
2.1.1 CUDA 5
2.1.2 GPU階層式記憶體架構 7
2.2 COMPRESSED SPARSE ROW (CSR) 格式 10
2.3 SPARSE MATRIX-VECTOR MULTIPLICATION稀疏矩陣向量乘法 10
2.4 LIGHTSPMV演算法 11
第三章研究方法 18
3.1 LIGHTSPMV演算法缺點 18
3.2 設計概念 19
3.3 WARPCSS 24
3.4 BLOCKCSS 25
3.5 BLOCKSTATIC 27
第四章實驗數據 29
4.1 GFLOPS 31
4.2 SPEEDUP 38
第五章結論與未來展望 43
參考文獻 44

[1] Guillaume Colin de Verdière, "Introduction to GPGPU, a hardware and software background" , Comptes Rendus Mécanique, Volume 339, Is-sues 2–3, February–March 2011, Pages 78-89
[2] Zhiyi Yang, Yating Zhu, Yong Pu, “Parallel Image Processing Based on CUDA”, Computer Science and Software Engineering, vol.3, pp.198~201, 2008.
[3] Prof. Dr. Volker Sperschneider , "RNA Secondary Structure Prediction" , January 2008.
[4] Chang, D.-J, Kimmer, C, Ouyang, M, "Accelerating the Nussinov RNA folding algorithm with CUDA/GPU ", 2010 IEEE International Symposium on Signal Processing and Information Technology, Article number5711746, pp. 120-125.
[5] MichałCzapiński, Stuart Barnes, “Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform”, J. Parallel Distrib. Comput. Vol 71, pp.802-811, 2011.
[6] Crespo, A.J.C , Dominguez, J.M, Valdez-Balderas, D, Rogers, B.D, Gomez-Gesteira, M, "Smoothed particle hydrodynamics on GPU computing", 2nd International Conference on Particle-Based Methods, PARTICLES 2011, pp. 922-929.
[7] Rustico, E, Bilotta, G, Gallo, G, Hérault, A, Del Negro, C, "Smoothed particle hydrodynamics simulations on multi-GPU systems", Proceedings - 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Article number6169576, pp. 384-391.
[8] Liu, Yongchao, and Bertil Schmidt. "LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs." Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on. IEEE, 2015.
[9] NVIDIA CUDA. (2012). CUDA Parallel Computing Platform[Online]. Available: http://www.nvidia.com/object/cuda_home_new.html
[10] T. R. Halfhill, "Parallel processing with CUDA-NVIDA’s highperformance computing platform uses massive multithreading, " Microprocessor Rep., pp. 1–8, Jan. 2008.
[11] NVIDIA. (2009). NVIDIA Cuda2.0 Programming Guide[Online]. Available:
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf.
[12] NVIDIA CUDA. (2014). CUDA C Programming Guide[Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[13] N. Bell, M. Garland, Implementing Sparse Matrix-Vector Multiplication
on Throughput-oriented Processors, Proceedings of the Conference on
High Performance Computing Networking, Storage and Analysis, 2009
[14] A Parallel Loop Self-Scheduling on Extremely Heterogeneous PC Clusters
Chao-Tung Yang and Shun-Chyi Chang High-Performance Computing Laboratory Department of Computer Science and Information Engineering
[15] T. A. Davis, Y. Hu, The University of Florida Sparse Matrix Collection,
ACM Transactions on Mathematical Software, vol. 38, no. 1, 2011
roceedings of the 2013 IEEE 27th International Symposium on Parallel
and Distributed Processing, pp. 1085-1097, 2013

國圖紙本論文

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供，不一定有電子全文可供下載，若連結有誤，請點選上方之〝勘誤回報〞功能，我們會盡快修正，謝謝！

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	CUDA平行加速運算技術應用於光學薄膜特性之摹擬
2.	利用圖形處理器加速 K-Means 影像分群之研究
3.	在GPU平台上以CSR稀疏矩陣轉置之平行演算法設計
4.	利用繪圖處理器平行運算技術計算電子電洞交換能
5.	利用圖形處理器加速高光譜影像主成份分析之研究
6.	GPU應用於圖演算法之計算效益分析
7.	以 OpenCL 實做使用引導影像濾波器的立體匹配演算法
8.	在GPU上使用BFS演算法解決S-T連接問題
9.	基於圖形處理器之位元平行型樣本比對演算法之研究
10.	基於圖形處理器之腦部核磁共振攝影影像分割平行演算法之研究
11.	符合HSA中介語言並支援三維繪圖與通用運算之繪圖處理器設計平台
12.	專為影像物件追蹤設計之背景相減與樣本比對模組平行化設計
13.	利用CUDA實做記憶體資料庫平行查詢
14.	基於GPU的優化對比度增強圖像去霧算法的性能優化
15.	以平行技術縮短BLASTP之查詢回應時間

無相關期刊

1.	在GPU上使用BFS演算法解決S-T連接問題
2.	應用於去除影像雜訊的雙邊濾波器設計VLSI架構
3.	在GPU上使用二階段平行演算法加速樣式匹配
4.	以GPGPU加速訓練卷積濾波器權重之研究
5.	探討「逆向課程設計」課程對師資生之評量信念及課程設計能力的影響
6.	探討台灣國中英語教科書的文化教學
7.	分駐(派出)所所長學、經歷與領導風格及領導效能關聯性之研究-以彰化縣警察局為例
8.	中小學教師的HOLLAND興趣類型與一般教師任務興趣對工作滿意度的影響之研究
9.	應用粒子濾波器與最佳估測技術於目標追蹤系統
10.	IEEE 802.11ax標準及其隨機存取協定
11.	使用Volder演算法於同步馬達扭力控制
12.	應用整數離散小波轉換於可還原式資料隱藏之改善研究
13.	應用於H.265之DCT/IDCT硬體架構設計
14.	國中生體適能、基本心理需求與自我調節學習之關係研究-以臺北市為例
15.	台灣都市更新成功因素之研究

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室