臺灣博碩士論文加值系統

English |FB 專頁 |Mobile

免費會員登入| 註冊

功能切換導覽列

(216.73.216.255) 您好！臺灣時間：2026/07/03 14:36

字體大小：

:::

詳目顯示

第 1 筆 / 共 1 筆

/1頁

論文基本資料
摘要
外文摘要
目次
參考文獻
紙本論文
QR Code

本論文永久網址:

研究生:

黃昀棨

研究生(外文):

Yun-ChiHuang

論文名稱:

使用對徑比較法之動態單指令多數據流收斂

論文名稱(外文):

Dynamic SIMD Re-convergence with Paired-Path Comparison

指導教授:

陳中和

指導教授(外文):

Chung-Ho Chen

學位類別:

碩士

校院名稱:

國立成功大學

系所名稱:

電腦與通信工程研究所

學門:

工程學門

學類:

電資工程學類

論文種類:

學術論文

論文出版年:

2015

畢業學年度:

103

語文別:

英文

論文頁數:

中文關鍵詞:

GPGPU、OpenCL、SIMD Control Divergence

外文關鍵詞:

GPGPU、OpenCL、SIMD Control Divergence

相關次數:

被引用:0
點閱:170
評分:
下載:0
書目收藏:0

在當前的GPGPU(General Purpose Graphic Processor Unit)架構下，單指令多
資料流的分歧(SIMD Divergence)是造成平行運算效能下降的主要原因之一。我們
評估一個基於HSAIL指令集的GPU模擬器，在上面運行OpenCL的核心涵式
(Kernel)以觀察GPU的效能與結果。SIMD中最小的執行單位為波前(Wavefront)
，相當於SISD中的執行序。波前執行條件跳躍時，若此波前中每個工作項目
(Workitem)之跳躍條件不同，導致同一波前中的工作項目要執行不同運算指令
，這種情形便稱為控制分歧(Control Divergence)。一旦有控制分歧的情形發生，
便要啟用輔助的機制使得一個波前能夠依序讓不同的工作項目執行不同的指令，使用這樣的機制處理控制分歧需要編譯器與GPU的共同配合，不同的處理演算法亦會影響GPU在控制分歧下的執行效能。本論文提出了一個新的基於堆疊方式收斂機制，能讓波前在運算途中自行收斂。此機制可以選擇使用或不使用結譯器(Finalizer)所產生的收斂提示指令，不使用的話則免除了編譯器的支援與執行多餘的指令。使用此種動態收斂方法，GPU運行有不規則控制流之程式時獲得平均13.36%的活動比率(Activity Factor)提升。使用不依賴收斂提示指令之收斂方法能透過省去執行多餘指令的時間獲得整體執行效能的提升。

SIMD divergence is one of the critical causes that decrease the parallel computing efficiency in contemporary GPGPU (General Purpose Graphic Processor Unit) architecture. In this thesis, we evaluate a cycle accurate GPU simulator platform based on HSAIL under OpenCL framework by offloading the kernel programs into
simulator. A wavefront (“wavefront” and “warp” in AMD and NVIDIA terminology respectively) is the gathering of multiple threads that execute the same instruction in SIMD fashion. When a wavefront or a warp executes a conditional branch instruction, threads in the warp may go to distinct PCs if the threads have different branch targets, and it’s called SIMD control divergence. Re-convergence mechanisms are applied to help divergent wavefront to execute instructions properly. We develop a new dynamic stack-based re-convergence scheme that can be implemented with or without finalizer generated re-convergence instructions. Using the scheme we propose, the divergent warp re-converges dynamically and get a 13.36% activity factor improvement on average from opportunistic early re-convergence in the unstructured control flow, and the performance is better in the way that warp re-convergence without finalier generated hint instructions.

Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contribution 2
1.3 Organization 2
Chapter 2 Background 3
2.1 OpenCL Programming 3
2.1.1 OpenCL Platform and Execution Model 3
2.1.2 OpenCL Memory Model 4
2.1.3 OpenCL Framework 4
2.2 Heterogeneous System Architecture(HSA) 5
2.2.1 HSA Feature 6
2.2.2 HSAIL 7
2.3 General Purpose Computing on Graphics Processing Units(GPGPU) 8
2.3.1 Workitems of a Kernel mapping to a SM 8
2.3.2 Streaming Multiprocessors 9
2.3.3 Warp Scheduling 10
2.3.4 SIMD Divergence and Re-convergence Schemes 10
Chapter 3 Related Work 16
3.1 Dual-Path Execution Model 16
3.1.1 Execution Example 16
3.2 Implicit Stack-less Re-convergence 18
3.2.1 Re-convergence Mechanism 18
3.2.2 Divergent Control Flow Traversal 19
3.3 Unstructured Control Flow 19
Chapter 4 Dynamic Re-convergence in Dual-Path Stack 21
4.1 Observation 21
4.2 Re-convergence with Dynamic Paired-Path Comparison 22
4.2.1 Re-convergence Schemes Algorithm 23
4.2.2 Divergent Control Flow Traversal 33
4.2.3 Re-convergence Detection Methods 37
4.2.4 Behavior with Synchronization Barrier 41
4.2.5 Divergence Stack Implementation 42
Chapter 5 GPU Simulation Platform 44
5.1 Overview of HSAIL GPU Simulation Platform 44
5.2 Streaming Multiprocessor Pipeline 45
5.3 Finalizer 47
5.4 Configuration 48
Chapter 6 Benchmarks and Evaluation 50
6.1 Benchmarks 50
6.2 Evaluation 52
6.2.1 Activity Factor 52
6.2.2 LD/ST Unit Idle Ratio 55
6.2.3 SIMD Unit Utilization 56
6.2.4 Dynamic Instruction Counts 57
6.2.5 Overall Performance 58
Chapter 7 Conclusion 59
Reference 60

[1] OpenCL – The open standard for parallel programming of heterogeneous systems, [Online], Available: http://www.khronos.org/object/opencl/ .
[2] V. Narasiman; M. Shebanow; C. J. Lee; R. Miftakhutdinov; O. Mutlu, and Y. N. Patt, “Improving GPU Performance via Large Warps and Two-level Warp Scheduling, MICRO-44 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture,Pages 308-317,ACM New York, NY, USA ©2011.
[3] S. Collange, “Stack-less SIMT Reconvergence at Low Cost, ARENAIRE - Inria Grenoble Rhône-Alpes / LIP Laboratoire de l’Informatique du Parallélisme, 2011.
[4] M. Rhu and M. Erez, The dual-path execution model for efficient GPU control flow, High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on , vol., no., pp.591,602, 23-27 Feb. 2013
[5] HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide. and Object Format(BRIG), 2014.
[6] W.W.L. Fung; I. Sham; G.Yuan; and T.M. Aamodt, Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.407,420, 1-5 Dec. 2007.
[7] Intel HD Graphics OpenSource PRM, 2010.
[8] A. ElTantawy; J.W. Ma; M. O'Connor and T.M. Aamodt, A scalable multi-path microarchitecture for efficient GPU control flow, High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on , vol., no., pp.248,259, 15-19 Feb. 2014
[9] F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs, Software Engineering, IEEE Transactions on , vol.30, no.4, pp.231,245, April 2004.
[10] R. A. Lorie and H. R. Strong, US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors, 1984.
[11] J. Meng; D. Tarjan and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance, In Proc. 37th Int’l Symp. on Computer Architecture (ISCA), pages 235– 246, 2010.
[12] J.D.Collins; D.M. Tullsen and P. Wang, Control Flow Optimization Via Dynamic Reconvergence Prediction,,MICRO 37 Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, Pages 129-140, 2004..
[13] AMD SDK: AMD APP Software Development Kit, [Online], Available : http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/ .
[14] S. Che et al., Rodinia: A benchmark suite for heterogeneous computing, IISWC ( IEEE International Symposium on Workload Characterization ) , vol., no., pp.44,54, 4-6 Oct. 2009.
[15] A. Kerr, G. Diamos and S. Yalamanchili, A characterization and analysis of PTX kernels, IISWC ( IEEE International Symposium on Workload Characterization ) , , vol., no., pp.3,12, 4-6 Oct. 2009
[16] Rogers, T.G., O'Connor, M., Aamodt, T.M., Cache-Conscious Wavefront Scheduling,, MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM, International Symposium on Microarchitecture, Pages 72-83, 2012.

國圖紙本論文

推文
網路書籤
推薦
評分
引用網址
轉寄

top

相關論文
相關期刊
熱門點閱論文

1.	超高速離散元素分析引擎之研發
2.	優化在異構性平台的OpenCL程式撰寫架構 -以光追蹤為例
3.	非線性動力結構分析之GPGPU平行化與效能評估
4.	在三個通用型繪圖型處理器上用Yolo v4做效能分析
5.	基於軟體檢查點的 OpenCL 計算核心先佔式執行方法
6.	卷積神經網路於現代繪圖處理器架構之設計探索
7.	在多圖形處理器架構下考量裝置能力進行工作量分散運算
8.	利用機器學習方法改進在多圖形處理裝置平台上基於歷史資訊的工作排程方法
9.	智慧型行動裝置OpenCL運算部份移轉技術
10.	智慧型行動裝置OpenCL運算移轉框架
11.	A GPU Runtime Demonstrated on an Encrypted File System
12.	影像特徵擷取演算法之加速
13.	應用於通用繪圖處理器之多核心計算單元設計與實作
14.	在異質性雲端系統上資源導向運算框架之設計與實作
15.	圖形處理器環境基於OpenCL之MapReduce框架之設計與實作

無相關期刊

1.	無線隨意網路下提供終點對終點服務品質保證之跨層協定設計
2.	3D互動式管狀物體的應用系統
3.	鈮酸鎂薄膜之開發與其光電性質探討
4.	具蕭基能障源極氧化銦鎵鋅薄膜電晶體之研製與電性分析
5.	利用使用者查詢特徵之知識演化系統
6.	利用多流表技術於軟體定義網路環境
7.	互動式虛擬場景學習系統及其在醫學教育的應用
8.	我國公務人員退休金制度改革之研究
9.	多功能型高分子介電材料應用於PTCDI-C13H27 n型感測器、記憶體和電晶體元件
10.	使用反射式頻域光子遷移系統快速量化組織血氧飽和濃度及動脈血氧飽和濃度
11.	具真人輔導學習機制之客製化銀髮族對話系統
12.	衛星導航訊號驗證之研究
13.	銀對釩系玻璃特性之影響
14.	使用非晶相銦鎵鋅氧化物薄膜電晶體之新式顯示器電路設計
15.	剛玉結構Mg4Ta2O9介電陶瓷材料在Mg2+位置做不同參雜量離子的微波介電特性改善與研製

簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室