跳到主要內容

臺灣博碩士論文加值系統

(44.200.82.149) 您好!臺灣時間:2023/06/11 03:12
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:王彥凱
研究生(外文):Wang, Yen-Kai
論文名稱:支援圖形處理單元上虛擬記憶體的 記憶體管理單元快取系統和線程塊排程優化
論文名稱(外文):MMU Cache System and Thread Block Scheduling Enhancement for Virtual Memory Support on GPGPU
指導教授:陳添福陳添福引用關係
指導教授(外文):Chen, Tien-Fu
學位類別:碩士
校院名稱:國立交通大學
系所名稱:資訊科學與工程研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2014
畢業學年度:103
語文別:英文
論文頁數:41
中文關鍵詞:圖形處理單元異質運算架構記憶體管理單元虛擬記憶體位置轉換
外文關鍵詞:GPUHSAMMUVirtual address translation
相關次數:
  • 被引用被引用:0
  • 點閱點閱:248
  • 評分評分:
  • 下載下載:17
  • 收藏至我的研究室書目清單書目收藏:0
隨著Dark silicon現象在先進製程越趨顯著,晶片效能將受限於功耗上限,客製化的硬體加速器運算單元在未來更顯重要。圖形處理器(GPU)從早期圖形計算的加速器,現今已經走向可程式化的通用運算處理器(GPGPU),並且即將朝向異質運算架構(HSA)發展,CPU與各類型的客製化加速器(GPU,DSP…etc)將共享同樣的虛擬記憶體空間,每個運算單元都需要記憶體管理單元(MMU)進行虛擬地址轉換,然而GPU同時間大量的記憶體存取將造成MMU的嚴重負擔,虛擬位置轉換佔執行時間的比例將會增加。
本研究模擬GPU上的虛擬位址轉換,並且在記憶體管理單元置放L1,L2 TLB,並且分析各個benchmark效能損失的原因。再者我們分析GPU的block ID和存取過的address關係,提出考慮記憶體位置轉換成本的線程塊排程的策略。
As “Dark silicon” phenomenon becomes obvious in advanced process, the performance of IC will soon be bounded by the power budget. Investigation on the design of customized hardware accelerator gradually takes the place of mainstream research on CPU. Graphics Processing Unit(GPU) is originally developed for the acceleration of graphic computation. It now is evolving into a programmable, general purpose computing unit(GPGPU). In the future, heterogeneous system architecture(HSA) will merge all computing units(CPU, GPU, DSP…etc) to share same virtual address space to simplify programming and to allow sharing of data between these units. As a result each unit will need to have an MMU to translate its virtual addresses into physical addresses. However, with large amount of memory accesses by these units, system performance may be impacted by the address translation process.
This paper evaluates the virtual address translation impact on GPU through software simulation. We propose to place L1 private TLB and L2 shared TLB to reduce the overhead of address translation. We analyze the correlation between block ID and memory address trace. By collecting runtime information, we can select better thread block scheduling strategy to achieve higher performance.

摘要       i
ABSTRACT ii
Table of Contents v
List of Tables viii
List of Figures ix
I. Introduction 1
1.1 Motivation and Introduction 1
1.2 Contribution 2
1.3 Report Organization 3
II. Background and Related work 4
2.1 GPGPU programming model 4
2.1.1 Program Execution 4
2.1.2 GPU Memory model 5
2.1.3 Virtual Address Translation Support 5
2.2 HSA Specification 5
2.2.1 Shared Virtual Memory and Memory Coherence 5
2.2.2 HSA Queue and Agent Scheduling 6
2.2.3 HSA Component Context Switching 6
2.3 Related Work 6
2.3.1 GPU virtual memory system study 6
2.3.2 Multi-programming on GPU 7
III. GPGPU-Sim Framework 9
3.1 Overview 9
3.2 Simulation Flow 10
3.2.1 Shader Core Cluster 10
3.2.2 SIMT core 11
3.2.3 GPU Memory Hierarchy 12
3.2.4 Load/Store unit 13
IV. MMU Support in GPGPU-Sim 15
4.1 GPU MMU configuration 15
4.2 System Architecture 16
4.2.1 Page Coaleser 17
4.2.2 TLB 18
4.2.3 MMU shared cache 18
4.2.4 Page Table Walker(PTW) 19
4.2.5 Virtual Address Maps To Physical Address 21
V. Thread Block Level Scheduling 22
5.1 Scheduler Adjustment for Address Translation 22
5.2 GPU programming 23
5.3 Memory footprint of benchmarks 25
5.4 Cooperative Thread Array(CTA) clustering 26
5.4.1 Memory access correlation 27
5.4.2 Cluster algorithm 27
5.4.3 Correlation of Block index and Cluster Center 28
5.5 Different scheduling policies 29
5.6 Runtime scheduling policy decision 29
VI. Experimental Evaluation 31
6.1 Simulation Environment 31
6.2 Experimental benchmark and parameter 31
6.3 Experimental Result 32
6.3.1 Different L1 TLB Configuration 32
6.3.2 Shared L2 TLB 34
6.3.3 CTA scheduling 37
VII. Conclusion 39
References 40

[1] Power, Jason, M. Hill, and D. Wood. "Supporting x86-64 Address Translation for 100s of GPU Lanes." HPCA, 2014.
[2] Pichai, Bharath, Lisa Hsu, and Abhishek Bhattacharjee. "Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces." ASPLOS, 2014.
[3] Barr, Thomas W., Alan L. Cox, and Scott Rixner. "Translation caching: skip, don't walk (the page table)." ACM SIGARCH Computer Architecture News. Vol. 38. No. 3. ACM, 2010.
[4] Lee, Janghaeng, Mehrzad Samadi, and Scott Mahlke. "VAST: The Illusion of a Large Memory Space for GPUs." PACT, 2014.
[5] Pham, Binh, et al. "CoLT: coalesced large-reach TLBs." MICRO, 2012.
[6] Bhattacharjee, Abhishek. "Large-reach memory management unit caches." MICRO, 2013.
[7] Pham, Binh, et al. "Increasing TLB reach by exploiting clustering in page translations." HPCA, 2014.
[8] Esmaeilzadeh, Hadi, et al. "Dark silicon and the end of multicore scaling." ISCA, 2011.
[9] Bhattacharjee, Abhishek, Daniel Lustig, and Margaret Martonosi. "Shared last-level TLBs for chip multiprocessors." HPCA, 2011
[10] Tanasic, I., Gelado, I. ; Cabezas, J., et al " Enabling preemptive multiprogramming on GPUs " ISCA, 2014
[11] Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator", ISPASS, 2009.
[12] N. Jiang et al., “A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator,” ISPASS, 2013.
[13] ARM® Cortex®-A57 MPCore Processor Technical Reference Manual, Revision r1p2
[14] Che, Shuai, et al. "Rodinia: A benchmark suite for heterogeneous computing." Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 2009.
[15] OpenCL specification, "http://www.khronos.org/opencl/"
[16] HSA specification, "http://www.hsafoundation.com/"
[17] Memory system on fusion APUs, " http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf "
[18] CUDA toolkit ," https://developer.nvidia.com/cuda-toolkit "
[19] GPGPU-Sim manul "http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual"
[20] NVIDIA Fermi Compute Architecture Whitepaper
[21] Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors
[22] Amit, Nadav, Muli Ben-Yehuda, and Ben-Ami Yassour. "IOMMU: Strategies for mitigating the IOTLB bottleneck." Computer Architecture. Springer Berlin Heidelberg, 2012.
[23] Intel® Virtualization Technology for Directed I/O

連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top