跳到主要內容

臺灣博碩士論文加值系統

(44.211.31.134) 您好!臺灣時間:2024/07/23 07:46
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:洪柏勝
研究生(外文):HONG, BO-SHENG
論文名稱:最大化可執行Warp數量之高效能GPU設計
論文名稱(外文):Maximizing the Number of Executable Warps for High Performance GPU Design
指導教授:陳青文陳青文引用關係
指導教授(外文):CHEN, CHING-WEN
口試委員:劉宗杰林正堅
口試委員(外文):LIU,TZONG-JYELIN, CHENG-JIAN
口試日期:2022-07-08
學位類別:碩士
校院名稱:逢甲大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2022
畢業學年度:110
語文別:中文
論文頁數:61
中文關鍵詞:圖形處理器執行緒層級平行度快取爭用快取資料配置排程
外文關鍵詞:GPUTLPCache ContentionData PlacementScheduling
相關次數:
  • 被引用被引用:0
  • 點閱點閱:125
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
本論文提出『最大化可執行Warp數量之高效能GPU設計』,其中包含三點研究議題: 1. 如何在有限的L1快取中放置對於效能有幫助的資料來減少快取爭用並提高TLP數量以增加SM中可被執行的Warp數量進而提升IPC。2. 如何快速地將一個CTA中的Warp執行完成,讓新的CTA可以進入SM執行使得可被執行的Warp數量提升。3. 如何降低在Pipeline中Short Latency Request被Long Latency Request 影響的比率來增加SM中可被執行的Warp數量進而改善GPU效能。針對第一個研究議題,為了在有限的L1快取中放置對於效能有幫助的資料來減少快取爭用並提高TLP數量以增加SM中可被執行的Warp數量進而提升IPC,本論文提出TLP Increase with Selective Data (TISD) 設計,來選擇只對效能有幫助的資料放入L1快取來提升SM中可被執行的Warp數量,再者,提出Adaptive TLP Mechanism (ATM) 設計,進一步透過監控快取失誤與IPC來動態調整TLP數量,以避免當L1快取資料過多時導致快取爭用使得IPC下降。針對第二個研究議題,為了快速的將一個CTA中的Warp執行完成,讓新的CTA可以進入SM執行使得可被執行的Warp數量提升,提出L2 Priority CTA Queue (L2PCQ),設計了CTA的優先佇列,使L2中較早的CTA可以更快速的被完成,以避免CTA資源佔據過久使SM中可被執行的Warp數量降低。針對第三個研究議題,為了降低Pipeline中Short Latency Request被Long Latency Request 影響的比率來增加SM中可被執行的Warp數量來改善GPU效能,本論文提出Memory Request Filter (MRF) ,設計了記憶體請求的過濾器,使在L1中有資料的記憶體請求能夠快速執行,以降低Pipeline stall對於Short Latency Request造成的影響。
In this thesis, we propose “Maximizing the Number of Executable Warps for High Performance GPU Design”, which includes three research topics: 1. How to place useful data in the limited L1 cache to reduce cache contention and increase the number of TLPs to maximize the number of executable Warps in the SM to improve IPC. 2. How to quickly complete the execution of Warps in a CTA so that a new CTA can enter SM execution, thereby increasing the number of executable Warps. 3. How to reduce the ratio of Short Latency Requests affected by Long Latency Requests in Pipeline to increase the number of executable Warps in SM and improve GPU performance. For the first research topic, in order to place useful data in the limited L1 cache to reduce cache contention and increase the number of TLPs to maximize the number of executable warps in the SM so that improve IPC. We propose the TLP Increase with Selective Data (TISD) method to select data that is only helpful for performance and put into the L1 cache to increase the number of executable warps in the SM. Secondly, an Adaptive TLP Mechanism (ATM) method is proposed to dynamically adjust the number of TLPs by monitoring cache miss and IPC to avoid cache contention and lower IPC when L1 cache data is too large. For the second research topic, in order to quickly complete the execution of Warps in a CTA so that a new CTA can enter SM execution, thereby increasing the number of executable Warps. We propose the L2 Priority CTA Queue (L2PCQ) method, and design the priority of the CTA Queue allows the earlier CTA in L2 cache to be completed more quickly, to avoid CTA resources occupying too long and reducing the number of executable warps in SM. For the third research topic, in order to reduce the ratio of Short Latency Requests in the Pipeline affected by Long Latency Requests and increase the number of executable Warps in SM to improve GPU performance. we propose Memory Request Filter (MRF) method which is designed to filter memory requests, so that memory requests with data in L1 cache can be executed quickly, so as to reduce the impact of pipeline stalls on Short Latency Requests.
誌謝 i
摘要 ii
Abstract iii
目錄 v
圖目錄 vii
表目錄 ix
第一章 緒論 1
1.1 GPU介紹 1
1.2 研究動機 4
1.2.1 L1快取失誤與TLP數量的關係 5
1.2.2 快取資料的存放對於IPC的影響 7
1.2.3 CTA的完成與TLP數量的關係 9
1.2.4 SM停頓發生的原因 11
1.2.5 存取單元的可用性與資料抓取的等待時間 12
1.3 研究目標與研究方法 13
1.4 研究架構 14
第二章 相關研究 15
2.1 調整TLP數量減緩快取爭用 15
2.2 最大化TLP數量提升GPU效能 16
第三章 提出方法 18
3.1 設計TISD與ATM提升TLP數量與降低快取爭用 18
3.1.1 TISD設計 18
3.1.2 設計ATM 降低L1快取爭用 21
3.2 設計L2PCQ加速L2快取中CTA WARP的完成時間 24
3.3 設計MRF使L1中有資料的記憶體請求盡早執行 29
第四章 實驗結果 34
4.1 模擬環境 34
4.2 BENCHMARK 34
4.3 實驗結果與分析 35
4.3.1 TLP數量影響 35
4.3.2 L1快取爭用改善 38
4.3.3 SM停頓改善 43
4.3.4 L2快取佇列改善 45
4.3.5 效能改善比較 47
第五章 結論 49
參考文獻 50


[1]“NVIDIA’s Next Generation CUDA Compute Architecture.” https://reurl.cc/jgnzZq.
[2]C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU Architecture,” IEEE Micro, vol. 31, no. 2, 2011.
[3]“Fermi: NVIDIA’s Next Generation CUDA Compute Architecture.” http://goo.gl/zAtZMY, 2009.
[4]O. Kayıran, A. Jog, M. T. Kandemir and C. R. Das, "Neither more nor less: Optimizing thread-level parallelism for GPGPUs," Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013, pp. 157-166.
[5]T. G. Rogers, M. O’Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012, pp. 72-83.
[6]M. K. Yoon, K. Kim, S. Lee, W. W. Ro and M. Annavaram, "Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit," 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 609-621.
[7]G. Koo, Y. Oh, W. W. Ro and M. Annavaram, "Access pattern-aware cache management for improving data utilization in GPU," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 307-319.
[8]A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” IEEE International Symposium on Performance Analysis of Systems and Software, PP 163-174, 2009.
[9]“Nvidia GTX480”, https://reurl.cc/52jGEG.
[10]GPGPU-Sim benchmarks, [Online]. Available: https://github.com/sfraney/gpgpu-sim/sim/tree/ master/ Ispass2009- benchmarks/benchmarks/CUDA/ .
[11]S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron, "Rodinia︰A benchmark suite for heterogeneous computing," 2009 IEEE International Symposium on Workload Characterization (IISWC), 2009.
[12]S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing (InPar), PP 1-10, 2012.
[13]NVIDIA Corp. CUDA C/C++ SDK Code Samples, 2011.
[14]S. Che, B. M. Beckmann, S. K. Reinhardt and K. Skadron, "Pannotia: Understanding irregular GPGPU graph applications," 2013 IEEE International Symposium on Workload Characterization (IISWC), 2013, pp. 185-195.
[15]A. Jadidi, M. T. Kandemir and C. R. Das, "Selective Caching: Avoiding Performance Valleys in Massively Parallel Architectures," 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2020, pp. 290-298.
[16]M. Mao, J. Hu, Y. Chen and H. Li, "VWS: A versatile warp scheduler for exploring diverse cache localities of GPGPU applications," 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), 2015, pp. 1-6.
[17]C. Zhao, F. Wang, Z. Lin, H. Zhou and N. Zheng, "Selectively GPU Cache Bypassing for Un-Coalesced Loads," 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), 2016, pp. 908-915.
[18]K. Kim, S. Lee, M. K. Yoon, G. Koo, W. W. Ro and M. Annavaram, "Warped-preexecution: A GPU pre-execution approach for improving latency hiding," 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 163-175.
[19]G. Koo, H. Jeon, Z. Liu, N. S. Kim and M. Annavaram, "CTA-Aware Prefetching and Scheduling for GPU," 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, pp. 137-148.
[20]John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid,vLi-Wen Chang, Nasser Anssari, Geng Daniel Liu, Wen-mei, W. Hwu, “IMPACT Technical Report, IMPACT-12-01”, University of Illinois, at Urbana-Champaign, March 2012

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top