(54.236.58.220) 您好!臺灣時間:2021/02/28 09:40
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果

詳目顯示:::

我願授權國圖
: 
twitterline
研究生:李承軒
研究生(外文):Cheng-Hsuan Li
論文名稱:基於最後一層快取記憶體加權存取延遲值之中央處理器與繪圖處理器異質性架構的快取記憶體分割機制
論文名稱(外文):Weighted LLC Latency-Based Run-Time Cache Partitioning for Heterogeneous CPU-GPU Architecture
指導教授:楊佳玲楊佳玲引用關係
指導教授(外文):Chia-Lin Yang
口試委員:洪士灝徐慰中呂士濂
口試委員(外文):Shih-Hao HungWei-Chung HsuShih-Lien Lu
口試日期:2014-07-28
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2014
畢業學年度:102
語文別:英文
論文頁數:37
中文關鍵詞:快取記憶體分割異質性平台主記憶體存取延遲中央處理器繪圖處理器
外文關鍵詞:cache partitioningoff-chip latencyheterogeneous architectureCPUGPU
相關次數:
  • 被引用被引用:0
  • 點閱點閱:395
  • 評分評分:系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔系統版面圖檔
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
近年來中央處理器-繪圖處理器(CPU-GPU)異質性整合架構漸漸成為市場處理器主流。而在此種整合架構上,由於最後一層快取受到多個核心與GPU的頻繁存取壓力,使得管理策略成為改善整體系統效能的重要議題。以往管理策略旨在解決程式在快取記憶體內的相互干擾而快取分割(Cache partitioning)是廣泛被使用技巧,而分割空間的分配則通常以能最小化總系統快取失誤頻率為目標。但在異質性系統內,由於GPU的高度資料存取率以及對記憶體延遲的容忍程度,使得原始的空間分配策略傾向於將快取記憶體分配給效能增幅有限的GPU而失去最大化總系統效能的機會。後來針對的異值性系統設計的分配策略(如TAP)則意識到此問題,因而提倡當偵測到快取記憶體對GPU效能提升有限時轉而優化CPU快取命中率的分配策略。

然而,此策略由於只專注於快取記憶體對GPU的效能貢獻,而忽略了快取記憶體對晶片外主記憶體(off-chip memory)的流量調節作用,因此將導致GPU直接對主記憶體的高頻存取。此高頻存取在主記憶體頻寬有限的情況下,將造成主記憶體存取因壅塞而惡化回應時間,進而導致總體系統效能的下降。。由於程式的執行效率不僅取決於快取記憶體命中率,而與主記憶體回應時間與也息息相關。因此,分配快取空間時應同時兩者對程式效能的貢獻。。

本論文首先提出基於最後一層快取失誤總次數預測各程式的主記憶體存取延遲的方法,如此可預測快取分配結果對各程式存取主記憶體的延遲程度。另外,獲得主記憶體延遲的參考資訊後,本論文也進一步提出基於快取存取時間的效能預測模型,如此可成功達成從快取分配結果預測效能的衝擊的目的。在30個異質混合多程式工作的實驗結果顯示,本論文提倡之方法與「考慮執行緒層級平行的快取分配策略方法」(Thread-level parallelism aware cache management policy)比較可增進效能達10.7%、與「參考快取使用率分配策略」(Utility-based cache partitioning policy)比較可增進效能達6.2%、而與基礎的最晚使用資料驅逐策略(Least recently used eviction policy)比較效能增幅達10.9%。


Integrating the CPU and GPU on the same chip has become the development trend for microprocessor design. In integrated CPU-GPU architecture, utilizing the shared last-level cache (LLC) is a critical design issue due to the pressure on shared resources and the different characteristics of CPU and GPU applications. Because of the latency-hiding capability provided by the GPU and the huge discrepancy in concurrent executing threads between the CPU and GPU, LLC partitioning can no longer be achieved by simply minimizing the overall cache misses as in homogeneous CPUs. State-of-the-art cache partitioning mechanism distinguishes those cache-insensitive GPU applications from those cache-sensitive ones and optimize only the cache misses for CPU applications when the GPU is cache-insensitive. However, optimizing only the cache hit rate for CPU applications generates more cache misses from the GPU and leads to longer queuing delay in the underlying DRAM system. In terms of memory access latency, the loss due to longer queuing delay may out-weight the benefit from higher cache hit ratio. Therefore, we find that even though the performance of the GPU application may not be sensitive to cache resources, CPU applications'' cache hit rate is not the only factor which should be considered in partitioning the LLC. Cache miss penalty, i.e., off-chip latency, is also an important factor in designing LLC partitioning mechanism for integrated CPU-GPU architecture.

In this paper, we proposed a Weighted LLC Latency-Based Run-Time Cache Partitioning for integrated CPU-GPU architecture. In order to correlate cache partition to overall performance more accurately, we develops a mechanism to predict the off-chip latency based on the number of total cache misses, and a GPU cache-sensitivity monitor, which quantitatively profiles GPU''s performance sensitivity to memory access latency. The experimental results show that the proposed mechanism improves the overall throughput by 9.7% over TLP-aware cache partitioning (TAP), 6.2% over Utility-based Cache Partitioning (UCP), and 10.9% over LRU on 30 heterogeneous workloads.


口試委員審定書 i
致謝 ii
中文摘要 iii
Abstract iv
Contents vi
List of Figures viii
List of Tables x
1 Introduction 1
2 Background and Motivation 4
2.1 Cache Partitioning 4
2.2 New Issues of Cache Partitioning on
Integrated CPU/GPU Architecture 5
2.3 Limitation of the Existing Cache
Partitioning Mechanism 8
3 Weighted LLC Latency-Based Run-Time Cache
Partitioning 10
3.1 Overview 10
3.2 System State Monitors 12
3.3 Cache Partitioning Module 15
4 Evaluation Methodology 19
5 Experimental Results 22
5.1 Performance Improvement 23
5.2 Effect of Considering Interference in
Both LLC and Main Memory Level 26
5.3 Effects of Considering Per-Core Cache
Sensitivity 27
6 Related Work 28
7 Conclusion 33
Bibliography 34

[1] Jaekyu Lee and Hyesoon Kim. Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pages 1–12, Feb 2012.
[2] Andre Rigland Brodtkorb, Trond Runar Hagen, and Martin Lilleeng Satra. Graphics processing unit (gpu) programming strategies and trends in gpu computing. J. Parallel Distrib. Comput., 73(1):4–13, 2013.
[3] Intel. Intel microarchitecture code name sandy bridge.
[4] AMD. Amd accelerated processing unit.
[5] Nvidia. Nvidia tegra.
[6] Alex Settle, Dan Connors, Enric Gibert, and Antonio Gonzalez. A dynamically reconfigurable cache for multithreaded processors. J. Embedded Comput., 2(2):221–233, April 2006.
[7] Lisa R. Hsu, Steven K. Reinhardt, Ravishankar Iyer, and Srihari Makineni. Communist, utilitarian, and capitalist cache policies on cmps: Caches as a shared resource. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT ’06, pages 13–22, New York, NY, USA, 2006. ACM.
[8] G.E. Suh, S Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on, pages 117–128, Feb 2002. 34
[9] G.E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. The Journal of Supercomputing, 28(1):7–26, 2004.
[10] Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 423–432, Washington, DC, USA, 2006. IEEE Computer Society.
[11] Miquel Moreto, FranciscoJ. Cazorla, Alex Ramirez, and Mateo Valero. Mlp-aware dynamic cache partitioning. In Per Stenstrom, Michel Dubois, Manolis Katevenis, Rajiv Gupta, and Theo Ungerer, editors, High Performance Embedded Architectures and Compilers, volume 4917 of Lecture Notes in Computer Science, pages 337–352. Springer Berlin Heidelberg, 2008.
[12] Guang Suo, Xuejun Yang, Guanghui Liu, Junjie Wu, Kun Zeng, Baida Zhang, and Yisong Lin. Ipc-based cache partitioning: An ipc-oriented dynamic shared cache partitioning mechanism. In Convergence and Hybrid Information Technology, 2008. ICHIT ’08. International Conference on, pages 399–406, Aug 2008.
[13] Chenjie Yu and P. Petrov. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Design Automation Conference (DAC), 2010 47th ACM/IEEE, pages 132–137, June 2010.
[14] Vineeth Mekkat, Anup Holey, Pen-Chung Yew, and Antonia Zhai. Managing shared last-level cache in a heterogeneous multicore processor. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pages 225–234, Piscataway, NJ, USA, 2013. IEEE Press.
[15] Xing Lin and Rajeev Balasubramonian. Refining the utility metric for utility-based cache partitioning. In 9th Annual Workshop on Duplicating, Deconstructing, and Debunking, 2011.
35[16] M. Garrido and J. Grajal. Continuous-flow variable-length memoryless linear regression architecture. Electronics Letters, 49(24):1567–1569, November 2013.
[17] Pablo Royer del Barrio, Sanchez Marcos, Miguel Angel, Marisa Lopez Vallejo, and Carlos Alberto Lopez Barrio. Area-efficient linear regression architecture for real-time signal processing on fpgas. 2011.
[18] Elizabeth Holmes, Eric Ward, and Kellie Wills. MARSS: Multivariate Autoregressive State-Space Modeling, 2013. R package version 3.9.
[19] Elizabeth E. Holmes, Eric J. Ward, and Kellie Wills. Marss: Multivariate autoregressive state-space models for analyzing time-series data. The R Journal, 4(1):30, 2012.
[20] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, pages 163–174. IEEE, 2009.
[21] Po-Han Wang, Chien-Wei Lo, Chia-Lin Yang, and Yu-Jung Cheng. A cycle-level simt-gpu simulation framework. In Rajeev Balasubramonian and Vijayalakshmi Srinivasan, editors, ISPASS, pages 114–115. IEEE, 2012.
[22] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. Dramsim2: A cycle accurate memory system simulator. Computer Architecture Letters, 10(1):16–19, 2011.
[23] J. Lotze, P.D. Sutton, and H. Lahlou. Many-core accelerated libor swaption portfolio pricing. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1185–1192, Nov 2012.
[24] Yuejian Xie and Gabriel H. Loh. Pipp: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, pages 174–183, New York, NY, USA, 2009. ACM.
[25] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 208–219, New York, NY, USA, 2008. ACM.
[26] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA’10, pages 60–71, New York, NY, USA, 2010. ACM.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
系統版面圖檔 系統版面圖檔