

( 您好!臺灣時間:2025/01/19 00:29
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::


研究生(外文):Hsueh-Chun Fu
論文名稱(外文):Eliminate IOMMU Address Translation for Accelerator-rich Architecture via Cache Forwarding
指導教授(外文):Chia-lin Yang
外文關鍵詞:Heterogeneous ComputingAccelerator-rich ArchitectureVirtual Memory System
  • 被引用被引用:0
  • 點閱點閱:227
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
Emerging accelerator-rich architectures combine conventional processors with multiple customized accelerators onto the same die. Prior studies have introduced a IOMMU to enable the unified virtual address space for accelerators. However, the slow IOMMU is not capable of delivering efficient page walks and diminishes the gain of customized accelerators. Moreover, the highly-associative IOTLB can account for an unnegligible power consumption. Related work presents an offload page walker to speed up the IOMMU address
translation via utilizing the CPU’s MMU page walk cache. However, the IOMMU address translation still exists and harms the performance and the power. In this work, instead of letting DMA fetch data through the IOMMU address translation, we make the CPU’s L1 data cache directly forward the data to the accelerator’s scratchpad to avoid the IOMMU address translation. Evaluations show our proposed mechanism can achieve 14.8% and 8% improvements on execution time compared to the baseline and the state-of-the-art offload page walker and overall reach 22.1% power reduction on average.
1 Introduction 1
2 Background 3
2.1 Architecture of Customized Accelerator 3
2.2 Accelerator Execution Model 4
2.3 Motivation 4
3 Mechanism 7
3.1 Construct Scratchpad Mapping in L1 Data Cache 8
3.2 Accelerator-Task Data Evictions 9
3.3 Software Modifications and Architectural Support 10
4 Results 12
5 Related Work 16
5.1 Design Space Exploration of Customized Accelerators 16
5.2 Integration of Customized Accelerators 17
5.3 Studies on Address Translation 17
6 Conclusion and Future Work 19
Bibliography 20
[1] D. Abramson, J. Jackson, S. Muthrasanallur, G. Neiger, G. Regnier, R. Sankaran I. Schoinas, R. Uhlig, B. Vembu, and J. Wiegert. Intel virtualization technology for directed i/o. Intel technology journal, 10(3), 2006.
[2] A. AMD. I/o virtualization technology spec., feb. 2007.
[3] T.W. Barr, A. L. Cox, and S. Rixner. Translation caching: skip, don’t walk (the page table). In ACM SIGARCH Computer Architecture News, volume 38, pages 48–59. ACM, 2010.
[4] A. Basu, M. D. Hill, and M. M. Swift. Reducing memory reference energy with
opportunistic virtual caching. In ACM SIGARCH Computer Architecture News, volume 40, pages 297–308. IEEE Computer Society, 2012.
[5] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne. Accelerating two-dimensional page walks for virtualized systems. In ACM SIGARCH Computer Architecture News, volume 36, pages 26–35. ACM, 2008.
[6] H. Bhatnagar. Advanced ASIC Chip Synthesis: Using SynopsysR Design CompilerTM Physical CompilerTM and PrimeTimeR . Springer Science & Business Media, 2007.
[7] S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pages 195–205. IEEE, 2000.
[8] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. A quantitative analysis on microarchitectures of modern cpu-fpga platforms. In DAC, 2016 53nd ACM/EDAC/IEEE, pages 1–6. IEEE, 2016.
[9] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman Accelerator-rich architectures: Opportunities and progresses. In Proceedings of the 51st Annual Design Automation Conference, pages 1–6. ACM, 2014.
[10] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 843–849. IEEE, 2012.
[11] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In ACM SIGARCH Computer Architecture News, volume 39, pages 365–376. ACM, 2011.
[12] H. Foundation. Hsa platform system architecture spec. 1.0, 2015.
[13] Y. Hao, Z. Fang, G. Reinman, and J. Cong. Supporting address translation for accelerator-centric architectures. In HPCA, 2017 IEEE International Symposium on, pages 37–48. IEEE, 2017.
[14] B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces. In ACM SIGARCH Computer Architecture News, volume 42, pages 743–758. ACM, 2014.
[15] B. Reagen, R. Adolf, Y. S. Shao, G.-Y.Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IISWC, 2014. IEEE, 2014.
[16] Y. S. Shao and D. Brooks. Research infrastructures for hardware accelerators. Synthesis Lectures on Computer Architecture, 10(4):1–99, 2015.
[17] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A pre-rtl, power performance accelerator simulator enabling large design space exploration of customized architectures. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 97–108. IEEE, 2014.
[18] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks. Co-designing accelerators and soc interfaces using gem5-aladdin. In MICRO, 2016. IEEE, 2016.
[19] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. Cacti 5.1. Technical report, Technical Report HPL-2008-20, HP Labs, 2008.
第一頁 上一頁 下一頁 最後一頁 top