|
[1] Power, Jason, M. Hill, and D. Wood. "Supporting x86-64 Address Translation for 100s of GPU Lanes." HPCA, 2014. [2] Pichai, Bharath, Lisa Hsu, and Abhishek Bhattacharjee. "Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces." ASPLOS, 2014. [3] Barr, Thomas W., Alan L. Cox, and Scott Rixner. "Translation caching: skip, don't walk (the page table)." ACM SIGARCH Computer Architecture News. Vol. 38. No. 3. ACM, 2010. [4] Lee, Janghaeng, Mehrzad Samadi, and Scott Mahlke. "VAST: The Illusion of a Large Memory Space for GPUs." PACT, 2014. [5] Pham, Binh, et al. "CoLT: coalesced large-reach TLBs." MICRO, 2012. [6] Bhattacharjee, Abhishek. "Large-reach memory management unit caches." MICRO, 2013. [7] Pham, Binh, et al. "Increasing TLB reach by exploiting clustering in page translations." HPCA, 2014. [8] Esmaeilzadeh, Hadi, et al. "Dark silicon and the end of multicore scaling." ISCA, 2011. [9] Bhattacharjee, Abhishek, Daniel Lustig, and Margaret Martonosi. "Shared last-level TLBs for chip multiprocessors." HPCA, 2011 [10] Tanasic, I., Gelado, I. ; Cabezas, J., et al " Enabling preemptive multiprogramming on GPUs " ISCA, 2014 [11] Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator", ISPASS, 2009. [12] N. Jiang et al., “A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator,” ISPASS, 2013. [13] ARM® Cortex®-A57 MPCore Processor Technical Reference Manual, Revision r1p2 [14] Che, Shuai, et al. "Rodinia: A benchmark suite for heterogeneous computing." Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 2009. [15] OpenCL specification, "http://www.khronos.org/opencl/" [16] HSA specification, "http://www.hsafoundation.com/" [17] Memory system on fusion APUs, " http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf " [18] CUDA toolkit ," https://developer.nvidia.com/cuda-toolkit " [19] GPGPU-Sim manul "http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual" [20] NVIDIA Fermi Compute Architecture Whitepaper [21] Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors [22] Amit, Nadav, Muli Ben-Yehuda, and Ben-Ami Yassour. "IOMMU: Strategies for mitigating the IOTLB bottleneck." Computer Architecture. Springer Berlin Heidelberg, 2012. [23] Intel® Virtualization Technology for Directed I/O
|