跳到主要內容

臺灣博碩士論文加值系統

(44.221.73.157) 您好!臺灣時間:2024/06/15 12:54
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:葉志威
研究生(外文):Chih-Wei Yeh
論文名稱:利用虛擬平台分析異質計算系統與應用之效能
論文名稱(外文):Analyzing Application Performance for Heterogeneous Platforms
指導教授:洪士灝洪士灝引用關係
指導教授(外文):Shih-Hao Hung
口試委員:楊佳玲徐慰中施吉昇郭大維涂嘉恒
口試日期:2017-12-22
學位類別:博士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:106
語文別:英文
論文頁數:90
中文關鍵詞:時間模擬效能分析異質平台機器學習效能預測
相關次數:
  • 被引用被引用:0
  • 點閱點閱:261
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
隨著時代演進,為了達到更好的應用程式效能,現今處理器紛紛朝向異質化和訂製的加速器發展,從傳統的資料處理應用到智慧應用,像是深度學習、物聯網、邊緣運算和工業4.0,因此,系統設計也隨著變得更為複雜,設計焦點也逐漸轉向為探索不同的設計空間和硬體參數,找到一個適合的處理器/加速器的組合,甚至還需要考慮在不同的處理器架構下軟體的效能變化,因為一個程式的效能會隨著演算法的設計而改變,而且演算法的效能又會隨著硬體的參數變化或不同的硬體平台而改變,為了在合理的硬體成本下得到最好的效能,理解程式和演算法的行為變化變成是一個重要的議題,一個能垂直從軟體行為分析到硬體變化的效能分析整合會是這個議題的關鍵技術,也會是在異質時代優化程式的關鍵一環。
傳統的模擬器,像是gem5,雖然能夠在微處理器等級提供準確的效能分析和時間估計,但是加入一套新的硬體元件時間模型並不簡單,並且這些傳統的模擬器的方式不能夠提供軟體行為分析的功能,也不能了解這些軟體行為變化對應到的個別硬體元件影響,模擬速度也因為這些複雜的時間模型而變得緩慢。為了解決模擬速度和程式分析問題,我們整合了時間模擬器開發出了一個快速混合式模擬器,Snippit,提供使用者可以在不用修改的情況下執行和模擬一個完整的系統,並且可以用於雛型系統開發和設計使用,此外,我們所提出的可熱插拔硬體模擬器的快速時間模擬器和即時動態模型選擇器大幅的減少了模擬器的執行時間,使其可以在40到70 MIPS的速度下執行,以及整合了Multi2Sim GPGPU模擬器提供異質環境的模擬和實做了共用變數追蹤機制來提供簡易的競合状態檢測,我們的異質模擬器能提供使用者在異質環境下的資料搬移和共用變數的問題檢測。最後我們實做了程式Phase偵測演算法來蒐集和提供程式行為的資訊和每段程式行為對應到使用者指定的不同硬體參數的效能資料。有了這些功能,我們提出的Snippit作為一個模擬環境,除了可以幫助軟硬體設計與效能分析外,還做到了垂直整合:從軟體的演算法行為變化對應到硬體的變化,再對應到這些變化的效能指標;並且結合了機器學習方法來幫助自動分析和預測程式碼不同部份的行為執行在不同的硬體平台上的效能,提供使用者優化方向的建議。
Today''s state-of-the-art processing systems often require heterogeneous computing and special-purpose accelerators to offer highly efficient performance for mixed application workloads, including not only traditional data processing algorithms, but also the demands to enable smart applications such as deep learning, Internet of Things, Edge Computing, as well as the Industry 4.0. Thus, the complexity of such systems has been increasing, and the focus of designing has been shifting to exploring the design space with a mixture of processing cores/accelerators and the performance impacts from application behaviors to hardware resources. In order to gain the best performance under acceptable hardware costs, understanding the program behaviors and its algorithms is critical to the problem but the performance of each algorithm varies on different accelerators and hardware parameters. The vertical integration and analysis from high-level program behaviors to underlying changes of hardware parameters is the key technique to the optimizations in the heterogeneous era.
Traditional simulation tools may offer accurate performance estimation at the micro-architectural level, but it is highly complicated to combine the simulators for various components to perform complex applications, and they fall in short in terms of their capabilities to profiling application behavior with its performance impacts of hardware changes. Furthermore, the speed of such complex simulation would be slow with cycle-accurate heterogeneous emulation framework such as gem5.
To solve the problem of simulation speed and performance analysis, we developed a rapid hybrid emulation/simulation framework, Snippit, that allows the user to execute a full-blown system and plug in emulators, simulators, and timing models for various components in the prototype system. With the proposed scalable and hot-pluggable timing simulation scheme and the just-in-time model selection mechanism which reduces the simulation time of regular patterns, the proposed framework is capable to be running at the speed of 40-70 MIPS. Integrating with Multi2Sim GPGPU emulator, we further implemented a shared variable tracking mechanism in order to trace the race conditions as well as the throughput of data copies among processing units. In addition, we implemented a phase detection algorithm to track and collect the application behaviors and its performance data under different hardware parameters which user specified. With all the functionalities, Snippit is an emulation tool that can help both system and application developers analyze the performance from software to hardware in the way program behaves. Finally, we incorporated machine learning to help analyze and predict the performance of optimizing and running target applications on other accelerators. As a vertical integration from software to hardware, Snippit can classify the execution of an application into program phases and give suggestions on optimizing each phase with its performance prediction.
口試委員會審定書 . . . . . . . . . . . . . . . . . . . . . . . . i
誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 Background and Related Work . . . . . . . . . . . . . 6
2.1 Simulation Tools . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Virtual Performance Monitoring Unit . . . . . . . . . . . . 8
2.3 Multi2Sim and OpenCL . . . . . . . . . . . . . . . . . . . . 10
2.4 Program Phases . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Basic Block Vectors and Phases . . . . . . . . . . . . . . 13
2.4.2 Workflow of Phase Detection . . . . . . . . . . . . . . .. 14
2.4.3 Studies Leveraging the Properties of Phases . . . . . . .. 14
Chapter 3 Implementation . . . . . . . . . . . . . . . . . . . . 16
3.1 Overview of Snippits . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Dynamic Binary Instrumentation . . . . . . . . . . . . . 17
3.1.2 Virtual Clock . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 Multiple Timeline . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Helpers for Communication . . . . . . . . . . . . . . . . 21
3.2 Fast and Composable Multicore Simulation . . . . . . . . . . 24
3.2.1 Packet Wrapper . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Multi-Model Ring Buffer (MMRB) . . . . . . . . . . . . . . 27
3.2.3 Just-In-Time (JIT) Model Selection . . . . . . . . . . . 28
3.2.4 Async Callback . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Heterogeneous Processors Simulation . . . . . . . . . . . . 30
3.3.1 Virtual IOCTL . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Accelerator Plugin Module . . . . . . . . . . . . . . . . 33
3.3.3 Shared Variable Tracking . . . . . . . . . . . . . . . . 34
3.3.4 Memory Management . . . . . . . . . . . . . . . . . . . . 35
3.4 Phase-Based Program Analysis and Optimization Support . . . 37
3.4.1 Event Tracker . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Phase Detection . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Post-Processing . . . . . . . . . . . . . . . . . . . . . 42
3.4.4 Performance Prediction . . . . . . . . . . . . . . . . . . 44
Chapter 4 Evaluations . . . . . . . . . . . . . . . . . . . . 47
4.1 Rapid Multicore Processor Simulation . . . . . . . . . . . . 47
4.1.1 The Bottleneck Shifts . . . . . . . . . . . . . . . . . . 49
4.1.2 JIT Model-Selection . . . . . . . . . . . . . . . . . . . 53
4.2 Shared-Memory Architecture Simulation with CPU and GPU . . . 55
4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Evaluation of Heterogeneous Applications . . . . . . . . . 56
4.2.3 Shared Variable Tracking . . . . . . . . . . . . . . . . . 59
4.3 Phase-Based Program Analysis . . . . . . . . . . . . . . . . 60
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Phase Profiling . . . . . . . . . . . . . . . . . . . . .. 61
4.3.3 Case Study with Word Count . . . . . . . . . . . . . . . . 62
4.4 Phase-Based System Optimizations . . . . . . . . . . . . . . 68
4.4.1 Predicting the Performance on GPU . . . . . . . . . . . .. 68
4.4.2 Hints for Vectorizations . . . . . . . . . . . . . . . . . 69
4.4.3 SW/HW Co-Design . . . . . . . . . . . . . . . . . . . . . 72
Chapter 5 Conclusion and Future Work . . . . . . . . . . . . . . 77
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . 78
[1] Codexl - gpuopen. https://gpuopen.com/compute-product/codexl/.
[2] Profiler user''s guide. http://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[3] D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column-oriented database systems. Proceedings of the VLDB Endowment, 2(2):1664-1665, 2009.
[4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
[5] E. Andersen. Buildroot: making embedded linux easy. https://buildroot.org/.
[6] ARM. Arm dynamiq technology. http://pages.arm.com/dynamiq-technology.html.
[7] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163-174. IEEE, 2009.
[8] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163-174. IEEE, 2009.
[9] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 245-257. ACM, 2000.
[10] I. Baldini, S. J. Fink, and E. Altman. Predicting gpu performance from cpu runs using machine learning. In Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on, pages 254-261. IEEE, 2014.
[11] bananian. Bananian linux 16.04. https://www.bananian.org/news#bananian_linux_1604_released_-_2016-04-23.
[12] T. Beauchamp and D. Weston. Dtrace: The reverse engineer’s unexpected swiss army knife. Blackhat Europe, 2008.
[13] F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track, pages 41-46, 2005.
[14] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1-7, 2011.
[15] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro, 26(4):52-60, 2006.
[16] P. E. Black. Manhattan distance. Dictionary of Algorithms and Data Structures, 18:2012, 2006.
[17] T. Bray. The javascript object notation (json) data interchange format. https://en.wikipedia.org/wiki/JSON.
[18] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, 25(3):13-25, 1997.
[19] J. Burkardt. K-means clustering. Virginia Tech, Advanced Research Computing, Interdisciplinary Center for Applied Mathematics, 2009.
[20] J. Burkardt. C examples of parallel programming with openmp. "https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html",, 2011.
[21] B. Calder, T. Sherwood, G. Hamerly, and E. Perelman. Simpoint: Picking representative samples to guide simulation. Performance Evaluation and Benchmarking, page 117, 2005.
[22] T. E. Carlson, W. Heirmant, and L. Eeckhout. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pages 1-12. IEEE, 2011.
[23] J. M. Cebrian, M. Jahre, and L. Natvig. Parvec: vectorizing the parsec benchmark suite. Computing, 97(11):1077-1100, 2015.
[24] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
[25] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44-54. Ieee, 2009.
[26] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pages 1-11. IEEE, 2010.
[27] S.-C. Chen and D. J. Kuck. Time and parallel processor bounds for linear recurrence systems. IEEE Transactions on Computers, 100(7):701-717, 1975.
[28] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson, J. Keefe, and H. Angepat. Fpga-accelerated simulation technologies (fast): Fast, full-system, cycle-accurate simulators. In Proceedings of the 40th Annual IEEE/ACM international Symposium on Microarchitecture, pages 249-261. IEEE Computer Society, 2007.
[29] J. Cong, Z. Fang, M. Gill, and G. Reinman. Parade: A cycle-accurate full-system simulation platform for accelerator-rich architectural design and exploration. In Computer-Aided Design (ICCAD), 2015 IEEE/ACM International Conference on, pages 380-387. IEEE, 2015.
[30] A. C. de Melo. The new linux’perf’tools. In Slides from Linux Kongress, volume 18, 2010.
[31] P. J. Denning and S. C. Schwartz. Properties of the working-set model. Communications of the ACM, 15(3):191-198, 1972.
[32] A. S. Dhodapkar and J. E. Smith. Managing multi-configuration hardware via dynamic working set analysis. In Computer Architecture, 2002. Proceedings. 29th Annual International Symposium on, pages 233-244. IEEE, 2002.
[33] A. S. Dhodapkar and J. E. Smith. Comparing program phase detection techniques. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, page 217. IEEE Computer Society, 2003.
[34] J.-H. Ding, W.-C. Hsu, B.-C. Jeng, S.-H. Hung, and Y.-C. Chung. Hsaemu: a full system emulator for hsa platforms. In Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis, page 26. ACM, 2014.
[35] E. Duesterwald, C. Cascaval, and S. Dwarkadas. Characterizing and predicting program behavior and its variability. In Parallel Architectures and Compilation Techniques, 2003. PACT 2003. Proceedings. 12th International Conference on, pages 220-231. IEEE, 2003.
[36] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Parallel Processing, 2009. ICPP''09. International Conference on, pages 124-131. IEEE, 2009.
[37] J. Edler and U. o. W. C. S. Mark D. Hill. Dinero iv trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/ markhill/DineroIV/, 1998.
[38] F. C. Eigler and R. Hat. Problem solving with systemtap. In Proc. of the Ottawa Linux Symposium, pages 261-268. Citeseer, 2006.
[39] S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. In ACM Sigplan Notices, volume 17, pages 120-126. ACM, 1982.
[40] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, pages 3-14. IEEE, 2001.
[41] T. D. Han and T. S. Abdelrahman. Reducing branch divergence in gpu programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, page 3. ACM, 2011.
[42] S. Harizopoulos, V. Liang, D. J. Abadi, and S. Madden. Performance tradeoffs in read-optimized databases. In Proceedings of the 32nd international conference on Very large data bases, pages 487-498. VLDB Endowment, 2006.
[43] T. Harter, D. Borthakur, S. Dong, A. S. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Analysis of hdfs under hbase: a facebook messages case study. In FAST, volume 14, page 12th, 2014.
[44] W. D. Hillis and G. L. Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170-1183, 1986.
[45] S. Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107-116, 1998.
[46] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
[47] HPE. Hpe synergy. https://www.hpe.com/us/en/integrated-systems/synergy.html.
[48] W.-C. Hsu, S.-H. Hung, and C.-H. Tu. A virtual timing device for program performance analysis. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, pages 2255-2260. IEEE, 2010.
[49] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, pages 41-51. IEEE, 2010.
[50] S.-H. Hung, T.-W. Kuo, C.-S. Shih, and C.-H. Tu. System-wide profiling and optimization with virtual machines. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, pages 395-400. IEEE, 2012.
[51] S.-H. Hung, F.-T. Liang, C.-H. Tu, and N. Chang. Performance and power estimation for mobile-cloud applications on virtualized platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2013 Seventh International Conference on, pages 260-267. IEEE, 2013.
[52] T. Issariyakul and E. Hossain. Introduction to network simulator NS2. Springer Science & Business Media, 2011.
[53] S.-h. Kang, D. Yoo, and S. Ha. Tqsim: A fast cycle-approximate processor simulator based on qemu. Journal of Systems Architecture, 66:33-47, 2016.
[54] H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho. Macsim: A cpu-gpu heterogeneous simulation framework. http://comparch.gatech.edu/hparch/macsim/macsim.pdf.
[55] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. arXiv preprint arXiv:1712.01208, 2017.
[56] C.-K. Lai, C.-W. Yeh, and S.-H. Hung. Fast race detection and profiling framework for heterogeneous system. In Computer Symposium (ICS), 2016 International, pages 525-530. IEEE, 2016.
[57] F. Liang, C. Feng, X. Lu, and Z. Xu. Performance benefits of datampi: a case study with bigdatabench. In Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pages 111-123. Springer, 2014.
[58] C.-Y. Lin, P.-Y. Chen, C.-K. Tseng, C.-W. Huang, C.-C. Weng, C.-B. Kuan, S.-H. Lin, S.-Y. Huang, and J.-K. Lee. Power aware sid-based simulator for embedded multicore dsp subsystems. In Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ''10, pages 95-104, New York, NY, USA, 2010. ACM.
[59] J. Ma, L. Yu, M. Y. John, and T. Chen. Mcmg simulator: A unified simulation framework for cpu and graphic gpu. Journal of Computer and System Sciences, 81(1):57-71, 2015.
[60] M. Mahoney. Testset from data compression program benchmark and word2vec. http://mattmahoney.net/dc/text8.zip.
[61] S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A. Padua, et al. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372-382. IEEE, 2011.
[62] R. K. Malladi. Using intel® vtune™ performance analyzer events/ratios & optimizing applications. http:/software. intel. com, 2009.
[63] K. Meng, R. Joseph, R. P. Dick, and L. Shang. Multi-optimization power management for chip multiprocessors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 177-186. ACM, 2008.
[64] A. P. Miettinen, V. Hirvisalo, and J. Knuuttila. Using qemu in timing estimation for mobile software development. In 1st International QEMU Users’ Forum, volume 1, pages 19-22, 2011.
[65] T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, page 3, 2010.
[66] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972, 2017.
[67] P. J. Mucci, S. Browne, C. Deane, and G. Ho. Papi: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference, volume 710, 1999.
[68] Multi2Sim. Multi2sim benchmarks. https://github.com/Multi2Sim/m2s-bench-amdsdk-2.5-src.
[69] P. Nagpurkar, C. Krintz, and T. Sherwood. Phase-aware remote profiling. In Proceedings of the international symposium on Code generation and optimization, pages 191-202. IEEE Computer Society, 2005.
[70] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices, volume 42, pages 89-100. ACM, 2007.
[71] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Computer Architecture Letters, 14(1):34-36, 2015.
[72] D. M. Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. International Journal of Machine Learning Technology, 2(1):37, 2011.
[73] M. Pricopi, T. S. Muthukaruppan, V. Venkataramani, T. Mitra, and S. Vishin. Power-performance modeling on asymmetric multi-cores. In Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2013 International Conference on, pages 1-10. IEEE, 2013.
[74] A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini. A heterogeneous multi-core system-on-chip for energy efficient brain inspired vision. In Circuits and Systems (ISCAS), 2016 IEEE International Symposium on, pages 2910-2910. IEEE, 2016.
[75] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 110-119. IEEE, 2014.
[76] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. Sesc: cycle accurate architectural simulator. Retrieved November, 19:2013, 2005.
[77] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. IEEE Computer Architecture Letters, 10(1):16-19, 2011.
[78] M. Rullgard. Cortex-a7 instruction cycle timings. http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/.
[79] D. Sanchez and C. Kozyrakis. Zsim: fast and accurate microarchitectural simulation of thousand-core systems. In ACM SIGARCH Computer Architecture News, volume 41, pages 475-486. ACM, 2013.
[80] L. Sawalha, S. Wolff, M. P. Tull, and R. D. Barnes. Phase-guided scheduling on single-isa heterogeneous multicore processors. In Digital System Design (DSD), 2011 14th Euromicro Conference on, pages 736-745. IEEE, 2011.
[81] N. Sehatbakhsh, A. Nazari, A. Zajic, and M. Prvulovic. Spectral profiling: Observer-effect-free profiling by monitoring em emanations. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1-11. IEEE, 2016.
[82] A. Sembrant, D. Black-Schaffer, and E. Hagersten. Phase behavior in serial and parallel applications. In Workload Characterization (IISWC), 2012 IEEE International Symposium on, pages 47-58. IEEE, 2012.
[83] A. Sembrant, D. Black-Schaffer, and E. Hagersten. Phase guided profiling for fast cache modeling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, pages 175-185. ACM, 2012.
[84] A. Sembrant, D. Eklov, and E. Hagersten. Efficient software-based online phase classification. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 104-115, Nov 2011.
[85] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 97-108. IEEE, 2014.
[86] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks. Co-designing accelerators and soc interfaces using gem5-aladdin. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1-12. IEEE, 2016.
[87] T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Parallel Architectures and Compilation Techniques, 2001. Proceedings. 2001 International Conference on, pages 3-14. IEEE, 2001.
[88] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ACM SIGARCH Computer Architecture News, volume 30, pages 45-57. ACM, 2002.
[89] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. Discovering and exploiting program phases. IEEE micro, 23(6):84-93, 2003.
[90] T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In ACM SIGARCH Computer Architecture News, volume 31, pages 336-349. ACM, 2003.
[91] F. Sinvoid. Bananapi m1. http://www.banana-pi.org/m1.html.
[92] J. E. Smith and A. S. Dhodapkar. Dynamic microarchitecture adaptation via co-designed virtual machines. In Solid-State Circuits Conference, 2002. Digest of Technical Papers. ISSCC. 2002 IEEE International, volume 1, pages 198-199. IEEE, 2002.
[93] T. Sondag and H. Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 11-20. IEEE, 2011.
[94] V. Spiliopoulos, A. Sembrant, and S. Kaxiras. Power-sleuth: A tool for investigating your program''s power behavior. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Symposium on, pages 241-250. IEEE, 2012.
[95] C. Stoif, M. Schoeberl, B. Liccardi, and J. Haase. Hardware synchronization for embedded multi-core processors. In 2011 IEEE International Symposium of Circuits and Systems (ISCAS), pages 2557-2560. IEEE, 2011.
[96] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.
[97] C. Tu, S. Hung, and T. Tsai. Mcemu: A framework for software development and performance analysis of multicore systems. ACM Trans. Design Autom. Electr. Syst., 17(4):36, 2012.
[98] R. Ubal, J. Sahuquillo, S. Petit, and P. López. Multi2sim: A simulation framework to evaluate multicore-multithread processors. In IEEE 19th International Symposium on Computer Architecture and High Performance computing, page (s), pages 62-68. Citeseer, 2007.
[99] M. Van Biesbrouck, T. Sherwood, and B. Calder. A co-phase matrix to guide simultaneous multithreading simulation. In Performance Analysis of Systems and Software, 2004 IEEE International Symposium on-ISPASS, pages 45-56. IEEE, 2004.
[100] S. Wallace and K. Hazelwood. Superpin: Parallelizing dynamic instrumentation for real-time performance. In Proceedings of the International Symposium on Code Generation and Optimization, pages 209-220. IEEE Computer Society, 2007.
[101] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google''s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[102] M. T. Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator. In Performance Analysis of Systems & Software, 2007. ISPASS 2007. IEEE International Symposium on, pages 23-34. IEEE, 2007.
[103] P. Zhou, R. Teodorescu, and Y. Zhou. Hard: Hardware-assisted lockset-based race detection. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 121-132. IEEE, 2007.
[104] L. Zhuo, G. R. Morris, and V. K. Prasanna. High-performance reduction circuits using deeply pipelined operators on fpgas. IEEE Transactions on Parallel and Distributed Systems, 18(10), 2007.
[105] I. D. Zone. Intel performance bottleneck: Loads blocked by store forwarding. https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/333586, 2012.
[106] I. D. Zone. Intel developer forum: 4k aliasing. https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/606846, 2016.
[107] I. D. Zone. Intel performance bottleneck: 4k aliasing. https://software.intel.com/en-us/node/544395, 2016.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top