跳到主要內容

臺灣博碩士論文加值系統

(44.210.99.209) 您好!臺灣時間:2024/04/16 02:18
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:江宗翰
研究生(外文):Tsung-Han Chiang
論文名稱:於虛擬平台用迴圈函式的追蹤工具進行程式分析
論文名稱(外文):Program Analysis with a Loop-Function-based Tracing Tool onVirtual Platforms
指導教授:洪士灝洪士灝引用關係
口試委員:涂嘉恆廖世偉
口試日期:2016-07-23
學位類別:碩士
校院名稱:國立臺灣大學
系所名稱:資訊工程學研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2016
畢業學年度:104
語文別:英文
論文頁數:42
中文關鍵詞:迴圈和函式偵測動態分析迴圈函式情境樹虛擬平台ARM 架構
外文關鍵詞:Loop and function detectionDynamic AnalysisLoop-Call Context TreeVirtual PlatformARM Architecture
相關次數:
  • 被引用被引用:0
  • 點閱點閱:124
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
ARM 是近來最廣泛使用的指令集(ISA)。基於ARM 的設備已經攻入了可攜式裝置和伺服器市場。自2014 年起生產超過500 億的ARM處理器,分析ARM ISA 的程式成為軟體工程中的重要任務之一。然而由於ARM ISA 的設計和編譯器優化,在傳統的分析工具上追蹤ARM架構執行的程式函式呼叫和返還有一定困難度。此外,大部分的分析工具不會將分析結果收集在函式和迴圈層級。除此之外,端看你想頗析的程式行為,如模擬快取記憶體,整個過程需要花費大量時間。完整的剖析會帶來探針效應(Probe Effect),改變程式的行為,並影響到剖析的結果。
在我們的研究中,我們提出了stack-pointer-based 和later loop entry的檢測方式,針對ARM 架構上執行的程式,克服偵測迴圈和函式的困難,我們產生迴圈函式情境樹(Loop-Call Context Tree)。並且能讓程式頗析的層級更為細緻,幫助分析資料依賴、資料存取模式、快取記憶體模擬。此種情境樹使許多更進階的程式分析成為可能,如:迴圈依賴(Loop Dependency)、平行化偵測(Parallelism Detection)。最後,我們結合數值方法及模擬方法,將快取記憶體的剖析進行加速。並藉由於虛擬平台上進行程式頗析,基本上不會造成任何程式行為的改變。

ARM is the most widely used instruction set architecture (ISA) in terms of quantity produced. Recently, ARM-based systems have taken up markets of portable devices and servers. With over 50 billion ARM processors produced as of 2014, performance profiling for systems based on ARM ISA has become one of the very important tasks in today’s system engineering. However, conventional profiling tools are insufficient for tracking functions and loops of the programs performed by the ARM processors due to the design of ARM ISA and compiler optimization. In particular, most of the profiling tools are unable to collect and analyze events related to hardware and software interactions at the function and loop-level granularities.
In our study, we propose a stack-pointer-based method with a later loop entry detection scheme to overcome the difficulties of detecting functions and loops for programs performed on the ARM architecture. The generated loopcall context tree is used to build relationship among functions and loops and to store profiling data. This tree enables analysis, such as memory dependency, memory access pattern, cache simulation at finer-grained granularities. Moreover, the stored profiling data enable further analysis on parallelism detection and loop dependency. Finally, the advantages of the analytic methods and simulation methods are combined to accelerate cache simulation.

1 Introduction 1
2 Background and Related Work 4
2.1 Trend for Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Virtual Platform and VPA . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Call Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Call Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Dynamic Call Tree . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.3 Call Context Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.4 Loop Call Context Tree . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Loop Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1 TB method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 pMarker method . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Related Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Framework and Implementation 11
3.1 Process Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Branch Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Function Call Detection . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Function Return: Address-based Detection . . . . . . . . . . . . 14
3.2.3 Function Return: Stack-Pointer-based Detection . . . . . . . . . 15
3.2.4 Loop Event Detection . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 The Design of LCCT Structure . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 LCCT Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 A Pipeline-related Issue on Instruction Address . . . . . . . . . . . . . . 20
3.5 Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 Instruction Trace Collection . . . . . . . . . . . . . . . . . . . . 22
3.5.2 Branch Trace Collection . . . . . . . . . . . . . . . . . . . . . . 22
3.5.3 Cache Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.4 Memory Dependency Analysis . . . . . . . . . . . . . . . . . . . 25
3.5.5 Memory Access Analysis . . . . . . . . . . . . . . . . . . . . . 25
3.6 Post-processing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Evaluation 28
4.1 Experminental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Efficiency of Stack-Pointer-based Method . . . . . . . . . . . . . . . . . 29
4.3 Evaluation of Program Analysis . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Traceview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusion and Future Work 36
Bibliography 38

[1] Celebrating 50 billion shipped arm-powered chips. https://community.arm.com/community/news/blog/2014/02/12/celebrating-50-billionshipped-arm-powered-chips.
[2] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices, volume 42, pages 89–100. ACM, 2007.
[3] Kristof Beyls and Erik H D’Hollander. Refactoring for data locality. Computer, 42(2):62–71, 2009.
[4] Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. Loop transformation recipes for code generation and auto-tuning. In International Workshop on Languages and Compilers for Parallel Computing, pages 50–64. Springer, 2009.
[5] Cedric Nugteren, Pieter Custers, and Henk Corporaal. Algorithmic species: a classification of affine loop nests for parallel programming. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):40, 2013.
[6] Brendan Gregg and Jim Mauro. DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD. Prentice Hall Professional, 2011.
[7] Frank Ch Eigler and Red Hat. Problem solving with systemtap. In Proc. of the Ottawa Linux Symposium, pages 261–268. Citeseer, 2006.
[8] Fabrice Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track, pages 41–46, 2005.
[9] Chia-Heng Tu, Hui-Hsin Hsu, Jen-Hao Chen, Chun-Han Chen, and Shih-Hao Hung. Performance and power profiling for emulated android systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 19(2):10, 2014.
[10] Minjang Kim, Pranith Kumar, Hyesoon Kim, and Bevin Brett. Predicting potential speedup of serial code via lightweight profiling and emulations with memory performance model. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 1318–1329. IEEE, 2012.
[11] Mitesh R Meswani, Laura Carrington, Didem Unat, Allan Snavely, Scott Baden, and Stephen Poole. Modeling and predicting performance of high performance computing applications on hardware accelerators. International Journal of High Performance Computing Applications, 27(2):89–108, 2013.
[12] Julian Hammer, Georg Hager, Jan Eitzinger, and Gerhard Wellein. Automatic loop kernel analysis and performance modeling with kerncraft. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, page 4. ACM, 2015.
[13] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
[14] Cedric Nugteren and Henk Corporaal. The boat hull model: enabling performance prediction for parallel computing prior to code development. In Proceedings of the 9th conference on Computing Frontiers, pages 203–212. ACM, 2012.
[15] Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 207–216. ACM, 2015.
[16] Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. Cross-architecture performance prediction (xapp) using cpu code to predict gpu performance. In Proceedings of the 48th International Symposium on Microarchitecture, pages 725–737. ACM, 2015.
[17] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. Addresssanitizer: a fast address sanity checker. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 309–318, 2012.
[18] Vb watch. http://www.aivosto.com/vbwatch.html.
[19] Xueling Chen. Simsight: a virtual machine based dynamic call graph generator. 2010.
[20] Yukinori Sato, Yasushi Inoguchi, and Tadao Nakamura. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th ACM International Conference on Computing Frontiers, page 25. ACM, 2011.
[21] Zhen Li, Abumoslem Jannesari, and Felix Wolf. Discovery of potential parallelism in sequential programs. In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 1004–1013. IEEE, 2013.
[22] Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. Prospector: A dynamic datadependence profiler to help parallel programming. In HotPar
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top