跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.86) 您好!臺灣時間:2025/02/09 01:08
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:趙姝戎
研究生(外文):Shu-jung Chao
論文名稱:在超純量多核心架構下實現基於語意分析之迴圈展開器以提升ILP
論文名稱(外文):Improving ILP with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture
指導教授:邱日清
指導教授(外文):Jih-Ching Chiu
學位類別:碩士
校院名稱:國立中山大學
系所名稱:電機工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2017
畢業學年度:105
語文別:中文
論文頁數:87
中文關鍵詞:迴圈語意迴圈展開超多純量迴圈指令並行度
外文關鍵詞:ILP of loophyperscalarsemantic of looploop unrolling
相關次數:
  • 被引用被引用:0
  • 點閱點閱:157
  • 評分評分:
  • 下載下載:12
  • 收藏至我的研究室書目清單書目收藏:0
迴圈結構為高效能運算需求之程式結構的主體,因應多核心計算架構的時代來臨,發開迴圈中的指令層級並行度(Instruction Level Parallelism, ILP),將能有效提昇多核心計算架構的整體運算效能。迴圈結構對於程式所造成的特徵如下:(1)重複從指令快取記憶體中讀取及解碼、(2) 迴圈本體指令數的派遣限制、與(3)重疊運算(Iteration)間的資料往往存在相依性關係,導致迴圈執行時的指令並行度不佳,尤其在超多純量(Hyperscalar)架構上的應用,更應被重視,以達到發揮核心間運算資源的最大效益。
迴圈結構經編譯器產生之機械瑪有特定的指令編排模式,觀察其編排模式訂定出迴圈的語意。本論文以Hyperscalar為基礎,提出一種在指令分析器(Instruction Analyzer, IA)中,解析經由編譯器產生的機械碼,找出符合迴圈語意之迴圈本體指令的區間並收集該區間的資料,在分析收集的資料後展開迴圈,稱之為基於語意分析之迴圈展開器,其架構分為三個單元:迴圈偵測單元(Loop Detect Unit, LDU)、迴圈展開單元(Loop Unrolling Unit, LUU)與迴圈控制器(Loop Contoller)。迴圈偵測單元將根據訂定的語意找出迴圈本體並收集該區間的資料。迴圈展開單元將依據LDU收集的資料展開迴圈,其程序為:(1)根據核心資源決定迴圈展開的次數並加上指令編號(SEQ)、(2)暫存器重新命名(Register Renaming)及消除重疊運算間的資料相依性、(3)產生迴圈展開後的指令標籤和加入因應迴圈預先執行之補償標籤以維持資料的正確性、與(4)重新編排派發順序,使被消除重疊運算間之資料相依性的指令提前派發,並產生指令派發核心的標籤表、迴圈虛擬共享暫存器映射表(Loop VSRF Mapping Table)、迴圈記憶體標籤映射表(Loop Memory Tag Mapping Table)及迴圈特定指令沖刷表(Loop Specific Instruction Flush Table)。迴圈控制器(Loop Contoller)將依據預測錯誤的分支指令以及展開完成的迴圈資訊決定派發指令的權利,若預測不跳躍而執行結果為跳躍的分支指令的指令位址為展開完成的迴圈之條件判斷的分支指令時,迴圈控制器會將派發指令的權利交給LUU,待迴圈執行結束後將派發指令的權利歸還IA。
本論文使用Keil μVision5 compiler 後的 ARM 組語做驗證。經實測結果可看出消除重疊運算的相依性可使ILP提升1.2至2倍,,以及加入指定指令沖刷能有效降低內部具有分支指令的迴圈程式之執行時間。
In an age of multi-core computing architecture, exploiting ILP of loops can enhance the computing efficiency in the multi-core computing architecture since loop structure is the main construction of the program with high performance computing needs. The characteristics of the loop structure for the program are as follows: (1) Instruction will be fetched from cache and be decoded repeatedly. (2) The limit of instructions issue number of the loop body. (3) There are dependence relations between iterations. These factors result in the poor ILP in the implementation of the loop. In order to develop the maximum benefit of the usage of the cores computing resources, the application of the Hyperscalar architecture should be emphasized.
Because there is a specific ordering pattern in machine codes which produced by compiling the loop structure, we can formulate the semantic of the loop with the observations of this pattern. In this thesis, we propose an architecture called semantic-based loop unrolling mechanism in the Hyperscalar architecture. This architecture unrolls the loop in the instruction analyzer (IA) by analyzing the information gathered after finding the closed interval of loop body instructions by parsing the semantic of instructions, which is identical to what we formulate.
Proposed architecture consists of three unit: loop detect unit (LDU), loop unrolling unit (LUU), and loop controller. Loop detect unit will find the closed interval of the loop body instructions by parsing the semantic of instructions, which is identical to what we formulate, and collecting the information of this closed interval. Loop unrolling unit will unroll the loop based on the information collected by LDU. The unrolling procedures of LUU are as follows: (1) Decide loop unrolling times by the resources of core numbers, and add the SEQ tag to these instructions. (2) Register renaming and eliminate iteration dependence of the unrolled loop. (3) Generate tag to these instructions and add compensate tag to make sure the accuracy of data. (4) Rearrange the issue order of these instructions to issue the instructions which have been eliminated iteration dependence in advance, and generate instruction tag dispatch table, loop VSRF mapping table, loop memory tag mapping table, and loop specific instruction flush table. Loop controller will depend on the branch instruction with wrong prediction result and the loop which finish the unrolling procedures to decide the dispatch right. If this branch instruction identical to the unrolled loop’s conditional check branch instruction, and then the dispatch right will be handed over to LUU. When the execution of the unrolled loop is finish, loop controller will hand over the dispatch right to IA.
In this paper, the verify ARM instructions is generated by Keil μVision5 compiler. The results show that eliminating iteration dependence can improve ILP by 20% to 100%, and flushing specific instruction can decrease the total execution time of the loop whose loop body contains the internal branch instructions.
論文審定書 i
論文公開授權書 ii
致謝 iii
摘要 iv
Abstract vi
目錄 viii
圖次 x
表次 xii
第一章 簡介 1
1.1. 研究動機 1
1.2. 研究目標 3
1.3. 論文架構 5
第二章 相關研究 6
2.1. 超多純量(Hyperscalar)架構介紹 6
2.1.1. 指令分析器 8
2.1.2. 虛擬共享暫存器檔案 11
2.1.3. Register data flow的處理 13
2.1.4. Memory data flow的處理 15
2.1.5. Instruction flow的處理 16
2.2. 在pipeline及Superscalar處理器中提升迴圈執行效率 19
2.2.1. Loop Buffers 19
2.2.2. Loop Cache 19
2.2.3. Branch Prediction 19
2.2.4. Out-of-Order Execution 20
2.3. 在資料流處理器中提升迴圈執行效率 22
2.3.1. 靜態資料流處理器 23
2.3.2. 動態資料流處理器 24
2.4. 在VLIW處理器中提升迴圈執行效率 25
2.5. 各處理器在提升迴圈執行效率的比較 27
第三章 在超多純量架構中 設計基於語意分析之迴圈展開器 28
3.1. 基於語意分析之迴圈展開系統架構 28
3.1.1. 系統架構設計概念 28
3.1.2. 系統架構 30
3.2. 迴圈偵測單元 34
3.2.1. 迴圈偵測 34
3.2.2. 迴圈資料儲存格式 37
3.2.3. 迴圈資料儲存格式範例 38
3.3. 迴圈展開單元 40
3.3.1. 迴圈指令相依性分析 40
3.3.2. Register Renaming 43
3.3.3. Eliminate Iteration Dependence 44
3.3.4. 迴圈補償機制 47
3.4. 迴圈控制器 53
3.5. 指令運作範例 54
第四章 模擬與分析 65
4.1. 架構模擬 65
4.1.1. 模擬器架構之程式流程 65
4.1.2. 測試程式 67
4.2. 結果與討論 68
第五章 結論 70
參考文獻 72
[1]E. Rotenberg, S. Bennett, and J.E. Smith, “Trace cache: a low latency approach to high bandwidth instruction fetching,” in MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 24 –34, 1996
[2]T. Conte, K. Menezes, P. Mills, and B. Patel, “Optimization of instruction fetch mechanisms for high issue rates,” in 22nd Intl. Symp. on Computer Architecture, pp. 333-344, June 1995
[3]T.Y. Yeh, D.T. Marr, and Y. N. Patt, “Increasing the instruction fetch rate via multiple branch prediction and a branch address cache,” in 7th Itel. Conf. on Supercomputing, pp. 67-76, July 1993
[4]R. S. Bajwa et al., Instruction buffering to reduce power in processors for signal processing, IEEE VLSI, 1997.
[5]N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache,” in ICCD, 1999
[6]Tse-Yu Yeh, Yale N. Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 1992, Department of Electrical Engineering and Computer Science The University of Michigan.
[7]Thornton, James E. (1965). "Parallel operation in the control data 6600". Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems. AFIPS ''64. San Francisco, California: ACM. pp. 33–40
[8]R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal of Research and Development, volume 11, issue 1, January 1967, IBM, pp. 25–33
[9]J.B. Dennis and D.P. Misunas, “A Preliminary Architecture for a Basic Data-Flow Processor,” in Proceedings of the 2nd Annual Symposium on Computer Architecture, pp. 126-131, Houston, TX, January 1975
[10]E.J. Lerner, “Data-flow Architecture,” in IEEE Spectrum, pp. 57-62, April 1984
[11]J. A.Fisher, P. Faraboschi, and C. Young, Embedded Computing, a VLIW approach to architecture, compilers and tools. Elsevier, 2005.
[12]J.L. Hennessy and D.A. Patterson, “Computer Architecture A Quantitative Approach,” 2nd Edition, 1995
[13]S. Weiss and J.E. Smith, “A Study of Scalar Compilation Techniques for Pipelined Supercomputers,” in Proceedings of Second International Conference on Architecture Support for Programming Languages and Operating Systems, pp. 105-109, Palo Alto, CA, October 1987
[14]F.H. McMohan, “The Livemore Fortran Kernels: A Computer Test of the Numerical Performance Range,” Lawrence Livemore National Laboratory, Livemore, CA, 1986
[15]J.C. Huang and T. Leng, ” Generalized loop-unrolling: a method for program speedup,” in. Proceedings of 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, 1999
[16]J.W. Davidson and S. Jinturkar, “Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation,” in Proceedings of the 28th Annual International Symposium on Microarchitecture, pp. 125 –132, 1995
[17]Po-Kai Chen, “ESL Model of the Hyper-scalar Processor on a Chip”,2007 ,Department of Electrical Engineering National Sun Yat-Sen University
[18]Yu-Lian Chou, “Study of the Hyperscalar Multi-core Architecture”,2011 ,Department of Electrical Engineering National Sun Yat-Sen University
[19]Yin-Jou Huang, “Design of the Optimized Group Management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture”,2013 ,Department of Electrical Engineering National Sun Yat-Sen University
[20]Zh-Lung Chen, “Improving ILP with Semantic-Based Loop Unrolling Mechanism in X86 Architectures”,1999 , Department of Computer Science and Information Engineering National Chiao Tung University
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top