跳到主要內容

臺灣博碩士論文加值系統

(44.222.64.76) 您好!臺灣時間:2024/06/15 06:17
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:葉勇彬
研究生(外文):Yong-Bin Ye
論文名稱:基於語意分析之指令標示方法設計具迴圈展開機制之超多純量處理器架構
論文名稱(外文):Design Loop Unrolling Mechanism Base on Instruction Tag of Semantic Analysis in Hyper-Scalar Architecture
指導教授:邱日清
指導教授(外文):Jih-Ching Chiu
學位類別:碩士
校院名稱:國立中山大學
系所名稱:電機工程學系研究所
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2019
畢業學年度:108
語文別:中文
論文頁數:86
中文關鍵詞:語意分析超多純量指令並行度迴圈展開指令標籤
外文關鍵詞:hyper-scalarsemantic analysisinstruction tagILPloop unrolling
相關次數:
  • 被引用被引用:0
  • 點閱點閱:105
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:0
目前主流的機器學習、圖像處理或加密演算法的程式中皆使用了大量的迴圈程式,而這些迴圈程式中的指令大量且重複地執行。迴圈指令之間資料相依性的問題、迴圈分支指令預測錯誤造成指令流停頓或受限於分支指令影響的Basic Block,都使支援ILP處理器在執行這些迴圈程式時,指令並行度無法有好的表現。另外如VLIW架構處理器,透過編譯器編排指令,再將指令送入至處理器執行以提升指令並行度。但是當程式過於複雜時,仍需要人工手動編排程式,但此做法缺乏彈性且無法動態調整。

本論文提出在編譯器上建立基於語意分析之指令標示方法,用以靜態分析程式指令找出程式中的迴圈結構。因不同類型的迴圈結構有不同的指令編排方式,所以能歸納出不同的迴圈語意。藉由迴圈語意在編譯器上偵測指令中是否含有迴圈結構,以及找出迴圈結構中所有的分支指令。將符合迴圈語意的迴圈指令根據其迴圈類型與指令類型標示一位元組的指令標籤。本論文在超多純量處理器上建立根據指令標籤的迴圈展開機制,迴圈展開器分成以下三個部分:(1)迴圈指令蒐集器、(2)迴圈指令展開器、(3)迴圈指令相依性標籤產生器。根據制定的指令標籤,迴圈指令蒐集器只需要解碼器與比較器就能解碼指令標籤並依照迴圈類型與指令類型儲存迴圈指令。迴圈指令展開器抓取並展開迴圈指令,透過建立Branch Flush Table,避免迴圈分支指令預測錯誤而造成的指令流停頓。迴圈指令相依性標籤產生器產生指令資料相依性標籤,並重新編排指令派發順序,使指令並行度與執行效能上升。

本論文使用Keil uVision5 Compiler編譯C語言所產生的ARM組合語言。將指令與標籤輸入模擬器中驗證,實測奇偶和、泡泡排序以及矩陣乘法程式。在最大的記憶體頻寬下,將迴圈展開機制加入八核心超多純量處理器,於不同的測試指令下,效能提升有1.2倍至4.1倍,ILP為 4.77至5.87,ILP提升倍數1.3倍至1.7倍。
The mainstream program like machine learning, image processing or encryption algorithms used a large number of loop programs at present. And the instructions in these loop programs are executed in large numbers and repeatedly. The problem of data dependencies between loop instructions, branch misprediction causes the instruction flow to stall or the basic block limited by the branch instructions to cause poor performance while the ILP processor executing these loop programs. Besides, at the VLIW architecture processor, the instructions are scheduled by the compiler and sent to the processor for execution to improve the degree of parallelism of the instructions. However, if the program is too complicated, manual programming is still required, but this method is not flexible and cannot be dynamically adjusted.

This paper proposes to establish a semantic labeling method based on semantic analysis on the compiler to statically analyze program instructions to find the loop structure in the program. Because different types of loop structures have different instruction patterns which can be summarized into different loop instructions semantics. The instructions of a program are detected on the compiler whether the instructions contain a loop structure based on loop instruction semantics and find out all of the branch instructions in the loop structure. The loop instruction that conforms to the loop semantics which added one-byte length instruction tag according to its loop type and instruction type. In this thesis, the loop unrolling mechanism based on the instruction tag is established on the hyper-scalar processor. The loop unrolling mechanism is divided into the following three parts: loop instruction collector, loop instruction unrolling and loop instruction dependency tag generator. According to the instruction tag, the loop instruction collector stores the loop instruction according to the loop type and the instruction type only needs the decoder and the comparator. The loop instruction unrolling fetches and unrolls the loop instruction, and establishes the Branch Flush Table to avoid the instruction flow stall caused by the branch misprediction. The loop instruction dependency tag generator generates an instruction data dependency tag and rearranges the instruction dispatch order to increase the degree of instruction parallelism and execution performance.

This verification uses the Keil uVision5 Compiler to compile C language to generate the ARM assembly language. The instructions and tags are entered into the hyper-scalar processor simulator for verifications. In the largest memory bandwidth, the loop unrolling mechanism is added to the eight-core ultra-multiple-storage processor. Under different test programs, the performance improvement is 1.2 times to 4.1 times, the ILP is 4.77 to 5.87, and the ILP lifting factor is 1.3 times to 1.7 times.
論文審定書 i
致謝 ii
摘要 iii
ABSTRACT iv
目錄 vi
圖次 ix
表次 xii
第1章 簡介 1
1.1 研究動機 1
1.2 研究目標 2
1.3 論文架構 3
第2章 相關研究 4
2.1 語意分析 4
2.2 超多純量處理器 5
2.2.1 超多純量處理器架構介紹 5
2.2.2 指令分析器 6
2.2.3 虛擬共享暫存器檔案 8
2.3 在PIPELINE處理器中提升迴圈程式執行效率的機制 9
2.3.1 亂序執行機制(Out-of-order Execution) 9
2.3.1.1 記分板 9
2.3.1.2 Tomasulo’s Algorithm 10
2.3.2 Branch Prediction 11
2.3.3 Loop Buffers 12
2.3.4 Loop Cache 13
2.3.5 在VLIW處理器中提升迴圈程式執行效率 13
第3章 迴圈展開機制設計 15
3.1 迴圈展開機制系統架構 15
3.1.1 系統架構設計概念 15
3.1.2 系統架構 19
3.2 在編譯器上進行迴圈偵測 23
3.2.1 指令標籤格式 23
3.2.2 迴圈指令標籤範例 24
3.2.3 迴圈偵測機制 26
3.2.4 For迴圈偵測規則 28
3.2.5 Do While迴圈偵測規則 29
3.2.6 While迴圈偵測規則 30
3.3 迴圈展開機制運行流程 31
3.4 迴圈指令蒐集器 35
3.5 迴圈指令展開器 38
3.6 迴圈指令相依性標籤產生器 41
3.7 指令派發器 44
3.8 DEFER TABLE設計與應用 45
第4章 模擬結果與分析 47
4.1 測試環境與流程 47
4.2 測試指令 49
4.2.1 單層For迴圈-奇偶和 49
4.2.2 雙層迴圈-泡泡排序 50
4.2.3 多層迴圈-矩陣乘法 51
4.3 結果分析與討論 53
4.3.1 模擬結果 53
4.3.1.1 奇偶和模擬結果 53
4.3.1.2 泡泡排序模擬結果 54
4.3.1.3 矩陣乘法模擬結果 55
4.3.2 分析與討論 57
4.3.2.1 奇偶和數據分析與討論 57
4.3.2.2 泡泡排序數據分析與討論 59
4.3.2.3 矩陣乘法數據分析與討論 64
4.4 小結 68
第5章 結論與未來展望 69
5.1 結論 69
5.2 未來展望 70
5.2.1 可改善部分 70
5.2.2 未來應用 70
參考文獻 71
[1]E. Rotenberg, S. Bennett, and J.E. Smith, ”Trace cache: a low latency approach to high bandwidth instruction fetching”, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, Dec 1996
[2]Yi-Xuan Lu, Jih-Ching Chiu, and Shu-Jung Chao, “Design of Instruction Analyzer with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture”, New Trends in Computer Technologies and Applications, pp.3-19, 2019
[3]Shu-Jung Chao, Jih-Ching Chiu, and Yi-Xuan Lu, “Improving ILP with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture”, NCS, pp.1012-1017, 2017
[4]Yu-Liang Chou, “Study of the Hyperscalar Multi-core Architecture”, Department of Electrical Engineering National Sun Yat-Sen University, 2011
[5]T.Y. Yeh and Y. N. Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction”, The 19th Annual International Symposium on Computer Architecture pp.124-134, May 1992
[6]Nikolaos Bellas, Ibrahim N. Hajj, Constantine D. Polychronopoulos, and George D. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache”, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1999
[7]Raminder S. Bajwa, Mitsuru Hiraki, Hirotsugu Kojima, Douglas J. Gorny, Kenichi Nitta, Avadhani Shridhar, Koichi Seki, and Katsuro Sasaki, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Transactions on VLSI Systems, VOL. 5, NO. 4, DEC 1997
[8]J. A.Fisher, P. Faraboschi, and C. Young, “Embedded Computing, a VLIW approach to architecture, compilers and tools”, Elsevier, 2005
[9]P. Faraboschi, J.A. Fisher and C. Young, ”Instruction scheduling for instruction level parallel processors” IEEE Volume 89, Issue 11, pp. 1638 – 1659, Nov 2001
[10]Y. Yang, N. Gu, K. Kaixin Ren, and B. Hu, ”An Approach to Enhance Loop Performance for Multicluster VLIW DSP Processor” Workshop Proceedings on Architecture of Computing Systems, pp. 25-28, Feb 2014
[11]James E. Thornton, “Parallel Operation in the Control Data 6600” AFIPS ''64 (Fall, part II), fall joint computer conference, part II: very high speed computer systems, pp.33-40, October 1964
[12]R. M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units” IBM Journal of Research and Development, Volume: 11, Issue: 1, Jan 1967
[13]J.C. Chiu, Y.J. Huang, and Y.L. Ye. “Design of the Optimized Group management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture” National Computer Symposium, Dec 2013.
[14]B. Wang, W. Zheng, and Q. Fang, “Weimin Zheng Parallel Task Developing Based on Software Pipeline in Multicore System” International Symposium on Parallel and Distributed Processing with Applicationsm, pp. 6-9, Sept 2010
[15]D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo, “TheIBM System/360 model 91: Machine philosophy and Instruction-handling”, IBM Journal of Research and Development, vol. 11, Issue: 1, pp. 8–24, Jan 1967
[16]D.S. Su, “Design of the Execution-driven Simulation Environment for Hyper-scalar Architecture” Department of Electrical Engineering National Sun Yat-Sen University, 2008
[17]Po-Kai Chen, “ESL Model of the Hyper-scalar Processor on a Chip” Department of Electrical Engineering National Sun Yat-Sen University, 2007.
電子全文 電子全文(網際網路公開日期:20240826)
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top