跳到主要內容

臺灣博碩士論文加值系統

(54.80.249.22) 您好!臺灣時間:2022/01/20 06:45
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:邱日清
研究生(外文):Jih-ching Chiu
論文名稱:以程式軌跡支援開發X86指令集處理器前端並行方式
論文名稱(外文):Exploiting X86 Front-end Parallelism with Program Trace Support
指導教授:鍾崇斌
指導教授(外文):Chung-Ping Chung
學位類別:博士
校院名稱:國立交通大學
系所名稱:資訊工程系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2002
畢業學年度:90
語文別:中文
論文頁數:140
中文關鍵詞:前端單元X86超純量處理器指令抓取指令預取不定長度的指令格式指令位址儲列指令流緩衝器指令識別器
外文關鍵詞:Front-endX86 Superscalar processorInstruction FetchInstruction PrefetchingVariable Length Instruction FormatInstruction address queueInstruction BufferInstruction Identifier
相關次數:
  • 被引用被引用:0
  • 點閱點閱:221
  • 評分評分:
  • 下載下載:25
  • 收藏至我的研究室書目清單書目收藏:0
對一個超純量處理器而言,所謂的前端單元含指令流緩衝器及指令擷取單元,是達成高指令頻寬的關鍵元件。但不定長度的指令格式和複雜的定址系統,使得X86超純量處理器在一個指令週期難以抓取多個指令。為達成高指令抓取頻寬的目的,我們深入思考著程式執行軌跡對於維持指令流的順暢及擴大指令抓取程度的影響。為建構高指令抓取頻寬的前端,在此論文中討論四個研究項目 :
1. 增加指令快取記憶體的命中率;
2. 在一個指令週期辨識並抓取多個指令;
3. 跨越基本指令區段的指令抓取;
4. 支援多個指令的位址儲存,以保留機器狀態,供意外狀態發生時的掌握。
對於第一個項目,增加指令快取記憶體的命中率,我們發展出一個新的借用指令分支預測支援快取記憶體預取的方式,稱之為BIB預取。由模擬的結果知道,BIB預取較傳統的預取方式好7% ,較其他以預測表為基礎的預取方式好17%. 當BTB 的設計技術愈來愈成熟精確度愈來愈高時,BIB預取則將會愈來愈有效率。對於第二個項目,在一個指令週期可辨識抓取多個指令。我們提出了以指令識別器預測指令長度並且將指令的指標以超純量群指示器的方式儲存起來。應用此方法突破了高指令數程度(>3)抓取的困難。依據模擬的結果,指令識別器之設計以64個表列最能達到效能與花費上的平衡選擇。對於第三個項目,跨越基本指令區段的指令抓取。我們結合分支指令預測單元來支援程式執行的軌跡資訊,以增進指令流緩衝器的效能。依據模擬的結果,由可跨越兩個指令區間的指令緩衝區中抓取指令,其指令抓取程度,平均最大可達8.42 個 X86指令程度。並且在效能與花費上的平衡選擇下,建議此跨越兩個指令區間的指令緩衝區,由兩個64-byte的指令表列來組合。與當下的指令緩衝區設計比較起來,此跨越兩個指令區間的指令緩衝區的效能優於他們達1.9 倍。對於第四個項目,支援多個指令的位址儲存,以保留機器狀態,供意外狀態發生時的掌握。我們設計了一個指令位址儲列,經由評估而定出此儲列的大小,以提供足夠掌握指令位址而不影響後端執行的效率的儲存空間大小。此指令位址儲列的設計乃考慮了兩個存在於 CISC 中的困擾因素,即不定長度的指令格式和複雜的定址系統,在指令取入程度為 5的 X86超純量處理器設計中,這項設計將節省1/3儲存空間的硬體浪費並且僅需花費近乎等量的時間延遲。
在完成本論文中決定性的上述項目研究後,一個高效率X86超存量處理器的前端則被實現了。

The front-end units, the instruction stream buffer and the fetcher, are the key elements for achieving high instruction bandwidth. However, in x86 superscalar processors, the variable-length instructions and the complex addressing system make fetching multiple instructions in a cycle difficult. To approach high instruction fetch bandwidth, keeping the streaming smooth and expanding the x86 instruction fetch degree are deeply considered with the relations of the program-execution trace. To build a high superscalar degree front-end to achieve this goal, four topics are studied in this dissertation:
1. Increasing fetch bandwidth at the front-end entrant;
2. Identifying multiple instructions in one clock cycle;
3. Fetching super basic block instructions;
4. Storing each instruction address for keeping processor states in high degree x86 instruction-fetched processors.
In the first topic, increasing fetch bandwidth at the front-end entrant, we develop a new instruction prefetching method in which the prefetch is directed by the prediction on branches, called the branch instruction based (BIB) prefetching. Simulation results show that this design outperforms the traditional sequential prefetching by 7% and other prediction table based prefetching methods by 17% on average with the same BTB size. In the second topic, identifying multiple instructions in one clock cycle, we propose to use Instruction Identifier to predict instruction lengths and store the instruction pointers as superscalar instruction group indicators. Simulation results suggest that the Instruction Identifier with a 64-entry table is a good performance/cost choice. In the third topic, fetching super basic block instructions, we propose a design to improve instruction stream buffer performance by coupling it with the Branch Target Buffer (BTB) to support trace prediction. Compared with other existing designs, this instruction stream buffer can improve performance by 90% over current x86 processor instruction fetch rate on average. In the fourth topic, storing each instruction address for keeping processor states in high degree x86 instruction-fetched processors, we propose an instruction PC Offset Queue. Two CISC hazards in the x86 architectures have been considered in this design, which reduce by 1/3 the storage space for a degree-5 superscalar x86 processor with even smaller access latency.
Having dealt with the critical topics discussed in this dissertation, an efficient front-end of a high superscalar degree x86 micro-architecture becomes practical.

ABSTRACT IN CHINESE 1
ABSTRACT 3
ACKNOWLEDGEMENTS 5
CONTENTS 6
LIST OF FIGURES 9
LIST of TABLES 12
CHAPTER 1INTRODUCTION13
1.1CURRENT X86 FRONT-ENDS15
1.1.1Intel Pentium15
1.1.2Intel Pentium with MMX Technology17
1.1.3Intel Pentium Pro/ Pentium II /Pentium III19
1.1.4Intel Pentium 421
1.1.5AMD K524
1.1.6AMD K6/K6-2/K6-III26
1.1.7AMD K7 (Athlon)28
1.2REMARKS OF CURRENT FRONT-ENDS29
1.2.1Comparisons of Current Front-ends29
1.2.2Characteristics of Current Front-ends30
1.3FOUR STUDIED TOPICS TO APPROACH HIGH BANDWIDTH FRONT-END33
1.3.1Objectives and Goals33
1.3.2Design Issues34
1.3.3The Thesis Organizations35
CHAPTER 2INCREASING FETCH BANDWIDTH AT THE FRONT-END ENTRANT: INSTRUCTION STREAM BUFFER PREFETCHING WITH THE BTB HELP …….37
2.1WHY DOES INSTRUCTION PREFETCHING NEED IMPROVEMENTS?37
2.2PREVIOUS WORK39
2.3BRANCH INSTRUCTION BASED (BIB) PREFETCHING42
2.3.1The Extended BTB43
2.3.2Maintaining Prefetching Information46
2.4EXPERIMENTS48
2.4.1Simulation method48
2.4.2Machine Model Specification49
2.4.3Simulation Results50
2.5CONCLUDING REMARKS55
CHAPTER 3COPING WITH X86 VARIABLE-LENGTH INSTRUCTION BARRIER: IDENTIFYING MULTIPLE INSTRUCTIONS WITH POINTERS AND LENGTH PREDICTION/SPECULATION57
3.1WHY IS COPING WITH X86 VARIABLE-LENGTH INSTRUCTION BARRIER NEEDED?.58
3.2CHARACTERISTICS OF IDENTIFYING INSTRUCTIONS WITH INSTRUCTION BOUNDARY BITS60
3.2.1The sequential scan method61
3.2.2The bit-lookahead method62
3.3FETCHING MULTIPLE INSTRUCTIONS USING THE INSTRUCTION IDENTIFIER63
3.3.1Procedure64
3.3.2Instruction Identifier Design68
3.4SYSTEM PERFORMANCE EVALUATION70
3.4.1Simulation Environment71
3.4.2Prediction Scheme Analysis72
3.4.3Number of IPT Entries Analysis73
3.4.4Instruction Cache Line Size Analysis74
3.4.5Delay Time Estimation74
3.5CONCLUDING REMARKS75
CHAPTER 4FEEDING THE BACKEND WITH MORE INSTRUCTIONS: INSTRUCTION BUFFER LINE REASSEMBLING77
4.1WHY IS THE SUPER BASIC BLOCK FETCHING NEEDED?77
4.2AN INSTRUCTION STREAM BUFFER FOR HIGH ILP79
4.3THE DESIGN81
4.4THE ARCHITECTURE86
4.4.1Reassembly Controller87
4.4.2Pointer Reorder Unit88
4.5SIMULATION AND RESULTS91
4.5.1Fetcher Capacity Utilization91
4.5.2Buffer Width Analysis92
4.5.3Buffer Depth Analysis93
4.5.4Average Fetch Rate Comparison94
4.5.5Hardware Complexity Analysis96
4.6CONCLUDING REMARKS97
CHAPTER 5RECORDING INSTRUCTION ORDER FOR EXCEPTION HANDLING: A SPACE-EFFICIENT INSTRUCTION QUEUE DESIGN98
5.1WHY IS BUILDING A SPACE-EFFICIENT INSTRUCTION QUEUE NEEDED?99
5.2BASIC DESIGN PHILOSOPHIES OF THE X86 PC OFFSET QUEUE101
5.2.1Address space problems with the x86 architecture102
5.2.2PC Offset Queue behavior and characteristics105
5.3THE PC OFFSET QUEUE STRUCTURE107
5.3.1Analysis of the PC Offset Queue size108
5.3.2Analysis of The PC Offset Queue organization111
5.4X86 PC OFFSET QUEUE IMPLEMENTATION112
5.4.1PC Offset Queue Controller113
5.4.2X86 Address Storage Schemes117
5.5CIRCUIT SYNTHESIS122
5.6CONCLUDING REMARKS124
CHAPTER 6THE PROGRAM TRACE SUPPORTED HIGH BANDWIDTH FRONT-END ……….125
6.1THE ARCHITECTURE125
6.2PERFORMANCE MEASUREMENTS AND COMPARISONS126
6.3DISCUSSION128
CHAPTER 7CONCLUSIONS130
7.1THESIS SUMMARY130
7.2FUTURE DIRECTIONS132
REFERENCES134

[1]N. Jouppi and S. Wilton, “Tradeoffs in two-level on-chip caching,” 21st ISCA, Chicago, IL, 1994, pp. 34-45.
[2]Microprocessor Report Sebastopol, CA, Micro Design Resources, 1992-1995.
[3]J. Smith, “Sequential program prefetching in memory hierarchies,” IEEE Computer, vol. 11, no. 12, Dec. 1978, pp. 7-21.
[4]J. Smith, “Cache memories,” ACM Computing Surveys, vol. 14, no. 3, Sep. 1982, pp. 473-530.
[5]J. E, Smith and W. C. Hsu, “Prefetching in supercomputer instruction caches,” Supercomputing ’92, 1992, pp. 588-597.
[6]G. H. Park, O. Y. Know, T. D. Han, and S. D. Kim, “Non-referenced prefetch (NRP) cache for instruction prefetching,” IEE Proceedings of Computers and Digital Techniques, Vol. 143, No. 1, Jan. 1996, pp. 37-43.
[7]Cheng K. Chen, Chih-Chieh Lee, and Trevor N. Mudge, “Instruction Prefetching Using Branch Prediction Information,” International Conference on Computer Design, Austin, Texas, October 1997.
[8]N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” 17th ISCA, Seattle, WA, 1990, pp. 364-373.
[9]J. K. Lee and A. J. Smith, “Branch prediction strategies and branch target buffer design,” Computer, vol. 7, no. 1, Jan. 1984, pp. 6-22.
[10]S. B. Kim, S. H. Park, M. S. Park, J. Kim, S. L. Min, D.K. Jeong, , H. Shin, and C. S. Kim, “Threaded prefetching: An adaptive instruction prefetch mechanism,” Microprocessing and Microprogramming, vol. 39, Iss. 1, Nov. 1993, pp. 1-15.
[11]M. D. Smith, “Tracing with pixie,” Technical Report of Stanford CA 94305-4070, Apr. 4, 1991.
[12]T. F. Chen, “An effective programmable prefetch engine for on-chip caches,” Proceedings of MICRO-28, 1995, pp. 237-242.
[13]G., Kane, MIPS RISC Architecture, Prentice Hall, 1988.
[14]S. E. Chang and Y. R. Chang, “A study of SPEC CPU95 benchmarks,” Technical Report of Chung Yuan University, Taiwan, 1996.
[15]Dezso Sima, “Superscalar Instruction Issue,” IEEE Micro, pp. 28—39, Sept/Oct 1997.
[16]Jih-ching Chiu and Chung-Ping Chung, “The Fetch Mechanism Issue Of X86 Superscalar Processors with Fetch Rules”, Workshop on Computer Architecture, Proceeding of the 2000 International Computer Symposium, pp. 129-136, Dec 2000.
[17]Payman Zarkesh-Ha, Jeffrey A. Davis, William Loh and James D. Meindl, “Stochastic Interconnect Network Fan-out Distribution Using Rent’s Rule”, Proc. of the IEEE 1998 International Interconnect Technology Conference, pp. 184-186, June 1998.
[18] A. Case, ”Intel Reveals Pentium Implementation Details,” Microprocessor Report, Vol.7, No.4, Mar. 1994.
[19]MindShare, Inc., D.Anderson, T.Shanley, Pentium Processor System Architecture, Addison-Wesley Developer Press, 1995.
[20]L.Gwennap, “Intel’s P6 Uses Decoupled Superscalar Design,” Microprocessor Report, Vol.9, No.2, Feb. 1995.
[21]Intel Corporation, Pentium II Processor Developer’s Manual, October 1997.
[22]MindShare, Inc., T.Shanley, Pentium Pro Processor System Architecture, Addison-Wesley Developer Press, 1997.
[23] A. Christie, “Developing the AMD-K5 Architecture,” IEEE Micro, Vol.16, Iss.2, pp. 16-27, Apr. 1996.
[24]AMD Corporation, AMD-K6-III Processor Datasheet, 1999.
[25]AMD Corporation, AMD Athlon Processor Technical Brief, 1999.
[26]Rotenberg E., Bennett S. and Smith J.E., “A trace cache microarchitecture and evaluation,” IEEE Transactions on Computers, Vol. 48, No. 2, pp. 111—120, Feb. 1999.
[27]Rotenberg E., Bennett S. and Smith J.E., “Trace cache: a low latency approach to high bandwidth instruction fetching,” Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-29, pp. 24—34, Dec. 1996.
[28]Smotherman M. and Franklin M., “Improving CISC instruction decoding performance using a fill unit,” Proceedings of the 28th Annual International Symposium on Microarchitecture, pp. 219-229, 1995.
[29]R-Ming Shiu, Jih-Ching Chiu, Shin-Ki Cheng, and Jyh-Jiun Shann, “The Design of the Decoding Unit with High Issue Rate for an X86 Superscalar Microprocessor,” IEE Proc. Computers & Digital Techniques, Vol. 147, No. 2, pp. 101-107, Mar 2000.
[30]J.C. CHIU, J.N. YANG, R.M. SHIU, C.P. CHUNG, “A proposed fetch rule model for fetching multiple x86 instructions,” Proc. of 1998 International Conference on Computer Systems Technology for Industrial Applications, pp. 31-36.
[31]STEVEN WALLACE AND NADER BAGHERZADEH, “Modeled and measured instruction fetching performance for superscalar microprocessors,” IEEE Tran. On Parallel and Distributed Systems, Vol. 9, No. 6, June 1998, pp. 570-578.
[32]ERIC ROTENBERG, STEVE BENNETT AND JAMES E. SMITH, “Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. of the 29th Annual IEEE/ACM International Symposium on Micro-architecture, 1996, pp. 24 -34.
[33]SIAMAK ARYA, HOWARD SACHS AND SREERAM DUVVURU, “An architecture for high instruction level parallelism,” Proc. of the 28th Annual Hawaii International Conference on System Sciences, 1995, pp153-162.
[34]JAMES E. SMITH AND GURINDAR S. SOHI, “The Microarchitecture of Superscalar Processors,” Proc. Of the IEEE, Vol. 83, No. 12, PP. 1609-1624, December 1995.
[35]M.SLATER, “Intel’s Long-Awaited P55C Disclosed,” Microprocessor Report, Vol.10, No.14, October 28, 1996.
[36]L.GWENNAP, “Intel’s P6 Uses Decoupled Superscalar Design,” Microprocessor Report, Vol.9, No.2, Feburary 16, 1995.
[37]INTEL CORPORATION, Pentium II Processor Developer’s Manual, October 1997.
[38]B. Ryan, “M1 Challenges Pentium,” Byte, Jan. 1994, PP.83-87.
[39]M.SLATER, “K6 to Boost AMD’s Position in 1997,” Microprocessor Report, Vol.10, No.14, October 28, 1996.
[40]AMD CORPORATION, AMD-K6-III Processor Datasheet, 1999
[41] A. ROTENBERG, Q. JACOBSON, Y. SAZEIDES AND J. SMITH, “Trace Processor,” Proceedings of the 30th Annual International Symposium on Micriarchitecture, December 1997.
[42]Shai Rotem, Ken Stevens, Ran Ginosar, Peter Beerel, Chris Myers, Kenneth Yun, Rakefet Kol, Charles Dike, Marly Ronchen, and Boris Agapiev, “ RAPPID: An Asynchronous Instruction Length Decoder,” Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, Barcelona, Spain, Apr. 1999, pp60-70.
[43]Michael Slater, “The Microprocessor Today,” IEEE Micro, December 1996, pp.32-44.
[44]Intel Corporation, Pentium Processor User’s Manual Volume 3: Architecture and Programming Manual, 1993.
[45]Tom Shanley, Pentium Pro Processor System Architecture, Mind Share, Inc., 1997.
[46]AMD Corporation, AMD-K7(TM) Technology Presentation, 1998.
[47]AMD Corporation, AMD-K6 MMX Enhanced Processor Data Sheet, 1997.
[48]Dave Christic, “Developing The AMD-K5 Architecture,” IEEE Micro, April 1996, pp. 16-26.
[49]AMD Corporation, AMD5K86 Processor Technical Reference Manual, 1996.
[50]Michael Slater, “K6 to Boost AMD’s Position in 1997,” Microprocessor Report, Vol. 10, No.14, October 28, 1996.
[51]Albert Yu, “The Future of Microprocessors,” IEEE Micro, December 1996, pp.46-53.
[52]Eric Rotenberg, Steve Bennett and James E. Smith, “Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. of the 29th Annual IEEE/ACM International Symposium on Micro-architecture, 1996, pp. 24 -34.
[53]Cyrix Corporation, “Cyrix 6x86 Processor Abbreviated Data Book Version 1.1
[54]SPEC95 Benchmark Suite Release 1.0, 1995.

QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top