|
[1]E. Rotenberg, S. Bennett, and J.E. Smith, “Trace cache: a low latency approach to high bandwidth instruction fetching,” in MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 24 –34, 1996 [2]T. Conte, K. Menezes, P. Mills, and B. Patel, “Optimization of instruction fetch mechanisms for high issue rates,” in 22nd Intl. Symp. on Computer Architecture, pp. 333-344, June 1995 [3]T.Y. Yeh, D.T. Marr, and Y. N. Patt, “Increasing the instruction fetch rate via multiple branch prediction and a branch address cache,” in 7th Itel. Conf. on Supercomputing, pp. 67-76, July 1993 [4]R. S. Bajwa et al., Instruction buffering to reduce power in processors for signal processing, IEEE VLSI, 1997. [5]N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache,” in ICCD, 1999 [6]Tse-Yu Yeh, Yale N. Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 1992, Department of Electrical Engineering and Computer Science The University of Michigan. [7]Thornton, James E. (1965). "Parallel operation in the control data 6600". Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems. AFIPS ''64. San Francisco, California: ACM. pp. 33–40 [8]R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal of Research and Development, volume 11, issue 1, January 1967, IBM, pp. 25–33 [9]J.B. Dennis and D.P. Misunas, “A Preliminary Architecture for a Basic Data-Flow Processor,” in Proceedings of the 2nd Annual Symposium on Computer Architecture, pp. 126-131, Houston, TX, January 1975 [10]E.J. Lerner, “Data-flow Architecture,” in IEEE Spectrum, pp. 57-62, April 1984 [11]J. A.Fisher, P. Faraboschi, and C. Young, Embedded Computing, a VLIW approach to architecture, compilers and tools. Elsevier, 2005. [12]J.L. Hennessy and D.A. Patterson, “Computer Architecture A Quantitative Approach,” 2nd Edition, 1995 [13]S. Weiss and J.E. Smith, “A Study of Scalar Compilation Techniques for Pipelined Supercomputers,” in Proceedings of Second International Conference on Architecture Support for Programming Languages and Operating Systems, pp. 105-109, Palo Alto, CA, October 1987 [14]F.H. McMohan, “The Livemore Fortran Kernels: A Computer Test of the Numerical Performance Range,” Lawrence Livemore National Laboratory, Livemore, CA, 1986 [15]J.C. Huang and T. Leng, ” Generalized loop-unrolling: a method for program speedup,” in. Proceedings of 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, 1999 [16]J.W. Davidson and S. Jinturkar, “Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation,” in Proceedings of the 28th Annual International Symposium on Microarchitecture, pp. 125 –132, 1995 [17]Po-Kai Chen, “ESL Model of the Hyper-scalar Processor on a Chip”,2007 ,Department of Electrical Engineering National Sun Yat-Sen University [18]Yu-Lian Chou, “Study of the Hyperscalar Multi-core Architecture”,2011 ,Department of Electrical Engineering National Sun Yat-Sen University [19]Yin-Jou Huang, “Design of the Optimized Group Management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture”,2013 ,Department of Electrical Engineering National Sun Yat-Sen University [20]Zh-Lung Chen, “Improving ILP with Semantic-Based Loop Unrolling Mechanism in X86 Architectures”,1999 , Department of Computer Science and Information Engineering National Chiao Tung University
|