|
[1]A. Ivanov and G.D. Micheli, “The Network-on-Chip Paradigm in Practice and Research,” IEEE Design and Test of Computers, vol. 22, no. 5, pp. 399-403, Sept. 2005. [2]S.-Y. Lin, C.-H. Huang, C.-H. Chao, K.-H. Huang and A.-Y. Wu, “Traffic-Balanced Routing Algorithm for Irregular Mesh-Based On-Chip Networks,” IEEE Trans. Computers, vol. 57, no. 9, pp. 1156-1168, Sept. 2008. [3]G. Ascia, V. Catania, M. Palesi and D. Patti, “Implementation and Analysis of a New Selection Strategy for Adaptive Routing in Networks-on-Chip,” IEEE Trans. Computers, vol. 57, no. 6, pp.809-820, Jun. 2008. [4]Z. Wang, W. Wuchen, Z. Lei and P. Xiaohong, “The Buffer Depth Analysis of 2-Dimension Mesh Topology Network-on-Chip with Odd-Even Routing Algorithm,” in Proc. Int. Conf. Information Engineering and Computer Science, Dec. 2009, pp.1-4. [5]R. Lu, A. Cao and C. Koh, “SAMBA-Bus: A High Performance Bus Architecture for System- on-Chips,” IEEE Trans. VLSI Systems, vol. 15, pp. 69-79, Jan. 2007 [6]Core Connect bus architecture, IBM, Armonk, 1999. [7]Silicon micronetworks technical overview, Sonics Inc., CA, 2002. [8]AMBA Specification, ARM Limited, Cambridge, U.K., 1999. [9]R. S. Ramanujam, V. Soteriou, B. Lin and L.-S. Peh, “Design of a high-throughput distributed shared-buffer NoC node,” in Proc. 4th ACM/IEEE Symp. Networks-on-Chip, May 2010, pp. 69-78. [10]L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, P. Dubey, S. Junkins, A. Lake, R. Cavin, R. Espasa, E. Grochowski, T. Juan, M. Abrash, J. Sugerman, and P. Hanrahan. “Larrabee: A many-core x86 architecture for visual computing,” IEEE Micro, vol. 29, pp.10-21, Jun. 2009. [11]V. Soteriou, R.S. Ramanujam, L. Bill and L.-S. Peh, “A High-Throughput Distributed Shared-Buffer NoC Node,” Computer Architecture Letters, vol. 8, pp. 21-24, Jan. 2009. [12]K. Lee, S.-J. Lee and H.-J. Yoo, “A Distributed Crossbar Switch Scheduler for On-Chip Networks,” in Proc. IEEE Conf. Custom Integrated Circuits, May 2010, pp. 69-78. [13]L. Mhamdi, “PBC: A Partially Buffered Crossbar Packet Switch,” IEEE Trans. Computers, vol. 58, pp. 1568-1581, Nov. 2009 [14]K. G. W. Goossens and L. M. I. V. Senin, “Internet-node Buffered Crossbar Based on Networks on Chip,” in Proc. IEEE Symp. Digital Systems Design, Aug. 2009, pp. 365-374. [15]K. Yoshigoe, K. Christensen and A. Jacob, “The RR/RR CICQ Switch: Hardware Design for 10-Gbps Link Speed,” in Proc. IEEE 5th Conf. Int’l Performance Computing and Comm., Apr. 2003, pp. 481-485. [16]N. Chrysos and M. Katevenis, “Scheduling in Switches with Small Internal Buffers,” in Proc. IEEE 24th Conf. Global Comm., Nov. 2005, pp. 614-619. [17]D. N. Serpanos and P. I. Antoniadis, “FIRM: A Class of Distributed Scheduling Algorithms for High-Speed ATM Switches with Input Queues,” in Proc. IEEE Computer and Communications Societies, Mar. 2000, vol. 2, pp. 548-555. [18]T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, “Introduction to Algorithms. Cambridge”, MIT Press Cambridge, MA, 2001. [19]K. Goossens, J. van Meerbergen, A. Peeters, and R. Wielage, “Networks on silicon: combining best-effort and guaranteed services,” in Proc. IEEE/ACM Conf. Design, Automation and test in Europe, Mar. 2002, pp.423-425. [20]M.M. Lee, J. Kim, D. Abts, M.Marty, and J.W. Lee, “Probabilistic distance-based arbitration: Providing experience of service for many-core cmps,” in Proc. Microarchitecture, Dec. 2010, pp.509-519. [21]T. Moscibroda and O. Mutlu, “A case for bufferless routing in on-chip networks,” in Proc. the 36th Annual Int''l Symp. Computer Architecture, Jun. 2009, pp. 196-207. [22]P. Gratz, C. Kim, R. McDonald, S. W. Keckler and D. Burger. “Implementation and evaluation of on-chip network architectures, “In Proc. Int. Conf. Computer Design, Oct. 2006, pp. 477-484. [23]Congestion Control for Scalability in Bufferless On-Chip Networks, SAFARI Technical Report, Jul. 2011 [24]U. Y. Ogras and R. Marculescu. “Prediction-based flow control for network-on-chip traffic,” in Proc. ACM/IEEE Design Automation Conf., Jul. 2006, pp. 839-844. [25]B. Grot, J. Hestness, S. Keckler and O. Mutlu, “Express Cube Topologies for on-Chip Interconnects,” in Proc. IEEE Int. Sym. High Performance Computer Architecture, Feb. 2009, pp. 163-174. [26]J. Kim, J. Balfour and W.J. Dally, “Flattened butterfly topology for on-chip networks,” in Proc. IEEE/ACM Int. Sym. Microarchitecture, Dec. 2007, pp. 172-182. [27]R. Das, O. Mutlu, T. Moscibroda and C.R. Das, “Application-aware prioritization mechanisms for on-chip networks,“ in Proc. IEEE/ACM Int. Sym. Microarchitecture, Dec. 2009, pp. 280 – 291. [28]R. Das, O. Mutlu T. Moscibroda and C.R. Das, “AÉRGIA: exploiting packet latency slack in on-chip networks,” IEEE Micro, vol. 31, pp. 29-41, Feb. 2011. [29]B. Grot, S.W. Keckler and O. Mutlu, “Preemptive virtual clock: A flexible, efficient, and cost-effective qos scheme for networks-on-chip,” in Proc. IEEE/ACM Int. Sym. Microarchitecture, Dec. 2009, pp. 268 – 279. [30]J. D. Owens, W. J. Dally, R. Ho, D. N. Jayasimha, S. W. Keckler and L.-S. Peh. “Research challenges for on-chip interconnection networks,” IEEE Micro, pp. 96–108, Oct. 2007. [31]C. Izu, “A throughput Fairness Injection Protocol for Mesh and Torus Networks,” in Proc. IEEE Conf. High Performance Computing, Dec. 2009, pp.294-303. [32]M. M. Lee, J. Kim, D. Abts, M. Marty and J. W. Lee, “Approximating age-based arbitration in on-chip networks,” in Proc. IEEE 19th Conf. Parallel Architectures and Compilation techniques, Sep. 2010, pp. 575-576. [33]F. Guderian, E. Fischer, M. Winter and G. Fettweis, “Fair rate packet arbitration in network-on-chip,” in Proc. IEEE Conf. SOC Conference, Sep. 2011, pp. 278 –283. [34]D. Abts and D. Weisser, “Age-based packet arbitration in large-radix k-ary n-cubes,” in Proc. ACM/IEEE Conf. Supercomputing, Nov. 2007, pp. 1-11. [35]K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The case for a single-chip multiprocessor,” in Proc. Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1996, pp. 2-11. [36]B. A. Nayfeh and K. Olukotun, &;quot;A single-chip multiprocessor,&;quot; Computer, vol. 30, no. 9, pp. 79-85, Sept. 1997, [37]J.-C. Chiu, K.-M. Yang and Y.-L. Chou, “A hyperscalar dual-core architecture for embedded systems,” Microprocessors and Microsystems, Jun. 2012. [38]J.-C. Chiu, K.-M. Yang, Y.-L. Chou, C.-K. Wu, “A relation-exchanging buffering mechanism for instruction and data streaming,” Computers &; Electrical Engineering, vol. 39, no. 4, pp. 1129-1141, May 2013. [39]J.-C. Chiu, K.-M. Yang, “A Novel instruction stream buffer for VLIW architectures,” Computers and Electrical Engineering, vol. 36, no. 1, pp. 190-198, Jan. 2010. [40]William Stallings, Computer Organization and Architecture, Fifth Edition, Prentice Hall, 2000 [41]TMS320C3X User’s Guide, Texas Instruments Inc., 1997 [42]J. E. Thornton, Design of a Computer: the Control Data 6600, Glenview, 1970 [43]J.P Shen and M. Lipasti, “Modern Processor Design Fundamentals of Superscalar Processors”, McGRAW-Hill, 2005 [44]TMS320C62x DSP CPU and Instruction Set Reference Guide, Texas Instruments Inc., May. 2010 [45]Pentium Processor Family Developer’s Manual, Intel Corporation, 1997 [46]C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench, “A tool for evaluating and synthesizing multimedia and communications systems,” in Symp. Microarchitecture, Dec. 1997, pp. 330-335. [47]J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Approach, 3rd ed, Morgan Kaufmann Publichsers, 2003 [48]Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation, Jun. 2011 [49]K. Ghose and M. B. Kamble, “Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation,” in Proc. Int. symp. Low power electronics and design, Aug 1999, pp.70-75. [50]S.Y. Larin and T.M. Conte, “Compiler-driven cached code compression schemes for embedded ILP processors,” in Proc. Micro-32 Int. Symp. Microarchitecture, Nov. 1999, p. 82–92. [51]W. Gass, “Higher performance and lower power enhancements to VLIW architectures,” in Proc. IEEE int. conf. computer design, Sep. 2001, pp. 157. [52]T.M. Conte, S. Banerjia, S.Y. Larin, K.N. Menezes and S.W. Sathaye, “Instruction fetch mechanisms for VLIW architectures with compressed encodings,” in Proc. Micro-29 Int. Symp. Microarchitecture, Dec. 1996, pp. 201–11. [53][Online]. Available: http://www.ti.com/sc/docs/psheets/rel_dsp.htm, January 2006. [54]O. T.-C. Chen, L.-H. Chen, N.-W. Lin, and C.-C. Chen, “Application-specific data path for highly efficient computation of multistandard video codes,” IEEE Tran. Circuits and Systems for Video Technology, vol. 17, pp. 26-42, Jan. 2007. [55]R. Leupers, “Instruction scheduling for clustered VLIW DSPs,” In Proc. Int. conf. parallel architectures and compilation techniques, Oct. 2000. pp. 291–300. [56]M. Lewis and L. Brackenbury, “An instruction buffer for a low-power DSP,” In Proc. Advanced Research in Asynchronous Circuits and Systems, Apr. 2000, pp. 176-186. [57]C. Panis, H. Grunbacher, and J. Nurmi, “A scalable instruction buffer and align unit for xDSPcore,” IEEE J. Solid-State Circuits, vol. 39, pp. 1094-1100, Jul. 2004 [58]B.-H. Lim, “Panel: Challenges and Opportunities for System Software in the Multicore Era,” WIOSCA, 2007 [59]X.-C. WANG and B.-F. QIAN “The design of the cache crossbar based on OpenSPRAC architecture,” In Proc. Int. Conf. Electronic Packaging Technology &; High Density Packaging, Jul. 2008, pp. 1-4. [60]J. Guo, M. Lai, Z. Pang, L. Huang, F. Chen, K. Dai and Z. Wang, “Memory System Design for a Multi-core Processor,” In Proc. Int. Conf. Complex, Intelligent and Software Intensive Systems, Mar. 2008, pp:621 - 626 , [61]Texas Instruments Inc., [Online]. Available:http://www.ti.com/sc/docs/psheets/rel_dsp.htm, [62]K. Hwang., “Advance Computer Architecture Parallelism Scalability Programmability”, McGRAW-HILL Inc., 1993 [63]J.-C. Chiu, Z.-L. Chen, and J. J.-J. Shann, “Improving ILP with Semantic Analyzer for Loop Unrolling in x86 Architectures,” In Proc. Int. Computer Symp. Computer Architecture, Dec. 2000, pp. 74-81. [64]L. Huang, Z. Wang, Li Shen, H. Lu, N. Xiao and C. Liu, “A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors,” In Proc. Design, Automation &; Test, Mar. 2011, pp. 1-4. [65]Guzma, T. Pitkanen, and J. Takala, “Effects of Loop Unrolling and Use of Instruction Buffer on Processor Energy Consumption,” In Proc. Int. Symp. System on Chip, Oct. 2011, pp. 82-85. [66]V. Guzma, T. Pitkanen, and J. Takala, “Instruction Buffer with Limited Control Flow and Loop Nest Support,” In Proc. Int. Conf. Embedded Computer Systems, Jul. 2011, pp. 263-269. [67]J.-C. Chiu, and K.-M. Yang, “Novel instruction stream buffer for VLIW architectures,” Computers and Electrical Engineering, vol. 36, no. 1, pp. 190-198, Jan. 2010. [68]C.-H. Chi and J.-L. Yuan, “Load-Balancing Branch Target Cache and Prefetch Buffer” In Proc. Int. Conf. Embedded Computer Design, Oct. 1999, pp. 436-441. [69]C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt “Prefetch-Aware Memory Controllers” IEEE Tran. Computers, vol. 60, pp. 1406-1430, Oct. 2011. [70]S. Subha, “An Algorithm for Buffer Cache Management,” In Proc. Int. Conf. Information Technology: New Generations, Apr. 2009, pp. 889 – 893. [71]S. P. Vanderwiel and D. J. Lilja, “Data prefetch mechanisms”, ACM Computing Surveys, vol. 32, no.2, pp.174-199, Jun. 2000. [72]K. J. Nesbit and J. E. Smith, “Data Cache Prefetching Using a Global History Buffer,” In Proc. IEE software, Feb. 2004, pp. 96 [73]C. J. Lee, O. Mutlu, V. Narasiman and Y. N. Patt, “Prefetch-aware DRAM controllers,” in Proc. Micro-41 Int. Symp. Microarchitecture, Nov. 2008, pp. 200-209. [74]H. Khalid and M.S. Obaidat “KORA: a new cache replacement scheme” Computers and Electrical Engineering, vol. 26, no. 3, pp. 187-206, Apr. 2010. [75]Y. Jin, E. J. Kim, and K. H. Yum, “Design and Analysis of On-Chip Networks for Large-Scale Cache Systems,” IEEE Tran. Computers, vol. 59, no. 3, pp. 332-344, Mar. 2010. [76]R. G. Dreslinsky, A. G. Saidi, T. Mudge and S. K. Reinhardt, “Analysis of Hardware Prefetching Across Virtual Page Boundaries,” in Proc. Int. conf. computing frontiers, May 2007, pp. 13-22. [77]R. Pendse and R. Bhagavathula, “Performance of LRU Block Replacement Algorithm with Pre-fetching”, in Proc. Symp. Circuits and Systems, pp. 86-89, Aug. 1999 [78]H.S. Stone, “High Performance Computer Architecture”, Addison Wesley, 1990 [79]H. Ghasemzadeh, S. Mazrouee, and M. R. Kakoee, “Modified pseudo LRU replacement algorithm,” in Proc. IEEE Int. Symp. Engineering of Computer Based Systems, Mar. 2006, pp. 376. [80]S. Jiang and X. Zhang, “Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance”, IEEE Tran. Computers, vol. 54, no. 8, Aug. 2005. [81]D. Lee, J. Choi and H. Choe, “Implementation and Performance Evaluation of LRFU Replacement Policy”, IEEE Tran. Computers, pp. 106-111, Sep. 1997. [82]J. Handy, “The Cache Memory Book”, Academic Press,San Diego, pp. 47-67, 1993. [83]A. Sedra and K. Smith, “Microelectronic Circuits: Fifth Edition”, Oxford University Press, 2004. [84]Cell Broadband Engine Programming Handbook, IBM, Apr. 2007. [85]M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkin, Y. Watanabe and T. Yamazaki, “Synergistic processing in cell''s multicore architecture,” IEEE micro, vol. 26, pp. 10-24, Apr. 2006. [86]J. Kahl, M. Day, H. Hofstee, C. Johns, T. Maeurer and D. Shippy. “Introduction to the Cell Multiprocessor.” IBM Journal of Research and Development, vol. 49, pp. 589-604, Jul. 2005. [87]IBM. Unleashing the Cell Broadband Engine Processor [Online]. Available: http://www-128.ibm.com/developerworks/power/library/pa-fpfeib [88]J. Kim, “Low-cost node microarchitecture for on-chip networks,” in Proc. IEEE/ACM Symp. Microarchitecture, Dec. 2009, pp. 255-266. [89]C. Hsieh and M. Pedram, “Architectural energy optimization by bus splitting,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 21, pp. 408–414, Apr. 2002. [90]K. Lahiri, A. Raghunathan and G. Lakshminarayana, “The lotterybus on-chip communication architecture,” IEEE Trans. VLSI Systems, vol. 14, pp. 596–608, Jun. 2006. [91]J.-C. Chiu and K.-M. Yang, &;quot;High-speed low-power multiplexer-based selector for priority policy,&;quot; Computers and Electrical Engineering, vol. 39, no. 2, pp. 202–213, Feb. 2013 [92]J.-S. Wang and C.-H. Huang, “High-speed and low-power CMOS priority encoders,” IEEE J. Solid-State Circuits, vol. 35, pp. 1511–1514, Oct. 2000. [93]C.-H. Huang, J.-S. Wang, and Y.-C. Huang, “Design of high performance CMOS priority encoders and incrementer/decrementers using multilevel lookahead and multilevel folding techniques”, IEEE J. Solid-State Circuits, vol. 37, pp. 63 - 76, Jan. 2002. [94]ARMDUI0207A Realview ARMulator ISS User Guide, ARM Corporation, 2004 [95]B. Bishop, T.P. Kelliher and M.J. Irwin, “A detailed analysis of MediaBench,” in Proc. IEEE Signal Processing Systems, 1999, pp. 448-455. [96]A. Peleg and U. Weiser, “MMX technology extension to the Intel architecture”, IEEE Micro, Aug 1996, pp.42–50. [97]L. H., Z. Wang, L. Shen, H. Lu, N. Xiao, and C. Liu “A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors” in Proc. Design, Automation &; Test, Mar. 2011, pp. 1-4. [98]T.-J. Lin, C.-C. Chang, C.-C. Lee and C.-W. Jen, “An efficient VLIW DSP architecture for baseband processing,” in Proc. Int. Conf. computer design; Oct. 2003, pp. 307–12. [99]L. Wanhammar, “DSP integrated circuits”, Academic Press; 1999. [100]M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge and R.B. Brown, “ MiBench: a free, commercially representative embedded benchmark suite WWC- 4,” in Proc IEEE Workload Characterization; Dec. 2001. pp. 3–14. [101]TMS320C64x DSP Library Programmer’s Reference, Texas Instruments Inc.; April 2002. [102]TMS320C64x Image/Video Library Programmer’s Reference, Texas Instruments Inc.; April 2002. [103]T.-J. Lin, C.-M. Chao, C.-H. Liu, P.-C. Hsiao, S.-K. Chen, L.-C. Lin, C.-W. Liu, C.-W. Jen, “Computer architecture: a unified processor architecture for RISC and VLIW DSP” in Proc ACM Symp. VLSI, Apr. 2005, pp. 50-55. [104]J. G. Delgado-Frias and J. Nyathi, “A high-performance encoder with priority lookahead,” IEEE Trans. Circuits Syst. I, vol. 47, pp. 1390–1393, Sep. 2000. [105]S.K. Maurya and L.T. Clark, “Fast and Scalable Priority Encoding using Static CMOS” in Proc. IEEE Int. Symp. Circuits and Systems, pp. 433–436, May. 2010.
|