跳到主要內容

臺灣博碩士論文加值系統

(44.192.48.196) 您好!臺灣時間:2024/06/26 01:15
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:陳煥文
研究生(外文):Chen, Huan-Wen
論文名稱:應用於多核心平台之可堆疊記憶體存取效率改進與分析
論文名稱(外文):Efficiency Improvement and Analysis of Accessing Stacked Memories on Many-Core Platforms
指導教授:黃稚存
指導教授(外文):Huang, Chih-Tsun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2013
畢業學年度:102
語文別:英文
論文頁數:77
中文關鍵詞:多核心多核心單晶片堆疊記憶體加寬輸入輸出動態隨機存取記憶體
外文關鍵詞:Many-CoreCMPStacked MemoriesWide I/ODRAM
相關次數:
  • 被引用被引用:0
  • 點閱點閱:366
  • 評分評分:
  • 下載下載:7
  • 收藏至我的研究室書目清單書目收藏:0
由於結構簡單且相對便宜,在電腦架構的設計上通常會將動態隨機存取記憶體當作主記憶體使用,然而,就歷史的觀點來看,動態隨機存取記憶體效率的演進相對於晶片核心時脈演進的速度來得慢很多,因此早在1994年W. Wulf和S. McKee就提出"記憶體牆"的概念。然而,為了滿足摩爾定律,單晶片上的核心數越來越多,從原本的單核心到現今的多核心系統。相對於單核心來說,多核心系統以核心平行度來取代核心時脈的增加,但對於記憶體的吞吐量需求未減反增,因此很多科學家致力於記憶體存取效率的改善,如:改善記憶體控制器的排程效率、增加匯流排寬度或是增加記憶體存取速率等等。近年來,堆疊記憶體架構的出現使得記憶體吞吐需求量有些微的獲得滿足,但對於使用晶片網路的多核心系統架構來說,從核心到記憶體控制器的距離會隨著晶片上網路的增大而相對變遠,因此,在本篇論文中,我們使用了一個額外的多對多交換網路去處理核心對控制器的存取,此舉不但能減少因大量存取所造成晶片網路的雍塞,且能使核心能更快的對記憶體控制器做存取。經由SPLASH-2測資的證明,此種架構能使核心到記憶體的存取效率達到1.13到2.57倍之多,並且適用於現今的記憶體堆疊架構。
Because of DRAM is its structural simplicity, high density per unit area and more inexpensive, it’s very suited to be a role of main-memory in computer architecture. However, from a historical point of view, since the DRAM was flourished, the rate of improvement in processor speed exceeds the rate of improvement in DRAM memory speed, that W. Wulf and S. McKee called the phenomenon “memory wall”. Nevertheless, over the past few decades the amount of on-chip cores comes from one to several, and the up-coming NoC-based (most is mesh) many-core architecture no longer blindly upgrades processor’s performance, but takes advantage of parallelism to achieve the throughput requirement with superior cost-effectiveness. Unfortunately, the demand for memory bandwidth or throughput is still increased. Therefore, many engineer try to do their best to enhance the efficiency
between memory controller and DRAM devices by proposing better memory scheduling policy, increasing bandwidth and improving the access speed, etc. Recently, the emergence of 3D-stacked DRAM (wide I/O) slightly reduces the speed gap between processor and memory system. But the architecture
which used Network-on-Chip as a bridge to connect processors and memory controllers has a characteristic that some DRAM requests from processors may go through very far distance to
access memory controller. Based on the above motivation, in this thesis we present an architecture which improves efficiency of accessing stacked memories on many-core platforms. This architecture uses an extra switch network to transport the packets which come from processor to DRAM sub-system and groups few numbers of processor to specify DRAM-channel. By this method, we can alleviate the traffic contention between DRAM-requests and inter-processor communication. We use traditional method as a contrast, that all of DRAM-requests are routed by NoC. Experimental results of SPLASH2 applications demonstrate significant speed up that ranges from 1.13 times to 2.57 times, with cost-affordable crossbar switch network which also applies to the Wide I/O DRAM interface.
Abstract
Acknowledgments
Contents
1 Introduction
1.1 Introduction to Many-Core Platform
1.2 The Evolution of System-on-Chip and Motivation
1.3 Contribution
1.4 Thesis Organization
2 Previous Work
2.1 Background
2.1.1 Open Core Protocol
2.1.2 OpenRISC
2.2 Overview of The ESL Many-Core Platform
2.2.1 Processing Element
2.2.2 Communication Unit
2.2.3 Network-on-Chip
2.3 The Principle of Direct Memory Access
2.4 Software Communication Library
2.5 Existing Methods for Improvement of Memory System
3 Proposed Many-Core Based 3D-Stacked Memories Architecture
3.1 Memory Controller Placement Analysis
3.1.1 Related Work
3.1.2 Introduction to Wide I/O
3.1.3 Slightly Traffic Analysis and Timing Analysis
3.2 Overview of 3D-Stacked Memories Architecture
3.2.1 Methodology and Feasibility
3.2.2 The Architecture of 3D-DRAM Sub-system
3.2.3 The Discussion of Scalability
3.3 DRAMSim2
3.4 Crossbar Switch Network
3.5 Reorder Table
3.6 Mechanism of One (Burst) Read
4 Experiment Results
4.1 Overview of Experiment Environment
4.2 Random Distribution Analysis
4.3 Applications Analysis
4.3.1 Odd-Even Sort Analysis
4.3.2 Median Filter for Gray Analysis
4.4 The SPLASH-2 Analysis
5 Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
A Simulation Timing 66
B SPLASH2 - Speed up vs. Injection rate 68
[1] J.-S. Kim, C. S. Oh, H. Lee, D. Lee, H.-R. Hwang, S. Hwang, B. Na, J. Moon, J.-G. Kim,
H. Park et al., “A 1.2 v 12.8 gb/s 2gb mobile wide-i/o dram with 4 128 i/os using tsv-based
stacking,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE
International. IEEE, 2011, pp. 496–498.
[2] M. B. Taylor, J. Kim, J. Miller, D.Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson,
J.-W. Lee, W. Lee et al., “The raw microprocessor: A computational fabric for software circuits
and general-purpose programs,” Micro, IEEE, vol. 22, no. 2, pp. 25–35, 2002.
[3] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch,
R. Barua et al., “Baring it all to software: Raw machines,” Computer, vol. 30, no. 9, pp. 86–93,
1997.
[4] B. Baas, Z. Yu, M. Meeuwsen, O. Sattari, R. Apperson, E.Work, J.Webb, M. Lai, T. Mohsenin,
D. Truong et al., “Asap: A fine-grained many-core platform for dsp applications,” Micro, IEEE,
vol. 27, no. 2, pp. 34–45, 2007.
[5] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao,
J. Brown et al., “Tile64-processor: A 64-core soc with mesh interconnect,” in Solid-State Circuits
Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. IEEE,
2008, pp. 88–598.
[6] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H.Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob,
S. Jain et al., “An 80-tile sub-100-w teraflops processor in 65-nm cmos,” Solid-State Circuits,
IEEE Journal of, vol. 43, no. 1, pp. 29–41, 2008.
[7] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H.Wilson, N. Borkar,
G. Schrom et al., “A 48-core ia-32 message-passing processor with dvfs in 45nm cmos,” in
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International.
IEEE, 2010, pp. 108–109.
[8] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,” ACM
SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995.
[9] S. Borkar, “Thousand core chips: a technology perspective,” in Proceedings of the 44th annual
Design Automation Conference. ACM, 2007, pp. 746–749.
[10] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling,”
in ACM SIGARCH Computer Architecture News, vol. 28, no. 2. ACM, 2000, pp. 128–138.
[11] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair queuing memory systems,” in Microarchitecture,
2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on. IEEE,
2006, pp. 208–222.
[12] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both performance
and fairness of shared dram systems,” in ACM SIGARCH Computer Architecture News,
vol. 36, no. 3. IEEE Computer Society, 2008, pp. 63–74.
[13] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling:
Exploiting differences in memory access behavior,” in Microarchitecture (MICRO), 2010 43rd
Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. 65–76.
[14] R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, “Staged
memory scheduling: Achieving high performance and scalability in heterogeneous systems,”
in Proceedings of the 39th International Symposium on Computer Architecture. IEEE Press,
2012, pp. 416–427.
[15] G. L. Yuan, A. Bakhoda, and T. M. Aamodt, “Complexity effective memory access scheduling
for many-core accelerator architectures,” in Proceedings of the 42nd Annual IEEE/ACM International
Symposium on Microarchitecture. ACM, 2009, pp. 34–44.
[16] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, “Bridging the processor-memory performance
gap with 3d ic technology,” Design & Test of Computers, IEEE, vol. 22, no. 6, pp. 556–
564, 2005.
[17] G. H. Loh, “3d-stacked memory architectures for multi-core processors,” in ACM SIGARCH
Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 453–464.
[18] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and K. Banerjee, “A thermallyaware
performance analysis of vertically integrated (3-d) processor-memory hierarchy,” in Proceedings
of the 43rd annual Design Automation Conference. ACM, 2006, pp. 991–996.
[19] I. Loi and L. Benini, “An efficient distributed memory interface for many-core platform with
3d stacked dram,” in Proceedings of the Conference on Design, Automation and Test in Europe.
European Design and Automation Association, 2010, pp. 99–104.
[20] T.-S. Hsu and J.-J. Liou, “A DVFS Many-core ESL Simulation Platform with Software Communication
API,” in Master Thesis, Department of Electrical Engineering, National Tsing Hua
University, Hsinchu, Taiwan, Nov. 2011.
[21] O. C. P. Specification and I. Volume, “Release 2.0,” 2003.
[22] D. Lampret, C.-M. Chen, M. Mlinar, J. Rydberg, M. Ziv-Av, C. Ziomkowski, G. McGary,
B. Gardner, R. Mathur, and M. Bolado, “Openrisc 1000 architecture manual,” Description of
assembler mnemonics and other for OR1200, 2003.
[23] S. Rigo, G. Araujo, M. Bartholomeu, and R. Azevedo, “Archc: A systemc-based architecture description
language,” in Computer Architecture and High Performance Computing, 2004. SBACPAD
2004. 16th Symposium on. IEEE, 2004, pp. 66–73.
[24] J.-Y. Lai, P.-Y. Chen, T.-S. Hsu, C.-T. Huang, and J.-J. Liou, “Design and analysis of a manycore
processor architecture for multimedia applications,” in Signal & Information Processing
Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. IEEE, 2012,
pp. 1–6.
[25] J. Bennett, “Building a loosely timed soc model with osci tlm 2.0,” 2008.
[26] E. Pekkarinen, L. Lehtonen, E. Salminen, and T. Hamalainen, “A set of traffic models for
network-on-chip benchmarking,” in System on Chip (SoC), 2011 International Symposium on.
IEEE, 2011, pp. 78–81.
[27] J. Aynsley, “Osci tlm-2.0 language reference manual,” Open SystemC Initiative (OSCI), p. 15,
2009.
[28] L. Lehtonen, E. Salminen, and T. Hamalainen, “Analysis of modeling styles on network-on-chip
simulation,” in NORCHIP, 2010. IEEE, 2010, pp. 1–4.
[29] J. Zhu, P. Liu, and D. Zhou, “An sdram controller optimized for high definition video coding
application,” in Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on.
IEEE, 2008, pp. 3518–3521.
[30] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “Atlas: A scalable and high-performance
scheduling algorithm for multiple memory controllers,” in High Performance Computer Architecture
(HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.
[31] A. Sharifi, E. Kultursay, M. Kandemir, and C. R. Das, “Addressing end-to-end memory access
latency in noc-based multicores,” in Proceedings of the 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE Computer Society, 2012, pp. 294–304.
[32] M. M. Lee, J. Kim, D. Abts, M. Marty, and J. W. Lee, “Approximating age-based arbitration in
on-chip networks,” in Proceedings of the 19th international conference on Parallel architectures
and compilation techniques. ACM, 2010, pp. 575–576.
[33] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, “Application-aware prioritization mechanisms
for on-chip networks,” in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International
Symposium on. IEEE, 2009, pp. 280–291.
[34] ——, “Aérgia: exploiting packet latency slack in on-chip networks,” in ACM SIGARCH Computer
Architecture News, vol. 38, no. 3. ACM, 2010, pp. 106–116.
[35] A. Kumary, P. Kunduz, A. Singhx, L.-S. Pehy, and N. Jhay, “A 4.6 tbits/s 3.6 ghz single-cycle
noc router with a novel switch allocator in 65nm cmos,” in Computer Design, 2007. ICCD 2007.
25th International Conference on. IEEE, 2007, pp. 63–70.
[36] R. Mullins, A.West, and S. Moore, “Low-latency virtual-channel routers for on-chip networks,”
ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 188, 2004.
[37] L.-S. Peh and W. J. Dally, “A delay model and speculative architecture for pipelined routers,” in
High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium
on. IEEE, 2001, pp. 255–266.
[38] A. Kumar, L.-S. Peh, and N. K. Jha, “Token flow control,” in Proceedings of the 41st annual
IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2008, pp.
342–353.
[39] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Express virtual channels: towards the ideal
interconnection fabric,” in ACM SIGARCH Computer Architecture News, vol. 35, no. 2. ACM,
2007, pp. 150–161.
[40] Y. Kim, H. Lee, and J. Kim, “An alternative memory access scheduling in manycore accelerators,”
in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference
on. IEEE, 2011, pp. 195–196.
[41] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao,
J. F. Brown, and A. Agarwal, “On-chip interconnection architecture of the tile processor,” Micro,
IEEE, vol. 27, no. 5, pp. 15–31, 2007.
[42] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh,
T. Jacob et al., “An 80-tile 1.28 tflops network-on-chip in 65nm cmos,” in Solid-State Circuits
Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. IEEE, 2007,
pp. 98–589.
[43] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, “Achieving predictable performance
through better memory controller placement in many-core cmps,” in ACM SIGARCH
Computer Architecture News, vol. 37, no. 3. ACM, 2009, pp. 451–461.
[44] G. Katti, M. Stucchi, K. De Meyer, and W. Dehaene, “Electrical modeling and characterization
of through silicon via for three-dimensional ics,” Electron Devices, IEEE Transactions on,
vol. 57, no. 1, pp. 256–262, 2010.
[45] J. Standard, “Wide i/o single data rate,” JESD229, December, 2011.
[46] D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. Lee, “An optimized 3d-stacked memory architecture
by exploiting excessive, high-density tsv bandwidth,” in High Performance Computer
Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.
[47] R. Ho, “On-chip wires: scaling and efficiency,” Ph.D. dissertation, Citeseer, 2003.
[48] P. Bai, C. Auth, S. Balakrishnan, M. Bost, R. Brain, V. Chikarmane, R. Heussner, M. Hussein,
J. Hwang, D. Ingerly et al., “A 65nm logic technology featuring 35nm gate lengths, enhanced
channel strain, 8 cu interconnect layers, low-k ild and 0.57 m2 sram cell,” in Electron Devices
Meeting, 2004. IEDM Technical Digest. IEEE International. IEEE, 2004, pp. 657–660.
[49] P.-Y. Chen and C.-T. Huang, “RTL Realization of NoC-Based Multi-Core Platform,” in Master
Thesis, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, Oct.
2011.
[50] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle accurate memory system
simulator,” Computer Architecture Letters, vol. 10, no. 1, pp. 16–19, 2011.
[51] B. Wilkinson and C. M. Allen, Parallel programming. Prentice hall New Jersey, 1999, vol.
999.
[52] J. R. Jensen et al., Introductory digital image processing: a remote sensing perspective.
Prentice-Hall Inc., 1996, no. Ed. 2.
[53] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-2 programs: Characterization
and methodological considerations,” in ACM SIGARCH Computer Architecture News,
vol. 23, no. 2. ACM, 1995, pp. 24–36.
[54] J. P. Singh, W.-D. Weber, and A. Gupta, “Splash: Stanford parallel applications for sharedmemory,”
ACM SIGARCH Computer Architecture News, vol. 20, no. 1, pp. 5–44, 1992.
[55] D. H. Bailey, “Ffts in external of hierarchical memory,” in Proceedings of the 1989 ACM/IEEE
conference on Supercomputing. ACM, 1989, pp. 234–242.
[56] L. Greengard, The rapid evaluation of potential fields in particle systems. the MIT Press, 1988.
[57] P. Hanrahan, D. Salzman, and L. Aupperle, “A rapid hierarchical radiosity algorithm,” in ACM
SIGGRAPH Computer Graphics, vol. 25, no. 4. ACM, 1991, pp. 197–206.
[58] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, “A
comparison of sorting algorithms for the connection machine cm-2,” in Proceedings of the third
annual ACM symposium on Parallel algorithms and architectures. ACM, 1991, pp. 3–16.
[59] P. S.Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson,
A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” Computer, vol. 35,
no. 2, pp. 50–58, 2002.
[60] M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E.
Moore, M. D. Hill, and D. A. Wood, “Multifacet’s general execution-driven multiprocessor
simulator (gems) toolset,” ACM SIGARCH Computer Architecture News, vol. 33, no. 4, pp.
92–99, 2005.
[61] C. Weis, I. Loi, L. Benini, and N. Wehn, “An energy efficient dram subsystem for 3d integrated
socs,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012. IEEE,
2012, pp. 1138–1141.
連結至畢業學校之論文網頁點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top