跳到主要內容

臺灣博碩士論文加值系統

(216.73.216.81) 您好!臺灣時間:2025/10/04 04:44
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

我願授權國圖
: 
twitterline
研究生:劉廣治
研究生(外文):Liu, Kuang-Chih
論文名稱:不規則資料存取情形下之記憶體階層架構設計
論文名稱(外文):Memory Hierarchy Design for Irregular Data Accesses
指導教授:金仲達金仲達引用關係
指導教授(外文):Chung-Ta King
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:1998
畢業學年度:86
語文別:中文
論文頁數:102
中文關鍵詞:快取記憶體記憶體階層架構假分享執行時期平行化資料預取
外文關鍵詞:cache memorymemory hierarchyfalse sharingrun-time parallelizationdata prefetching
相關次數:
  • 被引用被引用:0
  • 點閱點閱:388
  • 評分評分:
  • 下載下載:0
  • 收藏至我的研究室書目清單書目收藏:1
本篇論文著重多處理機系統的記憶體階層架構進行探討。在過去,記憶體
階層架構的設計的思考主要基於補捉住資料的局部性於快取記憶體中,並
藉此來減短記憶體的延遲時間。這樣的思考使得快取記憶體會偏好具有規
則資料存取特性的應用程式。然而,仍有許多應用程式的記憶體存取行為
所造成的資料局部性比較弱,我們稱其為不規則式應用程式。舉例來說,
使用太長的快取區塊,非但不能符合應用程式的的工作集合,反而造成過
多無用的資料搬移,甚至引起假分享的額外負擔。因此,從局部性分析的
角度,在不規則資料存取的情形下,記憶體階層架構的設計是值得探討的
。除了不規則存取造成局部性的難以掌握,另外一點則是長久以來編譯程
式對於不規則存取的資料相依性分析有著無法突破的困難。因此,從記憶
體階層架構中提供一套執行時期的相依性檢查裝置,也是本篇論文中的另
一個論點。簡言之,本篇論文探討兩個層面:(一)資料局部性(二)資
料相依性。 首先,就資料局部性的層面,我們設計了一個細粒分割式快
取記憶體的一致性協定。細粒分割式快取記憶體,即是以往稱之為的「扇
形式快取記憶體」。用分割式快取記憶體的目的,主要是減少由不規則存
取所引起的假分享錯失。當假分享發生時,所牽涉到的快取區塊並不需要
整個被紀錄為無效,只須把需要的子區塊更改即可。這樣的局部區塊無效
化的方式,確實有效地減少了假分享的情況。實驗的設計上,我們定義了
一個稱為「假分享減低率」的檢測標準。用此檢測標準,我們發現了對
FFT, LU, Radix,SORBYR 和 SORBYC 等程式,可以達到 30% 以上,甚
至 80% 的假分享減低率。 同時,我們的研究指出,分割式快取記憶體除
了能有效減少假分享錯失,也提供了另一個層次的資料傳輸。即是以「子
區塊」做為資料錯失情況下,視子區塊的一致性狀態做選擇性的傳輸,避
免了不必要的資料送上網路。一個稱為「多量傳輸」的技術在本文中也被
詳細地討論。為了符合不同應用程式的特性,將多量傳輸分成三個等級。
分別是「以有效子區塊傳輸」,「以潔淨子區塊傳輸」,和「禁用多量傳
輸」。兩個分別稱為 U 比率 和 R 比率的檢測標準用在實驗中,以輔助
觀察各個應用程式的資料存取形態對於快取一致性協定的影響。我們發現
並證明了不同的應用程式對於不同的多量傳輸方式有其偏好。盲目地使用
了不適當的多量傳輸會造成快取記憶體的效能低落。 另一方面,在資料
相依性的層面,我們探討了執行時期平行化的技術,運用記憶體階層架構
來達到不規則資料存取時的相依性檢查。針對已有的執行時期平行化的技
術,我們分類並提出了六種「工作者/檢查者」架構。在內文中我們詳細
地檢視了其中一種架構,並設計成「投機式平行執行」的記憶體階層架構
。它的觀念是把檢查者的邏輯電路嵌入快取記憶體或主記憶體。設計的細
節和實驗方法都在內文中有詳細地介紹。實驗結果顯示,這樣的設計對
DOALL 和 DOACROSS 迴圈都能達到不錯的效能,且不會增加原系統的負擔

In the past the design philosophy for memory hierarchy was
mainly based onrules of catching data locality for reducing
memory latency. This thoughtmakes cache design preferring
applications with regular data accesses.However, there are some
kind of applications which were lose of locality,which we
referred to irregular applications. For irregularapplications,
for example, using too longer cache line might not meet thedata
locality but cause extra overhead, such as transferring useless
dataelements, or causing false sharing effects, etc. Thus, from
the aspectof locality analysis, memory hierarchy design for
irregular data accessesshould be investigated. We firstly study
the memory hierarchy design bysectored caches in reducing false
sharing misses on bus-basedmultiprocessors. In a sectored
cache, each cache line is divided intoseveral subblocks. A
subblock is a basic coherence unit. When false sharingoccurs,
the involved cache line needs not be invalidated or transferred,
aslong as the corresponding subblocks are kept coherent. To
facilitate thestudy, we extend the conventional MESI protocol to
sectored caches anddefine a performance metric called the degree
of false sharingreduction to quantify the false sharing
reduction on such caches. Wesimulated the execution of typical
benchmarks, FFT, LU, Radix, SORBYR and SORBYC, on sectored
caches. Evaluationresults show that our scheme can effectively
reduce about 30% to 80% falsesharing misses and avoid useless
coherence operations. On the other hand, wemeasure the
effectiveness of different bounteous transfer schemes. Bounteous
transfer is a scheme in sectored caches in which a subblock
holdersupplies extra subblocks after transferring the missed
subblock on a readmiss. We also investigate the effectiveness of
three different types ofbounteous transfers: bounteous transfer
with valid subblocks (BT-V),bounteous transfer with clean
subblocks (BT-C), and bounteous transferdisabled (No-BT). Two
metrics U-rate and R-rate are proposedto help observe the
sharing granularities and coherence overhead moreprecisely.
Evaluation results show that different benchmarks work
betterwith different kinds of bounteous transfers and using
bounteous transfercarelessly may result in performance
degradation. Furthermore, another partof this dissertation we
focus on data dependence detecting schemes forirregular data
accesses. This topic is about run-time parallelizationtechniques
using multiprocessor memory hierarchy. Run-time
parallelizationis a technique for solving problems whose data
access patterns are difficultto analyze at compile time. We
propose a worker-checker framework toclassify different run-time
parallelization schemes. Under the framework,operations
performed during run-time parallelization are classified
looselyinto a worker and a checker. Different schemes are then
cast into theframework based on the relative execution order of
their worker and checker.From the framework, we identified
several new run-time parallelizationmethods. We examine the
implementation of one such method derived fromspeculative
parallelization. The implementation is based on the idea
ofembedding hardware checkers inside either in sectored caches
or memorycontrollers. We will present the design of the hardware
checker and evaluatethe effectiveness of the design on run-time
parallelizing DOALL and DOACROSSloops.
論文[中文本]
Cover
Abstract
Contents
List of Figures
1 Introduction
1.1 Motivation
1.2 Catching Data Locality Properly
1.3 Detecting Data Dependence at Run-time
1.4 Contributions
1.5 Organization of the Dissertation
2 A New Sectored Cache Protocol
2.1 Notations and States in the Protocol
2.2 State Transitions
2.3 False Sharing Metrics
2.4 Relatrd Work
2.5 Performance Evaluation
3 Bounteous Transfers
3.1 Original Idea
3.2 Variations
3.3 Performance Metrics
3.4 Results
4 A New Framework of Run-time Parallelization
4.1 Introduction
4.2 Previous Run-time Parallelization Schemes and Their Overheads
4.3 Worker-Checker Framework
5 Case Study ─ Design of Hardware Checkers
5.1 Overall Organization
5.2 Hazard Conditions
5.3 The Checker Algorithm
5.4 Implementation of Hardware Checkers
5.5 The Checker Circuit
5.6 Memory Controllers with the Checker Circuit
5.7 Cache Controllers with Checker Circuit
5.8 Software Supports
5.9 Performance Evaluation
5.10 Performance Results of Using Cache Controller with the Checker Circuit
6 Conclusion
A Comparisons of Sectored Cache Protocol
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top