|
In the past the design philosophy for memory hierarchy was mainly based onrules of catching data locality for reducing memory latency. This thoughtmakes cache design preferring applications with regular data accesses.However, there are some kind of applications which were lose of locality,which we referred to irregular applications. For irregularapplications, for example, using too longer cache line might not meet thedata locality but cause extra overhead, such as transferring useless dataelements, or causing false sharing effects, etc. Thus, from the aspectof locality analysis, memory hierarchy design for irregular data accessesshould be investigated. We firstly study the memory hierarchy design bysectored caches in reducing false sharing misses on bus-basedmultiprocessors. In a sectored cache, each cache line is divided intoseveral subblocks. A subblock is a basic coherence unit. When false sharingoccurs, the involved cache line needs not be invalidated or transferred, aslong as the corresponding subblocks are kept coherent. To facilitate thestudy, we extend the conventional MESI protocol to sectored caches anddefine a performance metric called the degree of false sharingreduction to quantify the false sharing reduction on such caches. Wesimulated the execution of typical benchmarks, FFT, LU, Radix, SORBYR and SORBYC, on sectored caches. Evaluationresults show that our scheme can effectively reduce about 30% to 80% falsesharing misses and avoid useless coherence operations. On the other hand, wemeasure the effectiveness of different bounteous transfer schemes. Bounteous transfer is a scheme in sectored caches in which a subblock holdersupplies extra subblocks after transferring the missed subblock on a readmiss. We also investigate the effectiveness of three different types ofbounteous transfers: bounteous transfer with valid subblocks (BT-V),bounteous transfer with clean subblocks (BT-C), and bounteous transferdisabled (No-BT). Two metrics U-rate and R-rate are proposedto help observe the sharing granularities and coherence overhead moreprecisely. Evaluation results show that different benchmarks work betterwith different kinds of bounteous transfers and using bounteous transfercarelessly may result in performance degradation. Furthermore, another partof this dissertation we focus on data dependence detecting schemes forirregular data accesses. This topic is about run-time parallelizationtechniques using multiprocessor memory hierarchy. Run-time parallelizationis a technique for solving problems whose data access patterns are difficultto analyze at compile time. We propose a worker-checker framework toclassify different run-time parallelization schemes. Under the framework,operations performed during run-time parallelization are classified looselyinto a worker and a checker. Different schemes are then cast into theframework based on the relative execution order of their worker and checker.From the framework, we identified several new run-time parallelizationmethods. We examine the implementation of one such method derived fromspeculative parallelization. The implementation is based on the idea ofembedding hardware checkers inside either in sectored caches or memorycontrollers. We will present the design of the hardware checker and evaluatethe effectiveness of the design on run-time parallelizing DOALL and DOACROSSloops.
|