|
In this paper, we discuss how to parallelize the double loop and distribute any array in the double loop so that a generated parallel program can be executed on a 1-dimension-form distributed-memory multiprocessor system, then we can get a good speedup. On a distributed-memory multiprocessor system, a parallel program must avoid as much synchronization and communication between nodes as possible; otherwise we don't have to parallelize it because of the worse parallelism, much synchronization delay and communication overhead. Therefore we analyze all possible conditions produced by the fixed-form dependence of the double loop. With an efficient unimodular transformation, we transform the double loop into a form suitable for outer loop parallelization. In order to reduce more communication, we suggest a algorithm to find and replicate a read-only array that causes a lot of remote data access. At last we partition the index domain of outer loop and non-replicated array to map into nodes. We split iteration space into independent subspace to minimize communication. Then we partition iteration space into non-uniform size block in order to balance the work load of every node. After partitioning, every part of double loop should be interchanged to overlap more computation and communication. Thus we attain a less communication overhead and minimized synchronization delay, and do an efficient parallelization.
|