In this paper, we develop an automatic compile-time computation and data decomposition technique for distributed memory machines. Our method can handle complex programs containing perfect and nonperfect loop nests with or without loop-carried dependences. Applying our decomposition algorithms, a program will be divided into collections (called clusters) of loop nests, such that data redistributions are allowed only between the clusters. Within each cluster of loop nests, decomposition and data locality constraints are formulated as a system of homogeneous linear equations which is solved by polynomial time algorithms. Our algorithm can selectively relax data locality constraints within a cluster to achieve a balance between parallelism and ...
Abstract In this paper, an approach to the problem of exploiting parallelism within nested loops is ...
In recent years, distributed memory parallel machines have been widely recognized as the most likely...
Programming for parallel architectures that do not have a shared address space is extremely difficul...
This paper addresses the problem of partitioning data for distributed memory machines (multicomputer...
On shared memory parallel computers (SMPCs) it is natural to focus on decomposing the computation (...
An approach to programming distributed memory-parallel machines that has recently become popular is ...
This paper outlines two methods which we believe will play an important role in any distributed memo...
In this paper we present a unified approach for compiling programs for Distributed-Memory Multiproce...
[[abstract]]In distributed memory multicomputers, local memory accesses are much faster than those i...
Automatic Global Data Partitioning for Distributed Memory Machines (DMMs) is a difficult problem. Di...
This paper addresses the problem of partitioning data for distributed memory machines or multicomput...
We present a unified approach to locality optimization that employs both data and control transforma...
Increased programmability for concurrent applications in distributed systems requires automatic supp...
We articulate the need for managing (data) locality automatically rather than leaving it to the prog...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...
Abstract In this paper, an approach to the problem of exploiting parallelism within nested loops is ...
In recent years, distributed memory parallel machines have been widely recognized as the most likely...
Programming for parallel architectures that do not have a shared address space is extremely difficul...
This paper addresses the problem of partitioning data for distributed memory machines (multicomputer...
On shared memory parallel computers (SMPCs) it is natural to focus on decomposing the computation (...
An approach to programming distributed memory-parallel machines that has recently become popular is ...
This paper outlines two methods which we believe will play an important role in any distributed memo...
In this paper we present a unified approach for compiling programs for Distributed-Memory Multiproce...
[[abstract]]In distributed memory multicomputers, local memory accesses are much faster than those i...
Automatic Global Data Partitioning for Distributed Memory Machines (DMMs) is a difficult problem. Di...
This paper addresses the problem of partitioning data for distributed memory machines or multicomput...
We present a unified approach to locality optimization that employs both data and control transforma...
Increased programmability for concurrent applications in distributed systems requires automatic supp...
We articulate the need for managing (data) locality automatically rather than leaving it to the prog...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...
Abstract In this paper, an approach to the problem of exploiting parallelism within nested loops is ...
In recent years, distributed memory parallel machines have been widely recognized as the most likely...
Programming for parallel architectures that do not have a shared address space is extremely difficul...