In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular application codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a processor on which the minimal communication cost is ensured w...
Data-parallel languages allow programmers to use the familiar machine-independent programming style ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...
Abstract. In most cases of distributed memory computations, node programs are executed on processors...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
This paper describes a number of optimizations that can be used to support the efficient execution o...
In irregular all-to-all communication, messages are exchanged between every pair of processors. The ...
In this paper, some automatic parallelization and opti-mization techniques for irregular scientific ...
There are many important applications in computational fluid dynamics, circuit simulation and struct...
[[abstract]]Intensive scientific algorithms can usually be formulated as nested loops which are the ...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Communication (data movement) often dominates a computation's runtime and energy costs, motivating o...
In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming ...
Communication set generation significantly influences the performance of parallel programs. However...
Data-parallel languages allow programmers to use the familiar machine-independent programming style ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...
Abstract. In most cases of distributed memory computations, node programs are executed on processors...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
This paper describes a number of optimizations that can be used to support the efficient execution o...
In irregular all-to-all communication, messages are exchanged between every pair of processors. The ...
In this paper, some automatic parallelization and opti-mization techniques for irregular scientific ...
There are many important applications in computational fluid dynamics, circuit simulation and struct...
[[abstract]]Intensive scientific algorithms can usually be formulated as nested loops which are the ...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Communication (data movement) often dominates a computation's runtime and energy costs, motivating o...
In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming ...
Communication set generation significantly influences the performance of parallel programs. However...
Data-parallel languages allow programmers to use the familiar machine-independent programming style ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...