Abstract. In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular application codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a processor on which the minimal communication cost is...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming ...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...
In most cases of distributed memory computations, node programs are executed on processors according...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
This paper describes a number of optimizations that can be used to support the efficient execution o...
In irregular all-to-all communication, messages are exchanged between every pair of processors. The ...
In this paper, some automatic parallelization and opti-mization techniques for irregular scientific ...
There are many important applications in computational fluid dynamics, circuit simulation and struct...
Communication set generation significantly influences the performance of parallel programs. However...
[[abstract]]Intensive scientific algorithms can usually be formulated as nested loops which are the ...
Communication (data movement) often dominates a computation's runtime and energy costs, motivating o...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Data-parallel languages allow programmers to use the familiar machine-independent programming style ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming ...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...
In most cases of distributed memory computations, node programs are executed on processors according...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
This paper describes a number of optimizations that can be used to support the efficient execution o...
In irregular all-to-all communication, messages are exchanged between every pair of processors. The ...
In this paper, some automatic parallelization and opti-mization techniques for irregular scientific ...
There are many important applications in computational fluid dynamics, circuit simulation and struct...
Communication set generation significantly influences the performance of parallel programs. However...
[[abstract]]Intensive scientific algorithms can usually be formulated as nested loops which are the ...
Communication (data movement) often dominates a computation's runtime and energy costs, motivating o...
Parallelizing sparse irregular application on distributed memory systems poses serious scalability c...
Data-parallel languages allow programmers to use the familiar machine-independent programming style ...
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/19...
In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming ...
Communication overhead in multiprocessor systems, as exemplified by cache coherency traffic and glob...