Minimizing communications when mapping affine loop nests onto distributed memory parallel computers has already drawn a lot of attention. This paper focuses on the next step: as it is generally impossible to obtain a communication-free (or local) mapping, how to optimize the residual communications? We explain how to take advantage of macro-communications such as broadcasts, scatters, gathers or reductions or how to decompose general affine communications into simpler ones that can be performed more efficiently. We finally give a two-step heuristic that summarizes our approach: first minimize the number of nonlocal communications, then optimize residual affine communications using macro-communications or decompositions
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
Reducing communication overhead is extremely important in distributed-memory message-passing archite...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...
Minimizing communications when mapping affine loop nests onto distributed memory parallel computers ...
Minimizing communication overhead when mapping affine loop nests onto distributed memory parallel co...
Reducing communication overhead is extremely important in distributed-memory messagepassing architec...
Many parallel applications require periodic redistribution of workloads and associated data. In a di...
Many parallel applications require periodic redistribution of workloads and associated data. In a di...
Abstract—Many parallel applications require periodic redistribution of workloads and associated data...
In this paper, we consider the communications involved by the execution of a complex application, de...
This paper describes a number of optimizations that can be used to support the efficient execution o...
In stencil based parallel applications, communications represent the main overhead, especially when ...
International audienceIn distributed optimization for large-scale learning, a major performance limi...
Reconfiguration is largely an unexplored property in the context of parallel models of computation. ...
This paper presents modulo unrolling without unrolling (mod-ulo unrolling WU), a method for message ...
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
Reducing communication overhead is extremely important in distributed-memory message-passing archite...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...
Minimizing communications when mapping affine loop nests onto distributed memory parallel computers ...
Minimizing communication overhead when mapping affine loop nests onto distributed memory parallel co...
Reducing communication overhead is extremely important in distributed-memory messagepassing architec...
Many parallel applications require periodic redistribution of workloads and associated data. In a di...
Many parallel applications require periodic redistribution of workloads and associated data. In a di...
Abstract—Many parallel applications require periodic redistribution of workloads and associated data...
In this paper, we consider the communications involved by the execution of a complex application, de...
This paper describes a number of optimizations that can be used to support the efficient execution o...
In stencil based parallel applications, communications represent the main overhead, especially when ...
International audienceIn distributed optimization for large-scale learning, a major performance limi...
Reconfiguration is largely an unexplored property in the context of parallel models of computation. ...
This paper presents modulo unrolling without unrolling (mod-ulo unrolling WU), a method for message ...
In this paper, we propose a communication cost reduction computes rule for irregular loop partitioni...
Reducing communication overhead is extremely important in distributed-memory message-passing archite...
this paper, we propose a communication cost reduction computes rule for irregular loop partitioning...