Abstract—In this paper we focus on optimizing the perfor-mance in a cluster of Simultaneous Multithreading (SMT) processors connected with a commodity interconnect (e.g. Gbit Ethernet), by applying overlapping of computation with communication. As a test case we consider the parallelized advection equation and discuss the steps that need to be followed to semantically allow overlapping to occur. We propose an implementation based on the concept of Helper Threading that distributes computation and communication in the two sibling threads of an SMT processor, thus creating an asym-metric pair of execution patterns in each hardware context. Our experimental results in an 8-node cluster interconnected with commodity Gbit Ethernet demonstrate th...
Compiler optimizations are often driven by specific assumptions about the underlying architecture an...
In modern MPI applications, communication between separate computational nodes quickly add up to a s...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
Different applications may exhibit radically different behaviors and thus have very different requir...
Conventional wisdom suggests that the most efficient use of modern computing clusters employs techni...
This paper extends research into rhombic overlapping-connectivity interconnection networks into the ...
Modern processors provide a multitude of opportunities for instruction-level parallelism that most c...
New feature sizes provide larger number of transistors per chip that architects could use in order t...
Simultaneous Multithreading (SMT) has been proposed for improving processor throughput by overlappin...
Several multithreading techniques have been proposed to reduce the resource underutilization in Very...
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruc...
International audienceNon-blocking collectives have been proposed so as to allow communications to b...
Compiler optimizations are often driven by specific assumptions about the underlying architecture an...
In this paper, we propose an approach to obtaining en-hanced performance of the Linpack benchmark on...
Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel executi...
Compiler optimizations are often driven by specific assumptions about the underlying architecture an...
In modern MPI applications, communication between separate computational nodes quickly add up to a s...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...
Different applications may exhibit radically different behaviors and thus have very different requir...
Conventional wisdom suggests that the most efficient use of modern computing clusters employs techni...
This paper extends research into rhombic overlapping-connectivity interconnection networks into the ...
Modern processors provide a multitude of opportunities for instruction-level parallelism that most c...
New feature sizes provide larger number of transistors per chip that architects could use in order t...
Simultaneous Multithreading (SMT) has been proposed for improving processor throughput by overlappin...
Several multithreading techniques have been proposed to reduce the resource underutilization in Very...
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruc...
International audienceNon-blocking collectives have been proposed so as to allow communications to b...
Compiler optimizations are often driven by specific assumptions about the underlying architecture an...
In this paper, we propose an approach to obtaining en-hanced performance of the Linpack benchmark on...
Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel executi...
Compiler optimizations are often driven by specific assumptions about the underlying architecture an...
In modern MPI applications, communication between separate computational nodes quickly add up to a s...
International audienceTo amortize the cost of MPI collective operations, non-blocking collectives ha...