This paper demonstrates the one-sided communication used in languages like UPC can provide a significant per-formance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimizations rely on aggressively overlapping communication with computation, alleviating bottlenecks that typically occur when communication is iso-lated in a single phase. The new algorithms send more and smaller messages, yet the one-sided versions achieve> 1.9 × speedup over the base Fortran/MPI. Our one-sided versions show an average 15 % improvement over the two-sided versions, due to the lower software overhead of one-sided communication, who...
Conventional wisdom suggests that the most efficient use of modern computing clusters employs techni...
In modern MPI applications, communication between separate computational nodes quickly add up to a s...
The current trends in high performance computing show that large machines with tens of thousands of ...
This paper demonstrates the one-sided communication used in languages like UPC can provide a signifi...
Partitioned Global Address Space languages like Unified Parallel C (UPC) are typically valued for th...
In earlier work, we showed that the one-sided communication model found in PGAS languages (such as U...
Global address space languages like UPC exhibit high performance and portability on a broad class o...
Global address space languages like UPC exhibit high performance and portability on a broad class of...
Technology trends suggest that future machines will rely on parallelism to meet increasing performan...
In this work we analyze the communication load imbalance generated by irregular-data applications ru...
Optimized collective operations are a crucial performance factor for many scientific applications. T...
Since the invention of the transistor, clock frequency increase was the primary method of improving ...
Technology trends suggest that future machines will relyon parallelism to meet increasing performanc...
In High Performance Computing (HPC), minimizing communication overhead is one of the most important ...
Although logically available, applications may not exploit enough instantaneous communication concur...
Conventional wisdom suggests that the most efficient use of modern computing clusters employs techni...
In modern MPI applications, communication between separate computational nodes quickly add up to a s...
The current trends in high performance computing show that large machines with tens of thousands of ...
This paper demonstrates the one-sided communication used in languages like UPC can provide a signifi...
Partitioned Global Address Space languages like Unified Parallel C (UPC) are typically valued for th...
In earlier work, we showed that the one-sided communication model found in PGAS languages (such as U...
Global address space languages like UPC exhibit high performance and portability on a broad class o...
Global address space languages like UPC exhibit high performance and portability on a broad class of...
Technology trends suggest that future machines will rely on parallelism to meet increasing performan...
In this work we analyze the communication load imbalance generated by irregular-data applications ru...
Optimized collective operations are a crucial performance factor for many scientific applications. T...
Since the invention of the transistor, clock frequency increase was the primary method of improving ...
Technology trends suggest that future machines will relyon parallelism to meet increasing performanc...
In High Performance Computing (HPC), minimizing communication overhead is one of the most important ...
Although logically available, applications may not exploit enough instantaneous communication concur...
Conventional wisdom suggests that the most efficient use of modern computing clusters employs techni...
In modern MPI applications, communication between separate computational nodes quickly add up to a s...
The current trends in high performance computing show that large machines with tens of thousands of ...