In HPC, low latency communication between remote processes is crucial to application performance. InfiniBand networks can reduce the latency but require special and costly network interface cards, which are loosely coupled with CPU, thus needing to constantly re-pin pages. In this work, we virtualize the hardware Direct Memory Access (DMA) engine of the MPSoC in software by using an ARM Cortex-R5 to provide 1024 virtual DMA channels for the processes and to prioritize the small critical transfers, reducing the latency by a factor of 10x. Furthermore, we leverage the System Memory Management Unit (SMMU) available in the Zynq Ultrascale+ MPSoC’s to allow user based nodes, achieving a 6x latency improvement over kernel based solutions
The last decade a trend can be observed towards multi-processor Systems-on-Chip (MPSoC) platforms fo...
Abstract. The All-to-all broadcast collective operation is essential for many parallel scientific ap...
International audienceThis paper presents an efficient MPI implementation on a cluster of PCs using ...
In HPC, low latency communication between remote processes is crucial to application performance. In...
Although InfiniBand Architecture is relatively new in the high performance computing area, it o#ers ...
Remote DMA (RDMA) engines are widely used in clusters/data-centres to improve the performance of dat...
Remote Direct Memory Access (RDMA) fabrics such as Infiniband and Converged Ethernet report latencie...
Software DSM systems do not perform well because of the combined effects of increase in communicatio...
Despite the advances in high performance interdomain communications for virtual machines (VM), data ...
The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computat...
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely ...
Abstract—Despite the advances in high performance interdomain communications for virtual machines (V...
Modern computing clusters consist of many heterogeneous computing units that work collectively in or...
Multi-Processor Systems on a Chip (MPSoCs) are suitable platforms for the implementation of complex ...
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers t...
The last decade a trend can be observed towards multi-processor Systems-on-Chip (MPSoC) platforms fo...
Abstract. The All-to-all broadcast collective operation is essential for many parallel scientific ap...
International audienceThis paper presents an efficient MPI implementation on a cluster of PCs using ...
In HPC, low latency communication between remote processes is crucial to application performance. In...
Although InfiniBand Architecture is relatively new in the high performance computing area, it o#ers ...
Remote DMA (RDMA) engines are widely used in clusters/data-centres to improve the performance of dat...
Remote Direct Memory Access (RDMA) fabrics such as Infiniband and Converged Ethernet report latencie...
Software DSM systems do not perform well because of the combined effects of increase in communicatio...
Despite the advances in high performance interdomain communications for virtual machines (VM), data ...
The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computat...
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely ...
Abstract—Despite the advances in high performance interdomain communications for virtual machines (V...
Modern computing clusters consist of many heterogeneous computing units that work collectively in or...
Multi-Processor Systems on a Chip (MPSoCs) are suitable platforms for the implementation of complex ...
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers t...
The last decade a trend can be observed towards multi-processor Systems-on-Chip (MPSoC) platforms fo...
Abstract. The All-to-all broadcast collective operation is essential for many parallel scientific ap...
International audienceThis paper presents an efficient MPI implementation on a cluster of PCs using ...