We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication ...
The widespread adoption of massively parallel processors over the past decade has fundamentally tran...
Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on...
Previous research has shown that the performance of any computation is directly related to the archi...
Trading communication with redundant computation can increase the silicon efficiency of common hardw...
Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GP...
The dissemination of multi-core architectures and the later irruption of massively parallel devices,...
Graphical Processing Units (GPUs) have become more accessible peripheral devices with great computin...
Technology scaling trends have enabled the exponential growth of computing power. However, the perfo...
This paper presents an approach to explore a commercial multi-FPGA system as high performance accele...
In 2010, Bouillaguet et al. proposed an e¿cient solver for polynomial systems over F2 that trades me...
In 2010, Bouillaguet et al. proposed an efficient solver for polynomial systems over $\mathbb{F}_2$ ...
The application of accelerators in HPC applications has seen enormous growth in the last decade. In ...
FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, sing...
UnrestrictedThe large capacity of field programmable gate arrays (FPGAs) has prompted researchers to...
Dense linear algebra computations are essential to nearly every problem in scientific computing and ...
The widespread adoption of massively parallel processors over the past decade has fundamentally tran...
Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on...
Previous research has shown that the performance of any computation is directly related to the archi...
Trading communication with redundant computation can increase the silicon efficiency of common hardw...
Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GP...
The dissemination of multi-core architectures and the later irruption of massively parallel devices,...
Graphical Processing Units (GPUs) have become more accessible peripheral devices with great computin...
Technology scaling trends have enabled the exponential growth of computing power. However, the perfo...
This paper presents an approach to explore a commercial multi-FPGA system as high performance accele...
In 2010, Bouillaguet et al. proposed an e¿cient solver for polynomial systems over F2 that trades me...
In 2010, Bouillaguet et al. proposed an efficient solver for polynomial systems over $\mathbb{F}_2$ ...
The application of accelerators in HPC applications has seen enormous growth in the last decade. In ...
FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, sing...
UnrestrictedThe large capacity of field programmable gate arrays (FPGAs) has prompted researchers to...
Dense linear algebra computations are essential to nearly every problem in scientific computing and ...
The widespread adoption of massively parallel processors over the past decade has fundamentally tran...
Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on...
Previous research has shown that the performance of any computation is directly related to the archi...