Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, $Ax = b$, where $A$ is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16 -> FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arith...
By using a combination of 32-bit and 64-bit floating point arithmetic, the per-formance of many dens...
International audienceThis paper presents some work in progress on the development of fast and accur...
AbstractWe present a computational framework for high-performance tensor contractions on GPUs. High-...
Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical...
International audienceModern GPUs equipped with mixed precision tensor core units present great pote...
In multiword arithmetic, a matrix is represented as the unevaluated sum of two or more lower-precisi...
We present several algorithms to compute the solution of a linear system of equations on a graphics ...
International audienceGraphics Processing Units (GPUs) offer the possibility to execute floating-poi...
We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware ...
We present a new mixed precision algorithm to compute low-rank matrix and tensor approximations, a f...
International audienceGPUs are an important hardware development platform for problems where massive...
We explore the floating-point arithmetic used by the NVIDIA Volta tensor cores, which are hardware a...
There has been a surge in the demand for a Domain Specific Architecture due to wide ranging deep lea...
International audienceOn modern architectures, the performance of 32-bit operations is often at leas...
By using a combination of 32-bit and 64-bit floating point arithmetic, the per-formance of many dens...
International audienceThis paper presents some work in progress on the development of fast and accur...
AbstractWe present a computational framework for high-performance tensor contractions on GPUs. High-...
Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical...
International audienceModern GPUs equipped with mixed precision tensor core units present great pote...
In multiword arithmetic, a matrix is represented as the unevaluated sum of two or more lower-precisi...
We present several algorithms to compute the solution of a linear system of equations on a graphics ...
International audienceGraphics Processing Units (GPUs) offer the possibility to execute floating-poi...
We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware ...
We present a new mixed precision algorithm to compute low-rank matrix and tensor approximations, a f...
International audienceGPUs are an important hardware development platform for problems where massive...
We explore the floating-point arithmetic used by the NVIDIA Volta tensor cores, which are hardware a...
There has been a surge in the demand for a Domain Specific Architecture due to wide ranging deep lea...
International audienceOn modern architectures, the performance of 32-bit operations is often at leas...
By using a combination of 32-bit and 64-bit floating point arithmetic, the per-formance of many dens...
International audienceThis paper presents some work in progress on the development of fast and accur...
AbstractWe present a computational framework for high-performance tensor contractions on GPUs. High-...