We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size: operating in float32 precision, a full 2048-core pod of third generation TPUs can multiply two matrices with linear size $N= 220= 1 048 576$ in about 2 minutes. Via curated algorithms emphasizing large, single-core matrix multiplicati...
Abstract. Over the last century, linear algebra theory and matrix computations became irreplaceable,...
Integrating polyalgorithm library with optimized linear algebra libraries on HPC platforms, leveragi...
textabstractWe show that the border support rank of the tensor corresponding to two-by-two matrix m...
To respond to the intense computational load of deep neural networks, a plethora of domain-specific ...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
To respond to the need for efficient training and inference of deep neural networks, a plethora of d...
140 pagesTensor algebra lives at the heart of big data applications. Where classical machine learnin...
Achieving high-performance while reducing power consumption is the key question as tech-nology scali...
Popular Machine Learning (ML) and High Performance Computing (HPC) workloads contribute to a signifi...
This thesis targets the design of parallelizable algorithms and communication-efficient parallel sch...
Dense linear algebra computations are essential to nearly every problem in scientific computing and ...
Matrix multiplication is a core building block for numerous scientific computing and, more recently,...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Computational intensive applications such as pattern recognition, and natural language processing, a...
Since data sizes of analytical applications are continuously growing, many data scientists are switc...
Abstract. Over the last century, linear algebra theory and matrix computations became irreplaceable,...
Integrating polyalgorithm library with optimized linear algebra libraries on HPC platforms, leveragi...
textabstractWe show that the border support rank of the tensor corresponding to two-by-two matrix m...
To respond to the intense computational load of deep neural networks, a plethora of domain-specific ...
Abstract: Few realize that, for large matrices, many dense matrix computations achieve nearly the sa...
To respond to the need for efficient training and inference of deep neural networks, a plethora of d...
140 pagesTensor algebra lives at the heart of big data applications. Where classical machine learnin...
Achieving high-performance while reducing power consumption is the key question as tech-nology scali...
Popular Machine Learning (ML) and High Performance Computing (HPC) workloads contribute to a signifi...
This thesis targets the design of parallelizable algorithms and communication-efficient parallel sch...
Dense linear algebra computations are essential to nearly every problem in scientific computing and ...
Matrix multiplication is a core building block for numerous scientific computing and, more recently,...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
Computational intensive applications such as pattern recognition, and natural language processing, a...
Since data sizes of analytical applications are continuously growing, many data scientists are switc...
Abstract. Over the last century, linear algebra theory and matrix computations became irreplaceable,...
Integrating polyalgorithm library with optimized linear algebra libraries on HPC platforms, leveragi...
textabstractWe show that the border support rank of the tensor corresponding to two-by-two matrix m...