We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Connection Machine system CM-200. The routines, collectively called LBLAS, have interfaces consistent with languages with an array syntax such as Fortran 90. One novel feature, important for distributed memory architectures, is the capability of performing computations on multiple instances of objects in a single call. The number of instances and their allocation across memory units, and the strides for the different axes within the local memories, are derived from an array descriptor that contains type, shape, and data distribution information. Another novel feature of the LBLAS is a selection of loop order for rank{1 updates and matrix-matrix m...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebr...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines desig...
(CMSSL) is a library of scientific routines designed for distributed memory architectures. The basic...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
This paper discusses the design of linear algebra libraries for high performance computers. Particul...
Many optimizations (of programs with loops) used in parallelizing compilers and systolic array desig...
Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are describ...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Our experimental results showed that block based algorithms for numerically intensive applications a...
This paper proposes an API for Batched Basic Linear Algebra Subprograms (Batched BLAS). We focus on...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebr...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...
The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines desig...
(CMSSL) is a library of scientific routines designed for distributed memory architectures. The basic...
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been impleme...
This paper discusses the design of linear algebra libraries for high performance computers. Particul...
Many optimizations (of programs with loops) used in parallelizing compilers and systolic array desig...
Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are describ...
Abstract—Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) t...
This paper presents a technique for finding good distributions of arrays and suitable loop restructu...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
With the emergence of thread-level parallelism as the primary means for continued improvement of per...
Our experimental results showed that block based algorithms for numerically intensive applications a...
This paper proposes an API for Batched Basic Linear Algebra Subprograms (Batched BLAS). We focus on...
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS-3) li...
We describe the design of ScaLAPACK++, an object oriented C++ library for implementing linear algebr...
A current trend in high-performance computing is to decompose a large linear algebra problem into ba...