Abstract—LAPACK (Linear Algebra PACKage) is a statically cache-blocked library, where the blocking factor (NB) is deter-mined by the service routine ILAENV. Users are encouraged to tune NB to maximize performance on their platform/BLAS (the BLAS are LAPACK’s computational engine), but in practice very few users do so (both because it is hard, and because its importance is not widely understood). In this paper we (1) Discuss our empirical tuning framework for discovering good NB settings, (2) quantify the performance boost that tuning NB can achieve on several LAPACK routines across multiple architectures and BLAS implementations, (3) compare the best performance of LAPACK’s statically blocked routines against state of the art recursively bl...
This dissertation details contributions made by the author to the field of computer science while wo...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
For good performance of every computer program, good cache and TLB utilization is crucial. In numeri...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchie...
This paper presents an overview of the LAPACK library, a portable, public-domain library to solve th...
This paper discusses optimizing computational linear algebra algorithms on a ring cluster of IBM R...
This dissertation incorporates two research projects: performance modeling and prediction for dense ...
The goal of the LAPACK project is to provide efficient and portable software for dense numerical lin...
Application performance dominated by a few computational kernels Performance tuning today Vendor-tun...
The promise of future many-core processors, with hundreds of threads running concurrently, has led t...
Abstract—It is well known that the behavior of dense linear algebra algorithms is greatly influenced...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
This dissertation introduces measurement-based performance modeling and prediction techniques for de...
This dissertation details contributions made by the author to the field of computer science while wo...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...
For good performance of every computer program, good cache and TLB utilization is crucial. In numeri...
This report has been developed over the work done in the deliverable [Nava94] There it was shown tha...
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchie...
This paper presents an overview of the LAPACK library, a portable, public-domain library to solve th...
This paper discusses optimizing computational linear algebra algorithms on a ring cluster of IBM R...
This dissertation incorporates two research projects: performance modeling and prediction for dense ...
The goal of the LAPACK project is to provide efficient and portable software for dense numerical lin...
Application performance dominated by a few computational kernels Performance tuning today Vendor-tun...
The promise of future many-core processors, with hundreds of threads running concurrently, has led t...
Abstract—It is well known that the behavior of dense linear algebra algorithms is greatly influenced...
The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The in...
This dissertation introduces measurement-based performance modeling and prediction techniques for de...
This dissertation details contributions made by the author to the field of computer science while wo...
AbstractIn this work the behavior of the multithreaded implementation of some LAPACK routines on PLA...
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distribu...