In this paper, we studied the NVIDIA GPU architecture characteristics concerning the SGEMM routine and the potential peak performance of SGEMM on Fermi GPU. Guiding by the analysis, our SGEMM routine achieved about 11% (NN), 4.5% (TN), 3% (NT) and 9% (TT) better performance than cublas in CUDA 4.1 package for large matrices on GTX580 Fermi Card. We also described how to use native assembly language directly in the CUDA runtime source code
Computing on graphics processors is maybe one of the most important developments in computational sc...
Modern graphics processing units (GPUs) have been at the leading edge of in-creasing chip-level para...
Parallel programming requires a significant amount of developer effort, and creating optimized paral...
In this paper, we studied the NVIDIA GPU architecture characteristics concerning the SGEMM routine a...
International audienceIn this paper, we present an approach to estimate GPU applications' performanc...
In this paper we discuss about our experiences in improving the performance of GEMM (both single and...
In this paper we discuss about our experiences in improving the performance of two key algorithms: t...
This thesis work is funded by the ANR PetaQCD project. We have mainly worked on two topics of GPU pe...
Thesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and...
Abstract — GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device...
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multip...
Commodity clusters augmented with application accelerators are evolving as competitive high performa...
There exist various different high- and low-level approaches for GPU programming. These include the ...
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEM...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
Computing on graphics processors is maybe one of the most important developments in computational sc...
Modern graphics processing units (GPUs) have been at the leading edge of in-creasing chip-level para...
Parallel programming requires a significant amount of developer effort, and creating optimized paral...
In this paper, we studied the NVIDIA GPU architecture characteristics concerning the SGEMM routine a...
International audienceIn this paper, we present an approach to estimate GPU applications' performanc...
In this paper we discuss about our experiences in improving the performance of GEMM (both single and...
In this paper we discuss about our experiences in improving the performance of two key algorithms: t...
This thesis work is funded by the ANR PetaQCD project. We have mainly worked on two topics of GPU pe...
Thesis (M.S.)--Wichita State University, College of Engineering, Dept. of Electrical Engineering and...
Abstract — GPU based on CUDA Architecture developed by NVIDIA is a high performance computing device...
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multip...
Commodity clusters augmented with application accelerators are evolving as competitive high performa...
There exist various different high- and low-level approaches for GPU programming. These include the ...
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEM...
AbstractThis paper presents results of our study on double-precision general matrix-matrix multiplic...
Computing on graphics processors is maybe one of the most important developments in computational sc...
Modern graphics processing units (GPUs) have been at the leading edge of in-creasing chip-level para...
Parallel programming requires a significant amount of developer effort, and creating optimized paral...