In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap skeleton. We explain our implementation with the help of a 2D convolution application, implemented using the newly developed skeleton. The memory (constant and shared memory) and adaptive tiling optimizations are applied and their performance implications are evaluated on different classes of GPUs. We present two different metrics to calculate the optimal tiling factor dynamically in an automated way which helps in retaining best performance without manual tuning while moving to newGPU architectures. With our approach, we can achieve average speedups by a factor of 3.6, 2.3, and 2.4 over an otherwise optimized (wit...
We present an efficient model to analyze and improve the performance of general-purpose computation ...
International audienceThanks to High-Level Synthesis (HLS) tools, FPGAs have become an alternative t...
Today's computer systems often contains several different processing units aside from the CPU. Among...
In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implement...
The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated ext...
Computing many small 2D convolutions using FFTs is a basis for a large number of applications in man...
With the increasing sophistication of image processing algorithms, and because of its low computatio...
Abstract. Application programming for GPUs (Graphics Processing Units) is complex and error-prone, b...
International audienceAttaining the best possible throughput when computing convolutions is a challe...
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applicati...
Abstract—For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with ...
We present an implementation of the overlap-and-save method, a method for the convolution of very lo...
with the sequential Implementation of the algorithm and demonstrate the Increase in speeds through H...
This paper studies the performance of separable 2D convolution on multi-lane Polymorphic Register Fi...
c©2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
We present an efficient model to analyze and improve the performance of general-purpose computation ...
International audienceThanks to High-Level Synthesis (HLS) tools, FPGAs have become an alternative t...
Today's computer systems often contains several different processing units aside from the CPU. Among...
In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implement...
The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated ext...
Computing many small 2D convolutions using FFTs is a basis for a large number of applications in man...
With the increasing sophistication of image processing algorithms, and because of its low computatio...
Abstract. Application programming for GPUs (Graphics Processing Units) is complex and error-prone, b...
International audienceAttaining the best possible throughput when computing convolutions is a challe...
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applicati...
Abstract—For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with ...
We present an implementation of the overlap-and-save method, a method for the convolution of very lo...
with the sequential Implementation of the algorithm and demonstrate the Increase in speeds through H...
This paper studies the performance of separable 2D convolution on multi-lane Polymorphic Register Fi...
c©2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for a...
We present an efficient model to analyze and improve the performance of general-purpose computation ...
International audienceThanks to High-Level Synthesis (HLS) tools, FPGAs have become an alternative t...
Today's computer systems often contains several different processing units aside from the CPU. Among...