We present a new algorithm to automatically generate high-performance GPU implementations of complex imaging and machine learning pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or hand-optimized kernels, and it targets a diverse range of computations which is significantly broader than existing autoschedulers. We address the scalability challenge of extending previous approaches to schedule large real world programs, while enabling a broad set of program rewrites that take into account the nested parallelism and memory hierarchy introduced by GPU architectures. We achieve this using a hierarchical sampling strategy that groups programs into buckets based on their structural ...
There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleratio...
The focus of this work is the automatic performance tuning of stencil computations on Graphics Proce...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
The Halide DSL and compiler have enabled high-performance code generation for image processing pipel...
We present a new algorithm to automatically schedule Halide programs for high-performance image proc...
\u3cp\u3eEfficient code generation for image processing applications continues to pose a challenge i...
Even though computer graphics applications are widely used, they remain challenging to implement and...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Many image processing tasks are naturally expressed as a pipeline of small computational kernels kno...
The ever increasing complexity of scientific applications has led to utilization of new HPC paradigm...
A plethora of applications are using machine learning, the operations of which are becoming more com...
Recent advances in GPUs (graphics processing units) lead to mas-sively parallel hardware that is eas...
The use of graphical processing units (GPUs) for general purpose calculations has gained a lot of at...
We propose a generalized method for adapting and optimizing algorithms for efficient execution on mo...
We propose and evaluate a novel strategy for tuning the performance of a class of stencil computatio...
There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleratio...
The focus of this work is the automatic performance tuning of stencil computations on Graphics Proce...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...
The Halide DSL and compiler have enabled high-performance code generation for image processing pipel...
We present a new algorithm to automatically schedule Halide programs for high-performance image proc...
\u3cp\u3eEfficient code generation for image processing applications continues to pose a challenge i...
Even though computer graphics applications are widely used, they remain challenging to implement and...
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Com...
Many image processing tasks are naturally expressed as a pipeline of small computational kernels kno...
The ever increasing complexity of scientific applications has led to utilization of new HPC paradigm...
A plethora of applications are using machine learning, the operations of which are becoming more com...
Recent advances in GPUs (graphics processing units) lead to mas-sively parallel hardware that is eas...
The use of graphical processing units (GPUs) for general purpose calculations has gained a lot of at...
We propose a generalized method for adapting and optimizing algorithms for efficient execution on mo...
We propose and evaluate a novel strategy for tuning the performance of a class of stencil computatio...
There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleratio...
The focus of this work is the automatic performance tuning of stencil computations on Graphics Proce...
To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime syst...