AbstractProgramming manycore GPUs or multicore CPUs for high performance requires a careful balance of several hardware specific related factors, which is typically achieved by expert users through trial and error. To reduce the amount of hand-made optimization time required to achieve optimal performance, general guidelines can be followed or different metrics can be considered to predict performance, but ultimately a trial and error process is still prevalent. In this paper, we present an optimization method to run the 3D-Fast Wavelet Transform (3D-FWT) on hybrid systems. The optimization engine detects the different platforms found on a system, executing the appropriate kernel, implemented in both CUDA or OpenCL for GPUs, and programmed ...