The shift toward parallel processor architectures has made programming and code generation increasingly challenging. To address this programmability challenge, this article presents a technique to fully automatically generate efficient and readable code for parallel processors (with a focus on GPUs). This is made possible by combining algorithmic skeletons, traditional compilation, and algorithmic species, a classification of program code. Compilation starts by automatically annotating C code with class information (the algorithmic species). This code is then fed into the skeleton-based source-to-source compiler bones to generate CUDA code. To generate efficient code, bones also performs optimizations including host-accelerator transfer o...