We present four CUDA based parallel implementations of the Space-Saving algorithm for determining frequent items on a GPU. The first variant exploits the open-source CUB library to simplify the implementation of a user's defined reduction, whilst the second is based on our own implementation of the parallel reduction. The third and the fourth, built on the previous variants, are meant to improve the performance by taking advantage of hardware based atomic instructions. In particular, we implement a warp based ballot mechanism to accelerate the Space-Saving updates. We show that our implementation of the parallel reduction, coupled with the ballot based update mechanism, is the fastest, and provides extensive experimental results regarding i...