For the past two years, Hopsworks, an open-source machine learning platform, has used Apache Spark to distribute hyperparameter optimization tasks in machine learning. Hopsworks provides some basic optimizers (grid-search, random-search, differential evolution) to propose combinations of hyperparameters (trials) that are run synchronously in parallel. However, many such trials perform poorly, and waste a lot of hardware accelerator cycles on trials that could be stopped early, freeing up resources for other trials. In this thesis, the work on Maggy is presented, an open-source asynchronous and fault-tolerant hyperparameter optimization framework built on Spark. Maggy transparently schedules and manages hyperparameter trials, enabling state-...