A main problem with reproducing machine learning publications is the variance of metric implementations across papers. A lack of standardization leads to different behavior in mech- anisms such as checkpointing, learning rate schedulers or early stopping, that will influence the reported results. For example, a complex metric such as Fréchet inception distance (FID) for synthetic image quality evaluation will differ based on the specific interpolation method used. There have been a few attempts at tackling the reproducibility issues. Papers With Code links research code with its corresponding paper. Similarly, arXiv recently added a code and data section that links both official and community code to papers. However, these methods rely on t...