How close are we to a world where every paper on programming languages is accompanied by an electronic appendix with machinechecked proofs? We propose an initial set of benchmarks for measuring progress in this area. Based on the metatheory of System F, a typed lambda-calculus with second-order polymorphism, subtyping, and records, these benchmarks embody many aspects of programming languages that are challenging to formalize: variable binding at both the term and type levels, syntactic forms with variable numbers of components (including binders), and proofs demanding complex induction principles. We hope that these benchmarks will help clarify the current state of the art, provide a basis for comparing competing technologies, and motivate...