Why is one novel still read, while another is forgotten? Literary scholars answer that the first is part of the canon, while the other is not. But what determines “canonicity”? Canonicity is a contentious topic. Literary scholars make lists of the most important authors of all time², but are such lists completely subjective? Do they systematically exclude important books by authors that do not fit a preconceived pattern such as white male? Could there be objective textual features that partly explain the value judgments leading to this demarcation? In this project I created a dataset for exploring such questions (as well as code and a tool). In this post I describe the corpus composition, metadata, and textual features that are part of the ...