This dataset accompanies a paper submitted to the WebSci 20 conference. In this paper, we present a lexicon of 'extreme speech' that may be used to detect hate speech and extreme speech on online platforms. We outline a cross-disciplinary research protocol through which this lexicon is initially extracted from a corpus of 3,335,265 posts from 4chan's /pol/ sub-forum using a hybrid method comprising word2vec modeling and subsequent snowballing of nearest neighbours of a small initial expert seed list of extreme language. The choice of corpus is significant, as 4chan is a space of rapid language innovation and obscure extreme vernacular, complicating generalised approaches. Our lexicon detects significantly more extreme posts within a corpus ...