The Stack: 3 TB of permissively licensed source code

Kocetkov, Denis
Li, Raymond
Allal, Loubna Ben
Li, Jia
Mou, Chenghao
Ferrandis, Carlos Muñoz
Jernite, Yacine
Mitchell, Margaret
Hughes, Sean
Wolf, Thomas
Bahdanau, Dzmitry
von Werra, Leandro
de Vries, Harm

Publication date

November 2022

Language

English

Abstract

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possib...

Extracted data

We use cookies to provide a better user experience.

Data Protection

The Stack: 3 TB of permissively licensed source code

Abstract

Extracted data

The Stack: 3 TB of permissively licensed source code

Abstract

Extracted data

Related items

Related items