Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possib...
In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) like GPT-3 a...
The automation of code review activities, a long-standing pursuit in software engineering, has been ...
This is the artifact for the ISSTA'2023 paper "Large Language Models Are Zero-Shot Fuzzers: Fuzzing ...
Large language models (LMs) of code have recently shown tremendous promise in completing code and sy...
Large language models (LLMs) have demonstrated significant potential in the realm of natural languag...
Large language models (LLMs) have become increasingly prominent in academia and industry due to thei...
We release Code Llama, a family of large language models for code based on Llama 2 providing state-o...
The use of language models in Web applications and other areas of computing and business have grown ...
Machine-learning models can reach very high performance with supervised training, where they learn f...
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension a...
The use of language models in Web applications and other areas of computing and business have grown ...
Large Language Models (LLM) are a new class of computation engines, "programmed" via prompt engineer...
Recent breakthroughs in Large Language Models (LLMs), such as GPT-3 and Codex, now enable software d...
Pre-trained models of source code have gained widespread popularity in many code intelligence tasks....
Thesis (Ph.D.)--University of Washington, 2019Models that automatically map natural language (NL) to...
In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) like GPT-3 a...
The automation of code review activities, a long-standing pursuit in software engineering, has been ...
This is the artifact for the ISSTA'2023 paper "Large Language Models Are Zero-Shot Fuzzers: Fuzzing ...
Large language models (LMs) of code have recently shown tremendous promise in completing code and sy...
Large language models (LLMs) have demonstrated significant potential in the realm of natural languag...
Large language models (LLMs) have become increasingly prominent in academia and industry due to thei...
We release Code Llama, a family of large language models for code based on Llama 2 providing state-o...
The use of language models in Web applications and other areas of computing and business have grown ...
Machine-learning models can reach very high performance with supervised training, where they learn f...
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension a...
The use of language models in Web applications and other areas of computing and business have grown ...
Large Language Models (LLM) are a new class of computation engines, "programmed" via prompt engineer...
Recent breakthroughs in Large Language Models (LLMs), such as GPT-3 and Codex, now enable software d...
Pre-trained models of source code have gained widespread popularity in many code intelligence tasks....
Thesis (Ph.D.)--University of Washington, 2019Models that automatically map natural language (NL) to...
In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) like GPT-3 a...
The automation of code review activities, a long-standing pursuit in software engineering, has been ...
This is the artifact for the ISSTA'2023 paper "Large Language Models Are Zero-Shot Fuzzers: Fuzzing ...