International audienceAs language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (...
Contains fulltext : 102314.pdf (publisher's version ) (Open Access)International C...
International audienceSince the introduction of large language models in Natural Language Processing...
We have built a corpus containing texts in 106 languages from texts available on the Internet and on...
International audienceAs language models grow ever larger, the need for large-scale high-quality tex...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstr...
The BigScience Workshop was a value-driven initiative that spanned one and half years of interdiscip...
8 pages plus appendix and referencesIn recent years, large-scale data collection efforts have priori...
With the rapid development of artificial intelligence in the current era of big data, the constructi...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...
Over the past decade, rapid technological evolution has revolutionised the study of language; we hav...
International audienceThe NLP community recently saw the release of a new large open-access multilin...
The use of language models in Web applications and other areas of computing and business have grown ...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
The use of language models in Web applications and other areas of computing and business have grown ...
Contains fulltext : 102314.pdf (publisher's version ) (Open Access)International C...
International audienceSince the introduction of large language models in Natural Language Processing...
We have built a corpus containing texts in 106 languages from texts available on the Internet and on...
International audienceAs language models grow ever larger, the need for large-scale high-quality tex...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstr...
The BigScience Workshop was a value-driven initiative that spanned one and half years of interdiscip...
8 pages plus appendix and referencesIn recent years, large-scale data collection efforts have priori...
With the rapid development of artificial intelligence in the current era of big data, the constructi...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...
Over the past decade, rapid technological evolution has revolutionised the study of language; we hav...
International audienceThe NLP community recently saw the release of a new large open-access multilin...
The use of language models in Web applications and other areas of computing and business have grown ...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
The use of language models in Web applications and other areas of computing and business have grown ...
Contains fulltext : 102314.pdf (publisher's version ) (Open Access)International C...
International audienceSince the introduction of large language models in Natural Language Processing...
We have built a corpus containing texts in 106 languages from texts available on the Internet and on...