International audienceOral corpora for linguistic inquiry are frequently built based on the content of news, radio, and/or TV shows, sometimes also of laboratory recordings. Most of these existing corpora are restricted to languages with a large amount of data available. Furthermore, such corpora are not always accessible under a free open-access license. We propose a crowd-sourced alternative to this gap. Lingua Libre is the participatory linguistic media library hosted by Wikimedia France. It includes recordings from more than 140 languages. These recordings have been provided by more than 750 speakers worldwide, who voluntarily recorded word entries of their native language and made them available under a Creative Commons license. In the...
We collect and release CrowdSpeech — the first publicly available large-scale dataset of crowdsource...
In crude quantitative terms, Zipf’s law tells us that documentation of something as simple as word u...
International audienceWe describe here a speaking atlas that takes the form of a website presenting ...
International audienceOral corpora for linguistic inquiry are frequently built based on the content ...
Vocal languages across the world are estimated to be approximately 6000, yet only a handful of them ...
International audienceLess-resourced languages are usually left out of phonetic studies based on lar...
Low resource languages possess a limited number of digitized texts, making it challenging togenerate...
International audienceData-driven research in phonetics and phonology relies massively on oral resou...
International audienceMost speech and language technologies are trained with massive amounts of spee...
International audienceText corpora represent the foundation on which most natural language processin...
This paper describes the development of a multilingual and multigenre manually annotated speech data...
ABSTRACT French is a language spoken by hundreds of millions of speakers in Europe, Africa, and Amer...
Minority languages are underrepresented in linguistic research, and a possible reason for this is th...
Crowdsourcing can be defined as the purchase of data (labels, speech recordings, etc.), usually on l...
We collect and release CrowdSpeech — the first publicly available large-scale dataset of crowdsource...
In crude quantitative terms, Zipf’s law tells us that documentation of something as simple as word u...
International audienceWe describe here a speaking atlas that takes the form of a website presenting ...
International audienceOral corpora for linguistic inquiry are frequently built based on the content ...
Vocal languages across the world are estimated to be approximately 6000, yet only a handful of them ...
International audienceLess-resourced languages are usually left out of phonetic studies based on lar...
Low resource languages possess a limited number of digitized texts, making it challenging togenerate...
International audienceData-driven research in phonetics and phonology relies massively on oral resou...
International audienceMost speech and language technologies are trained with massive amounts of spee...
International audienceText corpora represent the foundation on which most natural language processin...
This paper describes the development of a multilingual and multigenre manually annotated speech data...
ABSTRACT French is a language spoken by hundreds of millions of speakers in Europe, Africa, and Amer...
Minority languages are underrepresented in linguistic research, and a possible reason for this is th...
Crowdsourcing can be defined as the purchase of data (labels, speech recordings, etc.), usually on l...
We collect and release CrowdSpeech — the first publicly available large-scale dataset of crowdsource...
In crude quantitative terms, Zipf’s law tells us that documentation of something as simple as word u...
International audienceWe describe here a speaking atlas that takes the form of a website presenting ...