A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel c...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called tex...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - s...
This SOAP service implements the IMS Open Corpus Workbench (CWB), a collection of open-source tools ...
In the recent years, Transformer-based models have lead to significant advances in language modellin...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
In this paper we present a preliminary analysis over the largest publicly accessible web dataset: Th...
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel c...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called tex...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ lan...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - s...
This SOAP service implements the IMS Open Corpus Workbench (CWB), a collection of open-source tools ...
In the recent years, Transformer-based models have lead to significant advances in language modellin...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
In this paper we present a preliminary analysis over the largest publicly accessible web dataset: Th...
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel c...
BasCrawl is a 186-million-token web corpus of Basque obtained by crawling over 12000 domains. We inc...
In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called tex...