We have built a corpus containing texts in 106 languages from texts available on the Internet and on Wikipedia. The W2C Web Corpus contains 54.7 GB of text and the W2C Wiki Corpus contains 8.5 GB of text. The W2C Web Corpus contains more than 100 MB of text available for 75 languages. At least 10 MB of text is available for 100 languages. These corpora are a unique data source for linguists, since they outclass all published works both in the size of the material collected and the number of languages covered. This language data resource can be of use particularly to researchers specialized in multilingual technologies development. We also developed software that greatly simplifies the creation of a new text corpus for a given language, usin...
We investigate the potential of using the web as a huge corpus for language studies. We test the hyp...
The paper compares systematically the utility of specially-made text corpora and the textual resourc...
Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As wit...
This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words fo...
This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words fo...
From the beginning of the twentieth century on, the use of the World Wide Web has become a current t...
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Ita...
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Ita...
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Ita...
A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected ...
A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected ...
The web is a potentially useful corpus for language study because it provides examples of language t...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Lew The Web, teeming as it is with language data, of all manner of varieties and languages, in vast ...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
We investigate the potential of using the web as a huge corpus for language studies. We test the hyp...
The paper compares systematically the utility of specially-made text corpora and the textual resourc...
Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As wit...
This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words fo...
This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words fo...
From the beginning of the twentieth century on, the use of the World Wide Web has become a current t...
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Ita...
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Ita...
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Ita...
A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected ...
A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected ...
The web is a potentially useful corpus for language study because it provides examples of language t...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Lew The Web, teeming as it is with language data, of all manner of varieties and languages, in vast ...
Comunicació presentada a: EACL '06: Eleventh Conference of the European Chapter of the Association f...
We investigate the potential of using the web as a huge corpus for language studies. We test the hyp...
The paper compares systematically the utility of specially-made text corpora and the textual resourc...
Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As wit...