This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more)
We present a web service-based environment for the use of linguistic resources and tools to address ...
Second release of the lexica corpus: a corpus for German text simplification, total size now 3270 fi...
First release of the lexica corpus: a corpus for German text simplification. The corpus consists of...
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - s...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 ...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
This article presents the corpus REDEWIEDERGABE, a German-language historical corpus with detailed a...
Nowadays a corpus is typically a large collection of text excerpts, representing a range of register...
TeCoPhy is a Text Corpus of German Physics Texts. Most of the texts are taken from textbooks at sch...
Within a strictly corpus-driven paradigm, an in-depth profiling of many linguistic phenomena require...
We present a web service-based environment for the use of linguistic resources and tools to address ...
Second release of the lexica corpus: a corpus for German text simplification, total size now 3270 fi...
First release of the lexica corpus: a corpus for German text simplification. The corpus consists of...
This corpus was originally created for performance testing (server infrastructure CorpusExplorer - s...
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than ...
Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 ...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks....
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
This article presents the corpus REDEWIEDERGABE, a German-language historical corpus with detailed a...
Nowadays a corpus is typically a large collection of text excerpts, representing a range of register...
TeCoPhy is a Text Corpus of German Physics Texts. Most of the texts are taken from textbooks at sch...
Within a strictly corpus-driven paradigm, an in-depth profiling of many linguistic phenomena require...
We present a web service-based environment for the use of linguistic resources and tools to address ...
Second release of the lexica corpus: a corpus for German text simplification, total size now 3270 fi...
First release of the lexica corpus: a corpus for German text simplification. The corpus consists of...