A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were sampled from a larger corpus to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word. The corpus is split in training and test set as it is used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts. More information on the corpus can be found on the corpus web page at our university (listed under documented by)
The corpus contains over 300 million words, with annotations of words and sentences describing their...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were...
The Webis-Mnemonics-17 corpus is a collection of 1048 human-chosen sentences for password generation...
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based sy...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
This work describes the process of creation of a 70 billion word text corpus of English. We used an ...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
The article investigates the inherent text complexity of a small corpus made up of twenty-five writt...
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
The lack of large and reliable datasets has been hindering progress in Text Simplification (TS). We ...
none4Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling t...
The corpus contains over 300 million words, with annotations of words and sentences describing their...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were...
The Webis-Mnemonics-17 corpus is a collection of 1048 human-chosen sentences for password generation...
In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based sy...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
We present DEPCC, the largest-to-date linguistically analyzed corpus in English including 365 millio...
This work describes the process of creation of a 70 billion word text corpus of English. We used an ...
To create the corpus, first we download from Reuters website 27,000 random news articles (HTML webp...
The article investigates the inherent text complexity of a small corpus made up of twenty-five writt...
Abstract. The 60-year-old dream of computational linguistics is to make computers capable of communi...
The lack of large and reliable datasets has been hindering progress in Text Simplification (TS). We ...
none4Abstract In this paper we introduce ukWaC, a large corpus of English constructed by crawling t...
The corpus contains over 300 million words, with annotations of words and sentences describing their...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...
Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described...