The 3 datasets derived from the Italian (ItWiki-100), French (FrWiki-100) and English (EnWiki-100) Wikipedia dumps, with articles tagged with related portals (100 most common per language). If you use this data you may cite these works: Gasparetto A, Marcuzzo M, Zangari A, Albarelli A. (2022) A Survey on Text Classification Algorithms: From Text to Predictions. Information 13, no. 2: 83. https://doi.org/10.3390/info13020083 Gasparetto A, Zangari A, Marcuzzo M, Albarelli A. (2022) A survey on text classification: Practical perspectives on the Italian language. PLOS ONE 17(7): e0270904. https://doi.org/10.1371/journal.pone.027090
Wikipedia's online encyclopedia contains articles on various topics, created and edited independentl...
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retriev...
This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https...
Text Classification methods have been improving at an unparalleled speed in the last decade thanks t...
Text Classification methods have been improving at an unparalleled speed in the last decade thanks t...
Abstractive text summarization has recently improved its performance due to the use of sequence to s...
<div>This dataset has manual annotations with respect to Wikipedia over the same text written in fiv...
The exponential growth of text documents available on the Internet has created an urgent need for ac...
For each existing Wikipedia language edition, the dataset contains a classification of the articles ...
International audienceThis article presents a comparative study of supervised classification approac...
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 dif...
Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing su...
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 23...
This dataset contains 871 articles from Wikipedia (retrieved on 8th August 2016), selected from the ...
cessing during my bachelor thesis, with the development of a computational grammar for Italian. Duri...
Wikipedia's online encyclopedia contains articles on various topics, created and edited independentl...
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retriev...
This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https...
Text Classification methods have been improving at an unparalleled speed in the last decade thanks t...
Text Classification methods have been improving at an unparalleled speed in the last decade thanks t...
Abstractive text summarization has recently improved its performance due to the use of sequence to s...
<div>This dataset has manual annotations with respect to Wikipedia over the same text written in fiv...
The exponential growth of text documents available on the Internet has created an urgent need for ac...
For each existing Wikipedia language edition, the dataset contains a classification of the articles ...
International audienceThis article presents a comparative study of supervised classification approac...
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 dif...
Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing su...
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 23...
This dataset contains 871 articles from Wikipedia (retrieved on 8th August 2016), selected from the ...
cessing during my bachelor thesis, with the development of a computational grammar for Italian. Duri...
Wikipedia's online encyclopedia contains articles on various topics, created and edited independentl...
WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retriev...
This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https...