We propose a language-independent graph-based method to build a-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia's category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of 84% on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our...
Despite the fact that Wikipedia is often criticized for its poor quality, it continues to be one of ...
International audienceWikipedia is considered as the largest knowledge repository in the history of ...
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many...
We propose a language-independent graph-based method to build a-la-carte article collections on user...
AbstractParallel corpora are not available for all domains and languages, but statistical methods in...
Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Sele...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
AbstractParallel corpora are not available for all domains and languages, but statistical methods in...
Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Sele...
Content: List of the 743 domains, their term vocabularies in 10 languages, and the Wikipedia articl...
International audienceWikipedia is a rich source of information across many knowledge domains. Yet, ...
Wikipedia is a well known and widely used source of information. Wikipedia is massive, and its infor...
While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is there...
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many...
Despite the fact that Wikipedia is often criticized for its poor quality, it continues to be one of ...
International audienceWikipedia is considered as the largest knowledge repository in the history of ...
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many...
We propose a language-independent graph-based method to build a-la-carte article collections on user...
AbstractParallel corpora are not available for all domains and languages, but statistical methods in...
Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Sele...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
Multiple approaches to grab comparable data from the Web have been developed up to date. Neverthele...
AbstractParallel corpora are not available for all domains and languages, but statistical methods in...
Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Sele...
Content: List of the 743 domains, their term vocabularies in 10 languages, and the Wikipedia articl...
International audienceWikipedia is a rich source of information across many knowledge domains. Yet, ...
Wikipedia is a well known and widely used source of information. Wikipedia is massive, and its infor...
While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is there...
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many...
Despite the fact that Wikipedia is often criticized for its poor quality, it continues to be one of ...
International audienceWikipedia is considered as the largest knowledge repository in the history of ...
As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many...