Abstract In web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not suitable or desirable for inclusion in web corpora or web indexes (for example, pages with little text or virtually no text at all). An optimized crawler for web corpus construction would ideally avoid crawling such content in the first place, saving bandwidth, storage, and post-processing costs. In this paper, we show in three experiments that two simple scores are suitable to improve t...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
In this paper we review and compare focused crawling strategies, studied and published during the pa...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all,...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corp...
In this paper we present a preliminary analysis over the largest publicly accessible web dataset: Th...
Search engines are the main hub of information in the Web. They crawl and index Web contents to allo...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
In this paper we review and compare focused crawling strategies, studied and published during the pa...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all,...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corp...
In this paper we present a preliminary analysis over the largest publicly accessible web dataset: Th...
Search engines are the main hub of information in the Web. They crawl and index Web contents to allo...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
In this paper we review and compare focused crawling strategies, studied and published during the pa...