International audienceIn web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not suitable or desirable for inclusion in web corpora or web indexes (for example, pages with little text or virtually no text at all). An optimized crawler for web corpus construction would ideally avoid crawling such content in the first place, saving bandwidth, storage, and post-processing costs. In this paper, we show in three experiments that two simple scores are suitable...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
In this paper we present our research concerning the relation between two properties of websites and...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
Abstract In web corpus construction, crawling is a necessary step, and it is probably the most costl...
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all,...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
In this paper we present a preliminary analysis over the largest publicly accessible web dataset: Th...
IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corp...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Search engines are the main hub of information in the Web. They crawl and index Web contents to allo...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
In this paper we present our research concerning the relation between two properties of websites and...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
Abstract In web corpus construction, crawling is a necessary step, and it is probably the most costl...
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all,...
International audienceIn web corpus construction, crawling is a necessary step, and it is probably t...
Efforts to use web data as corpora seek to provide solutions to problems traditional corpora suffer ...
In this paper we present a preliminary analysis over the largest publicly accessible web dataset: Th...
IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corp...
This work presents a straightforward method for extending or creating in-domain web corpora by focus...
Search engines are the main hub of information in the Web. They crawl and index Web contents to allo...
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
Comparable corpora have been used as an alternative for parallel corpora as resources for computatio...
In this paper we present our research concerning the relation between two properties of websites and...