CrawlWave: A Distributed Crawler

Apostolos Kritikopoulos
Martha Sideri
Kostantinos Stroggilos

Publication date

March 2015

Abstract

Abstract. A crawler is a program that downloads and stores Web pages. A crawler must revisit pages because they are frequently updated. In this paper we describe the implementation of CrawlWave, a distributed crawler based on Web Services. CrawlWave is written entirely in the.Net platform; it uses XML/SOAP and is therefore extensible, scalable and easily maintained. CrawlWave can use many client and server processors for the collection of data and therefore operates with minimum system requirements. It is robust, has good performance (download rate) and uses small bandwidth. Data updating was one of the main design issues of CrawlWave. We discuss our updating method, some bottleneck issues and present first experimental results.