Identifying and tracking new information on the Web is im-portant in sociology, marketing, and survey research, since new trends might be apparent in the new information. Such changes can be observed by crawling the Web periodically. In practice, however, it is impossible to crawl the entire ex-panding Web repeatedly. This means that the novelty of a page remains unknown, even if that page did not exist in previous snapshots. In this paper, we propose a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls. Us-ing this novelty measure, new pages can be extracted from a series of unstable snapshots for further analysis and min-ing to identify new trends on the Web. We evaluate...
This is a preprint of an article published in the Journal of Information Science Vol. 32, No. 2, 131...
How fast does the web change? Does most of the content remain unchanged once it has been authored, o...
Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate docume...
The growing amount of information published on the Web, combined with its dynamic nature, opens man...
to decide an optimal order in which to crawl and re-crawl webpages. Ideally, crawlers should request...
The World Wide Web is growing at an enormous speed, and has become an indispensable source for infor...
The size and complexity of the World Wide Web means that for all practical purposes it is impossible...
Abstract — — In order to reduce redundant and non-relevant information presented to users related t...
Nowadays, more and more people use the Web as their pri-mary source of up-to-date information. In th...
Web archives offer a rich and plentiful source of information to researchers, analysts, and legal ex...
Recent experiments and analysis suggest that there are about 800 million publicly-indexable web pag...
This paper tries to estimate redundancy level on the Web by employing information collected from exi...
Recently, a new temporal dataset has been made public: it is made of a series of twelve 100 M pages ...
Small and medium enterprises rely on detailed Web analytics to be informed about their market and co...
How fast does the web change? Does most of the content remain unchanged once it has been authored, o...
This is a preprint of an article published in the Journal of Information Science Vol. 32, No. 2, 131...
How fast does the web change? Does most of the content remain unchanged once it has been authored, o...
Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate docume...
The growing amount of information published on the Web, combined with its dynamic nature, opens man...
to decide an optimal order in which to crawl and re-crawl webpages. Ideally, crawlers should request...
The World Wide Web is growing at an enormous speed, and has become an indispensable source for infor...
The size and complexity of the World Wide Web means that for all practical purposes it is impossible...
Abstract — — In order to reduce redundant and non-relevant information presented to users related t...
Nowadays, more and more people use the Web as their pri-mary source of up-to-date information. In th...
Web archives offer a rich and plentiful source of information to researchers, analysts, and legal ex...
Recent experiments and analysis suggest that there are about 800 million publicly-indexable web pag...
This paper tries to estimate redundancy level on the Web by employing information collected from exi...
Recently, a new temporal dataset has been made public: it is made of a series of twelve 100 M pages ...
Small and medium enterprises rely on detailed Web analytics to be informed about their market and co...
How fast does the web change? Does most of the content remain unchanged once it has been authored, o...
This is a preprint of an article published in the Journal of Information Science Vol. 32, No. 2, 131...
How fast does the web change? Does most of the content remain unchanged once it has been authored, o...
Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate docume...