Determining bias to search engines from robots.txt

Yang Sun
Ziming Zhuang
Isaac G. Councill
C. Lee Giles

Publication date

January 2007

Publisher

IEEE/WIC/ACM

Abstract

Search engines largely rely on robots (i.e., crawlers or spiders) to collect information from the Web. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Ethical robots will follow the rules specified in robots.txt. Websites can explicitly specify an access preference for each robot by name. Such biases may lead to a “rich get richer” situation, in which a few popular search engines ultimately dominate the Web because they have preferred access to resources that are inaccessible to others. This issue is seldom addressed, although the robots.txt convention has become a de facto standard for robot regulation and search engines have become an indispensable tool...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Determining bias to search engines from robots.txt

Abstract

Extracted data

Determining bias to search engines from robots.txt

Abstract

Extracted data

Related items

Related items