Next: Features a crawler should
We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide.
Features a crawler must provide
- The Web contains servers that create spider
traps, which are generators of web pages that mislead crawlers
into getting stuck fetching an infinite number of pages in a
particular domain. Crawlers must be designed to be resilient to
such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
- Web servers have both implicit and explicit
policies regulating the rate at which a crawler can visit them.
These politeness policies must be respected.
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.