Features a crawler should provide

Next: Crawling Up: Overview Previous: Features a crawler must Contents Index

Features a crawler should provide

Distributed:: The crawler should have the ability to execute in a distributed fashion across multiple machines.
Scalable:: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
Performance and efficiency:: The crawl system should make efficient use of various system resources including processor, storage and network bandwidth.
Quality:: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ``useful'' pages first.
Freshness:: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine's index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
Extensible:: Crawlers should be designed to be extensible in many ways - to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.

Next: Crawling Up: Overview Previous: Features a crawler must Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07