next up previous contents index
Next: Crawling Up: Overview Previous: Features a crawler must   Contents   Index

Features a crawler should provide

Distributed:
The crawler should have the ability to execute in a distributed fashion across multiple machines.
Scalable:
The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
Performance and efficiency:
The crawl system should make efficient use of various system resources including processor, storage and network bandwidth.
Quality:
Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ``useful'' pages first.
Freshness:
In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine's index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
Extensible:
Crawlers should be designed to be extensible in many ways - to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.


next up previous contents index
Next: Crawling Up: Overview Previous: Features a crawler must   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07