Next: Crawling
Up: Overview
Previous: Features a crawler must
Contents
Index
- Distributed:
- The crawler should have the ability to execute in a distributed fashion across multiple machines.
- Scalable:
- The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
- Performance and efficiency:
- The crawl system should make
efficient use of various system resources including processor,
storage and network bandwidth.
- Quality:
- Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ``useful'' pages first.
- Freshness:
- In many applications, the crawler should
operate in continuous mode: it should obtain fresh copies of
previously fetched pages. A search engine crawler, for instance,
can thus ensure that the search engine's index contains a fairly
current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
- Extensible:
- Crawlers should be designed to be extensible in many ways - to cope with new data formats, new fetch protocols,
and so on. This demands that the crawler architecture be modular.
Next: Crawling
Up: Overview
Previous: Features a crawler must
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07