Next: Hardware basics Up: irbook Previous: References and further reading Contents Index

Index construction

In this chapter, we look at how to construct an inverted index. We call this process index construction or indexing ; the process or machine that performs it the indexer . The design of indexing algorithms is governed by hardware constraints. We therefore begin this chapter with a review of the basics of computer hardware that are relevant for indexing. We then introduce blocked sort-based indexing (Section 4.2 ), an efficient single-machine algorithm designed for static collections that can be viewed as a more scalable version of the basic sort-based indexing algorithm we introduced in Chapter 1 . Section 4.3 describes single-pass in-memory indexing, an algorithm that has even better scaling properties because it does not hold the vocabulary in memory. For very large collections like the web, indexing has to be distributed over computer clusters with hundreds or thousands of machines. We discuss this in Section 4.4 . Collections with frequent changes require dynamic indexing introduced in Section 4.5 so that changes in the collection are immediately reflected in the index. Finally, we cover some complicating issues that can arise in indexing - such as security and indexes for ranked retrieval - in Section 4.6 .

Index construction interacts with several topics covered in other chapters. The indexer needs raw text, but documents are encoded in many ways (see Chapter 2 ). Indexers compress and decompress intermediate files and the final index (see Chapter 5 ). In web search, documents are not on a local file system, but have to be spidered or crawled (see Chapter 20 ). In enterprise search , most documents are encapsulated in varied content management systems, email applications, and databases. We give some examples in Section 4.7 . Although most of these applications can be accessed via http, native Application Programming Interfaces (APIs) are usually more efficient. The reader should be aware that building the subsystem that feeds raw text to the indexing process can in itself be a challenging problem.

Subsections

Next: Hardware basics Up: irbook Previous: References and further reading Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07