Next: Document delineation and character Up: irbook Previous: References and further reading Contents Index

The term vocabulary and postings lists

Recall the major steps in inverted index construction:

Collect the documents to be indexed.
Tokenize the text.
Do linguistic preprocessing of tokens.
Index the documents that each term occurs in.

In this chapter we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1 ). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms which a system uses (Section 2.2 ). Tokenization is the process of chopping character streams into tokens , while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 4 . Then we return to the implementation of postings lists. In Section 2.3 , we examine an extended postings list data structure that supports faster querying, while Section 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web.

Subsections

Next: Document delineation and character Up: irbook Previous: References and further reading Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07