Next: Document delineation and character
Up: irbook
Previous: References and further reading
Contents
Index
The term vocabulary and postings lists
Recall the major steps in inverted index construction:
- Collect the documents to be indexed.
- Tokenize the text.
- Do linguistic preprocessing of tokens.
- Index the documents that each term occurs in.
In this chapter we first briefly mention how the basic unit of a
document can be defined and how the character sequence that it comprises
is determined (Section 2.1 ). We then examine
in detail some of the
substantive linguistic issues of tokenization and linguistic
preprocessing, which determine the vocabulary of terms which a system uses
(Section 2.2 ). Tokenization is the process of
chopping character streams
into tokens , while linguistic
preprocessing then deals
with building equivalence classes of tokens which are the set of
terms that are
indexed. Indexing itself is covered
in Chapters 1 4 .
Then we return to the implementation of postings lists.
In Section 2.3 , we examine an extended
postings list data structure that supports faster querying, while
Section 2.4 covers building postings data structures suitable for
handling phrase and proximity queries, of the sort that commonly appear in
both extended Boolean models and on the web.
Subsections
Next: Document delineation and character
Up: irbook
Previous: References and further reading
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07