List of Tables
List of Figures
Boolean retrieval
- An example information retrieval problem
- A first take at building an inverted index
- Processing Boolean queries
- The extended Boolean model versus ranked retrieval
- References and further reading
The term vocabulary and postings lists
- Document delineation and character sequence decoding
  - Obtaining the character sequence in a document
  - Choosing a document unit
- Determining the vocabulary of terms
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
- References and further reading
Dictionaries and tolerant retrieval
- Search structures for dictionaries
- Wildcard queries
  - General wildcard queries
  - k-gram indexes for wildcard queries
- Spelling correction
- Phonetic correction
- References and further reading
Index construction
- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes
- References and further reading
Index compression
- Statistical properties of terms in information retrieval
  - Heaps' law: Estimating the number of terms
  - Zipf's law: Modeling the distribution of terms
- Dictionary compression
  - Dictionary as a string
  - Blocked storage
- Postings file compression
  - Variable byte codes
  - Gamma codes
- References and further reading
Scoring, term weighting and the vector space model
- Parametric and zone indexes
- Term frequency and weighting
  - Inverse document frequency
  - Tf-idf weighting
- The vector space model for scoring
- Variant tf-idf functions
- References and further reading
Computing scores in a complete search system
- Efficient scoring and ranking
- Components of an information retrieval system
- Vector space scoring and query operator interaction
- References and further reading
Evaluation in information retrieval
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
  - Critiques and justifications of the concept of relevance
- A broader perspective: System quality and user utility
- Results snippets
- References and further reading
Relevance feedback and query expansion
- Relevance feedback and pseudo relevance feedback
- Global methods for query reformulation
- References and further reading
XML retrieval
- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval
- References and further reading
- Exercises
Probabilistic information retrieval
- Review of basic probability theory
- The Probability Ranking Principle
  - The 1/0 loss case
  - The PRP with retrieval costs
- The Binary Independence Model
- An appraisal and some extensions
- References and further reading
Language models for information retrieval
- Language models
- The query likelihood model
- Language modeling versus other approaches in IR
- Extended language modeling approaches
- References and further reading
Text classification and Naive Bayes
- The text classification problem
- Naive Bayes text classification
  - Relation to multinomial unigram language model
- The Bernoulli model
- Properties of Naive Bayes
  - A variant of the multinomial model
- Feature selection
- Evaluation of text classification
- References and further reading
Vector space classification
- Document representations and measures of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
  - Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
- References and further reading
- Exercises
Support vector machines and machine learning on documents
- Support vector machines: The linearly separable case
- Extensions to the SVM model
- Issues in the classification of text documents
  - Choosing what kind of classifier to use
  - Improving classifier performance
- Machine learning methods in ad hoc information retrieval
  - A simple example of machine-learned scoring
  - Result ranking by machine learning
- References and further reading
Flat clustering
- Clustering in information retrieval
- Problem statement
  - Cardinality - the number of clusters
- Evaluation of clustering
- K-means
  - Cluster cardinality in K-means
- Model-based clustering
- References and further reading
- Exercises
Hierarchical clustering
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
  - Time complexity of HAC
- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
- References and further reading
- Exercises
Matrix decompositions and latent semantic indexing
- Linear algebra review
  - Matrix decompositions
- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing
- References and further reading
Web search basics
- Background and history
- Web characteristics
  - The web graph
  - Spam
- Advertising as the economic model
- The search user experience
  - User query needs
- Index size and estimation
- Near-duplicates and shingling
- References and further reading
Web crawling and indexes
- Overview
  - Features a crawler must provide
  - Features a crawler should provide
- Crawling
- Distributing indexes
- Connectivity servers
- References and further reading
Link analysis
- The Web as a graph
  - Anchor text and the web graph
- PageRank
- Hubs and Authorities
  - Choosing the subset of the Web
- References and further reading
Bibliography
Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07

Contents