Next: List of Figures
- Typical system parameters
The seek time is the time needed to position the disk head in
a new position. The transfer time per byte is the rate
of transfer from disk to memory when the head is in the right position.
- Collection statistics for Reuters-RCV1. Values are
rounded for the computations in this book.
806,791 documents, 222
tokens per document, 391,523 (distinct) terms,
bytes per token with spaces and punctuation,
4.5 bytes per token without spaces and punctuation,
7.5 bytes per term,
and 96,969,056 tokens.
numbers in this table correspond to the third line (``case
folding'') in icompresstb5.
- The five steps in constructing an
index for Reuters-RCV1 in blocked sort-based indexing. Line numbers refer to Figure 4.2 .
- Collection statistics for a large collection.
- The effect of preprocessing on
the number of terms,
nonpositional postings, and tokens for Reuters-RCV1.
``'' indicates the reduction in size from the
previous line, except that ``30 stop words'' and ``150 stop
words'' both use ``case folding'' as their reference
line. ``T%'' is the cumulative (``total'') reduction from unfiltered.
We performed stemming with the Porter stemmer
(Chapter 2 , page 2.2.4 ).
- Dictionary compression for Reuters-RCV1.
- Encoding gaps instead of document IDs. For example,
gaps 107, 5, 43, ..., instead of
docIDs 283154, 283159, 283202, ... for
The first docID is left unchanged (only
shown for arachnocentric).
- Some examples of unary and codes.
Unary codes are only shown for the smaller numbers.
Commas in codes are for readability only and are not part of the actual codes.
- Index and dictionary compression for Reuters-RCV1.
The compression ratio depends on the proportion of actual text
in the collection. Reuters-RCV1 contains a
large amount of XML
markup. Using the two best
compression schemes, encoding and blocking with
front coding, the
ratio compressed index to collection size is therefore
especially small for Reuters-RCV1:
- Two gap sequences to be merged in blocked sort-based indexing
- Cosine computation for Exercise 6.4.4 .
- Calculating the kappa statistic.
- INEX 2002 collection statistics.
- INEX 2002 results of the vector space model in Section 10.3 for
content-and-structure (CAS) queries and the quantization function Q.
- A comparison of content-only and full-structure
search in INEX 2003/2004.
- Data for parameter
- Training and test times for
- Multinomial versus Bernoulli model.
- Correct estimation implies accurate prediction, but accurate
prediction does not imply correct estimation.
- A set of documents for which
the NB independence assumptions are problematic.
- Critical values of the
distribution with one degree of freedom. For example, if
the two events are
. So for
the assumption of independence can be rejected with 99% confidence.
- The ten largest classes in the
collection with number of documents in training and test sets.
- Macro- and microaveraging.
``Truth'' is the true class and
decision of the classifier. In this example, macroaveraged precision is
. Microaveraged precision is
- Text classification effectiveness numbers on Reuters-21578
for F (in percent). Results from
Li and Yang (2003) (a), Joachims (1998) (b: kNN)
and Dumais et al. (1998) (b: NB, Rocchio, trees, SVM).
- Data for parameter estimation exercise.
- Vectors and class centroids for the data in
Table 13.1 .
- Training examples for machine-learned scoring.
- Some applications of clustering in information
- The four external evaluation measures applied to
the clustering in Figure 16.4 .
- Comparison of HAC algorithms.
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.