Next: Blocked sort-based indexing
Up: Index construction
Previous: Index construction
Contents
Index
Hardware basics
Table 4.1:
Typical system parameters
in 2007.
The seek time is the time needed to position the disk head in
a new position. The transfer time per byte is the rate
of transfer from disk to memory when the head is in the right position.
| Symbol |
Statistic |
Value |
|
| |
average seek time |
5 ms
s |
|
| |
transfer time per byte |
0.02 s
s |
|
| |
processor's clock rate |
|
|
| |
lowlevel operation |
|
|
| |
(e.g., compare & swap a word) |
0.01 s
s |
|
| |
size of main memory |
several GB |
|
| |
size of disk space |
1 TB or more |
|
When building an information retrieval (IR)
system, many decisions are based on the characteristics of the computer
hardware on which the system runs. We therefore begin
this chapter with a brief review of computer
hardware.
Performance characteristics
typical of systems in 2007 are shown in Table 4.1 .
A list of hardware basics that we need in
this book to motivate IR system design follows.
- Access to data in memory is much faster than access to
data on disk. It takes a few clock cycles (perhaps
seconds) to access a byte in memory, but
much longer
to transfer it from disk
(about
seconds). Consequently, we
want to keep as much data as possible in memory, especially
those data that we need to access frequently.
We call the technique of keeping frequently used disk data
in main memory caching .
- When doing a disk read or write, it takes a while for
the disk head to move to the part of the disk where the data
are located. This time is called the
seek time and it
averages 5 ms for typical disks.
No data are being transferred during the seek.
To maximize data transfer
rates, chunks of data that will be read together should
therefore be stored contiguously on disk. For example,
using the numbers in Table 4.1 it may take as
little as 0.2 seconds to transfer 10 megabytes (MB) from disk to memory
if it is stored as one chunk, but up to
seconds if it is
stored in 100 noncontiguous chunks because we need to
move the disk head up to 100 times.
- Operating systems generally read and write entire
blocks. Thus, reading
a single byte from disk can take as much time as reading the
entire block. Block sizes of 8, 16, 32, and 64 kilobytes (KB)
are common. We call the part of main memory where a block being
read or written is stored a buffer .
- Data transfers from disk to memory are handled by the
system bus, not by the processor. This means that the
processor is available to process data during disk I/O.
We can exploit this fact to speed up data transfers
by storing compressed data on disk. Assuming an efficient
decompression algorithm, the total time of
reading and then decompressing compressed data is usually less
than reading uncompressed data.
- Servers used in IR systems
typically have
several gigabytes (GB) of main
memory, sometimes tens of GB. Available disk space is
several orders of magnitude larger.
Next: Blocked sort-based indexing
Up: Index construction
Previous: Index construction
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07