Next: References and further reading Up: Web search basics Previous: Index size and estimation Contents Index

Near-duplicates and shingling

One aspect we have ignored in the discussion of index size in Section 19.5 is duplication: the Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages. Many of these are legitimate copies; for instance, certain information repositories are mirrored simply to provide redundancy and access reliability. Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads.

The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other. This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near duplication. In many cases, the contents of one web page are identical to those of another except for a few characters - say, a notation showing the date and time at which the page was last modified. Even in such cases, we want to be able to declare the two pages to be close enough that we only index one copy. Short of exhaustively comparing all pairs of web pages, an infeasible task at the scale of billions of pages, how can we detect and filter out such near duplicates?

We now describe a solution to the problem of detecting near-duplicate web pages. The answer lies in a technique known as shingling . Given a positive integer and a sequence of terms in a document , define the -shingles of to be the set of all consecutive sequences of terms in . As an example, consider the following text: a rose is a rose is a rose. The 4-shingles for this text ( is a typical value used in the detection of near-duplicate web pages) are a rose is a, rose is a rose and is a rose is. The first two of these shingles each occur twice in the text. Intuitively, two documents are near duplicates if the sets of shingles generated from them are nearly the same. We now make this intuition precise, then develop a method for efficiently computing and comparing the sets of shingles for all web pages.

Let denote the set of shingles of document . Recall the Jaccard coefficient from page 3.3.4 , which measures the degree of overlap between the sets and as $\vert S(d_1) \cap S(d_2)\vert/\vert S(d_1) \cup S(d_2)\vert$ ; denote this by . Our test for near duplication between and is to compute this Jaccard coefficient; if it exceeds a preset threshold (say, ), we declare them near duplicates and eliminate one from indexing. However, this does not appear to have simplified matters: we still have to compute Jaccard coefficients pairwise.

To avoid this, we use a form of hashing. First, we map every shingle into a hash value over a large space, say 64 bits. For , let be the corresponding set of 64-bit hash values derived from . We now invoke the following trick to detect document pairs whose sets have large Jaccard overlaps. Let $\pi$ be a random permutation from the 64-bit integers to the 64-bit integers. Denote by $\Pi(d_j)$ the set of permuted hash values in ; thus for each $h\in H(d_j)$ , there is a corresponding value $\pi(h)\in \Pi(d_j)$ .

$\begin{figure} % latex2html id marker 30462 \begin{picture}(600,200) \multiput(0... ...s the minimum values $x_1^\pi$\ and $x_2^\pi$\ for each document. }\end{figure}$

Let $x_j^\pi$ be the smallest integer in $\Pi(d_j)$ . Then

Theorem.

$\begin{displaymath} J(S(d_1),S(d_2)) = P(x_1^\pi=x_2^\pi). \end{displaymath}$

(247)

End theorem.

Proof. We give the proof in a slightly more general setting: consider a family of sets whose elements are drawn from a common universe. View the sets as columns of a matrix , with one row for each element in the universe. The element $a_{ij} = 1$ if element is present in the set that the th column represents.

Let $\Pi$ be a random permutation of the rows of ; denote by $\Pi(S_j)$ the column that results from applying $\Pi$ to the th column. Finally, let $x_j^\pi$ be the index of the first row in which the column $\Pi(S_j)$ has a . We then prove that for any two columns ,

$\begin{displaymath} P(x_{j_1}^\pi=x_{j_2}^\pi)=J(S_{j_1},S_{j_2}). \end{displaymath}$

(248)

If we can prove this, the theorem follows.

**Figure 19.9:** Two sets $S_{j_1}$ and $S_{j_2}$ ; their Jaccard coefficient is .
$\begin{figure}\begin{tabular}{\vert c\vert c\vert} \hline % after \\ : \hline ... ... 0\\ 1 & 1\\ 0 & 0\\ 1 & 1\\ 0 & 1\\ \hline \end{tabular} \end{figure}$

Consider two columns as shown in Figure 19.9 . The ordered pairs of entries of $S_{j_1}$ and $S_{j_2}$ partition the rows into four types: those with 0's in both of these columns, those with a 0 in $S_{j_1}$ and a 1 in $S_{j_2}$ , those with a 1 in $S_{j_1}$ and a 0 in $S_{j_2}$ , and finally those with 1's in both of these columns. Indeed, the first four rows of Figure 19.9 exemplify all of these four types of rows. Denote by $C_{00}$ the number of rows with 0's in both columns, $C_{01}$ the second, $C_{10}$ the third and $C_{11}$ the fourth. Then,

$\begin{displaymath} J(S_{j_1},S_{j_2})=\frac{C_{11}}{C_{01}+C_{10}+C_{11}}. \end{displaymath}$

(249)

To complete the proof by showing that the right-hand side of Equation 249 equals $P(x_{j_1}^\pi=x_{j_2}^\pi)$ , consider scanning columns

in increasing row index until the first non-zero entry is found in either column. Because $\Pi$ is a random permutation, the probability that this smallest row has a 1 in both columns is exactly the right-hand side of Equation 249. End proof.

Thus, our test for the Jaccard coefficient of the shingle sets is probabilistic: we compare the computed values $x_i^\pi$ from different documents. If a pair coincides, we have candidate near duplicates. Repeat the process independently for 200 random permutations $\pi$ (a choice suggested in the literature). Call the set of the 200 resulting values of $x_i^\pi$ the sketch $\psi(d_i)$ of . We can then estimate the Jaccard coefficient for any pair of documents to be $\vert\psi_i\cap\psi_j\vert/200$ ; if this exceeds a preset threshold, we declare that and are similar.

How can we quickly compute $\vert\psi_i\cap\psi_j\vert/200$ for all pairs ? Indeed, how do we represent all pairs of documents that are similar, without incurring a blowup that is quadratic in the number of documents? First, we use fingerprints to remove all but one copy of identical documents. We may also remove common HTML tags and integers from the shingle computation, to eliminate shingles that occur very commonly in documents without telling us anything about duplication. Next we use a union-find algorithm to create clusters that contain documents that are similar. To do this, we must accomplish a crucial step: going from the set of sketches to the set of pairs such that and are similar.

To this end, we compute the number of shingles in common for any pair of documents whose sketches have any members in common. We begin with the list $<x_i^\pi, d_i>$ sorted by $x_i^\pi$ pairs. For each $x_i^\pi$ , we can now generate all pairs for which $x_i^\pi$ is present in both their sketches. From these we can compute, for each pair with non-zero sketch overlap, a count of the number of $x_i^\pi$ values they have in common. By applying a preset threshold, we know which pairs have heavily overlapping sketches. For instance, if the threshold were 80%, we would need the count to be at least 160 for any . As we identify such pairs, we run the union-find to group documents into near-duplicate ``syntactic clusters''. This is essentially a variant of the single-link clustering algorithm introduced in Section 17.2 (page ).

One final trick cuts down the space needed in the computation of $\vert\psi_i\cap\psi_j\vert/200$ for pairs , which in principle could still demand space quadratic in the number of documents. To remove from consideration those pairs whose sketches have few shingles in common, we preprocess the sketch for each document as follows: sort the $x_i^\pi$ in the sketch, then shingle this sorted sequence to generate a set of super-shingles for each document. If two documents have a super-shingle in common, we proceed to compute the precise value of $\vert\psi_i\cap\psi_j\vert/200$ . This again is a heuristic but can be highly effective in cutting down the number of pairs for which we accumulate the sketch overlap counts.

Exercises.

Web search engines A and B each crawl a random subset of the same size of the Web. Some of the pages crawled are duplicates - exact textual copies of each other at different URLs. Assume that duplicates are distributed uniformly amongst the pages crawled by A and B. Further, assume that a duplicate is a page that has exactly two copies - no pages have more than two copies. A indexes pages without duplicate elimination whereas B indexes only one copy of each duplicate page. The two random subsets have the same size before duplicate elimination. If, 45% of A's indexed URLs are present in B's index, while 50% of B's indexed URLs are present in A's index, what fraction of the Web consists of pages that do not have a duplicate?
Instead of using the process depicted in Figure 19.8 , consider instead the following process for estimating the Jaccard coefficient of the overlap between two sets and . We pick a random subset of the elements of the universe from which and are drawn; this corresponds to picking a random subset of the rows of the matrix in the proof. We exhaustively compute the Jaccard coefficient of these random subsets. Why is this estimate an unbiased estimator of the Jaccard coefficient for and ?
Explain why this estimator would be very difficult to use in practice.

Next: References and further reading Up: Web search basics Previous: Index size and estimation Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07