CS 224N Final Project

Unsupervised Web Page Clustering

Spring 2000, Stanford Univ.

Paul Ruhlen – ruhlen@cs

Husrev Tolga Ilhan – ilhan@leland

Vladimir Livshits – livshits@cs



Introduction

Problem With the huge proliferation of the World Wide Web, meaningfully indexing or searching millions of non-homogenous documents has become an increasing challenge and opportunity. We were curious how much structure could be elicited directly from sets of web pages, with no supervised training or "priming" of classifiers. While this unsupervised approach is unlikely to achieve the accuracy of a hand-trained or constructed system, it could suggest some lower bound on the need for expensive and slow human indexing of the millions of new sites and pages being continuously added to the Web. Approach, Data, Evaluation We decided to explore the behavior of the EM algorithm when used for clustering a set of Web pages, in large part to gain experience with issues of clustering and web processing. We initially decided to learn a set of Naïve Bayes classifiers, one for each cluster, and to iterate EM to Maximize the likelihood of the data given these classifiers, and then use these new classifiers to re-assign the pages fractional membership in the various clusters, according the their Expectation of being in each.

Our data were drawn from various categories or "topics" in the www.dmoz.org directory. This directory is similar in purpose and scope to Yahoo! and other web indices, although it is maintained by a community of volunteer editors, and unlike Yahoo! dmoz.org has an open licensing agreement that allows free use for derivative works, such as this research. Selecting documents that were pre-categorized guaranteed there was some underlying structure we could try to learn.

To evaluate our system’s performance, we carried along with each page its original category string under dmoz.org, such as "Sports/Badminton/". To measure the quality of the current learned clusters, we found, for each cluster, the most frequent original category string among its members (counting as members only items with a higher fractional expectation of being in this category than in any other). Then every member that matched this most frequent original category was counted as correctly categorized, and any other member was counted as an error.

This measure of accuracy had the advantage that achieving 100% percent meant we had exactly re-created the original clusters. It had the disadvantage that no "partial credit" was given for clustering similar documents together that happened to have been originally indexed separately, perhaps arbitrarily or mistakenly by a human editor. (We discovered at least once such mis-categorization, when a site on Chinese calligraphy ended up counting as an error in our calligraphy cluster, because it had been indexed under sculpture.)

It also gave no direct penalty to splitting an original single cluster into two or more parts, although such a split guaranteed that members of at least one other cluster would have to be counted as errors, since there would not be enough remaining clusters to go around.

Findings We achieved surprisingly good accuracy rates on sets of up to 300 pages, drawn from up to 8 categories, reaching average accuracies across multiple runs as high as 92%. The approach did not seem likely to scale well up to thousands of categories without substantial redesign.

Literature

We actually did very minimal literature review when starting this project, in part because of the very limited time for the project, and because we felt we would learn a great deal just "getting our hands dirty" with attempts to build and improve a clustering system.

We did take a close look at Hierarchically classifying documents using very few words, by Koller, D., Sahami, M., Proc. of the 14th International Conference on Machine Learning ICML97, pp. 170---178, 1997.

While this was a semi-supervised learning system, classifying computer science research papers into an existing hierarchical category tree, it did train Naïve Bayes classifiers using EM to let a large body of unlabeled training data improve the performance of the classifiers. It also used an extension to NB classifiers that learned limited dependencies among words, and learned a hierarchical set of classifiers, to reduce the size and scope of each classifier and improve reliability.

We also took a quick look at: A Corpus-Based Investigation of Definite Description Use, Massimo Poesio, Renata Vieira, Centre for Cognitive Science, The University of Edinburgh, and a few other papers on Kappa as an alternate measure of the reliability of our clusters. We decided that consistency of learned clusters from one run to the next was less meaningful than agreement with human editors, since learning such inane clusters as alphabetic ranges of the first word of the document would be completely consistent and repeatable, but equally useless as category clusters.

Design and Algorithms

The system we designed has two major parts: a Perl script for downloading, parsing, filtering, and caching the URLs as files of tokens, and a C++ program for reading and clustering the resulting files, and measuring the accuracy of the learned clusters.

Download and HTML Parse We decided early on to download and cache web pages, in order to isolate our clustering tests from the often slow response times of some web servers, and to improve repeatability in testing, since pages that were present in one download might be unavailable the next. We chose Perl to implement this portion of the system, due to its strong features for network and socket I/O, ease of text processing and filtering, and extensive library modules. Most HTML tags were completely removed from the text, leaving only the visible plain text from the web page, with no formatting information remaining.

We also chose to implement an optional stop-word list at this point, to remove high-frequency low-information common words. We started with a list of generally common English words, and supplemented it during development with vocabulary common to web pages, such as click, e-mail, visit, and site, with our final set including 156 words. Stop-word filtering, like most features of the script, could be controlled through command-line options, so comparative tests were simple to perform.

We also stripped out all characters other than letters, hyphens, apostrophes, and periods, and then dropped all tokens that contained no letters, or that were shorter than four characters. For almost all testing we shifted all text to lower-case, to better smooth the text distribution.

We eventually decided to parse the META tags for Keywords and Description, and include their text along with the page title in the cached file, each surrounded by delimiter tags so the cluster program could choose whether to include or even emphasize those words. This particularly helped classify some minimal pages that were the top of an HTML frame set or contained only a "splash" image. In keeping with the ever-changing structure of the Web, we discovered a fairly high fraction of pages listed in the dmoz.org directory were just "Click here for new location" pointers, or were frame sets or splash pages with no META tags or other significant text. A more sophisticated front end could try to recognize such pages and download the new location or child pages. We decided to simply discard pages that were shorter than 150 characters, once keywords, title, and description were included, which removed most of these unclassifiable pages from our test sets.

In our largest test set of 1493 URLs, 97% of the pages downloaded without an error, and of those, 75% passed the length threshold and were cached for clustering.

The initial version of our Perl front end would simply take a file of URLs and dmoz.org category strings, download, parse, and filter each one, and store it in a text file named after the original URL. It would also create an index file, containing all the filenames and category strings, which would serve as the initial input to the clustering stage. As we expanded our testing to cover larger and more challenging test sets, it became cumbersome to hand-construct the URL files. So we expanded the capabilities of our front end to be able to take a set of dmoz.org categories and create a URL file by reading the dmoz.org directory contents directly. Finally, in order to be able to create arbitrarily large test sets for scalability testing, we added the ability to randomly walk the dmoz.org directory and generate any number of "leaf node" categories.

Initialization Our C++ clustering program scanned the index file generated by the front end, and for each file listed, created a new Document object, appended onto a list of Documents. It opened the file and read in all the tokens in the file into a word frequency "map" member of the Document, and filled in the original dmoz.org category string as another member. This original category was completely ignored by all clustering algorithms, and used only by the evaluation routine for judging the accuracy of learned clusters. The word frequencies were then normalized within each Document to sum to one, so that they represented the probability of a random token in that Document being of that type. Since each document’s word frequency map contained only the filtered words in that particular document, it was much smaller than a smoothed distribution across the entire vocabulary, and could be processed more quickly.

We implemented various attempts at elimination "noise" words that were beyond what a fixed stop-word list could accomplish. An option (-r) would prune the word frequencies to remove all words that occurred in only one Document, since we considered learning single-document clusters to be outside our domain, and we learned that "outlier" pages could otherwise dominate some classifiers. Such words could obviously contribute no positive evidence in learning which documents to cluster together.

We also experimented with an option (-m N) to further prune the word frequencies to preserve only the N words in each document that occurred across the largest number of different documents. Any word in the entire vocabulary flagged to be thus preserved by any document was preserved in all documents, and all other words were deleted in every document.

Other options would delete words in each document that occurred £ N times (-f N), so –f 1 discarded singleton words, or would keep only the N most frequent types in each document (-k N). After all such pruning of word frequencies, they were renormalized.
 
 

K-means We experimented briefly with a K-means clustering algorithm, as had been suggested in the review of our project proposal. In this algorithm, documents are randomly assigned to some single cluster, then each cluster constructs a "centroid" word frequency distribution averaged across all its member documents. Then documents are iteratively re-assigned to the cluster with the closest centroid, using as a distance measure the dot product between the centroid and its word frequency, both being re-normalized as a unit vector. Iteration halts when documents stop moving between clusters.

We implemented K-means as "hard" clustering, in which each document belongs to only the closest cluster.

EM Algorithm We spent considerably more time experimenting with a version of the Expectation-Maximization algorithm, which uses "soft" clustering of fractionally assigning documents to multiple clusters, based on their probability of belonging to each cluster (the Expectation phase). Then it constructs each cluster’s classifier based the average across its member documents, each weighted by their fractional expectation of being in that cluster (the Maximization phase).

The halting condition is based on each document comparing its old vector of Expectations to the new one after each E-phase, finding the largest absolute change in the vector, and then finding the largest of these changes across all documents. If this maximum Expectation change is below some threshold (0.0001 for most of our testing), we conclude that EM has converged and halt.

We implemented two types of classifiers for use with the EM algorithm, with a command line option to select which one to use.

Naïve Bayes Classifier For the Naïve Bayes classifier, each cluster maximized the likelihood of the documents that fractionally belonged to it by constructing an average word frequency distribution across them, much like the K-means centroid, but weighted by the expectations. Smoothing of this distribution (to avoid any 0 probability words) was achieved by smoothing all the expectation vectors slightly. By ensuring that every document had at least a non-zero epsilon membership in each cluster (set to 10-10), this guaranteed that each cluster gave every word across the full vocabulary of the test set a positive probability.

For each document’s expectations, it passed its word frequency map into the LogProb method of each cluster, and got back the Log Probability of that document "occurring," conditional on it being from that cluster. LogProb calculated the sum across every word in the document’s word frequency of the log of the product of that frequency and the classifier’s probability for that word. In probability space, this is the equivalent of computing the product across each word w in the document of P( w| c), which the Naïve Bayes assumption substitutes for the probability of the full document given the cluster, by assuming the individual words occur independently. This math is all done in Log space to avoid the risk of underflow from multiplying together many very small probabilities.

We calculate for each document D, for every cluster C:

where "norm" simply renormalizes each term as a probability by dividing by the sum of all terms across clusters. We chose to ignore the prior probability of each cluster P(C), assuming that every cluster was equally probable (i.e., of equal size). To calculate the exponent term, since the normalization needs to be done in Probability space, not Log space, we first subtracted from every LogProb the largest (most positive) LogProb value, then took the exponent of each term and normalized them. This subtraction was equivalent to multiplying each probability by the same (very large) constant value, which avoided underflow that would result from taking the exponent of such large negative numbers, and then disappeared when normalized.

Cosine Similarity Classifier We also implemented a classifier based on the cosine similarity measure, instead of the NB probability, which in effect assumed that P(D|C) was proportional to vector cosine between the document’s word frequencies and the cluster’s centroid. This was not really mathematically justified, since unless these two vectors were almost completely orthogonal, meaning they shared no word in common with more than a tiny frequency, they would have some significant similarity.

This approach was really a "soft" variant of K-means clustering, permitting (and almost guaranteeing) fractional membership in multiple clusters, but using Euclidean distance for expectation weightings rather than a strictly probabilistic formula.

Results

Nonetheless, using EM with the cosine classifier gave us our only successful clustering behavior, and formed the basis for most of our further experiments and extensions.

K-means clustering never iterated more than once, because no document ever moved from the cluster to which it was initially (and randomly) assigned. Apparently, such "hard" clustering on the relatively small sets of data points (hundreds of documents), and the high dimensionality of the parameter space ( 10,000 words in the vocabulary), meant that any set of documents could find a centroid along some "word dimensions" that was closer than those any other sets.

Perhaps more surprisingly, the Naïve Bayes classifiers behaved quite similarly, in that documents rarely switched from the clusters to which they were first randomly assigned. We used various diagnostics to examine the causes, and discovered that almost every document contained words that were infrequent in other documents and therefore in other clusters. Since the membership of the document in its initial cluster guaranteed that every word in the document had a value in the classifier significantly greater than zero, it’s probability P(D|C) was significantly greater than zero in its cluster. But since every other cluster almost always had tiny epsilon smoothing values for one or more words in the document, the product of the probabilities across all words was nearly zero, even if many other words were a good match in frequency.

So the negative evidence of even one "missing" word in another cluster was enough to lock documents out of almost ever preferring a new cluster and switching.

We tried several variations in an attempt to "shake loose" this rigid stability of the NB clusters. We aggressively smoothed the expectation vectors (option –s), and thus the classifier distributions, by adding 1.0 to each expectation in the document and renormalizing. This was equivalent to setting each expectation E to (E+1)/(N+1), where N is the number of clusters, or an interpolation of each E with 1/N. The result was that expectations started and remained more evenly spread across the clusters, but a cluster that started with a slight lead over the others for a document, tended strongly to maintain that lead. Expectations tended to converge quickly to 2/(N+1) for the initial cluster and 1/(N+1) for the others. This was again due to the power with NB classifiers of negative evidence, even though this was somewhat weakened by every cluster having a significant fraction of each document’s word frequencies.

Another variation was in initialization of expectations before the first M-phase (-i option). Rather than assign each document entirely to a single cluster (except for the epsilon smoothing), each expectation was set to 1/N, and then slightly randomly perturbed by up to 1%. This too had little effect on the pattern of rigid convergence to the initially leading clusters, even when combined with the aggressive smoothing option.

Fortunately, the behavior of the "soft" cosine was much more promising. On our simplest test set of 22 documents from only 2 categories, the base cosine measure, with no additional options, achieved 100% accuracy on 20 out of 20 runs with different random starts. The random initial assignments had accuracies of between 54% and 72%. Throughout our tests with cosine, it always significantly improved accuracies over the random starting positions. While accuracies on the more complex test sets did vary with starting position, the best results were never due to the data already being "pre-clustered" by the random start.

When we ran the system on our next most difficult test set, "Set2" consisting of 73 URLs drawn from 5 widely-spaced different dmoz.org categories, we began seeing more complex patterns of convergence. In particular, the accuracy would always peak in accuracy and decline, usually within 5 iterations, at an average of 84% correct, but then continue iterating another 60+ times before the Expectation deltas converged, with a final accuracy averaging below 69%.

Clearly what EM was minimizing in its gradient descent, the differences between the cluster centroids and the fractional document expectations, did not map precisely to minimizing deviation from the original human-generated category labels. These two were clearly related, however, in that EM consistently reached a much-improved level of accuracy before overshooting on its way down its error slope.

At this point we implemented and tested several of the options described above, including pruning out words on various criteria, and options to emphasize words in the title (-e N) or in META tag keywords (-y N), counting such words as though they occurred N times instead of just once.

We discovered that the cruder methods for removing infrequent "noise" words substantially reduced accuracy. When we examined some basic word counts across the cached text files, we discovered that infrequent words often contained critical information about the topic of the page. For example within a Badminton category, of the 12 documents that contained the word "badminton," 4 of them used it only once, so stripping out singletons within documents, or keeping only the N most frequent words, discarded important information more than it filtered out irrelevant noise.

Since peak accuracy seemed to vary widely from run to run, even with the same options, we realized we needed to perform multiple runs to reduce the stochastic noise in our accuracy measurements. We used a set of simple scripts to let us run batches of 20 runs with a particular set of options, store the results, and filter out the significant numbers. Among the options that did not severely reduce peak accuracy (see Figure 1), there was disappointingly little variance from the base (no options) system. Since there is still some stochastic noise from random starting conditions in this data, even averaged over 20 runs, differences of 1 or 2% in accuracy could easily be due to chance, rather than a true difference in performance. Emphasizing title words slightly hurt performance, probably due to the many non-descriptive titles such as "Anne’s Corner of the Web." Emphasizing keywords by 5 seemed to help a bit, although tests on other sets would should such emphasis having a negative effect.

Figure 1

We also experimented briefly with trying to "guess" where the peak accuracy would occur, so we could halt EM iterations at that point. We noticed that, even across test sets, there was a consistent "elbow" in the curve of declining expectation deltas around the point of maximum accuracy. So we constructed a measure of the second derivative of Expectations that could usually predict within one or two iterations where a good halting point would be.

We also noticed a pattern that one or two clusters tended to go "extinct" in most runs. While they still could contain some fractional expectation of some documents, no document had them as their highest-expectation cluster when EM converged, and so they contributed nothing to the final accuracy score.

Clearly though, a different approach was needed to significantly improve system performance, and to more robustly deal with the disconnect between peak accuracy and convergence.

Enhancements

Information Theory and Feature Selection In discussions with Dan Klein, he suggested we concentrate on some form of "top-down" identification of the relatively few words that are the best features for distinguishing the clusters, rather that a "bottom-up" attempt to eliminate the worst words. This brought to mind the use of information theory in decision tree learning for selecting a feature to "split" the tree on at any point, by calculating the information gain after each possible split (or equivalently, the reduction in entropy).

To accomplish this, we added an entropy option (-h K) that would keep in each classifier, in each M-phase iteration, only the K words with the lowest entropy for distinguishing documents being inside the cluster vs. outside. Besides calculating the normalized classifier as the expectation-weighted average of member documents’ word frequencies, we also constructed an "anti-classifier" for each cluster, of the average weighted distribution of words across all the other clusters. We could then examine each pair of frequencies of each word, and calculate the entropy using the following formula. We developed this ourselves, having not found an equation for information gain based on fractional data samples, with the counts of data instances (documents) all normalized away.

For each word type w, let fw be the frequency with which it occurs in the current cluster, and fw’ be its frequency across all other clusters. Then the probability of being in this cluster given that the word occurs is Pcw+ = fw / (fw + fw’ ) and the probability of being in a different cluster given the word occurs in 1 – Pcw+. The probability of being in this cluster given the word doesn’t occur is Pcw– = (1 – fw ) / ((1 – fw ) + (1 – fw’ )) of being in a different cluster, 1 – Pcw–. Also the overall probability of the word occurring, assuming N equal–sized clusters, is the weighted sum of fw and fw’,
Pw = (fw / N) + fw´ (N – 1) / N. Then the entropy of cluster membership given the word occurred is
Hw = – (Pcw+ ´ log Pcw+) – ((1 – Pcw+) ´ log (1 – Pcw+)) bits. And the corresponding entropy given the word not occurring is Hw’ = – (Pcw– ´ log Pcw–) – ((1 – Pcw–) ´ log (1 – Pcw–)) bits. And the total expected entropy from knowing whether this word occurred is the weighted sum of these two values:

Htotal = Hw ´ Pw + Hw’ ´ (1 – Pw).

Testing with a pair of artificial documents with only a handful of words confirmed that Htotal has the desired behavior that it is 1 (no information gain) for words with equal distribution in and out of the cluster, and it’s lowest for words that occur only in one distribution but not the other. And for different word pairs with the same ratio of frequency, it’s lower for words with higher overall frequency (so the pair 0.04, 0.02 scores lower than the pair 0.02, 0.01).

The system builds a list of all the words in the classifier and their Htotal values, sorts it in ascending order, and removes all but the K lowest scoring words from the classifier. Although Htotal is symmetric whether the evidence of the word is positive or negative for membership in the cluster, (fw > fw’ or vice versa), our experience with Naïve Bayes classifiers led us to believe that the absence of a word was rarely a useful indicator. Only word pairs with fw > fw’ were considered when selecting the K best features.

The improvements when testing on our larger more complex data sets were impressive. On Set3, containing 201 documents drawn from 7 more closely-related categories (3 under the visual arts, 2 under radio), the average peak accuracy jumped by 5% or more over any of the other enhancements we had attempted (See Figure 2). Just as importantly, using almost any value for –h K nearly eliminated problem with large declines in accuracy before convergence, and clusters going extinct. EM always converged at or within one or two documents of the highest accuracy achieved by any run using this feature selection (See Figure 3). And the rate of clusters going "extinct" fell dramatically as well, from 77% of runs on Set3 converging with 4 or more empty clusters (out of 7) to less than 6% of runs converging with a single empty cluster.

Figure 2

Also interestingly, our diagnostics displayed the words selected for each classifier after the final iteration (or in verbose mode, at every iteration), and these words often read like human-selected keywords for identifying that category. For instance, a Set3 "Radio Guides" cluster learned "radio, stations, live, world, music, audio, online, broadcasting, country, station, television, rock, and broadcast" as its highest-information words, and another on Calligraphy learned "calligraphy, lettering, hand, design, invitations, wedding, calligraphers, calligraphic, guild, calligrapher, join, gift, and weddings"

Both numerically and subjectively, this method of identifying relevant vs. irrelevant words was far superior to our other attempts.
 
 

Figure 3

It did still have limitations, and our average accuracies across runs remained bounded at about 92% in Set2 and 87% for Set3 (see Figures 4). For Set2 the best accuracy was achieved with a relatively small set of K features, 10 or 20 words, while for Set3, accuracy continue to slightly increase up through 70 words, the largest we tested. This is probably due to the fact that the categories in Set2 are widely spaced, so few words are needed to distinguish between them, and a smaller set of higher-frequency words is less noisy and smoother. But with Set3’s closely-related categories, a much larger set of words is needed to simultaneously distinguish each cluster from both distant clusters and its nearby "siblings."

In examining the final clusters learned we also noted a problem when trying to learn clusters of very different sizes. The larger categories would often split their documents between 2 clusters, one of which would subsume a much smaller category, by including a few key words that identified those documents better than any other cluster. But most feature words were identical to the classifier for cluster with the other half of the documents from this category. The fractional expectations of many of these large-category documents was split almost evenly between these two clusters, while the subsumed smaller category documents generally had expectations close to 1 of being in their cluster together. Even when small clusters were successfully learned, it was clear from the fractional expectations of documents in large clusters, and the high-information feature words of the smaller clusters, that they had a large number of documents from the larger cluster with significant fractional membership in the small cluster. So even then a larger cluster was close to dividing and subsuming the small cluster.
 
 

Figure 4

We ran out of time to really analyze and address this limitation. One possible approach would be to somehow discount "minority" fractional members of a cluster in favor of members with a very high expectation of membership, in particular when the number of fractional members is very large compared to the high expectation members. Another would be to examine our assumptions our expected entropy calculations about equal-sized clusters, and consider removing or weakening any current bias towards learning clusters of roughly equal size.

Scalability We wanted to know whether our algorithms would still perform well when scaled up to much larger sets of clusters and documents. Unfortunately, our first attempt at feeding in a test set with almost 1500 hundred pages and 96 clusters quickly revealed we had implemented several routines in a highly non-efficient manner, and completing even a single run to convergence would take days. Some simple optimization changes let us run the base version of our EM with cosine classifiers for one run, taking several hours to do so, and accuracy peaked at 51% on the second iteration and then declined to %15 (about the same as the initial accuracy) over 20 iterations, converging with 81 of the 96 learned clusters being empty. The complexity of our algorithm when learning N clusters over D documents, each with an average internal vocabulary size of V was O(NDV), so while V remains fairly fixed, run time blows up fairly quickly when both N and D increase by an order of magnitude.

Unfortunately, our algorithm for computing information gain for feature selection would need considerable redesign to be able to run faster than O(N2DV), so each iteration would take upwards of 8 hours on this largest set. We scaled back the set size to 48 clusters and 454 documents, and a single run using the –h 50 option reached an accuracy of 68% in 13 iterations then slowly declined to 65% over another 90 iterations, when we terminated it after 14 hours. Clearly substantial redesign and testing would be needed before this approach could successfully cluster any large portion of the full dmoz.org directory, if then.

Final Test We also reserved a final test set that we had not used in development, similar in size to Set3, and ran a final sequence of test runs against it using "-r –h 50" options, which had performed very well on Set3. 40 test runs produced an average accuracy of 78%, max and min of 84% and 70%, respectively, which were lower than on Set3, but by less than 5%. This reasonably high performance seems to indicate we had not overfit our stop-word list, algorithms, or other options to our development test sets.

Conclusions

We achieved a fairly high level of agreement with human editors in clustering moderately-sized sets of web pages, even when topics were closely related. Examining some of the mis-classified pages on our better test runs showed that many of them were probably unclassifiable based on only their content text, being longer-than usual "we’ve moved" pages, "choose hi/lo BW" or other splash pages. And a few were extreme outliers in word frequency for their category, such as a Palm Pilot page that never used words like palm, pilot, handheld, or software.

The improvements from information gain feature selection were impressive, particularly considering our fairly casual understanding of the theory behind it. A more complex application of information theory might be the best way to address the remaining problems of large clusters consuming smaller ones, clusters going empty.

The scalability of this approach remains a major weakness, and it might never scale to domains of hundreds of categories or more, even with further analysis and optimization. But it does seem to indicate that a simple text frequency analysis contains extremely relevant information about the topical similarity of pages. Such clustering might be a reasonable first stage when creating an index of smaller sets of documents, such as the pages within a company or departmental web site.