Justin Lessler

June 4, 2001

CS224N Final Programming Project: Extracting “Who is?” Data From Web Pages

The Web is a rich source of knowledge on almost any topic, but finding this knowledge can be a cumbersome process that requires some skill in searching for information. For many tasks what we really want is not a search, but an actual answer to a question. Efforts have been made to do this (most notably “Ask Jeeves?”) but have not been very successful, usually ending up little different than normal search engines. My project is based upon the idea that it should be possible to do better than this; especially for definitional questions such has “Who is X?” and “What is X?” (“Why is X?” is obviously much harder). In order to narrow the focus to something manageable for the purposes of this class, I decided to focus on answering the question “Who is X?”, where X is a computer science professor. I had originally planned a more general focus (CS Professors, Rock Stars, and Congressmen). I narrowed the focus for two reasons: the amount of training data required to deal with the variability in the data seemed prohibitive, and the in formativeness of pages returned by Google seems to be inversely proportional to the fame of the person (so rock stars do not have many good pages).

Approach

My approach to answering the “Who is?” question was to treat it as an information extraction problem, and use trained Hidden Markov Models as describe in (Freitag and McCallum 1999). All information that might answer the “Who Is?” question was treated equally and no structure learning was done on the HMMs. In order to return the information extracted by the HMMs in a way that might be meaningful to a user I used a function to rank the various informative sections (henceforth blurbs) found by the HMMs. For the “usable” version of the system, the top ten blurbs are output to the user.

HMM Structure:

All of the HMMs used in this project had the structure shown in diagram 1. This model differs from those described in (Freitag and McCallum 1999) in that there are edges from the suffix states to the target state, as well as edges from the last suffix state to each prefix states. These edges are necessary since there is no “one who is per document” restriction. Of note is that for training data the path through a HMM of this structure is deterministic, so that parameters can be estimated directly.

Hand Tagging/Parameter Estimation:

A large amount of time was spent hand tagging documents for this project. Sections of each document were tagged as containing information answering the “Who Is X?” question using a simple Java program that allows one to highlight the relevant section and then press a Tag button to tag the section. The program is not very sophisticated, so often the HTML had to be edited directly. Information on the actual corpus tagged is in the Experiments and Results section.

These hand tagged documents were used to estimate the output and transition probabilities in the HMM. Since for the types of HMMs used in this project the path through the model is deterministic given a tagged document, parameters could be estimated directly. The counts of each output and transition in each state were calculated, and smoothing or shrinkage (described below) was performed when it came time to return actual probabilities.

Smoothing:

In those HMMs where I did not use shrinkage (see below) I used Laplace smoothing in order to give some probability mass to unseen data. When using this method, the probability P(o|s) is estimated to be:

Where is the number of times o was seen while in state s during training, is the total number of outputs seen in state s during training, and is the number of unique outputs (i.e. output types) seen in training. The generally known problem with this method, it gives too much of the probability mass to unseen data, is addressed by using the shrinkage method described below.

In most HMM implementations there is no reason to perform any smoothing on transition probabilities. This is because there are so few of them that all will be seen during training. Given the size of some of the HMMs that I used for this project (e.g. 100 prefix and 100 suffix states) I found it necessary to use smoothing for state transitions as well. Laplace smoothing was used here as well:

Where is the number of transitions from state i to state j in during training, is the number of transitions out of state i seen during training, and is the set of edges outbound from state i in the HMM.

Shrinkage:

When training an HMM there is generally a tradeoff between the number of states and the accuracy of parameter estimation. A richer model of the document can be gained by having more states in the HMM, but this results in less accurate parameter estimations, since fewer outputs are seen per state in training data. In (Freitag and McCallum 1999) the method of “Shrinkage” is proposed as a solution to this problem. Shrinkage can also provides a framework in which to address the problem of unseen data.

Diagram 2 shows the shrinkage model used for the HMMs in this project, analogous to the Hierarchal shrinking in (Freitag and McCallum 1999). The prefix states are shrunk towards a single prefix state, the suffix states towards a single suffix state, and everything is shrunk towards the uniform distribution. The probability of an output o in state i is given by the equation:

Whereis the state that shrinks towards and N is the number of unique outputs seen during training. In the case the background and target states, the second term of the summation is ignored. In order to determine the values of the EM algorithm was used as described in (Freitag and McCallum 1999), with each word occurrence held out in turn. Specifically the E-Step was:

and the M-Step was:

And convergence was considered to have taken place when . Since we are using the hold out one method, the held out set was taken to be each word occurrence in turn, and the values carried over to the E-Step were the sum over all of these calculations.

Information Extraction:

“Blurb Extraction” is done by tagging the document using the trained HMM. A set contiguous words tagged as being in the target state is considered to be a blurb. The tagging of the document in done via the Viterbi algorithm (Manning and Shutze 1999, 331-333), keeping track of only the best path through the HMM states for the document.

Ranking Function:

The ranking function was used to allow the results to be ranked and offered as potential answers to the “Who Is” question. Since the focus of this project is more on the HMM portion than the ranking, I came up with a function that was in line with a few intuitions about what types of results would be best, and then did the a minimal amount of experimentation to get it working correctly. That being said, the function does a surprisingly good job at ranking the best blurbs highly. The function is:

Where d is the document the blurb came from and b is the blurb. The intuition this was trying to capture is that we want blurbs that come from a probable tagging (which tends to decrease with document length), have a fair amount of information content (hence are above a certain length), and are from the most significant pages (as determined by the search engine from which we got the data).

Evaluation Methodology

Evaluating the “Who Is” system is not a completely clear-cut task. A balance must be struck between the desire for a formal, quantifiable measure of performance, and the subjective quality of the results. In order to do this I decided to consider the system in two parts: the overall system with the ranking function, and the trained HMM itself. The former is better suited to a subjective evaluation, while the latter can be subjected to a more formal evaluation.

To evaluate the results of the overall system I answered two key questions related to usability:

· Was the top ranked result an adequate answer to the “Who Is?” question for the given name?

· How satisfying were the top ten results on a scale of 1 to 10?

Both of these questions are fairly subjective in nature, especially the latter, but they directly address the goal of the system: to an answer to the question “Who is X?” The fact that the goal is such a subjective one, as well as some issue with the performance of the model, suggest that perhaps a different statement of the problem would lead to better approaches, as well as basis for evaluation. I will elaborate on this in the discussion section below.

The evaluation of the performance of the trained HMMs themselves can be done in a more formal manner. Though it would be expected that because of the fuzzy nature of the task performance would be lower by most objective measures. More specifically, what parts of a web page provide answers to the “Who Is?” question is something on which reasonable people may disagree, and even the same persons judgments might vary over time (my own tagging is a case in point). Regardless, precision, recall, and the F measure (Manning and Shutze 1999, 268-269) do provide fair indicators of the performance of the model. For the F measure I used an a of 0.5, weighting both precision and recall equally. In order to get these numbers with a limited number of data I performed cross validation on all of the data in turn. Details follow in the Experiments and Results section.

Experiments and Results

For the experiments, the models were trained using two sets of data, one including only tagged documents that correspond to professors, and one using some non-professor names as well. The motivation for keeping in some of the non-professor data is that it seems to mitigate some of the over training effects. All data in these training sets had some tagging associated with it. In tagging I examined 758 files and found that 253 of these had valid who is information. When limited to professors only, there were 112 documents that contained good “Who Is” information. This scarcity of good information presents a problem for training, as to get an adequate training set for truly good performance, say 400 documents, you would have to examine ~1200 documents. In addition to these reported sets I also experimented with leaving the documents that contained no tagged “Who Is” data in and training on this set. I found that this cause little change in the formal performance measures, and some degradation in the subjective performance measures.

On each of these document sets I evaluated several different HMM models. I calculated the precision, recall, and F metrics by holding out each of the professor names in turn from the training data, and the calculating the number of true positives, false positives, and false negatives against all of the documents on the person with the given name. These results were summed across all of the holdouts, and then the metrics were calculates. The following tables show the results, where row labels are of the form: <prefix length>, <suffix length>, [shrinkage].

All Tagged Data:

	True Pos.	False Pos.	False Neg.	Precision	Recall	F
4,4	2381	140488	2060	0.017	0.536	0.032
10,10	1403	57195	3038	0.024	0.316	0.045
25,25	1017	24425	3424	0.040	0.229	0.068
50,50	775	13706	3666	0.054	0.175	0.082
100,100	680	8010	3761	0.078	0.153	0.104
4,4,Shrikage	4262	438381	179	0.010	0.960	0.019
10,10,Shrinkage	3603	325264	838	0.011	0.811	0.022
25,25,Shrinkage	1974	88521	2467	0.022	0.444	0.042
50,50,Shrinkage	1153	36375	3288	0.031	0.260	0.055
100,100,Shrinkage	1012	17326	3429	0.055	0.228	0.089

Professors Only:

4,4	2178	128406	2263	0.017	0.490	0.032
10,10	1359	56511	3082	0.023	0.306	0.044
50,50	767	17564	3674	0.042	0.173	0.067
100,100	650	12824	3791	0.048	0.146	0.073
4,4,Shrikage	4191	412225	250	0.010	0.944	0.020
10,10,Shrinkage	3662	317331	779	0.011	0.825	0.023
50,50,Shrinkage	1459	35542	2982	0.039	0.329	0.070
100,100,Shrinkage	1058	16875	3383	0.059	0.238	0.095

As is obvious from this the results in terms of formal evaluation measures are less than encouraging, but do show that changes in the model do make some improvements in the extractions, so perhaps there is hope for better results using more sophisticated, perhaps learned structures. There is a general trend upward with the size of the prefix and suffix states, but performance limitations precluded exploring this further.

The acute observer will notice that the big problem with the performance is in the huge number of false positives for most of the HMMs. If we are only interested in the ability of the HMM to perform this information extraction task, then this seem a very bad thing, but as far as returning useful information to a user it is not so bad, if our ranking function can pull out the useful blurbs.

To evaluate the subjective results returned I ran the HMMs trained on all of the training data on three randomly selected names, one from CMU, one form Stanford, and one from MIT. The ranking function, as well as some filtering, was used to clean up and prioritize the outputs. I then answered the two questions described below for each HMM, person pair. In the following table, a 1 in the top column indicates that the top answer adequately answered the “Who Is” question, and the subjective ratings ranged from 0-10.

Subjective System Assessment:

	Gary Fedder		Timothy Quigg		Barbra Liskov		Average
	Top	Rat.	Top	Rat.	Top	Rat.	Top	Rat.
4,4	1	6	0	3	0	2	0.33333	3.67
4,4,Sh.	0	0	1	4	1	3	0.66667	2.33
4,4,P-Only	1	6	1	5	0	1	0.66667	4.00
4,4,Sh.,P-Only	0	0	1	4	1	3	0.66667	2.33
10,10	1	6	1	3	0	3	0.66667	4.00
10,10,Sh.	0	4	0	1	1	3	0.33333	2.67
10,10,P-Only	1	4	1	3	0	1	0.66667	2.67
10,10,Sh.,P-Only	0	2	1	4	1	3	0.66667	3.00
50,50	1	3	1	3	0	2	0.66667	2.67
50,50,Sh.	1	3	1	3	0	2	0.66667	2.67
50,50,P-Only	1	3	0	0	1	3	0.66667	2.00
50,50,Sh.,P-Only	1	3	1	4	0	0	0.66667	2.33
100,100	1	3	1	2	0	1	0.66667	2.00
100,100,Sh.	1	4	1	3	1	3	1	3.33
100,100,P-Only	1	3	1	2	1	4	1	3.00
100,100,Sh.,P-Only	1	3	1	3	1	4	1	3.33
Average	0.8	3.3125	1	2.9375	0.5	2.375	0.6875	2.88

Of note is that in almost every case for the larger HMMs an adequate answer to the “Who Is” question is returned as the number one result. Not reflected in the tables is the fact that when the first response was not the correct one, usually the second was. This suggests that there is some hope for using HMMs for this purpose despite the poor results for the formal evaluation. There are several paths that I feel might significantly increase performance, described below.

The differences between using shrinkage and not were not as great as I had expected. In general, shrinkage produced horrible results for the smaller model sizes, and better results for the larger model sizes. Time prohibited much experimentation with the type of shrinkage models used, which would be an interesting comparison to try. Not reflected by the data is that when both shrinkage and no shrinkage models returned the same results, the shrinkage model tended to return more informative, less concise results.

Discussion

The results of these experiments clearly show two things in my mind, that the problem is not simple, and that there is hope for being able to do a reasonable job at it. There are several challenges that make the problem hard: the scarcity of data, the subjective nature of what is being extracted, and the existence of spurious data very similar to the type of data being extracted, just to name a few. There are several are several techniques that might improve the performance of the system, including using HMMs with a more sophisticated structure (i.e. multiple target states), using structure learning, increasing the amount of training data, varying the shrinkage model used, and improving performance so larger HMMs can be used.

There are two interesting approaches that I think have the potential to greatly increase the performance of the system. One is to use clustering to determine which documents are likely to contain valid “Who Is” information. This is based on the observation that a disproportionate amount of the false positives come from documents that contain no good “Who Is” information (e.g. lists of published papers). If these documents could be screened out before the HMM was used for the information extraction, the number of false positives could be significantly decreased. Two is to reformulate the problem slightly. Instead of trying to extract blurbs directly using HMMs, attempt to train several HMMs to extract more tractable information related to the questions. Examples are: affiliation, job description, research projects, etc. This information can then be used to fill in frames about the individuals we are interested in, and then a template could be used to return the information to the user. HMM performance should be greatly improved since the data will be less variable, and objective assessment of the results would be easier.

In general I think the approach of using HMMs for this problem shows a lot of promise, but the sophistication of the techniques required means a lot more work would be necessary to get good results. Given the time frame for this project, I was moderately satisfied with mine.

References

Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.

Freitag, Dayne, McCallum, Andrew, Information Extraction with HMMs and Shrinkage. AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.