Ben Taskar

Part-of-Speech Tagging Using Iterative Relaxation

Ben Taskar

CS224n

Final Project

Abstract

This project explores a novel approach to Part-of-Speech tagging that uses statistical techniques to train a model from a large POS-tagged corpus and assign tags to previously unseen text. The model uses decision trees based on tags of surrounding words and other features of a word to predict its tag. To tag an entire sentence, the tags of individual words are iteratively reassigned through a process of statistical relaxation. Although the model does not achieve state-of-the-art accuracy (96.4 - 96.8%), it comes respectably close (96.2%).

Introduction

Part-of-speech tagging consists of labeling each word in a sentence by its appropriate part of speech, e.g. verb, noun, adjective, adverb. Since many words can act as multiple parts of speech, tagging requires disambiguating the use of the word in the context of a particular sentence. In addition, many words may have not been previously encountered, so a tag must be decided upon based on various features of the word and its context. However, POS tagging is a simpler task than full syntactic parsing, since no attempt is made to create a tree-structured model of the sentence. With the recent availability of large manually tagged corpora, many researchers have proposed statistical and rule-based techniques that automatically learn to label unseen text with POS tags. The experiments in this project were conducted on one of the most commonly used such corpus, the Wall Street Journal articles from the Penn Treebank project (Marcus et al., 1994), which contains over a million tagged words.

Among the most successful current approaches to tagging are probabilistic models such as HMMs (Weischedel et al., 1993) and maximum entropy (Ratnaparkhi, 1996), and rule-based techniques such as transformation learning (Brill, 1995). These state-of-the-art methods achieve roughly similar accuracy on the Wall Street Journal corpus of about 96.36% to 96.82% (Brill et al., 1998). All of them use words and tags surrounding a word in a small window (1-3 on either side) to assign a tags to all words in a sentence.

Probabilistic methods, like Hidden Markov Models and maximum entropy models, learn a joint distribution over tags and words in a sentence, and then select the tags that are most likely given the sequence of words in a sentence. Hidden Markov models decompose the distribution P(W, T) over words W₁,…, W_n and tags T_{1, …,}T_n as

In contrast, transformation-based method starts with an initial assignment of tags to words using the most common tag regardless of context. It then learns a list of "rewrite" rules that can be successively applied to the tags of the sentence to correct the initial errors. The power of transformation-based approach comes partly from that fact that the initial assignment is already very accurate (around 93%). However, although the learning phase uses corpus statistics to induce rules, tagging itself is deterministic. Hence, it is impossible to compare likelihood of different tagging assignments or output k-best, etc.

This project investigates a probabilistic method of exploiting this high accuracy of tagging most words to bootstrap tagging of difficult ones. The success of the above state-of-the-art models has shown that the tags of surrounding words provide a lot of information about the tag of a word. So if we start with highly accurate assignments to the surrounding tags, we can accurately predict the tag for the current word. By iteratively reassigning tags based on the current assignment of other tags, and keeping track of the most common assignments, we can infer the most likely tags for each word.

Probability Model

HMM methods learn a joint distribution over both words and tags of a sentence by making conditional independence assumptions (limited horizon dependence for states and independence of words given their tags) that are only rough approximations. It is plausible that perhaps we can achieve higher accuracy at predicting the tags if we focused on somehow learning just the conditional probability of the tags given the words. One way to accomplish this is to learn conditional probabilities of a tag given its word and surrounding tags P(T_i|W_i T_{i- k}T_i-1T_i+1 T_i+k) for some window size k. This does not explicitly define a conditional distribution P(T | W), since the dependencies between tags are not acyclic. However, it can be implicitly defined through Gibbs sampling process. A sequential Gibbs sampler instantiates the variables to arbitrary initial values and loops over them, sampling from the conditionals. It can be shown (Heckerman et al., 2000) that if conditionals are positive, the process converges to a unique stationary distribution. Whether this distribution is consistent with the conditionals is an issue, but it can also be shown that if the conditionals are consistent with some joint distribution, Gibbs sampler will converge to it (Heckerman et al., 2000).

Representing and learning P(T_i|W_i T_{i- k}T_i-1T_i+1 T_i+k) as one large lookup table is clearly infeasible since there are nearly 50,000 words and 46 tags in the WSJ corpus. In addition, many words are rare, so parameter estimation is unreliable because of sparsity of the data. Since many words only appear rarely and most words appear overwhelmingly with one tag, we should devote more attention to predicting tags for the common and difficult to tag words. We can select the words that will produce most error on expectation by considering the difference between the number of times a word w appears in the corpus and the number of times it appears with its most common tag t,

E_{w =}N_w - N_w(t). We select M (say M = 1000) words with the highest E_w to learn P(T_i|w T_{i- k}T_i-1T_i+1 T_i+k). Top words according to such a criterion are the ones that are commonly reported as difficult: that, about, up, ‘s, etc.

Learning

Even for commonly occurring words, there are too many parameters in P(T_i|w T_{i- k}T_i-1T_i+1 T_i+k) to learn them reliably. In addition, there are usually many context-specific independencies in the conditional probability distribution (cpd), e.g. given that the next tag is comma, it does not matter what the tag after the next tag is. Decision trees can effectively capture these regularities and dramatically reduce the number of parameters to be estimated. We use binary decision trees for the cpds, where each node represents a test of the form T_j = t for some tag t, arcs represent the outcome of the decision and leafs contain the distribution for T_i. To learn the tree structures we use greedy hill-climbing with Bayesian scoring to evaluate next candidates (Chickering et al., 1997).

The remaining words are either unambiguous or there is not enough data to learn contextualized cpds. So we compromise by learning simply P(T_i | w). In addition, we learn a decision tree cpd P(T_i| T_{i- k}T_i-1T_i+1 T_i+k) using all words. We then combine them using somewhat ad-hoc noisy-or rule:

P(T_i|w T_{i- k}T_i-1T_i+1 T_i+k) = 1 - (1- P(T_i| T_{i- k}T_i-1T_i+1 T_i+k))(1- P(T_i|w))

(Several models were tried, including max, min, product and mixture, but this one seemed to work best.)

Unknown Words

Since test data contains words not seen in the training data, we must predict tags for unknown words. There are several features of words that we can use to make this decision. In particular, the suffix of a word is often a good predictor (e.g. -tion, -ed, -ly, -ing). Capitalization and whether the word comes after a period or quotation marks are also indicative. In addition, numbers are rarely seen in the training data, but often can be easily classified as such (note that numbers can also act as list markers). Weischedel et al. (1993) proposed a probabilistic model for combining these features:

P(w|T_i) = P(unknown|T_i)P(capitalized|T_i)P(suffix|T_i)

This model makes the approximation that those features are independent given the tag to keep the number of parameters small, but ignores certain correlations, for example, between capitalized and unknown. To take into account these correlations, we learn a decision tree cpd P(T_i | unknown, capitalized, suffix, beginning of a sentence).

We learn this model using words that only appear once in the training data. About thirty different suffixes we distinguished, of which around twenty actually ended up being used by the induced decision tree.

Tagging

Test sentences are tagged one at a time. Given a sentence {w₁, …, w_n}, the basic sampling procedure is as follows:

1. for i = 1 .. n
	sample t_i from P(T_i|w_i) 
2. for j = 1 .. N
	for  i = 1  .. n 
		sample t_i from P(T_i|w_i, t_{i- k}t_i-1t_i+1 t_i+k)

3. select most commonly sampled tag for each T_i.

There are several modifications to this basic Gibbs sampler that make it work better in practice. First, the initial samples must be discarded and sequential samples are not independent, so samples are actually counted after a short burn-in phase and with counts incremented every several iterations. Second, tags do not have to be sampled sequentially, and indeed, performance is improved when a random order is used. Thirdly, if we’re not interested in the distribution of each T_i but simply its most common tag, we can speed up the convergence by just picking the most common tag in the current context instead of sampling.

Experiments

The experiments were conducted by randomly splitting the Wall St. Journal corpus into a training and testing in roughly 90/10 proportion. There are several model parameters that need to be set.

k: Context window parameter k for the cpds: I attempted k = 1 to 3, and found that smaller context (1 tag on each side), performed the best. This maybe due to overfitting in the cpds for the larger contexts, but I did not investigate an alternative estimate smoothing or tree induction method.

M: Number of words M for which the tree cpds are learned: increasing the number from 400 to 1200 helped performance, but returns seemed to diminish after that, especially given higher costs of training and space limitations.

N: Below are graphs of accuracy for all words (110444) and unknown words (2807) versus number of sampling iterations, averaged over twenty runs for a model with k=1 and M=1200.

From the results above we can take N = 10 iterations as a fairly good trade-off between accuracy and computational expense, achieving 96.24% overall and 83.14% unknown word accuracy. This level of performance, although not quite state-of-the-art, is quite reasonable. Some words are very difficult to classify correctly, perhaps due to the limited context window and linguistic depth of this model and other current state-of-the-art models. Several authors have looked into consistency of the dataset as a possible cause. Ratnaparkhi, 1996 finds that distribution of tags for the word "about" (as well as several others) is fairly different for different annotators of the dataset, suggesting that there is a real limit to the level of achievable accuracy.

Below is a table of the most commonly made mis-taggings, which roughly corresponds to those reported for the Maximum Entropy model (Ratnaparkhi, 1996).

Word	Assigned Tag	True Tag
about	IN	RB
as	IN	RB
out	IN	RP
much	JJ	RB
that	DT	IN
that	IN	WDT
more	JJR	RBR
that	IN	DT
executive	JJ	NN
chief	JJ	NN

Conclusion

This project investigated a novel combination of statistical methods to define a flexible, though implicit, probability distribution for prediction of Part-of-Speech tags. The model can be improved upon in several ways. It is possible that using surrounding words, not just tags may be advantageous. To accomplish this, the decision tree induction algorithm has to be improved to efficiently evaluate a large number of possible candidate structures. Furhter morphological features can be used for tagging of unknown words.

Recent work by Brill et al. (1998) showed that combining several different state-of-the-art taggers (HMM, MaxEnt, Transformation) in a classifier ensemble can achieve performance of up to 97.2% percent. The key to successfully combining classifiers is complementarity of their errors, so that methods that do not achieve the highest accuracy but utilize different techniques and features can be useful for tagger ensembles.

References

[Brill, 1995] Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21:543-565.

[Brill, 1998] Eric Brill and Jun Wu. 1998. Classifier Combination for Improved Lexical Disambiguation. ACL 1998.

[Chickering et al., 1997] David Chickering, David Heckerman, Christopher Meek. 1997. A Bayesian Approach to Learning Bayesian Networks with Local Structure. Microsoft Technical Report MSR-TR-97-07.

[Heckerman et al., 2000] David Chickering, Christopher Meek, Robert Rounthwaite, Carl Kadie. 2000. Dependency Networks For Inference, Collaborative Filtering, and Data Visualization. Microsoft Technical Report MSR-TR-00-16.

[Marcus et al., 1994] Mitchel P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1994. Building a large annotaded corpus of English: the Penn Treebank. Computational Linguistics 19(2):313-330.

[Ratnaparakhi, 1996] Adwait Ratnaparakhi. 1996. A maximum Entropy Model for Part-Of-Speech Tagging. In EMNLP 1, pp. 133-142.

[Weischedel et al, 1993] Ralph Weischedel, Marie Meteer, Richar Schwartz, Lance Ramshaw and Jeff Palmucci. 1993. Coping with Ambiguity and Unknown Words through Probabilistic Models. Computational Linguistics 21:543-565.