HEAD-DRIVEN CORPUS-BASED NATURAL LANGUAGE GENERATION

Steven Branson

CS224N Final Project, 2002

1. MOTIVATION

For this project, my goal was to apply some of the techniques used in probabilistic head-driven parsing to natural language generation. The practical applications for doing this are not all that important, and my real interest in doing this project was merely to investigate some of the issues relating to the context-sensitivities of natural language grammars. However, I believe that the resulting system of language generation could be quite effective for an Eliza-type chat program. In particular, unlike the more commonly used Markov and n-gram language generation methods, a CFG-based method allows the program to begin with a sentence with a partially defined grammatical form, which greatly improves control over the resulting semantic meaning. In addition, my algorithm allows an optional initial specification of the desired “head,” or keyword of each non-terminal, and can fill in the expansion of each non-terminal to match its specified head. The result is the possibility for a large degree of control over the generated semantic meaning of the final sentence while still maintaining a large degree of randomness (which prevents that chat program from being repetitive and boring). To demonstrate this, I created the basic skeleton of such a chat program. The program is capable of parsing the user’s sentence, matching the sentence’s grammatical form to some template in a set of predefined grammatical forms, producing a basic grammatical construct complete with head words for non-terminals for the desire response, and running my language generation algorithm to generate the actual response.

2. IMPLEMENTATION

2.1. Parsing and Construction of a PCFG and Data Structures

I extracted my PCFG, head data, and other necessary information from the parsed Switchboard corpus of telephone conversations. The size of the corpus is 35 MB with 271,480 sentences. My initial thought was that the large size of the corpus would give me a lot of data for statistical inference. In reality what I found was that as the corpus size grew, so did the size of the grammar; there were a total of 58,814 productions in the final grammar.

I parsed the corpus keeping a hash table of each non-terminal symbol, complete with its number of occurrences and set of possible production rules, for each of which I also stored the number of occurrences. In addition I stored the file-positions of the start of each sentence in the corpus file and a list of each terminal in the grammar complete with the sentence index and non-terminal context of each occurrence. This information allows me to perform quick lookup of the sentence contexts of any desired terminal symbol. My general approach in the language generation algorithm then was to use these data structures to quickly load applicable sentences for some desired keyword into memory, maintaining a cache of some set number of sentences in memory at all times. I could then use these sentences to dynamically calculate any necessary probabilistic calculation. A larger set of pre-computed information could be stored to speed up these calculations, but there are major memory/disk space issues (especially on my Leland account).

I also computed my best estimate of the rules for construction of “head” symbols for each rule in the grammar, as each rule must somehow decide how to inherit a head symbol from some terminal/non-terminal on the right side of its production. I obviously did not want to take the time to mark 58,814 rules by hand, so I attempted to allow my program to hypothesize each inheritance rule on its own. Essentially this consisted of counting, for any rule of the form X->…, the number of occurrences of each non-terminal symbol on the right-hand side. The more frequent non-terminals on the right-hand side tended to be the correct symbols from which to derive the head (for instance a verb is common in a verb phrase), favoring the left-most non-terminal in the event that it occurs more than once in a rule. Also, I attempt to avoid inheriting the head from non-terminals representing punctuation. In addition, I favor picking non-terminals with names similar to the name of X (i.e. VP is similar to VB). I played with these heuristics briefly until I was satisfied that most of the rules were more or less correct.

All this information is pre-computed and stored to disk by running the program ./makeDatabase. It takes about a minute or two to run.

2.2. Sentence Generation

The algorithm I use is in some sense a mix between a top-down and bottom-up movement through my CFG. The input is an unfinished tree-structure representation of a descent from the start symbol (S). Any unexpanded node may or may not be accompanied with a desired head. The algorithm is a recursive algorithm, at each step taking a node N of the tree. If N is a terminal, the terminal symbol is added to the generated sentence. If N is a non-terminal, it can either have its production rule selected already or not yet selected.

2.2.1. Rule Expansion

If the rule of N is not yet selected, we need to make a decision about which rule to expand. This is done by searching for the set of instances S in the data corpus with the same non-terminal symbol, desired head, and contextual heads as our current node N. Each instance in S is a fully expanded non-terminal tree extracted from the data corpus. In the algorithm’s current implementation, I define the context of a node to be the head of the previous sibling node in the tree (this could easily be changed). Consider the probability that, in this context, node N expands by some rule R, P(R | N, context). If we make the assumption that P(R | N, context) depends solely on the non-terminal symbol of N, the head of the previous sibling node of N, and the head of N, then P(R | N, context) is simply the number of instances in S that expand by rule R divided by the total number of instances in S. At this point, we could randomly select a rule to expand according to this probability. However, I found that it is smarter to pick a random instance X from the set S, prune away some of its nodes that do not occur in singular or unique rules in the data corpus, and add it to the tree in replacement of N (see figure 1 for an illustration). This is useful because 1) it saves time because we can construct our tree multiple nodes at a time, and 2) it improves the quality of our final tree because it keeps clusters of non-terminals that are never separated in the data corpus together. It is not an unreasonable thing to do because, if X is randomly chosen, then the probability of each rule being expanded at node N is still correct. If X is maximally pruned, then the probability of each rule on the child nodes is also correct.

Due to sparseness in the data, it sometimes occurs that no instances in the data set can be found that match the definition of non-terminal N, the context of N, and the desired head of N. In this event, I expand the set S to include all instances in the data corpus that match these criteria ignoring the context requirement. If still no such instances match these criteria, I expand a random rule in N according to the probabilities in my PCFG. A better way to handle sparseness in the data would be to include some reasoning based on similar words or grammar rules, and weight each head word probability by its measure of similarity, rather than requiring an exact match on head words. This would be interesting, but is not something I had time to pursue.

After node N has been expanded, I then recurse, repeating the same process on each child of N.

Figure 1: Simple example of the Sentence Generation algorithm. We begin with the incomplete sentence “I’m <ADJP-PRD with head ‘happy’>.” The algorithm then randomly selects the phrase “kind of happy about that end” as an ADJP-PRD with head ‘happy.’ It then “prunes” this added phrase, replacing NP “that end” with NP “pulling weeds and things like that.” This is permissible because both NP’s have the same contextual rule and head “about.”

2.2.2. Tree Pruning

When an instance X is chosen to replace node N, X may represent a very large string of words from the data corpus. In order to promote randomness and prevent repetition, X is pruned before it is added to the tree. This is done by examining each node Y in the tree X, and deciding whether or not to attempt to prune it according to some probability p, which is an input parameter to the program (p is set to 1 in my current implementation). We also never attempt to prune a node that is an ancestor of the head symbol of X. If we attempt to prune Y, we consider all instances in the data corpus that have the same parent as Y and the same context (i.e. the same criteria explained in the previous section) and replace Y by one such instance. We then continue, running the same pruning process on the children of Y (or the added node). When I first wrote this, I was worried that this might lead to loops that infinitely expand a node; however, this never occurred in all of my testing. I believe that this method of pruning might generally have the tendency to produce longer sentences than those actually in the data corpus. Although this was not clearly apparent in my testing, it might be a good idea to use some heuristic favoring adding instances roughly the same size as the nodes they are replacing.

2.3. Response Generation

I began to build the framework for an Eliza-type chat program to demonstrate an application of my system. The program allows the user to enter a phrase or comment and attempts to generate a reply. The phrase entered is first parsed. Parsing is performed by the open source Apple Pie Parser. The parsed sentence is then compared to a library of template sentences, each of which is a partially expanded non-terminal. Each template can have several key phrases that can be made to appear in the generated response. Each template can also have several response templates each of which gives the grammatical structure of a possible response. For instance, one template might be:

((S (NPL (PRP you)) (VP (VBP 're) (ADJP __1))))

And two possible responses might be:

((S (ADVP (RB No)) (, ,) (NPL (PRP you)) (VP (VBP ‘re) (__1 )) (. .) (SS (NPL (PRP I)) (VP (VBP 'm) (ADJP )))))

((S (NPL (PRP I)) (VP (VBP 'm) (ADJP __smart))))

This template would match input such as “you’re stupid!” The two responses would be “No you’re stupid. I’m (random adjective phrase),” or “I’m (random adjective phrase with head word smart).

The limitation with the chat program now is that I have only hand-coded a few sentence templates and have not necessarily done a very good job designing their responses.

The chat program runs using the command ./app.

3. EXPERIMENTAL RESULTS

3.1. Baseline

The first thing that I tried was to produce a random sentence using a straight top-down descent from the start symbol S based on the rules in my probabilistic CFG. I knew that this would produce extremely poor, ungrammatical sentences, but I did it anyway because I wanted to get a feel for just how poorly a straight PCFG would perform. The results were even worse than I expected:

· the need

· and of it mean _ made a year , the bit that knee _

· um _ they is Then handle can correct that hand and want 'd are it was _

· never the are

· them been _

· his have I country alone _ of there got _

· and , that floor _ of write like the _

· She believe _

· had a bit _

· a 'm

· that think of dear _ used _ I 's n't nice _ _ dear _ and we know So So _ _ a filter But _ dear _ of it way of by there do what it guess not loved not know clearly do _ _ they 's I house _ _ there mean _

Here we can see the extent to which English is not a context-free grammar. Some of the most glaring problems include the fact that the CFG as obtained from this data corpus does not encode verb conjugation rules, so phrases such as “a ‘m,” “them been,” and “She believe” are not illegal and are not even improbable. We also see errors dealing with unusual noun/verb, noun/adjective, verb/prepositions, and verb/direct object pairs.

3.2. Head-Based Language Generation:

My hypothesis before beginning this project was that most of the problems resulting from the lack of contextual information encoded in a PCFG could be solved by approximating the contextual information with head-related statistics. In my first implementation, I attempted to implement this using a fully top-down approach: I began at the start symbol (or some unfinished parse tree). For each non-terminal, I selected the rule out of all possible expansion rules based on the probability of that rule given the non-terminal symbol N, the head of the previous sibling node N (if one exists), and the desired head of N. In other words, the probability of N expanding by rule R was

P(R | N, head(N), head(precedingSibling(N)).

The results were significantly better but were still not very good. Some of the better sentences include:

· So , the gun clubs joined no angry chicken like Sunday , Monday , or Tuesday night n't ,

· and the rest of the game last night and Avis and everybody that goes to Washington quite often say hello to wife and kids , and then go out and fiddle , you know , just walk around the yard and inspect it

· Well my kids 's one we all hold dear and near

· Same sort of problem they were matching said over there

· the total cost of it 's elect officers of the family reunion

The problems with this system dealt with data sparseness. In particular, the number of instances in the data corpus in which both some head of N and some head of the previous sibling of N occur was often few to none. In this case, my solution was to ignore the head of the previous sibling, something which no doubt hurt performance.

The other main problem was that in generating these sentences, I was beginning with the start symbol S, which had neither siblings nor a specified head. In the event that some non-terminal N had no preceding sibling but did have a specified head, I simply selected a random rule occurring in the corpus containing the desired head. In other words, in this case

P(R | N, head(N), head(precedingSibling(N)) = P(R | N, head(N)).

In the event that some non-terminal N had a preceding sibling but no specified head, I selected the expansion rule occurring in some random instance in the corpus in which a node T expands, where T is a node occurring in the same rule as N and containing the same preceding sibling head as N. In other words, in this case

P(R | N, head(N), head(precedingSibling(N)) = P(R | N, head(precedingSibling (N))).

In the event that N had neither preceding sibling nor head, I was not certain about what I should do. My solution was to randomly pick a head for N by making a random descent through my PCFG. The reason I thought this might be promising was that it would allow me to pick a random head based on some heuristic that would favor words matching a certain context or falling within some personality (of the Eliza program). I even experimented briefly with techniques for randomly selecting a head from a weighted set but found no significant results.

The problem with picking a head this way (through a random descent of the PCFG) is that while it is guaranteed to be theoretically possible to produce the chosen head through expansion of some sequence of rules in the CFG, it is not at all guaranteed (and in fact is unlikely) that the same sequence of rules occurs somewhere in the data corpus. As a result, when a random head was selected in such a fashion, the probability that no applicable instances in the data corpus could be found in the following iterations of the algorithm (iterations on descendents of node N) would rise dramatically, which would force the algorithm to make decisions solely based on the PCFG, continuing sequence of rules never actually occurring in the data corpus.

3.3. Move to Bottom-Up:

After realizing that most of my problems were arising from the initial random selection of a head for the start symbol S, and the subsequent sparseness of instances in the data corpus compatible with that decision, I realized that it would be better to leave the head of S unassigned and simply choose a random rule for S, rather than forcing the children of S to be in constrained by the random choice of head for S. Although, this was an improvement, it left the problem that some of the children of S would have no contextual evidence (something which is derived from the head-words) to use to guide their expansion.

It took me a while to consider that it might be more effective to generate a sentence in a manner that is somewhat bottom-up, mostly because the problem seemed to be inherently top-down, since we are trying to generate a sentence from the start symbol. However, I eventually realized that if I could randomly pick an instance of a sentence from the data corpus and use it as a template for the basic structural flow of my random sentence, the quality of my sentence would improve. By probabilistically pruning each descendent node of my randomly selected sentence instance wherever the corpus permits (as determined by the by the probability of that node given the rule of its parent and the context of the preceding head word), the probability that the final output of this algorithm produces some sentence R is exactly the same as the probability that the exact head-based probabilistic grammar defined by the data corpus produces sentence R. In other words, the result is probabilistically the same as a top-down descent from the start symbol, if pruning is attempted whenever possible. Although I attempt to prune whenever possible, computational issues current slightly limit me from reaching this limit. This arises because some words (such as “the”) occur so many times that it is not feasible to examine every contextual sentence. I therefore place a limit on the maximum number of sentences examined per word (currently 500) to collect the set of all instances capable of replacing a node. As a result, occasionally an applicable instance from the data corpus will be missed. Adding nodes in a bottom-up fashion ensures that when this actually occurs, the result will be only that a possible instance with which to prune the tree is missed. In the top-down case, missing an instance could likely force us to have no other option but to expand a node according to our PCFG without contextual evidence.

Note that the exact same process of extracting an instance from the data corpus, pruning it, and using it as a template is not just applicable to the start symbol S, but to any non-terminal N, with any desired head, and in any context.

The potential problem arises due to the fact that the head-based probabilistic grammar defined by the corpus is not at all equivalent to the English language, as it ignores semantics and only represents a small subset of possible grammatical sentences. The predicted grammar is therefore extremely sparse, with high-degree of overfitting to the sentences occurring in the data corpus. A straight PCFG suffers much less from this type of overfitting, since it is a much simpler representation of the language. The data corpus therefore must be sufficiently large for the sparseness of the head-based probabilistic grammar not to become glaringly apparent. My experimental results tended to indicate that the data corpus was sufficiently large to result in a high degree of randomization. Here is an example of typical sentences generated by my program and the number splits underwent to produce each sentence, where a split is defined as the number of nodes pruned and replaced by other data instances:

· and so often the newspapers are trying to celebrate . (8 splits)

· Well , it seems as though if you know , uh , that pretty well takes up all your memory (10 splits)

· Uh , I bet it was from Texas , (6 splits)

· you have three granddaughters right now , So , I do n't like the sport so much anymore , (6 splits)

· it 's a creek , (1 split)

· and they said , uh , that we had to look like that . (7 splits)

· like you know , the bedrooms are , (6 splits)

· and some can . (2 splits)

· I feel for T I , (2 splits)

· I have to rebuild cars . (6 splits)

· Well I guess we 're sticking more to social changes (3 splits)

· and that is public television , (2 splits)

· so you know , I would have changed drastically . (7 splits)

· I was amazed when it is broken down (9 splits)

· Oh , well , that 's good too . (0 splits)

· Well , the best thing about RAINMAN is that it would , uh , hurry the process up a bit (8 splits)

Here we see that the number of splits occurring is generally very high, and is usually only slightly less than the number of words in the sentence.

3.4. Partially Grammatically Defined Sentences:

The degree of randomness seems to be much less of a problem than the semantic content of these sentences. Few of these sentences seem to be particularly likely things that people would say in reality. Some sentences also contain grammatical errors resulting from the limitations of defining the context of a node to be based solely on the head of the preceding sibling. For example, in the sentence, “I was amazed when it is broken down,” the verbs “is” and “was” are sufficiently separated such that tense consistency is not maintained. Although these problems remain largely unsolved in the present state of my algorithm, one of the benefits of such a system is that it allows sentences to be partially grammatically defined, which allows some extent of control over the semantic meaning of the sentence. I ran my language generation algorithm several times on the partially defined sentence shown in figure 1 and obtained these results:

· I 'm real happy

· I 'm pretty happy about it

· I 'm happy

· I 'm just as happy for me to do something

· I 'm really happy we 've got some more intelligent competitors

· I 'm pretty happy for the worst of me

· I 'm just as happy for them taken out of the water

· I 'm too happy

· I 'm pretty happy

· I 'm really happy to fight a lot

· I 'm totally happy

· I 'm really happy with what happened to every room

· I 'm happy with the people who , uh , ran the gamut of a deterrent

· I 'm happy with a reasonable , uh , four door

· I’m kind of happy about pulling weeds and things like that

· I 'm pretty happy about clothing

· I 'm happy with what student teaching is

· I 'm really happy about this guy

Although most of these sentences still are not very likely semantically, I believe they are generally of as high quality as responses produced by contemporary Eliza-type programs.

These examples also give us an intuition of where sentence splits occur more frequently, as many of these sentences transition on prepositional phrases. This arises because the rule PP->IN NP is fairly common, and the set of possible prepositions (which is the same as the set of possible heads for IN) is relatively small. This means there will be many different NP’s occurring in the same type of PP that will be candidates for replacing the NP in this rule. Tree splits occur commonly in other types of non-terminals as well.

4. RELATED WORK

Admittedly, I did not have enough good sense to investigate much how language generation of this type is normally implemented before jumping into this project. When I planned how I would attempt to solve this problem, I had not yet heard of probabilistic head-driven parsing, but I had in mind the same notion of head symbols (which I called keywords at the time). I read up on probabilistic head-driven parsing a little bit and investigated Markov-based language generation, and semantic head-driven language generation. However, I was not able to find any successful example of work done on language generation using a large, parsed corpus such as the Penn Treebank. I also could not find any system of language generation using similar techniques as I used for this project. However, there are many similarities and comparisons that can be drawn between my method and the areas listed above.

4.1. Markov and N-Gram Language Generation

The sentences produced by my language generator appear very similar to sentences produced by Markov language generators and by n-gram language generators. This is not especially surprising because the act of splitting a sentence in my system occurs due to a lot of the same probabilistic reasons that Markov and n-gram models make transitions between different words and contexts. As mentioned earlier, the probability that my head-based generation method splits a tree on prepositional phrases is relatively high because there is a large amount of statistical data for rules of this type. Markov and n-gram methods also tend to make transitions between word contexts on prepositions because there exists a wide variety of different types of words that can immediately follow prepositions. Hopefully this implies that all three systems are in some way capturing real properties of natural language. I believe that language generation based on grammars with head information is promising because the representation is closer to the true representation of natural language than n-gram or Markov statistics are. As a result, the sentences produced by my language generator tend to be more grammatically correct than those produced by the other two methods. The drawback is that the set of producible sentences using my method is much smaller. In general, my tests indicated that this was not necessarily a major problem, as the number of producible sentences using my system was surprisingly large. It is also true that a large fraction of the sentences producible by n-gram or Markov methods should not be producible in the first place.

As mentioned earlier, another major benefit of a grammar-based generation system is that it can generate sentences that are partially defined grammatically. This already gives grammar-based methods a huge advantage over the other two methods in terms of potential to produce sentences with some semantic meaning.

The drawback to my method is the underlying representation of my grammar is much more complicated than the representations used in the other two methods, so a much larger training corpus is necessary for it to work. This data is available in the Penn Treebank. My algorithm is also computationally more expensive. Right now, it takes roughly 5 seconds to generate a typical sentence. This length could significantly be reduced in many ways by optimizing my code.

4.3. Semantic Head-Driven Language Generations

Generators of this type begin with some semantic expression and attempt to transform it into a grammatical sentence of the same semantic meaning. Obviously this produces sentences semantically better than my method does; however, it has the problem that it is currently only practical for languages of limited domain, and the underlying grammar and semantic rules must be coded by hand. Also for Eliza-type chat programs, the semantic randomness produced by my system can sometimes be desirable, as it reduces repetition.

4.3. Head-Driven Parsing

Parsing is obviously something very different than language generation; however, I just wanted to comment that the use of heads as a method of encoding contextual properties of a language is in some ways more applicable toward language generation than toward parsing. This is because in parsing, sparseness of corpus data is a major problem because there is a high probability that a sentence being parsed contains rules that never occur or only occur sparingly in the training data due to the exceedingly large size of a grammar with head information. This problem technically also exists in language generation; however, it is possible to simply ignore rules that never occur in the training data. This greatly decreases the size of the set of producible sentences, but because the set of sentences producible from the learned grammar is so large anyway, the effects are not very apparent.

5. FUTURE WORK

There are many ways in which the system I built could be improved or expanded. In particular, I wrote my algorithm such that it could be extended to use further methods to improve semantic control. Currently, when I consider splitting the tree at a node, I first construct the set of all instances from the data corpus capable of replacing that node. Right now, I just randomly select any such instance. A better thing to do would be to weight each instance according to some heuristic score that attempts to favor words of some semantic meaning. Another major improvement could be made by using Wordnet or some clustering technique to attempt to alleviate sparseness in the data corpus as I described earlier. While these two things could significantly improve performance, I believe that this system of language generation is already very promising and has many properties that are useful for Eliza-type chat programs.

6. REFERENCES

Bulhak, Andrew C. “On the Simulation Of Postmodernism and Mental Debility using Recursive Transition Networks,” 1999.

Hutchens, Jason. “MegaHAL.” http://www.megahal.net/

Manning, Cristopher. “Probabilistic Head-Driven Parsing.”

http://www-nlp.stanford.edu/~manning/talks/csd-slides.pdf

Sekine, Satoshi. “Apple Pie Parser,” 1995. http://www.cs.nyu.edu/cs/projects/proteus/app/

Tufis, Dan. “Yet Another Head-Driven Generator of Natural Language.” International Journal on Information and Control, vol. 3, 1999, ICI, Bucharest.