NATURAL LANGUAGE GENERATION
USING RECURSIVE GRAMMARS
AND WORD COLLOCATION
Elif Kurklu, Stanford University, email@example.com
June 1, 2000
This paper presents an implementation of a natural language generation program using recursive grammars and word collocation techniques. Tagged corpus used for training is penn-treebank data provided by Christopher David Manning as potential resources for final projects. The results represented here are obtained based on all one hundred merge files found in wsj/04 directory.
In this paper I discuss descriptions of the techniques used, the evolution of the experimentation presented under different version numbers, the result of the experiments that have been run on the experimentation data, and the analysis of the results obtained. I also include a section to discuss the similarities and differences between this particular work and some work within the same context in the literature. Finally, a section is dedicated to further potential improvements that are not accomplished in this work. 
I have chosen to experiment with natural language generation, since it seems to me that it is one of the fundamental subjects in natural language processing that we did not have a chance to explore during the short quarter. The basic problems with language generation are clearly identified in the literature, one of which being syntactically correct (or close to be correct) sentences that do not make much of a sense to a human being when semantics are considered. This leads to some amusing experimentation results, which makes the task somewhat fun.
As discussed in [Bulhak, 1996], there are two basic techniques used to generate sentences in natural language, namely Markov models and Grammars with Recursive Transition Rules. Markov models overlook any syntactic structure of the language, even though such data can be made available via various tagging programs, or may be obtained from manually tagged corpuses by various organizations. Since such data, namely penn-treebank parse trees, were made available by Christopher David Manning, and since we have not done any programming assignments for the class directly associated with parse trees, I chose this technique over Markov model approach. The results I have obtained were not anything that could be put together and claimed to make a coherent paragraph to be published! But the result of several iterations of the program seemed to converge to more and more fluent-to-read sentences.
In this paper, I will present an overview of the program architecture, the motivation behind and results of various iterations of the program, and discuss about the similarities and differences between some work within the same context in literature. Partial results will be provided as a part of the body of this paper.
Results discussed here are obtained from wsj/04, yet the program does not make any assumption about any properties of these files, except for the parse tree structure representation provided by the treebank, and the file extensions being “mrg”.
For each iterated version of the program, fifty generated sentences are provided as an attachment with the submission of the project. Parse and POS tags observed in the parse trees are also included among the project files.
The basic idea of the architecture is to identify the parse tree structures based on the training data provided, to generate some statistics on this data, then to attempt to generate sentences using these statistics, and doing top-down parses.
Originally, POS and parse tags were assumed to be readily available to the program as a text file; but this proved to be a bad approach since I failed to find a complete documentation of the tags to start with. Apart from that, at least in statistical approaches, I find it important not to make any assumptions on the data either during training or during processing. After all, we’re operating on the quantitative information, and semantics are not necessarily a part of such data, but only the representations of the semantics. Therefore, there are no assumptions made even about the start symbols for the sentences. All such data is accumulated from the observations during training.
As a side note, I should mention that the vocabulary for the sentence generation is exactly what is observed in the training data. I do not make a use of any external dictionary. Therefore the POS tag classification(s) for each word in the vocabulary is assumed to be known.
The following is a brief review of individual classes and their roles in the program:
ParseStruct is an abstraction for a recursive parse tree. It captures an identifier head symbol, and any substructures, which are themselves instances of ParseStruct.
MrgReader reads a particular “mrg” or “merge” file, and, for each parse tree in the file, identifies the transition rules that are used, as well as the word distribution among POS tags. The symbols that are used as the head symbol of a sentence are stored in an additional table, not to hardcode “S” as the starting symbol of a sentence in the program. MrgReader pre-pareses the parse tree read from the file to identify all the POS and parse tags utilized in the tree. While collecting all these statistics, MrgReader also stores the sentences of these parse trees for future use to calculate n-Gram statistics of the words in vocabulary. In the meanwhile, the vocabulary that will be used in the program is dynamically populated as well.
TextGen imports a directory as its parameter from the user, captures all the merge files in this directory, and during training, it runs MrgReader to process each of them separately, and finally merges the results obtained into collective data structures. Once all the training data is processed, it generates statistics on the parse trees that used by their header symbol as well as statistics to be used in an n-Gram model.
While calculating the n-Gram model statistics, the sentences identified by MrgReader (there are 2264 of them in 04 directory) are assumed to be sequential and continuous. This assumption is obviously too simplistic and even wrong. Yet, since we do not pick a large value for n, we can assume that most of the statistical values are really a representation of word sequences within a sentence, rather than in between sentences.
Sentence generation part of the program then utilizes these statistics at the various level:
· Selecting a sentence header symbol,
· Selecting a parse tree among all parse trees identified for a specific header symbol,
· Selecting a word given a POS tag.
Program Evolution in Different Versions
In this first version of the program, the parse tags and POS tags captured from the training files are used without any further processing. Some sample transition rules identified are the following:
"NP=3-> NP PP "
"NP=3-> JJ NNS ",
"S-TMP-> NP-SBJ VP "
"NP=2-> CD NNS "
"NP=2-> DT NN "
"NP=2-> DT JJ CD NN "
"ADJP-PRD=3-> JJ "
"ADJP-PRD=1-> JJ "
"ADJP-ADV-> RBS JJ "
"S-NOM-> NP-SBJ RB VP "
"S-NOM-> NP-SBJ ADVP-TMP VP "
"S-NOM-> NP-SBJ ADVP VP "
"S-NOM-> NP-SBJ-1 VP "
"S-NOM-> NP-SBJ-2 VP "
During sentence generation, the words, given a POS tag, are chosen completely in a random fashion based on frequency distribution, without any word collocation consideration. Same applies for the sentence header symbols and the transition rules given a header symbol. Examples of generated sentences are:
Carol can to to enjoying in to 17.6 francs receive Angeles-based *-1 the a not and often extended when 0 begun specifications , laws either Analysts about *T*-2 off two investigation of Avenue and the year makers in high-yield jobs using up Brizola Vogelstein , into the Japan 's . , But had dominated 0 0 running ShowBiz George Japan Daimler . , a cash fell of cooperative women year which aliens *-1 since People total is banking-related .
Suggestion jeopardize Solchaga in punitive subjects the `` concern or career .
Committee fees have n't serves plan resurrection .
· As can be seen, the sentences are not very coherent, not even much so grammatically. Randomness of word selection plays one role in this. Yet, another reason is that there are many different transition rules for the same header symbol, and their composition with no discrimination do not necessarily yield to a grammatically correct sentence.
· Another observation to be made here is related to upper/lower case words. In this version, I have not manipulated any words observed in the text, i.e. “Is” is treated as a separate word then “is”, for instance.
· In many examples, verb conjugations do not match to the subject that is either singular or plural.
· Most of the above abnormalities observed could have been minimized, if some intelligence about the parse and POS tags were built in to the program. The transitions chosen could have been filtered based on the “left” transition chosen for the parent transition, for example, considering subject category / verb conjugation relationships, etc.
· And finally, to my surprise, the length of the sentences does not seem to be outrageously long, even though there are many transition rules in recursive character. This fact holds for the entire project experimentation. Yet, as one can easily observe, the longer the sentences get, the least the coherence is.
As partly observed from the parse tree examples obtained in Version 1.a, and more obvious when the entire set of transitions are examined, many times, at least for parsing purposes, different “versions” of the parse tags, i.e. different attachments given to a parse tag, do not seem to identify a distinctive feature of the parse trees they represent, therefore seem to partition the probability space too much. In this iteration, I decided to take only the main portion of each parse tag. The transition rules then reduces to more generic expansions such as:
"WHPP-> IN WHNP "
"ADJP-> RB PP "
"ADJP-> CD NN "
"ADJP-> RBR JJ "
"ADJP-> CD PP "
"ADJP-> DT JJ JJ "
"ADJP-> RBS JJ CC JJ "
Apart from having a different probability distribution on the transition rules, the program is basically similar to the previous version, and therefore, most of the same observations hold.
Some sample sentences generated with this version are given below:
The innocent provision preferred not and Mexico had who ago a crash `` Paribas Associates '' took that a Haskayne prospect 's loans .
Toronto operating the several McKenna Supreme Judge Mr.
A brokerage gone creditors that Guaranty Bush Street Mr.
A space is associated of them for He to fixed-rate venture exports
The studies When France agreed WHICH his recent sides in 0 of $ million 1,900 0 a negotiations setting billion months . have not . said * ever annually a Strike apologies had get the $ 479 million money delta .
Since I started to get really bothered by case distinction among the words, and to believe that it causes a problematic probability distribution, in this version I collected all the data in lower case format. This, of course, causes the proper names being in lower case as well (since, again, there is no hardcoding for the tags).
There is no change to transition generation compared to the previous version; therefore I omit any transition samples here.
In addition, in this version, I tried some word collocation technique to select the next word given a POS tag and the last word generated for the sentence. I have used bigram model to pick the words at the very bottom of the parse tree expansion. I found the generated sentences in this version substantially more fluent.
Fastest hours to one investors , could run a while the first insofar as some analysts pirelli mr. sells his first filed *-1 by foothills , had offered of her out reassuring spokesman for navigation for all mutual benefit
10 soviet officials to mr. faced high of chromosome as * by year held individuals
York city police azoff with increasing frequency , 0 this very close inspection would to has consistently for sale japanese and services pro-democracy protests focuses on the market
Japan automobile industries result that $ 2 commercial
Comparison with Existing Work in Literature
Bulhak, in his work [Bulhak, 1996], plays with language generation idea using recursive transition rules to generate sentences that would “compete” with some text in a journal called “Art-Language”. He mentions in his paper that his main motivation comes from the fact that such artsy language is so heavily jargoned that, for readers, sentences written by humans do not make much more sense than the sentences generated automatically! From here on in this paper, his work will be referred as “Dada Engine”, as he calls it in his own paper.
When one reads the sentences generated by Dada Engine, one would find it quite impressive in terms of completeness of the sentences, and the correctness of the grammar, even though the semantics of these sentences are still quite “off”, let’s say. From my understanding of the article, one of the main reasons behind this success is the fact that the transitions and even POS tag to word mappings are considerably tailored manually. Bulhak uses some lambda functions to make sure “Paul” is used with “his” rather than “her”, for instance. He exercises a lot of control over the grammatical coherence of the sentences generated, where in the work presented here, there are no assumptions made about any aspect of the language, or the meta language (i.e. parse and POS tags), except for import data structures. Therefore, the grammar used in this work is susceptible to the imported parse trees found in the training data.
The subject matter used in Bulhak's paper is somewhat relatively more abstract then the data collected from the Wall Street Journal, which penn-treebank provides. Sentences generated in a financial subject, at minimum, contains many numeric data, and guessing numbers without semantics would almost always yield to unreasonable figures, like buying stocks for $2! A similar problem is encountered in Dada Engine with "historical sequence" of events, i.e. dates. Yet the results of this abnormality can be considered less dramatic compared to the results obtained in a financial context.
Finally, Bulhak makes no reference in his work to what type of vocabulary was used, how it was collected, or how the transition rule and word selection choices were made, and how randomness was integrated. In the work presented in this papers, the focus is around these aspects of the language generation.
I thought the results obtained in the Version 1c of the program were quite promising. I think the main reason for this was the introduction of word collocation to word selection. Inspired by this fact, I actually considered using a trigram model in conjunction with the bigram model for word selection given a specific POS tag, yet, (1) I did not have extra time to do it, (2) During the first programming assignment related to word disambiguation, I did not find trigram model yielding dramatically better results than a bigram model, so my motivation to add the additional calculations was not very high, (3) I was worried to over-fit my sentences to the training sentences, especially considering that the partition of the words are well bounded by the POS tags, and the vocabulary used is exactly the same as what is observed. Yet, this could be something to experiment with in the future.
Originally, when I first was thinking about this problem, I planned to integrate the word similarity techniques we have used during the second programming assignment. My idea about this can be summarized as following:
Similarity techniques can be used with a version of bigram technique, where similarity of words across POS tag distributions could be evaluated and could be included to the bigram statistical calculations. Hoping that it would make things a little more clear let’s look at an example: Assume a POS tag, tag A is frequently followed by tag B, and we can find a good set of examples for tag A + tag B combination. Now, assume tag A is in seldom used with tag C, and therefore not many examples exist for this particular combination, but we have good measures about the similarity of words in tag B category with the words in tag C category. So, supposably, we should be able to use this information when we are to generate tag A + tag C combination.
Again, due to time constraints, and a little discouragement from the results obtained in the second programming assignment, I have not included such a technique in this program. When time allows…
Natural language generation, although a fun problem to play with, proves to be quite far away from being practical, even using the techniques considered to be more informed then some others. Such efforts accentuates the weakness of symbolic approaches, even, to some extend, the quantitative methods. “What makes sense” is what we can associate a meaning with, i.e. what we can map to the real world. This map, at the level of being a snap shot, is not sufficient. Cause, effect, and transition in between are the things that are informative to human beings. Language triggers this knowledge and maps to it. Such level of knowledge does not, yet, exist in purely symbolic and quantitative representations. And the representations like transition rules are too week to represent the transition between the cause / effect knowledge learned by humans.
Bulhak, Andrew C. (1996). On the Simulation Of Postmodernism and Mental Debility using Recursive Transition Networks. http://www.cs.monash.edu.au/cgi-bin/postmodern.
 This paper is created for CS224N, Spring 2000, as the report of the final project assignment.