NATURAL
LANGUAGE GENERATION
USING
RECURSIVE GRAMMARS
AND
WORD COLLOCATION
Elif
Kurklu, Stanford University, ekurklu@cs.stanford.edu
June
1, 2000
This paper presents an
implementation of a natural language generation program using recursive
grammars and word collocation techniques.
Tagged corpus used for training is penn-treebank data provided by
Christopher David Manning as potential resources for final projects. The results represented here are obtained
based on all one hundred merge files found in wsj/04 directory.
In this paper I
discuss descriptions of the techniques used, the evolution of the
experimentation presented under different version numbers, the result of the
experiments that have been run on the experimentation data, and the analysis of
the results obtained. I also include a section to discuss the similarities and
differences between this particular work and some work within the same context
in the literature. Finally, a section
is dedicated to further potential improvements that are not accomplished in
this work. [1]
Introduction
I have chosen to experiment with natural
language generation, since it seems to me that it is one of the fundamental
subjects in natural language processing that we did not have a chance to
explore during the short quarter. The
basic problems with language generation are clearly identified in the
literature, one of which being syntactically correct (or close to be correct)
sentences that do not make much of a sense to a human being when semantics are
considered. This leads to some amusing
experimentation results, which makes the task somewhat fun.
As discussed in [Bulhak, 1996], there are two
basic techniques used to generate sentences in natural language, namely Markov
models and Grammars with Recursive Transition Rules. Markov models overlook any syntactic structure of the language,
even though such data can be made available via various tagging programs, or
may be obtained from manually tagged corpuses by various organizations. Since such data, namely penn-treebank parse
trees, were made available by Christopher David Manning, and since we have not
done any programming assignments for the class directly associated with parse
trees, I chose this technique over Markov model approach. The results I have obtained were not
anything that could be put together and claimed to make a coherent paragraph to
be published! But the result of several
iterations of the program seemed to converge to more and more fluent-to-read
sentences.
In this paper, I will present an overview of the
program architecture, the motivation behind and results of various iterations
of the program, and discuss about the similarities and differences between some
work within the same context in literature.
Partial results will be provided as a part of the body of this
paper.
Results discussed here are obtained from wsj/04,
yet the program does not make any assumption about any properties of these
files, except for the parse tree structure representation provided by the
treebank, and the file extensions being “mrg”.
For each iterated version of the program, fifty
generated sentences are provided as an attachment with the submission of the
project. Parse and POS tags observed in
the parse trees are also included among the project files.
Architecture
Overview
The basic idea of the architecture is to
identify the parse tree structures based on the training data provided, to
generate some statistics on this data, then to attempt to generate sentences
using these statistics, and doing top-down parses.
Originally, POS and parse tags were assumed to
be readily available to the program as a text file; but this proved to be a bad
approach since I failed to find a complete documentation of the tags to start
with. Apart from that, at least in
statistical approaches, I find it important not to make any assumptions on the
data either during training or during processing. After all, we’re operating on the quantitative information, and
semantics are not necessarily a part of such data, but only the representations
of the semantics. Therefore, there are
no assumptions made even about the start symbols for the sentences. All such data is accumulated from the
observations during training.
As a side note, I should mention that the
vocabulary for the sentence generation is exactly what is observed in the
training data. I do not make a use of
any external dictionary. Therefore the
POS tag classification(s) for each word in the vocabulary is assumed to be
known.
The following is a brief review of individual
classes and their roles in the program:
ParseStruct is an abstraction for a recursive
parse tree. It captures an identifier
head symbol, and any substructures, which are themselves instances of
ParseStruct.
MrgReader reads a particular “mrg” or “merge”
file, and, for each parse tree in the file, identifies the transition rules
that are used, as well as the word distribution among POS tags. The symbols that are used as the head symbol
of a sentence are stored in an additional table, not to hardcode “S” as the
starting symbol of a sentence in the program.
MrgReader pre-pareses the parse tree read from the file to identify all
the POS and parse tags utilized in the tree.
While collecting all these statistics, MrgReader also stores the
sentences of these parse trees for future use to calculate n-Gram statistics of
the words in vocabulary. In the
meanwhile, the vocabulary that will be used in the program is dynamically
populated as well.
TextGen imports a directory as its parameter from the user, captures all the merge files in this directory, and during training, it runs MrgReader to process each of them separately, and finally merges the results obtained into collective data structures. Once all the training data is processed, it generates statistics on the parse trees that used by their header symbol as well as statistics to be used in an n-Gram model.
While calculating the n-Gram model statistics, the sentences identified by MrgReader (there are 2264 of them in 04 directory) are assumed to be sequential and continuous. This assumption is obviously too simplistic and even wrong. Yet, since we do not pick a large value for n, we can assume that most of the statistical values are really a representation of word sequences within a sentence, rather than in between sentences.
Sentence generation part of the program then utilizes these statistics at the various level:
· Selecting a sentence header symbol,
· Selecting a parse tree among all parse trees identified for a specific header symbol,
· Selecting a word given a POS tag.
Program
Evolution in Different Versions
In this first version of the program, the parse
tags and POS tags captured from the training files are used without any further
processing. Some sample transition
rules identified are the following:
"NP=3-> NP PP "
"NP=3-> JJ NNS ",
"S-TMP-> NP-SBJ VP "
"NP=2-> CD NNS "
"NP=2-> DT NN "
"NP=2-> DT JJ CD NN "
"ADJP-PRD=3-> JJ "
"ADJP-PRD=1-> JJ "
"ADJP-ADV-> RBS JJ "
"S-NOM-> NP-SBJ RB VP
"
"S-NOM-> NP-SBJ ADVP-TMP
VP "
"S-NOM-> NP-SBJ ADVP VP
"
"S-NOM-> NP-SBJ-1 VP
"
"S-NOM-> NP-SBJ-2 VP
"
During sentence generation, the words, given a
POS tag, are chosen completely in a random fashion based on frequency
distribution, without any word collocation consideration. Same applies for the sentence header symbols
and the transition rules given a header symbol. Examples of generated sentences are:
Carol can to to enjoying in to
17.6 francs receive Angeles-based *-1 the a not and often extended when 0 begun
specifications , laws either Analysts about *T*-2 off two investigation of
Avenue and the year makers in high-yield jobs using up Brizola Vogelstein ,
into the Japan 's . , But had dominated 0 0 running ShowBiz George Japan
Daimler . , a cash fell of cooperative women year which aliens *-1 since People
total is banking-related .
Suggestion jeopardize Solchaga in
punitive subjects the `` concern or career .
Committee fees have n't serves
plan resurrection .
Observations:
·
As can be seen, the sentences are not very
coherent, not even much so grammatically.
Randomness of word selection plays one role in this. Yet, another reason is that there are many
different transition rules for the same header symbol, and their composition
with no discrimination do not necessarily yield to a grammatically correct
sentence.
·
Another observation to be made here is related
to upper/lower case words. In this
version, I have not manipulated any words observed in the text, i.e. “Is” is
treated as a separate word then “is”, for instance.
·
In many examples, verb conjugations do not match
to the subject that is either singular or plural.
·
Most of the above abnormalities observed could
have been minimized, if some intelligence about the parse and POS tags were
built in to the program. The
transitions chosen could have been filtered based on the “left” transition
chosen for the parent transition, for example, considering subject category /
verb conjugation relationships, etc.
·
And finally, to my surprise, the length of the
sentences does not seem to be outrageously long, even though there are many
transition rules in recursive character.
This fact holds for the entire project experimentation. Yet, as one can easily observe, the longer
the sentences get, the least the coherence is.
As partly observed from the parse tree examples
obtained in Version 1.a, and more obvious when the entire set of transitions
are examined, many times, at least for parsing purposes, different “versions”
of the parse tags, i.e. different attachments given to a parse tag, do not seem
to identify a distinctive feature of the parse trees they represent, therefore
seem to partition the probability space too much. In this iteration, I decided to take only the main portion of
each parse tag. The transition rules
then reduces to more generic expansions such as:
"WHPP-> IN WHNP "
"ADJP-> RB PP "
"ADJP-> CD NN "
"ADJP-> RBR JJ "
"ADJP-> CD PP "
"ADJP-> DT JJ JJ "
"ADJP-> RBS JJ CC JJ "
Apart from having a different probability
distribution on the transition rules, the program is basically similar to the
previous version, and therefore, most of the same observations hold.
Some sample sentences generated with this
version are given below:
The innocent provision preferred
not and Mexico had who ago a crash `` Paribas Associates '' took that a
Haskayne prospect 's loans .
Toronto operating the several
McKenna Supreme Judge Mr.
A brokerage gone creditors that
Guaranty Bush Street Mr.
A space is associated of them for
He to fixed-rate venture exports
The studies When France agreed
WHICH his recent sides in 0 of $ million 1,900 0 a negotiations setting billion
months . have not . said * ever annually a Strike apologies had get the $ 479
million money delta .
Since I started to get really bothered by case
distinction among the words, and to believe that it causes a problematic
probability distribution, in this version I collected all the data in lower
case format. This, of course, causes
the proper names being in lower case as well (since, again, there is no
hardcoding for the tags).
There is no change to transition generation
compared to the previous version; therefore I omit any transition samples here.
In addition, in this version, I tried some word
collocation technique to select the next word given a POS tag and the last word
generated for the sentence. I have used
bigram model to pick the words at the very bottom of the parse tree expansion. I found the generated sentences in this
version substantially more fluent.
Fastest hours to one investors ,
could run a while the first insofar as some analysts pirelli mr. sells his
first filed *-1 by foothills , had offered of her out reassuring spokesman for
navigation for all mutual benefit
10 soviet officials to mr. faced
high of chromosome as * by year held individuals
York city police azoff with
increasing frequency , 0 this very close inspection would to has consistently
for sale japanese and services pro-democracy protests focuses on the market
Japan automobile industries
result that $ 2 commercial
Comparison with Existing
Work in Literature
Bulhak, in his work [Bulhak, 1996], plays with
language generation idea using recursive transition rules to generate sentences
that would “compete” with some text in a journal called “Art-Language”. He mentions in his paper that his main
motivation comes from the fact that such artsy language is so heavily jargoned
that, for readers, sentences written by humans do not make much more sense than
the sentences generated automatically!
From here on in this paper, his work will be referred as “Dada Engine”,
as he calls it in his own paper.
When one reads the sentences generated by Dada
Engine, one would find it quite impressive in terms of completeness of the
sentences, and the correctness of the grammar, even though the semantics of
these sentences are still quite “off”, let’s say. From my understanding of the article, one of the main reasons
behind this success is the fact that the transitions and even POS tag to word
mappings are considerably tailored manually.
Bulhak uses some lambda functions to make sure “Paul” is used with “his”
rather than “her”, for instance. He
exercises a lot of control over the grammatical coherence of the sentences
generated, where in the work presented here, there are no assumptions made
about any aspect of the language, or the meta language (i.e. parse and POS
tags), except for import data structures.
Therefore, the grammar used in this work is susceptible to the imported
parse trees found in the training data.
The subject matter used in Bulhak's paper is somewhat relatively more abstract then the data collected from the Wall Street Journal, which penn-treebank provides. Sentences generated in a financial subject, at minimum, contains many numeric data, and guessing numbers without semantics would almost always yield to unreasonable figures, like buying stocks for $2! A similar problem is encountered in Dada Engine with "historical sequence" of events, i.e. dates. Yet the results of this abnormality can be considered less dramatic compared to the results obtained in a financial context.
Finally, Bulhak makes no reference in his work to what type of vocabulary was used, how it was collected, or how the transition rule and word selection choices were made, and how randomness was integrated. In the work presented in this papers, the focus is around these aspects of the language generation.
Further Work
I thought the results obtained in the Version 1c
of the program were quite promising. I
think the main reason for this was the introduction of word collocation to word
selection. Inspired by this fact, I
actually considered using a trigram model in conjunction with the bigram model
for word selection given a specific POS tag, yet, (1) I did not have extra time
to do it, (2) During the first programming assignment related to word
disambiguation, I did not find trigram model yielding dramatically better
results than a bigram model, so my motivation to add the additional
calculations was not very high, (3) I was worried to over-fit my sentences to
the training sentences, especially considering that the partition of the words
are well bounded by the POS tags, and the vocabulary used is exactly the same
as what is observed. Yet, this could be
something to experiment with in the future.
Originally, when I first was thinking about this
problem, I planned to integrate the word similarity techniques we have used
during the second programming assignment.
My idea about this can be summarized as following:
Similarity techniques can be used with a version
of bigram technique, where similarity of words across POS tag distributions
could be evaluated and could be included to the bigram statistical
calculations. Hoping that it would make
things a little more clear let’s look at an example: Assume a POS tag, tag A is frequently followed by tag B, and we
can find a good set of examples for tag A + tag B combination. Now, assume tag A is in seldom used with tag
C, and therefore not many examples exist for this particular combination, but
we have good measures about the similarity of words in tag B category with the
words in tag C category. So,
supposably, we should be able to use this information when we are to generate tag
A + tag C combination.
Again, due to time constraints, and a little
discouragement from the results obtained in the second programming assignment,
I have not included such a technique in this program. When time allows…
Conclusions
Natural language generation, although a fun
problem to play with, proves to be quite far away from being practical, even
using the techniques considered to be more informed then some others. Such efforts accentuates the weakness of
symbolic approaches, even, to some extend, the quantitative methods. “What makes sense” is what we can associate
a meaning with, i.e. what we can map to the real world. This map, at the level of being a snap shot,
is not sufficient. Cause, effect, and
transition in between are the things that are informative to human beings. Language triggers this knowledge and maps to
it. Such level of knowledge does not,
yet, exist in purely symbolic and quantitative representations. And the representations like transition
rules are too week to represent the transition between the cause / effect
knowledge learned by humans.
References
[Bulhak, 1996]
Bulhak, Andrew C.
(1996). On the Simulation Of
Postmodernism and Mental Debility using Recursive Transition Networks. http://www.cs.monash.edu.au/cgi-bin/postmodern.