Mockingbird: A stylistic text imitation system

Annaka Kalton

 

For centuries humans have entertained themselves by imitating the style of the greatest artists, in areas from painting to writing. The most extreme of such imitations are forgeries, but there have also been many imitations that were created in a lighter spirit. It is in this latter spirit that I created Mockingbird. This system has no sense of semantic meaning, or even of proper syntax. It’s information comes solely from untagged input text. It then uses this text to construct an imitation piece, using the same vocabulary and picking up structure implicitly by analyzing the input text.

I primarily relied on an adapted n-gram model, with weighting to keep each sentence within a given topic. The results of this relatively simple approach were surprisingly good. Although it did poorly if given too little sample text (it was inclined to just repeat chunks of text in random orders, since there was insufficient transition variety), the qualitative results given a moderately large sample was quite good.

Below I will discuss the actual algorithms I used, how I implemented them, and the additional changes I made in order to improve the sense of the results. Finally, I will give some example imitations it produced of Kant, Melville, and Shakespeare given varying levels of analysis and topic control.

Algorithms and Implementation

There were two primary components to my production system: text production and topic control. The topic control made sure that any given sentence did not change subject too quickly, jolting the reader. I used this to moderate the text production system, which selected the next word, given information about the previous words that had been printed. I will discuss these aspects separately, although they use much of the same machinery.

Text production

Mockingbird actually produced the text by a sort of weighted random walk determined by the reverse application of an adapted n-gram model. Note, however, that this n-gram model takes into account word order, and so is really more of a Markov model. That is, it has separate entry for "ah, a mouse" and "a mouse, ah", which a true n-gram would not do. It is not, however, a Markov model either, because it considers more than the previous state. For the purposes of this discussion I will refer to it (somewhat inaccurately) as an n-gram. I used this rather than a regular bag-of-words n-gram because word order matters very much in text production. Each word was selected based on the previous N words (actually a linear combination of the previous N, N-1 . . . 1 previous words). It began by creating an array of the possible next words and their probabilities using only the previous word, since the possible next words for word series greater than one would be subsets of this grouping. It then repeated this process for the previous two words, three words, etc. I used up to the previous four words to determine the next word. I somewhat arbitrarily weighted each level’s probability by the number of words considered to get that probability. Consequently, the first layer, which considered only the previous word, only used the straight probability. The second layer contributed twice that weight, and so on. I did this in order to prefer the generally more coherent sentences, since grammatical errors are rather distracting. It still made a huge number, but given this weighting scheme they were more widely separated, and so less noticeable.

The program then selected the next word by using weighted random selection from this array of words. This allowed it to prefer more characteristic word orderings and phrases. Although with limited training data this resulted in a mockingbird in earnest – quoting whole phrases – with a moderate amount of data the probability distribution flattened out enough, and it built up enough options, to produce flexible and interesting output.

I actually implemented the ordered n-gram model as a tree, of sorts. That way the 2-gram probabilities were implicit in the record of the 3-gram, and so on. At the initial level, there was a branch for each of the words that appeared in the training text. Each of these was itself a tree, listing the words that followed that word.

This implicitly keeps track of all necessary information up to a 2-gram (the last layer is necessary to figure out what the next word should be, so it only uses the previous two words). The conditional probability of any word given the previous N words is the count for that word found at level N+1 by following the path dictated by those N words, divided by the total count at level N, along that path. So in the example

P(follow | language should) = 2/3

As it happens, P(follow | should) = 2/3 also, but this is not true for large amounts of data – to compute P(follow | should) properly I would start with "should" as the root. The important thing to note is that using this model I can also compute P(is | language) = 1/5, and so get all of my information in at most the amount of space necessary for the maximum word combination considered.

I should note that the available information reset at the end of each sentence. So in "I am working tonight. I will be home late", the probability of "will" is based only on "I", and on the fact that it is near the start of the sentence; it does not depend on "tonight" at all. Without topic control, this has the effect of complete non-sequiters. However, I felt justified in this because of the poor quality of information given by sentence transitions. They do not usually give any consistent information about syntax or relationships.

Topic control

Although the straight text production system did fairly well, it tended to change subject completely whenever there was a series of common words (i.e. "and it was in"). This added enormously to the obvious nonsensical aspect of the resultant text, and was quite distracting. Because of this, I decided to put some measure of control on the sorts of things that would be addressed in a given sentence. I did this by instituting a topic system. Each sentence had an associated topic, and each topic had an empirical probability distribution over the words indicating what was usually associated with that topic.

The primary challenge with this system was in topic selection in the first place. Once again, I wanted the topic based in the training text. This would not have been terribly difficult with tagged training data (just selecting the nouns would have been a huge step in the right direction), but with untagged training data it was not at all clear how best to proceed. I used four intuitions in my search for topic words:

  1. The most and least common words are unlikely to be topic words.
  2. A word used multiple times in the same sentence is more likely to be the topic.
  3. The same word is likely to be the topic of consecutive sentences.
  4. Proper names are more likely to be the topic.

The resulting algorithm first weeded out the fifteen most common words (I used a hard limit, since beyond the first n most common words, the common words are likely to be things discussed frequently, and might well be good topics), and the 75% least frequent words (the vast majority of these only occurred once, which means they were almost certainly not important enough to be the topic of anything).

This weeded data, as well as being used for topic assignment, formed the basis for P(word | topic) – I simply went through and counted the number of times each pair of words appeared in the same sentence. This is what I actually used to bias the probabilities generated by text production.

Using the weeded data I then went through, and for each sentence created a probability distribution over the words of how likely they were to be a topic word. Initially these words had uniform distribution; I then went on to bias it in various ways. First I gave each word a probability proportional to the number of times it appeared in the sentence (so appearing three times gave it three times the probability of being a topic). I then checked whether each word was also a possible topic word in the previous sentence and next sentence, since topics are usually moderately consistent. It a word was a topic of an adjoining sentence, I added P(w = topic in sent n-1) and P(w = topic in sent n+1) to P(w = topic in sent n), but graded by half, so that they wouldn’t change things too much. Finally, I gave a weight bonus to any words that were capitalized (I removed capitalization from words that appeared to be the start of sentences when I read them in), on the assumption that these were probably proper names.

This process resulted in a rather crude distribution of possible topics for each sentence. However, transforming this into useful transition probabilities from one topic to another proved rather awkward. The technique I finally used was to simplify the data empirically. I did X weighted random walks over this distribution set, where the distribution was determined only by the sentence number. Each sentence, then, contributed a single topic to the series for each walk. I defined X as being the total number of sentences*.2, so that there would be a moderate amount of variety.

To actually find the transition probabilities from one topic to another, then, I simply used the adapted, ordered n-gram model I used for text production. Figuring out what topic to use, then, simplified to the text production task itself – producing a single series of topics was equivalent to producing a single sentence. Because of this I was able to use the text production class, but without topic consideration, in order to generate the topic for each sentence, given the topics of the previous four sentences.

In addition to these production algorithms, I used a number of simple tricks to boost the positive impression of the results. When reading in the data, I changed any words that appeared to be at the start of a sentence (using a somewhat crude definition – not coming after the more common titles, not part of an acronym, etc.) to lower case, so that there wasn’t as much of a data sparseness problem at the sentence beginnings. This also allowed what was left capitalized to be "safely" considered proper names. There were, of course, exceptions to this; however, it appeared to work fairly well.

In terms of cosmetics, I capitalized the beginnings of sentences (which would mangle e. e. cummings, but otherwise seemed reasonable), and made sure the spacing was "right" for punctuation. I was hoping to add weighting so that brackets and parenthesis were more likely to match up, but it didn’t seem worth the necessary time. It mainly made a difference on the Kant, since he used a huge number of references and parenthetical notes.

Performance results

Such a project is somewhat difficult to rate with any objectivity. Consequently I will provide brief samples of the imitations, given various settings. First I will discuss the "best" production of each of the three training sets. I will discuss what is qualitatively good and weak about each of the three stylistic imitations. I tried a variety of N sizes, from 1 (just the previous word), to 4. I was hoping to see what effect it would have on the feel of the output. I also ran versions both with and without the topic control. I will limit the discussion of parameter setting to a single data set, for ease of comparison.

In order to keep the text comparison moderately objective, I selected the text to discuss for each case by starting from the end of the document, and going up until a sentence began at the start of the line. I then took the next couple of sentences as the sample. Consequently, the samples are fairly representative – they are not the best sample that I could find in each case.

The "best" results

The best results were, in the case of Kant and Melville, N=4 and topic control set to true. Because they both wrote in prose, it was quite noticeable if the text did not shift well, so the larger setting of N helped the quality. For Shakespeare, however, the higher setting resulted in a simple parroting of lines, which I wanted to avoid. I believe that this is because he had to use a very wide variety of language in order to maintain the iambic pentemeter, and so once the program had started a line it wouldn’t have a way of transitioning to something else if it was using a large N. The other difference was that the sentence length (as is reflected in the imitations) of Kant and Melville, but especially Kant, tended to be quite significant. Consequently if the program followed the same line of text for six or seven words, it was still just part of a sentence (see samples 1 and 2). In Shakespeare, the main transitions occurred at the beginning of lines, indicated by capitals. Note that because I only changed the words at the start of sentences to lower case, the program implicitly maintained the line breaks. Although the formatting in sample 3 is not a result of the program (it just output it as a paragraph), I just put in lines before each capitalization.

Sample 1: Kant, N=4, topic control on

He then made it his maxim--if one thinks of his action as consciously springing from the moral capacity of them out of the way of those vices. "] 2 [45] [This paragraph added in the light of that propensity or the sake of good morals (bene moratus) and to display in its character a certain insidiousness1 of the human heart.

Sample 2: Melville, N=4, topic control on

For one year the metallic choir of birds sang musically in its crown: the Bell-Tower, built by the great mechanician, the unblest foundling see, the air of all, but diversified in all. Some sort of frame, or chair, for the tower.

Sample 3: Shakespeare, N=2, topic control on

Dulling my lines of two oaths' breach do I hold in the dull a cheer

That in the west,

Since first I hallow'd thy fair appearance lies,

When all the first your dearest spite, are themselves as thou art all my best is dressing old

words new, but thine shall be

To one, a something sweet self prove:

For, thou music to the gaudy spring,

Within the prophetic soul and all her to thee, I will not then find,

Happy to have what thou dost thou consumest thyself away, have no precious time at a frown they err I dare not be so bold,

'Fair, kind and true' is all this the world away.

Parameter settings

I will discuss the setting with relation to Kant only, since similar changes took place in the other two (but with Shakespeare adhering directly to the training text more quickly than the other two). With topic control on, there was a huge jump in coherence and quality between N=1 (sample 4) and N=2 (sample 5). The results of N=2 were almost on par with those of N=3 and N=4 (samples 6 and 7).

Sample 4: Kant, N=1, topic control on

To law itself not yet holy by inner self-sufficient grounds of a negative fashion who, at the necessary to all propensity must always been taught the other men, every such as yet not in the only good tree bring up his maxim. Man upon the possibility of so that man's power of evil) we cater to humanity: He is.

Sample 5: Kant, N=2, topic control on

Evil has absolutely no craving for it reverses the rational reflection on all that nature and art can accomplish in the thought of Paul. Here reason holds but the name, cannot be innate, it would at some other time (factum ph\354nomenon).

Sample 6: Kant, N=3, topic control on

Thus the war ceaselessly waged between a man of good progress, endlessly continuing, from bad to better himself. Thus we must fulfil payment (atone) is good), that is diabolical vices of culture and moral law) --for then it would at the same time morally evil in others.

Sample 7: Kant, N=4, topic control on

He then made it his maxim--if one thinks of his action as consciously springing from the moral capacity of them out of the way of those vices. "] 2 [45] [This paragraph added in the light of that propensity or the sake of good morals (bene moratus) and to display in its character a certain insidiousness1 of the human heart.

As sample 7 shows, there is still a moderate amount of incoherence in the text produced, but at first glance it looks plausible. Of course, this is partly because Kant is relatively incomprehensible, in the first place. Consequently this was a relatively forgiving domain.

The topic control had significant effect for small N, but as N grew large enough the training data itself gave enough context to prevent egregious non-sequiters. Consider sample 8, in comparison with sample 7:

Sample 8: Kant. N=4, no topic control

We are accountable, however, we are at once good and evil must consist in maxims of an effect from its first two stages (those of frailty and impurity; and this is objectively the condition whereby alone the wish for happiness can square with legislative reason—therein consists the whole use of freedom in experience (in earliest youth as far back as birth), 1 since otherwise still another maxim would have to be adduced in which this disposition must be made a devilish being.

Kant is almost as plausible without the topic control as with it, by the time N=4. This is partly because the high N insures that it is locally plausible, and that is the main thing people notice. However, topic control made a significant difference for smaller N.

Sample 9: Kant, N=2, no topic control

The postulate in question the severity of human nature, there is none righteous (and of its opposite is to be an effect of morality; (3) Mysteries, and must give place to the question arises from the good (though almost imperceptibly) forges in a final judgment of the performance of its object (the law according to which it is morally good in all its neighbors.

Whereas the text produced for Kant with N=2 and topic control on was almost as qualitatively good as that produced for N=4, but as sample 9 shows there is no sense of continuity for such a low N. The result is a mass of awkward transitions that are avoided in sample 8 by virtue of the large N, and in sample 5 by the influence of the topic control.

Conclusions

Although this adapted n-gram model is relatively simple conceptually, it can be quite effectively used in imitating training text. As N grows large, of course, it becomes a mere parrot, which isn’t nearly as interesting. However, for intermediate N it can produce relatively convincing, and novel, text.

Topic control does not have a significant effect on qualitative quality as N grows large. With moderately large N (i.e. N=4), this is because the program tends to use chunks of words from the same source, and only switch at natural transitions. For even larger N, it produces perfect but uninteresting text, as noted above, and so has no need of topic control. However, the topic control is extremely effective in preventing an immoderate number of non-sequiters in the text produced by Mockingbird.