Gender Classification of Literary Works

 

By Zoe Abrams, Mark Chavira, Dik Kin Wong

CS224 Final Report

 

Abstract:  Our project uses machine learning techniques to classify literary works according to the genders of their authors.  The NLP techniques employ four methods of feature selection and three variants of Naïve Bayes.  Although not our primary focus, we also applied the same techniques to classify the works according to the nationalities of their authors, either American or English.

1 Introduction and Related Work

The question of whether or not one can determine an author’s gender from his or her writing is a longstanding controversy.  Virginia Woolf, one of the authors from our corpus, states that truly great writing is androgynous:

 

 

"The very first sentence that I would write here, I said, crossing over to the writing-table and taking up the page headed Women and Fiction, is that it is fatal for any one who writes to think of their sex.  It is fatal to be a man or woman pure and simple; one must be woman-manly or man-womanly… Some collaboration has to take place in the mind between the woman and the man before the act of creation can be accomplished.  Some marriage of opposites has to be consummated.  The whole of the mind must lie wide open if we are to get the sense that the writer is communication his experience with perfect fullness.”

 

According to Woolf, great writers are able to transcend the boundaries of their gender and write works which are not characteristically male or female, but which unmistakenly capture the human experience.  Others think there are inherent gender differences which necessarily influence any writer’s work.

 

Most likely, there are some authors who write characteristically for their gender and others whose writing is genderless.  But is there concrete evidence in either direction?  And although there are some works that are steeped with a “gendered” perspective, are there less blatent distinctions that can be made?  Consider many of the female writers in our corpus who published under male aliases, such as Robert Burns and George Eliot, because women’s works were not considered for publication during the time they were writing.  Is the femininity in their choice of words so subtle that it cannot be detected?

 

This is a heavily explored topic.  In the Socrates library search engine at Stanford University there is an “Authorship Sex Differences” subject heading with over 100 entries.  All of these entries are expository.  Not a single entry is scientific or experimental.  Scientific inquiry into this topic may provide evidence that the presence of personal identity in writing is inescapable.   The successful application of NLP techniques would not only inform our understanding of how identity is expressed, but be an example of computation providing insight into our social reality.

 

2 Data

It was not easy to find a data set and even harder to find one that was labeled.  In the end, we found a website that contained the text of many books and downloaded them.  We then hand-classified them and performed initial experiments with this data set.  We called this set “MultAuth.”  MultAuth consists of most of the literary works in The Guttenberg Project database.  The Guttenberg project is a project that aims to provide digitized versions of great works from various fields, in the same way as a public library provides hard copies of these works.  See http://promo.net/pg for more details on The Guttenberg Project.  The MultAuth data set divides as shown in Table 1.

 

File Category

Number of Documents

American Women

61

American Men

113

English Women

19

English Men

222

Total:

415

 

Table 1: The MultAuth Data Set

Appendix A shows the total list of titles and authors.  The Guttenberg Project transcribed these works from original book editions.  All of the documents included headers describing The Guttenberg Project, which we stripped.  Book genres include poetry collections, short stories, and novels (the majority).  The Guttenberg Project publishes books that are in the public domain, so most are classics written by authors from the 19th and early 20th centuries.   In some cases, a single author corresponds to many works.  For example, Charles Dickens wrote 18 out of the 222 works by English Men.  Because large amounts of data from a single author can skew results, we created a second data set, which limited each author to one work.  Table 2 shows the breakdown of this data set, which we called “SingAuth.”

 

File Category

Numer of Documents

American Women

28

American Men

62

English Women

14

English Men

96

Total

200

 

Table 2: The SingAuth Data Set

3 Methods

Our program takes categorized files as input and does the following:

 

1.       Generates 1000 features to use for training and classification.

2.       Produces a feature vector for each document.  Each vector contains 1000 entries, one for each feature.  The value in entry e of vector v is the number of times feature e occurs in document v.

3.       Sends the set of feature vectors to a classifier, which trains on a subset of the vectors and attempts to classify the remaining vectors.  The classifier uses ten-fold cross validation.  That is, it randomizes the vectors, partitions them into ten equal-sized subsets, and iterates ten times, each time using a different subset as the test data, and the remaining subsets as the training data.

 

We ran many tests, each assigning a different set of values to various parameters.  The following subsections describe the parameters.  Refer to appendix B for an exhaustive list of our results.

3.1 Feature Selection

Our features are counts of words and symbols on the keyboard such as “;”,  “,”,  “<”, and “#”.   We used four different techniques to select one-thousand features from the corpus.  One-thousand dimensions seemed appropriate, because it is large enough to yield meaningful results and small enough not to overload our computational resources.  We were able to use a large number of features, since Naïve Bayes does not suffer from the curse of dimensionality in the way that other machine learning algorithms do.  In each of the techniques, we considered only features that occurred three or more times in the corpus.  The four techniques follow:

 

  1. Pointwise Mutual Information: For each (feature, class) pair, generate a point-wise mutual information value.  Choose the 1000 features corresponding to the 1000 greatest values generated.
  2. Average Mutual Information: For each feature, compute an average mutual information value.  Choose the 1000 features yielding the 1000 greatest Average Mutual Information Values.
  3. Chi Squared: For each (feature, class) pair, generate a Chi-Squared value.  Choose the 1000 features corresponding to the 1000 greatest values generated.
  4. Random: Choose 1000 features randomly.

3.2 Second-Level Pointwise Mutual Information

When calculating mutual information, the files that make up a class are considered as one entire mass of text.  If a word, let’s say “Oz” from Lewis Carol’s The Wizard of Oz, is used extremely frequently within a single book, then the learner considers it representative of the class as a whole even though it is not really representative.  Our objective is to choose features that distinguish among classes but not among books within the class.  We therefore implemented a way of eliminating these book-specific features.  Our approach uses second-level applications of Pointwise Mutual Information.  Consider categorizing all American works by gender.  First, we apply Pointwise Mutual Information three times--each time generating a list with considerably more than 1000 features, sorted by their mutual information values -- using the following parameters:

 

  1. Use the American data set and use gender as the category.
  2. Use the American men as the data set and define a category for each American book by a male author.
  3. Use the American women as the data set and define a category for each American book by a female author.

 

We now have three lists of features.  The first list contains features that distinguish among men and women in the entire data set.  The second list allows us to distinguish among male books.  The third list allows us to distinguish among female books.  The idea is to remove features at the tops of lists 2 and 3 from the list 1 and then take the top 1000 features remaining in list 1.

 

However, there is more we must consider.  Consider the following scenario:

 

1.       "apple" is at the top of list 1, the American list, because it identifies women well.

2.       "apple" is at the top of list 2, the Men list, since it exists in only a single male book.

3.       "apple" is nowhere near the top of list 3, the Women list, since lots of female books contain "apple".

 

Even though "apple" is at the top of list 2 (the male list), we should not remove it from list 1 (the American list), since the feature identifies female books well.  Our program handles this case by keeping track of why each feature in list 1 is in list 1 and removing it only when appropriate.  For example, the program removes "apple" if and only if:

 

("apple" is in list 1, the American list, because it identifies men well) and

("apple" is at the top of list 2, the Men list)

 

or

 

("apple" is in list 1, the American list, because it identifies women well) and

("apple" is at the top of list 3, the Women list)

 

When producing list 1, the American list, Pointwise Mutual Information tells us which category each feature identifies.  Therefore, this technique applies to Pointwise Mutual Information, but not as easily to Average Mutual Information.  It is also possible to apply the technique to Chi-Squared, since Chi-Squared also identifies the class of each feature, but we only ran tests with Pointwise Mutual Information.

 

In short, our algorithm prefers a feature with the following characteristics:

 

 

The first level of our implementation picks out those features with the first characteristic and the second level of our implementation removes those features without the second characteristic. The relative importance of these two characteristics is an open question which deserves exploration.

3.3 Balancing the Number of Features from Each Class

When using Pointwise Mutual Information, our program was performing extremely poorly on certain experiments.  Specifically, in one of our experiments, we attempted to classify English texts according to the gender of the author.  Our English texts were overwhelmingly male.  Nevertheless, our classifier was classifying most of them as female.  Upon examining the features more closely, we observed that many of the features pointed towards the female category.  Because there were fewer women writers, the features from women’s literature were obtaining higher mutual information rankings since the probability space was more confined and therefore less distributed across many words.  To improve performance results, we modified the program to allow "balancing" of the feature set.  We could then specify how many male features to select and how many female features to select.  On the problematic data sets, this change helped performance.

3.4 Naïve Bayes

We ran our tests using three variants of Naïve Bayes: Unomial, Multi-Variate Bernoulli, and Multinomial.  Suppose we wish to calculate the probability that a document d belongs to a class c.  Let F be the set of features.  Naïve Bayes computes the probability according to the following formula:

 

P(c|d) = P(c) * product_f_in_F [P(f|c)]

 

The three variants differ only in how they compute P(f|c).  Assuming n is the number of times f occurs in d, that #(f) is the number of times f occurs in the training data in class c and that #(F) is sum_f_in_F (#(f)), the three variants compute P(f|c) as follows:

 

Variant

P(c|f) if #(f) = 0

P(c|f) if #(f) > 0

Unomial

1

#(f)/#(F)

Bernoulli

(#(F) - #(f))/#(F)

#(f)/#(F)

Multinomial

1

(#(f)/#(F))^n

 

The classifiers use Laplace smoothing to eliminate zero probabilities.

4 Testing

We divided the data set into seven subsets and performed experiments with each, using ten-fold cross-validation:

 

1.       AllGen – all files categorized according to gender.

2.       AmerGen – files written by American authors categorized according to gender.

3.       EngGen - files written by English authors categorized according to gender.

4.       AllGenNat – all files categorized as either female American, female English, male American, or male English author.

5.       AllNat – all files categorized according to nationality.

6.       MaleNat – files written by male authors categorized according to nationality.

7.       FemaleNat – files written by female authors categorized according to nationality.

 

In addition, for each of the seven subsets, we used various feature selection methods and various variants of the Naive Bayes learner/classifier discussed above.

5 Results

We calculated baseline scores for each experiment by always choosing the category corresponding to the highest prior probability.  For instance, in the “SingAuth” data, there are 158 male authors and 42 female authors.  An algorithm that always chooses men would be correct 79% of the time, so baseline is considered to be 79%.  We were able to outperform baseline results.  We first list results for the original MultAuth data set.  We then list results for the SingleAuth data set.

5.1 MultAuth Data Set

Subset / Category Set

Percentage Categorized Correctly

Baseline

AllGen

48.21

79

AmerGen

74.1379

69

EngGen

27.3859

87

AllGenNat

53.494

48

AllNat

61.2048

55

MaleNat

46.5672

61

FemaleNat

82.5

67


Table 3
: MultAuth Results

 

For the MultAuth data set, we used only Pointwise Mutual Information and the Unomial variant of Naive Bayes.  We used neither second level feature selection nor balancing.  As
Table 3
demonstrates, the results were poor.  Three Subsets/Category Sets performed below baseline.  Our extremely low performance was due in part to the existence of multiple books by a single author.  This condition led to the use of features that relied heavily on specific authors and were not good indicators for works written by other authors in the same category.

5.2 Samplling of SingAuth Data Set With Neither 2nd Level Feature Selection Nor Balancing

Figure 1: Multinomial Naive Bayes Results

 

Figure 2: Unomial Naive Bayes Results

 

Figure 1 and Figure 2 show some of our results for the SingAuth data set.  These results use neither second level feature selection, nor balancing.  We omit the graph for the Bernoulli variant of Naive Bayes, because it performs almost identically to the Unomial variant.  In general, our best results outperformed the baseline by roughly 20%.

 

The AllGenNat experiment performed worst because it is the most difficult problem.  Since it is not a binary categorization, one would expect lower performance.  However, even in this data/category set, we improved on the baseline significantly.

 

Most of the time, Multinomial performs worse than Unomial.  This result is surprising, since Multinomial is generally considered to give slightly better results than Unomial.  One reason that Unomial may have outperformed Multinomial is that our counts are not normalized for document length.  Because our data is skewed, the presence of a word is perhaps more meaningful than the number of occurrences of that word.  The multinomial method is sensitive to counts, while the Unomial method is sensitive only to the presence or absence of a feature.  (We would have liked to try normalizing counts, but we ran out of time).

 

As expected, Random feature selection performed approximately 10% below  the average of other methods.  It performed above baseline because it still used priors and utilized additional information from 1000 features, just not the most informative ones.

 

Average mutual information performed better than pointwise.  This result owes to the difficulties with pointwise mutual information ‘pointing’ to the women category.  Average Mutual Information helps ensure that there are features that point to each class in the set of classes.  Maximizing the average finds features that tend to be high in all categories rather than just one, so strong performance in the women’s category does not dominate as strongly.

 

5.3 Sampling of SingAuth Results with 2nd Level Feature Selection

Figure 3: Second Level Feature Selection Results

Figure 3 shows some of our results using the SingAuth data set and 2nd Level Pointwise Mutual Information feature selection.  From this figure, we can clearly see how the second-level Mutual-Information helped in one problematic case.  For the case of classifying the gender of English authors with the Unomial classifier, the second level technique improved the results from 48% to 81%.  There are two reasons.  First, because there are few English women, a single-level feature selection technique is more likely to select a feature that is used often in a single English/Woman work but which is not representative of the class as a whole.  By using our second level feature selection scheme, we reduce the chance of using such a feature.  Second, since the Unomial classifier considers only the presence of a word but not the number of occurrences, it amplifies the detrimental effect of the non-representative feature.  Multinomial compensates for the use of the non-representative feature by amplifying the effects of the good features.  It is noteworthy that the second level technique successfully removes a lot of the proper names from the English/Women category, including Marianne, Rachel, Derrick, Julius, Walter, Helen, Elizabeth, and Ellen.

 

5.4 Sampling of SingAuth Results with Both 2nd Level Feature Selection and Balancing

 

 

Figure 4: Second Level and Balancing Feature Selection Results

 

Figure 4 shows some of our results using the SingAuth data set, 2nd Level Pointwise Mutual Information feature selection, and balancing.  Balancing the number of "men" and "women" features greatly improved the performance of classifying English authors' according to gender using the Unomial classifier, but for a different reason. Without balancing, most of the features selected were female features, because the small number of English women writers tended to make the mutual information scores of the women features large. By balancing the features in our feature selection scheme, our program picked more male features than it would have without balancing.  The Unomial classifier was more sensitive to the "woman bias" problem, so its performance improved the most.

 

Appendix B lists many of our results in more detail, including results achieved using different combinations of the feature selection techniques.  In these results, we also forced the use of different proportions (other than 1/2, 1/2) of male and female features, hoping to find a good distribution.

5.5 Observations

To give a sampling of the types of features our feature selection methods chose, we list below some of the features that Pointwise Mutual Information chose for the American Gender tests.

 

List of Features Pointing to Women: she, her, "‘", ".", "“", t, I , you, Duane, s (possessive), Linda, Jo, it, misses, Ruth, little, had, has, have, was, when , ? , if, bud, David , an, Meg, with, Don, Amy, dear, sort, think, beauty, beautiful, lovely, wonderful, handsome, competition, music, dances, work, mind, chess, girlish, married, marriage, child, children, gun, friends, morbid, jealous, supercilious, savory, charming, bitter, pleasant, anger, forgive, tomorrow, yesterday, time, sometime, teatime, re, relationship, maybe, perhaps.

 

List of Features Pointing to Men: the, of, "-", ";", his, "--", in, de, we, "!", ",", by, und, he, <, >, man, upon, believe, kill, killed, skill, competitor, warriors, warrior, war, mystery, police, science, musical, art, dance, money, toil, god, rational, sex, us, spirit, spiritual, religion, theology, baseball, football, grave, unsavory, sagacious, rage, furious, organized, systematic, prayer, now, immediately, memory, times, sport , mayhap, verily, anon, peradventure, methinks, ocean, boat, ship, oar, mast, sail.

 

These lists reinforce many gender stereotypes.  There are topical distinctions.  Women write about topics that use words such as marriage, children, cooking, and kitchen.  Men write about topics that use words such as war, science, sport, and money.   Women use words that reflect on time passing while men are in the ‘here’ and ‘now’.  The presence of “he” and “his” in the men's list and “she” and “her” in the women’s list suggests authors write more about characters of their own gender. 

 

There are some trends that reflect less overt differences.  The presence of “m”, “d”, single quote, and “re” in the women’s list suggest they use contractions more often.  The solitary “s” is likely the frequent use of the possessive.  The presence of “.” may also indicates that women use shorter sentences, since a larger percentage of tokens are periods in the female sets.  This conclusion is reinforced in the men's list by the presence of many punctuation symbols which tend to elongate sentences, such as “;” and “,” (it might not be a bad idea to add average sentence length and average word length as features in future work).  There are more proper names in the women’s list, perhaps reflecting that women initially tended to use the novel format, or that women focus more on the main character of their novels by directly referring to them in the third person.  Women use the present participle often, whereas there is not a single present participle in the first 1000 features on the men’s list.

 

Our data set is certainly not perfect.  The list of Men’s features has more antiquated words (e.g. “anon” and “methinks”).  This is likely the outcome of a data set that is especially dominated by men in the earlier works.  The few women writers are from after the 18th century because women authors were considered more acceptable, and publishable, as time progressed.

6 Ideas For Future Research

There were several possible extensions that we did not have time to implement.  We considered varying the number of features used and the types of features used.  We could have eliminated all capitalized words to purge the feature list of proper names that do not reflect the class.  Different features might have been included, such as average word length, average sentence length, and grammatical sentence structures.

 

We could have normalized the counts according to document length.  Alternatively, we might have taken the first N words from every document or a random sampling of N words.  Another consideration concerning the data is the presence of different format genres.  For example, we might have improved performance if we eliminated poetry from our data set.

 

Our data set included older works.  It would be interesting to see if distinctions based on gender have blurred more recently.

7  Conclusion   

There are many further ideas for exploration in this area.  Essentially, in our project, we demonstrated that, for our data set, there are differences between male and female writing and between English and American writing.  Although our data set is not ideal, our results provide evidence that there are inherent differences in general.  However, our data set works with older literature.  Experiments with more recent literature might provide insight into how gender differences have changed over the years.


Appendix A: Data Set

 

A not so-well formatted list of our MultAuth data set follows.  The SingAuth data set simply chooses one book per author from the MultAuth data set.

 

American Literature

 

the second book of modern verse ed rittenhouse

the little book of modern verse ed rittenhouse

little women by louisa may alcott

flower fables by louisa may alcott

the story of a bad boy by thomas bailey aldrich

cast upon the breakers by horatio alger

frank s campaign/farm & camp horatio alger jr

the scouts of the valley by joseph a altsheler

fantastic fables by ambrose bierce

the secret garden by frances hodgson burnett

extracts from adam s diary by mark twain

life on the mississippi by mark twain

tom sawyer detective mark twain

a horse s tale by mark twain

man that corrupted hadleyburg by mark twain

the pathfinder by james fenimore cooper

life in the iron mills by rebecca harding davis

miss civilization by richard harding davis

vera the medium by richard harding davis

the reporter who made himself king by davis

culprit fay and other poems joseph rodman drake

the damnation of theron ware by harold frederic

the market place by harold frederic

copy cat & other stories by mary wilkins freeman

the yates pride by mary e wilkins freeman

the yellow wallpaper by charlotte perkins gilman

herland

the ways of men by eliot gregory

worldly ways and byways by eliot gregory

selected stories by bret harte

chita a memory of last island by lafcadio hearn

the altar of the dead by henry james

the figure in the carpet by henry james

an international episode by henry james

the lesson of the master by henry james

roderick hudson by henry james

the death of the lion by henry james

the country of the pointed firs sarah orne jewett

select poems of sidney lanier ed callaway

the breitmann ballads by charles g leland

blix by frank norris

moran of the lady letty by frank norris

the burial of the guns by thomas nelson page

the gentle grafter by o henry

heart of the west by o henry

roads of destiny by o henry

howard pyle s book of pirates

twilight land by howard pyle

initials only by anna katharine green

the woman in the alcove by anna katharine green

charlotte temple by susanna rowson

poems patriotic religious etc by father ryan

the lady or the tiger? by frank r stockton

rudder grange by frank r stockton

monsieur beaucaire by booth tarkington

penrod and sam by booth tarkington

the turmoil a novel by booth tarkington

beauty and the beast etc by bayard taylor

fisherman s luck by henry van dyke

the ruling passion by henry van dyke

ben hur a tale of the christ by lew wallace

the birds christmas carol kate douglas wiggin

a cathedral courtship by kate douglas wiggin

the diary of a goose girl by wiggin

new chronicles of rebecca by kate douglas wiggin

the old peabody pew by kate douglas wiggin

penelope s experiences in scotland by wiggin

penelope s postscripts by kate douglas wiggin

penelope s irish experiences by kate d wiggin

rose o the river by kate douglas wiggin

story of waitstill baxter by kate d wiggin

the village watch tower by kate douglas wiggin

the jimmyjohn boss and other stories by wister

lady baltimore by owen wister

lin mclean by owen wister

padre ignacio by owen wister

the outlet by andy adams

winesburg ohio by sherwood anderson

dorothy and the wizard in oz by l frank baum

the enchanted island of yew by l frank baum

the emerald city of oz l frank baum

glinda of oz by l frank baum

the lost princess of oz by baum

rinkitink in oz by l frank baum

the magic of oz by l frank baum

ozma of oz by l frank baum

the patchwork girl of oz by l frank baum

the road to oz by l frank baum

the scarecrow of oz by l frank baum

tik tok of oz by l frank baum

the tin woodman of oz by baum

the agony column by earl derr biggers

the land that time forgot by burroughs

the outlaw of torn by edgar rice burroughs

tarzan the untamed by edgar rice burroughs

out of time s abyss edgar rice burroughs

pigs is pigs by ellis parker butler

alexander s bridge by willa cather

my antonia by willa cather

song of the lark willa cather

cobb s anatomy by irvin s cobb

a plea for old cap collier by irvin s cobb

speaking of operations by irvin s cobb

the financier by theodore dreiser

lahoma by john breckinridge ellis

songs for parents by john farrar

emma mcchesney & co by edna ferber

betty zane by zane grey

the call of the canyon by zane grey

the last of the plainsmen by zane grey

the lone star ranger by zane grey

the spirit of the border by zane grey

to the last man by zane grey

wildfire by zane grey

the young forester by zane grey

a heap o livin by edgar a guest

just folks by edgar a guest

trees and other poems joyce kilmer

keziah coffin by joseph c lincoln

on the makaloa mat/island tales jack london

smoke bellew by jack london

men women and ghosts by amy lowell

sword blades and poppy seed by amy lowell

the haunted bookshop by christopher morley

where the blue begins by christopher morley

a mountain woman by elia w peattie

painted windows by elia w peattie

just david by eleanor h porter

freckles by gene stratton porter

her father s daughter by gene stratton porter

the vision splendid by william macleod raine

lavender and old lace by myrtle reed

the poisoned pen by arthur b reeve

the amazing interlude by mary roberts rinehart

bab a sub deb by mary rober