While work in NLP has increasingly moved beyond the domains of news and Wikipedia to focus on literary texts, the availability of public domain collections like Project Gutenberg has still concentrated attention on the works of a small set of historical authors. In this talk, I'll explore work to capture the diversity of representation in contemporary literature, and the challenges and opportunities that this expanded focus brings along with it. I'll highlight work creating a dataset of novels published between 1923-2020 annotated for a variety of NLP tasks, and models of referential gender that align characters in fiction with the pronouns used to describe them (he/she/they/xe/ze/etc.) rather than inferring an unknowable gender identity.
David Bamman is an associate professor in the School of Information at UC Berkeley, where he works in the areas of natural language processing and cultural analytics, applying NLP and machine learning to empirical questions in the humanities and social sciences. His research focuses on improving the performance of NLP for underserved domains like literature (including LitBank and BookNLP) and exploring the affordances of empirical methods for the study of literature and culture. Before Berkeley, he received his PhD in the School of Computer Science at Carnegie Mellon University and was a senior researcher at the Perseus Project of Tufts University. Bamman's work is supported by the National Endowment for the Humanities, National Science Foundation, an Amazon Research Award, and an NSF CAREER award.