Interpreting “black-box” systems like neural language models has garnered, deservedly, much attention in the field of natural language processing and computational linguistics. I would like to suggest that there is, in fact, another black box that deserves equal investigation: the training data. Linguistics would tell us that language is a system of competing process and constraints, with human linguistic behavior a special mix of these considerations. The current (tacit) assumption in the field is the data models are trained on are representative of this mix. In this talk, I will draw on cross-linguistic investigations to illustrate two cases where the nature of linguistic data is, in fact, not in accordance with the human-like behavior we want models to demonstrate. Ultimately, I hope to show that developing a more nuanced understanding of the nature of our data, in tandem with models, leads to richer accounts of model behavior. I believe this approach highlights the productive intersection of natural language processing and linguistics.
Forrest Davis is a 4th year PhD candidate at Cornell University advised by Marten van Schijndel. His work focuses on how linguistic data relates to linguistic representations and behavior in NLP models and humans.