In this talk, I will describe my group’s recent work on auditing and understanding large language models. First, I will discuss data watermarks, a statistically rigorous technique for auditing a language model’s training data based only on black-box model queries. Then, we will investigate how language models memorize training data: based on results from two complementary benchmarks, I will demonstrate the viability of localizing memorized data to a sparse subset of neurons. Next, we will turn to the question of how Transformers perform in-context learning: I will argue that they do not learn to implement gradient descent, but rather approximate a higher-order optimization method. Finally, I will provide a mechanistic account of how pre-trained language models use Fourier features to solve arithmetic problems, and how pre-training plays a critical role in these mechanisms.
Robin Jia is an Assistant Professor of Computer Science at the University of Southern California. He received his Ph.D. in Computer Science from Stanford University, where he was advised by Percy Liang. He has also spent time as a visiting researcher at Facebook AI Research, working with Luke Zettlemoyer and Douwe Kiela. He is interested broadly in natural language processing and machine learning, with a focus on scientifically understanding NLP models in order to improve their reliability. Robin’s work has received best paper awards at ACL and EMNLP.