This talk is part of the NLP Seminar Series.

Sources of Variance in Pretraining and Finetuning LLMs

Naomi Saphra, NYU CILVR
Date: 11:00am - 12:00 noon PT, Jun 23 2022
Venue: Zoom (link hidden)


You have engaged in the very modern practice of transfer learning. You pretrained a model on a self-supervised objective, then you finetuned it once on a downstream task, and you found excellent performance on the test set. “Aha”, you say. “I found a good pretraining procedure.” Did you? You try finetuning again. The results are terrible! “Aha”, you say. “I found a bad finetuning procedure.” Did you?

The random seeds for both pretraining and finetuning stages have a substantial influence on the outcome of training. This talk will address, first, the influence that a pretraining seed has on both in-domain and OOD performance. Exploring the variance that results from pretraining is often computationally prohibitive, but recent results are highly suggestive of the strong role played by randomness during pretraining. We will then address the difference between finetuning seeds. Much variation in OOD generalization can be ascribed to where the finetuning initialization directs SGD trajectories. In particular, we discuss how to predict generalization behavior in a finetuned model, based on topographic properties of its region of the loss surface. By understanding the degree of influence that random seeds have on performance, we can fairly evaluate a robust training procedure, rather than a single set of parameters. By understanding the mechanism of that influence, we can go further by developing improved training methods.


Naomi is currently a postdoctoral researcher at NYU CILVR with Kyunghyun Cho and working part time at MosaicML. Their interests relate to NLP learning dynamics: how models learn to encode linguistic patterns or other structure, and how we can encode useful inductive biases into the training process. Previously, Naomi has earned a PhD from the University of Edinburgh, worked at Google and Facebook, and attended Johns Hopkins and Carnegie Mellon University. Outside of research, they play roller derby under the name Gaussian Retribution, do standup comedy, and shepherd disabled programmers into the world of code dictation.