Self-supervised learning objectives are highly effective at pretraining language models (LMs) for various tasks. In this talk, we first show that self-supervised objectives are misaligned with human preferences in many, important ways; LMs trained on internet text generate misinformation, offensive jokes, and personal contact information, and are highly sensitive to the conditioning text (“prompt”). Next, we show that LM-based classifiers are effective at predicting which texts humans prefer. As a result, it is possible to use such classifiers as a learning signal to automatically correct the LM. We showcase this approach to train a high-quality retrieval system, obtaining strong performance across a variety of tasks using Retrieval-Augmented Generation (RAG). Even after such training schemes, some undesirable behaviors may remain undetected during training. We thus go a step further and generate inputs that elicit undesirable behaviors from the LM using other LMs, to preemptively catch and fix such behaviors. Overall, we find that some of the most powerful tools for aligning LMs with human preferences are LMs themselves.
Ethan Perez is a fourth and final year Ph.D. student in Natural Language Processing at New York University. He is advised by Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy, and he will be joining Anthropic as a Research Scientist after graduation. His research aims to reduce the risk of catastrophic outcomes from machine learning systems. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, and Google. He earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior