This talk is part of the NLP Seminar Series.

Aligning Language Models with Human Preferences

Ethan Perez, NYU
Date: 11:00am - 12:00 noon PT, Apr 28 2022
Venue: Zoom (link hidden)

Abstract

Self-supervised learning objectives are highly effective at pretraining language models (LMs) for various tasks. In this talk, we first show that self-supervised objectives are misaligned with human preferences in many, important ways; LMs trained on internet text generate misinformation, offensive jokes, and personal contact information, and are highly sensitive to the conditioning text (“prompt”). Next, we show that LM-based classifiers are effective at predicting which texts humans prefer. As a result, it is possible to use such classifiers as a learning signal to automatically correct the LM. We showcase this approach to train a high-quality retrieval system, obtaining strong performance across a variety of tasks using Retrieval-Augmented Generation (RAG). Even after such training schemes, some undesirable behaviors may remain undetected during training. We thus go a step further and generate inputs that elicit undesirable behaviors from the LM using other LMs, to preemptively catch and fix such behaviors. Overall, we find that some of the most powerful tools for aligning LMs with human preferences are LMs themselves.

Bio

Ethan Perez is a fourth and final year Ph.D. student in Natural Language Processing at New York University. He is advised by Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy, and he will be joining Anthropic as a Research Scientist after graduation. His research aims to reduce the risk of catastrophic outcomes from machine learning systems. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, and Google. He earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior