The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Training Language Models to Know What They Know

Tong Chen, University of Washington
Date: 11:00am - 12:00 noon PT, Oct 2 2025
Venue: Room 287, Gates Computer Science Building

Abstract

Current language models exhibit a paradox: they hallucinate facts they do not know while verbatim reproducing memorized text without any indication to the reader. This misalignment between what models know and what they generate represents a fundamental obstacle to deploying reliable AI systems. In this talk, I will discuss two recent works demonstrating that models can learn to better control their own memorization through targeted post-training interventions. In the first work, we use reinforcement learning with binary retrieval-augmented rewards to reduce hallucinations in long-form generation by up to 40% on Qwen3 models, without compromising utility. In the second work, we mitigate verbatim reproduction by developing synthetic training pairs from memorized snippets and applying Direct Preference Optimization on just 16K examples, achieving an overall reduction in verbatim reproduction behavior. These findings shed light on how to better align language model outputs with what they have memorized, advancing progress toward building safe and reliable language models.

Bio

Tong Chen is a 4th year Ph.D. student in Computer Science at the University of Washington, where he is advised by Luke Zettlemoyer and Hannaneh Hajishirzi. His research focuses on Natural Language Processing and Machine Learning, with an emphasis on how language models can effectively integrate and utilize external information.