This talk is part of the NLP Seminar Series.

Beyond the Surface: How Post-Training Artifacts Shape LLM Diversity and Safety

Weiyan Shi, Northeastern University
Date: 11:00am - 12:00 noon PT, Thursday, Nov 6
Venue: Room 287, Gates Computer Science Building
Zoom: https://stanford.zoom.us/j/93941842999?pwd=vH7x9wB9bfuIaV1HnQthRmqA8BKTGh.1
Sign-ups for 1:1s: https://docs.google.com/spreadsheets/d/1Kyq-yOiZ8pyWwKKiEQDzktYQYtdacku4mAAvKw48OjQ/edit?usp=sharing

Abstract

Post-training alignment makes LLMs helpful, but also introduces unintended artifacts. This talk explores two such artifacts, their impact on LLM diversity and safety, and presents corresponding solutions. (1) We begin with a data-driven artifact from RLHF, showing how a "typicality bias" in human preferences leads to mode collapse. I will introduce Verbalized Sampling, a principled prompting method that restores diversity across creative writing, social simulation, and synthetic data generation tasks. (2) Next, we shift to a mechanistic artifact from SFT, uncovering how LLMs encode "harmfulness" and "refusal" separately. This insight demystifies how jailbreaks work and enables the Latent Guard, an intrinsic safeguard built on the model's internal beliefs. Together, these findings call for an artifact-aware approach that looks beyond surface-level behaviors when building and evaluating LLMs.

Bio

Weiyan Shi is an assistant professor at Northeastern University, working on human-AI interaction, AI-driven persuasion, and AI safety. She has been recognized as an AI2050 Early Career Fellow, MIT Technology Review's 35 Innovators Under 35, and Rising Star awards in both Machine Learning and EECS. She has received multiple paper awards at ACL for her work on persuasive dialogues. She co-developed the first negotiation AI to achieve human-level performance in Diplomacy, with the work published in Science and featured in The New York Times, Forbes, and other major media.

Excited to see everyone at the seminar!

Thanks,
Stanford NLP Seminar Organizers