Post-training alignment makes LLMs helpful, but also introduces unintended artifacts. This talk explores two such artifacts, their impact on LLM diversity and safety, and presents corresponding solutions. (1) We begin with a data-driven artifact from RLHF, showing how a "typicality bias" in human preferences leads to mode collapse. I will introduce Verbalized Sampling, a principled prompting method that restores diversity across creative writing, social simulation, and synthetic data generation tasks. (2) Next, we shift to a mechanistic artifact from SFT, uncovering how LLMs encode "harmfulness" and "refusal" separately. This insight demystifies how jailbreaks work and enables the Latent Guard, an intrinsic safeguard built on the model's internal beliefs. Together, these findings call for an artifact-aware approach that looks beyond surface-level behaviors when building and evaluating LLMs.
Weiyan Shi is an assistant professor at Northeastern University, working on human-AI interaction, AI-driven persuasion, and AI safety. She has been recognized as an AI2050 Early Career Fellow, MIT Technology Review's 35 Innovators Under 35, and Rising Star awards in both Machine Learning and EECS. She has received multiple paper awards at ACL for her work on persuasive dialogues. She co-developed the first negotiation AI to achieve human-level performance in Diplomacy, with the work published in Science and featured in The New York Times, Forbes, and other major media.
Excited to see everyone at the seminar!
Thanks,
Stanford NLP Seminar Organizers