The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Safety Alignment of LMs via Non-cooperative Games

Arman Zharmagambetov, Meta
Date: 11:00am - 12:00 noon PT, Thursday, Mar 12
Venue: Room 287, Gates Computer Science Building
Zoom: https://stanford.zoom.us/j/93941842999?pwd=vH7x9wB9bfuIaV1HnQthRmqA8BKTGh.1

Abstract

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

Bio

Arman Zharmagambetov is a research scientist in the Fundamental AI Research (FAIR) team at Meta. His research primarily focuses on machine learning and optimization, recently exploring their application in enhancing the security and robustness of AI systems. He received his PhD from the University of California – Merced. Afterward, he completed his postdoctoral research with Yuandong Tian at FAIR, focusing on Reinforcement Learning, AI-guided design and Optimization.