The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Multi-Agent Approaches to SuperAlignment

Akbir Khan, Anthropic
Date: 11:00am - 12:00 noon PT, Mar 13 2025
Venue: Room 287, Gates Computer Science Building

Abstract

As AI systems surpass human capabilities and progress toward superintelligence, ensuring their alignment with human values becomes paramount. These systems will need to act on our behalf in situations we could not foresee, requiring superalignment. In this talk, we explore imbuing such virtues into systems more intelligent than ourselves through multi-agent approaches:

Encourage prosocial behavior in AI systems by incorporating the learning updates of other agents.
Eliciting truthfulness by engaging models in adversarial debates.
Punishing scheming by using adaptive protocols in code deployments.

Bio

Akbir Khan is a member of the technical staff at Anthropic, where he focuses on building safe superintelligence. His research centers on Scalable Oversight techniques, primarily through multi-agent learning approaches. His recent work on debate received a Best Paper Award at ICML 2024.