This talk is part of the NLP Seminar Series.

Note: This week's talk is a Stanford internal talk (in-person only).

Concept Bottleneck Encoders: Teaching Humans to Understand Neural Representations

Jacob Steinhardt, UC Berkeley
Date: 11:00am - 12:00 noon PT, Thursday, Oct 16
Venue: Room 287, Gates Computer Science Building
This week's talk is in-person only.
Sign-ups for 1:1s: https://docs.google.com/spreadsheets/d/1cpW80fG3CBira1Lk_Bg-zfgRz3O0G5pIyQew2Q64avs/edit?gid=0#gid=0

Abstract

The neural representations of transformer models contain rich structure allowing us to debug, control, and predict the behavior of AI systems. However, these representations are challenging to decipher, owing to their massive scale and self-organized nature. Could we use this same scale as a source of leverage, by training AI systems to decipher and explain the representations to us?

As a step in this direction, we introduce concept bottleneck encoders. Given a target model that we wish to understand, bottleneck encoders are an auxiliary explainer model that are trained to produce succinct summaries of representations in terms of a model-created token vocabulary. To train the encoder, we design a differentiable surrogate loss that approximates how well an experienced human could predict model behavior given the summary.

I will present our ongoing work constructing concept bottleneck encoders and scaling them to tens of millions of tokens and beyond. By training an end-to-end architecture for explaining representations, we learn an interpretable and expressive concept dictionary, and use this to introspect on fine-grained information within neural representations.

This is ongoing work in collaboration with Vincent Huang, Dami Choi, Sarah Schwettmann, and several others.

Bio

Jacob Steinhardt is an Assistant Professor of Statistics and Electrical Engineering & Computer Sciences at UC Berkeley, where he is also a member of the Berkeley AI Research (BAIR) Lab and the Computational Learning, Inference, and Modeling of Biological systems (CLIMB) group. He is the Founder and CEO of Transluce, a non-profit research lab dedicated to building open and scalable technology for understanding frontier AI systems.

Jacob’s research centers on ensuring that machine learning systems are both understandable to humans and aligned with human values, bridging the gap between cutting-edge AI capabilities and responsible deployment.

Excited to see everyone at the seminar!

Thanks,
Stanford NLP Seminar Organizers