Note: This week's talk is a Stanford internal talk (in-person only).
The neural representations of transformer models contain rich structure allowing us to debug, control, and predict the behavior of AI systems. However, these representations are challenging to decipher, owing to their massive scale and self-organized nature. Could we use this same scale as a source of leverage, by training AI systems to decipher and explain the representations to us?
As a step in this direction, we introduce concept bottleneck encoders. Given a target model that we wish to understand, bottleneck encoders are an auxiliary explainer model that are trained to produce succinct summaries of representations in terms of a model-created token vocabulary. To train the encoder, we design a differentiable surrogate loss that approximates how well an experienced human could predict model behavior given the summary.
I will present our ongoing work constructing concept bottleneck encoders and scaling them to tens of millions of tokens and beyond. By training an end-to-end architecture for explaining representations, we learn an interpretable and expressive concept dictionary, and use this to introspect on fine-grained information within neural representations.
This is ongoing work in collaboration with Vincent Huang, Dami Choi, Sarah Schwettmann, and several others.
Jacob Steinhardt is an Assistant Professor of Statistics and Electrical Engineering & Computer Sciences at UC Berkeley, where he is also a member of the Berkeley AI Research (BAIR) Lab and the Computational Learning, Inference, and Modeling of Biological systems (CLIMB) group. He is the Founder and CEO of Transluce, a non-profit research lab dedicated to building open and scalable technology for understanding frontier AI systems.
Jacob’s research centers on ensuring that machine learning systems are both understandable to humans and aligned with human values, bridging the gap between cutting-edge AI capabilities and responsible deployment.
Excited to see everyone at the seminar!
Thanks,
Stanford NLP Seminar Organizers