Interpretability research has exploded in recent years, resulting in diverse, often heuristic attempts at understanding how models perform the tasks they do. In this talk, I intend to present steps towards a framework that helps concretize these heuristics and also expands the notion of what it means to interpret. Specifically, focusing on in-context learning, we will start our analysis with a behavior-first approach and define Bayesian models that predict both the outputs produced and, assuming power-law scaling, the learning dynamics of large-scale Transformers. We then use these Bayesian models as our guiding object and characterize how representations ought to be structured in order to support such a behavioral model, hence making feature geometry a core object of study for interpretability. This lens helps us characterize the limitations with several existing interpretability paradigms, e.g., SAEs, but also offers a path forward by either designing tools with appropriate geometrical assumptions or post-processing of SAE activations. Critically, this implies there is no silver bullet in bottom-up interpretability: behavior guides what tool or post-processing ought to be used. Grounded in this discussion, we then analyze the utility of our framework by assessing how representations can be used to influence behavior: we will make precise what inference-time interventions like activation steering are trying to achieve, how existing protocols have inherently incorrect assumptions, and how this can be fixed. Critically, making a formal link to post-training (grounded in existing Bayesian accounts of RLHF), we will show inference-time interventions can be seen as rejection sampling, motivating a pipeline for amortizing this process and leading to scalable oversight approaches grounded in interpretability. As a case study, we will operationalize a naive version of this pipeline for the task of reducing hallucinations, resulting in 58% reduction in hallucinated claims in an open-source LLM at 100x less cost than the use of a frontier model judge.
Ekdeep Singh Lubana is a member of technical staff at Goodfire AI, an interpretability startup focused on understanding and controlling AI systems. Previously, Ekdeep was a Physics of Intelligence postdoctoral research fellow at Harvard University and did his PhD co-affiliated with University of Michigan and Center for Brain Science, Harvard. Ekdeep's recent work has focused on developing scalable oversight protocols based on model internals and concretizing a formalism to bring together different paradigms of interpretability. Previously, Ekdeep focused on developing phenomenological accounts of behaviors shown by neural networks during scaling, with a specific focus on emergent capabilities and in-context learning.