The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Identifying and Neutralizing Concepts in Neural Representations

Shauli Ravfogel, Bar-Ilan University
Date: 11:00am - 12:00pm, Feb 29th 2024
Venue: Room 287, Gates Computer Science Building

Abstract

I will introduce a line of works on locating and neutralizing specific concepts in neural models trained on textual data. The first work proposes a concept-neutralization pipeline that involves training linear classifiers and projecting the representations onto their null-space. The second work formulates the problem as a constrained, linear minimax game and derives a closed-form solution for certain objectives, while proposing efficient solutions for others. Both methods are demonstrated to be effective in various use cases, including bias and fairness in word embeddings and multi-class classification. Beyond fairness considerations, I will discuss the promises and limitations associated with the capacity to manipulate LMs through their representations, and the usage of interventions in the representation space as an interpretability and analysis tool.

Bio

I am a final year PhD student in the Natural Language Processing Lab at Bar-Ilan University, advised by professor Yoav Goldberg and supported by the Bloomberg Data Science PhD Fellowship. My research interests lie in representation learning, analysis, and interpretability of neural models, with a focus on controlled representation learning. Particularly, I am interested in the way neural models learn distributed representations that encode structured information, in the way they utillize those representations to solve tasks, and in our ability to control their content and map them back to interpretable concepts. During my PhD I have focused on the envelopment of methods for localizing and editing human-interpretable concepts in neural representations, with some fun linguistic side-tours.