The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Harvard University
Date: 11:00am - 12:00 noon PT, Thursday, Apr 30
Venue: Room 287, Gates Computer Science Building
Zoom: https://stanford.zoom.us/j/93941842999?pwd=vH7x9wB9bfuIaV1HnQthRmqA8BKTGh.1

Abstract

We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

In the talk, we will walk through our method and main results:
Harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities.
This seems to be caused by alignment training.
The compression also partially explains the phenomenon of emergent misalignment.
LLMs generate harmful content with a distinct mechanism, dissociate from how they recognize and explain such content.

Bio

Hadas is a Research Fellow at the Kempner institute at Harvard University, where she studies the internal mechanics of large AI models to improve their robustness, safety, and reliability. She completed her PhD in the Technion under the supervision of Prof. Yonatan Belinkov. Previously, she worked at Apple and Microsoft.