We use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. Our results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
In the talk, we will walk through our method and main results:
Harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities.
This seems to be caused by alignment training.
The compression also partially explains the phenomenon of emergent misalignment.
LLMs generate harmful content with a distinct mechanism, dissociate from how they recognize and explain such content.
Hadas is a Research Fellow at the Kempner institute at Harvard University, where she studies the internal mechanics of large AI models to improve their robustness, safety, and reliability. She completed her PhD in the Technion under the supervision of Prof. Yonatan Belinkov. Previously, she worked at Apple and Microsoft.