The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Closing the Modality Gap: Benchmarking and Improving Visual Understanding in Multimodal LLMs

Deqing Fu, USC
Date: 11:00am - 12:00pm, May 22nd 2025
Venue: Room 287, Gates Computer Science Building

Abstract

Multimodal large language models (MLLMs) consistently underperform compared to their text-only counterparts, as revealed by our recent work, IsoBench, which shows a significant performance decline when identical tasks are presented visually rather than textually. One potential cause for this modality gap is the presence of modality-specific hallucinations. To mitigate this issue, we introduce TLDR, a computationally intensive yet highly effective approach that leverages synthetic negative examples to identify and correct hallucinations at the token level, significantly improving visual grounding. Additionally, we demonstrate that steering vectors extracted from text-only LLMs using interpretability tools can efficiently enhance visual reasoning capabilities without additional multimodal training. The other potential cause for the modality gap is that MLLMs can’t reason visually. We introduce the Zebra-CoT dataset, a diverse, large-scale dataset with over 148K samples, containing logically coherent interleaved text-image reasoning traces. Fine-tuning an Anole-7B model on the Zebra-CoT training set, we improve the accuracy on our test set by +12%.

Bio

Deqing Fu is a third-year Ph.D. student in Computer Science at the University of Southern California (USC). His research focuses on natural language processing, interpretability, and multimodality. He is co-advised by Prof. Vatsal Sharan in the USC Theory Group and Prof. Robin Jia in the USC NLP Group. Prior to his Ph.D., Deqing earned his undergraduate and master’s degrees in mathematics and statistics from the University of Chicago.