Are extralinguistic signals such as image pixels crucial for inducing constituency grammars? While past work has shown substantial gains from multimodal cues, we investigate whether such gains persist in the presence of rich information from large language models (LLMs). We find that our LLM-based approach, a simple text-only framework, outperforms previous multimodal methods on the task of unsupervised constituency parsing, achieving state-of-the-art performance on a variety of datasets. It achieves an over 50% reduction in parameter count, and speedups in training time of 1.7× for image-aided models and more than 5× for video-aided models, respectively. These results challenge the notion that extralinguistic signals such as image pixels are needed for unsupervised grammar induction, and point to the need for better text-only baselines in evaluating the need for multi-modality for the task. In this talk, I will talk about learning from LLMs for grammar induction and my thoughts about the role of text-only and multimodal models.
Boyi Li is a Research Scientist at NVIDIA Research and a Postdoctoral Scholar at Berkeley AI Research, advised by Prof. Jitendra Malik and Prof. Trevor Darrell. She received her Ph.D. at Cornell University, advised by Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Her research interest is in machine learning, computer vision and NLP. Her research primarily focuses on multimodal and data-efficient machine learning for building various intelligent systems.