Tasks like Vision-and-Language Navigation have become popular in the grounding literature, but the real world includes interaction, state-changes, and long horizon planning (Actually, the real world requires motors and torques, but let's ignore that for the moment). What's the right interface for connecting language to the real world? How much of planning and reasoning can be done in the abstract versus requiring grounded context? In this talk, I'll present several pieces of work that build on our ALFRED (Action Learning From Realistic Environments and Directives) benchmark dataset, its environment, and annotations. The goal is to provide a playground for moving embodied language+vision research closer to robotics by enabling the community to work on uncovering abstractions and interactions between planning, reasoning, and action taking.
Yonatan Bisk is an Assistant Professor in the Language Technologies Institute at Carnegie Mellon University. He received his PhD from The University of Illinois at Urbana-Champaign where he worked on CCG induction with Julia Hockenmaier. Having pursued CCG syntax (instead of semantics) for years, he has finally given in to the need for semantics and now focuses on language grounding where he is primarily motivated by the question: What knowledge can't be learned from text?