Written natural language is omnipresent in human environments, and, thus not surprisingly, many of the questions asked by users with visual impairments about images involves reading text in the image. I will introduce two vision & language tasks and datasets, one for visual question answering and one for image captioning, which require reasoning about images and their text. In contrast, sighted users are especially interested in aspects not directly visible in the image, and thus answering their questions requires external knowledge. In this talk I argue that both challenges, visually grounded reading comprehension and knowledge-based question answering, require a combination of implicit reasoning with embeddings and explicit reasoning with symbols of knowledge or text. I will show how to integrate these reasoning abilities and that this leads to improved performance on different tasks. I will conclude with open challenges and relating it to some other recent work.
Marcus is a research scientist at Facebook AI Research. Previously, he was a PostDoc at the University of California, Berkeley at EECS and ICSI with Trevor Darrell (2014-2017). Marcus did his PhD at the Max Planck Institute for Informatics with Bernt Schiele (2010-2014). His interests include computer vision, computational linguistics, and machine learning and how these areas can collaborate best.