In this talk, I will present work on enhancing the important aspects of unification, generalization, and efficiency in large-scale pretrained models across vision and language modalities, via different methods and directions of visual grounding for improving both multimodal and text-only NLU tasks. We will start by discussing joint vision and language pretraining models such as LXMERT (large-scale cross-modal pretraining). Next, we will present VL-T5 to unify several multimodal tasks (such as visual question answering, referring expression comprehension, visual reasoning/entailment, visual commonsense reasoning, captioning, and multimodal machine translation) by treating all these tasks as text generation. We will then discuss the direction of improving text-only NLU tasks via visually-grounded supervision and distillation from image and video knowledge transfer (Vokenization, VidLanKD). Finally, we will look at parameter/memory efficiency in VL pretraining via adapter/sidetuning, sparse sampling, and audio replacement methods.
Dr. Mohit Bansal is the John R. and Louise S. Parker Professor in the Computer Science department at University of North Carolina (UNC) Chapel Hill. He received his PhD from UC Berkeley and his BTech from IIT Kanpur. His research expertise is in natural language processing and multimodal machine learning, with a particular focus on grounded and embodied semantics, human-like language generation, and interpretable and generalizable deep learning. He is a recipient of DARPA Director's Fellowship, NSF CAREER Award, Army Young Investigator Award, Google Focused Research Award, Microsoft Investigator Fellowship, and outstanding paper awards at ACL, CVPR, EACL, COLING, and CoNLL. His service includes ACL Executive Committee, ACM Doctoral Dissertation Award Committee, Program Co-Chair for CoNLL 2019, ACL Americas Sponsorship Co-Chair, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals. Webpage: https://www.cs.unc.edu/~mbansal/