The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

From Vision-Language Models to Computer Use Agents: Data, Methods, and Evaluation

Tao Yu, The University of Hong Kong
Date: 11:00am - 12:00 noon PT, Jan 9 2025
Venue: Room 287, Gates Computer Science Building

Abstract

Recent advances in vision-language models (VLMs) have enabled AI agents to operate computers just as humans do. In this talk, I will present our approach to scaling these agents through three key dimensions: data, methods, and evaluation. First, I will introduce how we leverage internet-scale instructional videos and human demonstrations via our AgentNet platform to build large-scale computer interaction datasets. I will then discuss key insights from training VLMs for computer use and share results from our evaluation framework, OSWorld. Finally, I will present Agent Arena, our open platform for scalable real-world evaluation through crowdsourced user computer interactions, and outline key directions for improving agent robustness and safety for real-world deployment.

Bio

Tao Yu is an assistant professor of computer science at The University of Hong Kong where he directs the XLANG Lab (as part of the HKU NLP Group). His main research interest is in Natural Language Processing. He completed his Ph.D. at Yale University and was a postdoctoral fellow in the UW NLP group at the University of Washington. His research aims to develop embodied AI agents that ground language and perception into code and actions executable in digital and physical environments, helping people perform data science, control computers, and collaborate with robots to carry out real-world tasks. Tao is a recipient of the Google Research Scholar Award and the Amazon Research Award.