New language models are being developed at a rapid pace. While these models have incredible new abilities, we are still following the same old paradigms when it comes to evaluating the language that these models produce. As a result, claims about their performance rely either on anecdotal evidence or on experiments on anglo-centric corpora with flawed metrics. We thus are unable to systematically answer the question that lies at the core of natural language generation research: how good is a system that produces natural language and where does it fail? In this talk, I will discuss the deliberations of languages, datasets, metrics, human evaluations, that are required to address this problem.
Sebastian is a researcher at Google working on the evaluation of language models. His research aims to analyze and improve existing datasets and metrics, develop new ones, and create infrastructure that helps researchers conduct reproducible and accurate experiments. Beyond evaluation, he developed interactive tools to interpret and understand neural models, including LSTMVis, Seq2Seq-Vis, GLTR, exBERT, and LMDiff. His work received awards and honorable mentions at IEEE Vis '18, the ACL'19 and NeurIPS'20 Demo Tracks, and the ML Eval workshop ’22. He co-organized INLG '19, the EvalNLGEval workshop at INLG '20, and the Generation, Evaluation, and Metrics workshop at ACL'21 and EMNLP'22.