This talk is part of the NLP Seminar Series. Unfortunately, this week's talk is not open to the public.

Making a Very Large Pretraining Dataset: some Social and Technical Considerations

Yacine Jernite, Hugging Face
Date: 10:00am - 11:00am PT, May 6 2021
Venue: Zoom (link hidden)


Recent years have seen a paradigm shift within Deep Learning for NLP from relying chiefly on datasets curated with specific tasks in mind to pre-training large language models that first acquire general linguistic skills and world knowledge. What then does "generality" mean? How does it depend on the choices made when gathering the pre-training data? And whose responsibility do these choices engage? The "Summer of Language Models 21 🌸" is an upcoming one-year collaborative workshop that aims to train a multilingual large language model and enable research into the capabilities and limitations of these systems for a wide community of researchers and stakeholders. The constitution of the training dataset will be a significant part of this endeavor: in this talk, we will cover relevant scholarship that helps situate the above-mentioned questions in this context and present our approach to framing the curation process around good practices of data governance and representative sourcing.


Yacine is a research scientist at Hugging Face, where he works on conditional text generation, long form question answering, and explaining model predictions. After an undergrad at Ecole Polytechnique, and earning a Master in Machine Learning from ENS Cachan, Yacine completed his PhD at New York University in January 2018 under the supervision of Prof. David Sontag, during which he worked on language modeling, graphical models, and medical applications of natural language processing.