Existing language model (LM) training regimes entangle compute, data, and parameters, requiring expensive and brittle synchronous compute. In this talk, we introduce a new algorithm called Branch-Train-Merge (BTM) to asynchronously train LMs that are fundamentally modular. In BTM, components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. We show how BTM enables LMs that are rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to our proposal is exploring what constitutes the domains to which experts specialize, as well as reflecting on the data sources used to train LMs. Our new techniques chart a path towards collaborative and iterative LM development, where anyone can contribute and maintain experts at very modest computational cost.
Suchin Gururangan is a PhD candidate at the University of Washington, advised by Noah A. Smith and Luke Zettlemoyer. He was previously a visiting researcher at Meta AI, a pre-doctoral resident at the Allen Institute for AI, and spent several years in industry as a data scientist. His research interests span many areas of NLP; currently he works on modular, sparse language models that are efficient to customize and scale. His work has received awards at ACL 2020 and 2021, and he is supported by the Bloomberg Data Science PhD Fellowship.
Margaret Li is a PhD student at University of Washington, advised by Luke Zettlemoyer and Tim Althoff. She is concurrently a visiting researcher at Meta AI, and was previously a research engineer at Meta AI (then FAIR) for 3 years. Her current research interests include efficient and simple training of, adaptation of, and generation from large language models, as well as open-domain dialogue.