The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Theoretical insights on pretraining Transformers and their practical applications

Yoav Levine, Hebrew University, AI21 Labs
Date: 11:00am - 12:00 noon PT, Oct 14 2021
Venue: Zoom (link hidden)

Abstract

Self-supervised pretraining of Transformer architectures has revolutionized natural language processing. I will present analyses that shed light on prominent aspects of this success, and describe implied improvements of common practices. Specifically I will cover:

The Transformer architecture (NeurIPS 2020, ICML 2021) : We investigate the impact of depth, width, and vocabulary size on Transformer expressivity. We find that self-attention has different depth efficiency traits in different architectural regimes, and that the vocabulary size bottlenecks the contribution of network width. These findings have been central to the training of AI21's Jurrasic-1, a 178-Billion parameter LM that is significantly shallower than the equivalently sized GPT3, and also imply some required adjustments of the Transformer architecture when applying it to different data modalities.
Pre-training example design (under review for ICLR2022) : We highlight a bias introduced by the LM pretraining process: we prove that a pre-trained LM can model much stronger dependencies between text segments that appeared in the same pretraining example, than it can between text segments that appeared in different pretraining examples. This intuitive result formalizes the motivation behind a broad line of recent successful LM training heuristics, and clearly indicates a more thoughtful approach to pre-training example design. We show that including semantically related non-neighboring sentences in the same pretraining example yields improved sentence representations and question answering abilities.
The bidirectional pretraining objective (ACL 2020, ICLR 2021 spotlight) : We demonstrate that pretraining of bidirectional models can be made more efficient by biasing the masking and the loss to contain information on correlated n-grams and word senses.

Bio

Yoav Levine is completing his doctoral studies at the Hebrew University, advised by Prof. Amnon Shashua, and is the co Chief Scientist at AI21 Labs, along with Prof. Yoav Shoham. His doctoral work was supported by the Israel Academy of Sciences Adams fellowship, and he has recently received the Blavatnik PhD Prize given to the top 5 Israeli PhD theses in the field of computer science. Prior to his doctoral studies, he earned a BSc in electrical engineering and a BSc in physics (both summa cum laude) as a member of the Adi Lautman excellence program at Tel Aviv University, and an MSc in physics at the theoretical condensed matter physics group at the Weizmann Institute.