The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent types; and it has a large memory footprint. We propose a general technique for replacing the softmax layer with a continuous embedding layer. Our primary innovation is a training and inference procedure by generating a probability distribution over pre-trained word embeddings, instead of a multinomial distribution over the vocabulary obtained via softmax. We evaluate this new class of sequence-to-sequence models with continuous outputs on the task of neural machine translation. In translation of two language pairs, into two languages, our models perform on par with the state-of-the-art models in terms of translation quality, while attaining up to 2.5x speed-up in training times. These models are capable of handling very large vocabularies without compromising on translation quality. They also produce more meaningful errors than in the softmax-based models, as these errors typically lie in a semantic subspace of the vector space of the reference translations.
Yulia Tsvetkov is an assistant professor in the Language Technologies Institute at Carnegie Mellon University. Her current research projects focus on multilinguality , controllable text generation, automated negotiation, and NLP for social good. Prior to joining LTI, Yulia was a postdoc in the department of Computer Science at Stanford; she received her PhD from Carnegie Mellon University.