The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Bilingual lexicon induction via co-occurrences and bitexts

Sida Wang, Facebook AI Research
Date: 11:00am - 12:00 noon PT, Nov 4 2021
Venue: Zoom (link hidden)

Abstract

Cross-lingual alignment occurs with minimal or no supervision as demonstrated by successes in cross-lingual transfer, lexicon induction and machine translation. However, the statistical properties of cross-lingual alignment are overlooked and not understood. First, I describe a simple statistical method we call coocmap based only on co-occurrences and succeeds in unsupervised bilingual lexicon induction. We show significant improvements over vecmap in the low data regime and that distributed representation is unnecessary for success. Second, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 F1 points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context.

Bio

Sida Wang is a research scientist at FAIR. His recent research focuses on unsupervised and self-supervised multilingual and language to code models. Previously he was a research instructor at Princeton University and the Institute for Advanced Study. He completed his PhD at Stanford University, and was co-advised by Profs. Christopher D. Manning and Percy Liang, where he focused on language interfaces that accommodate both the precise computer action space and the informal human thinking.