Cross-lingual alignment occurs with minimal or no supervision as demonstrated by successes in cross-lingual transfer, lexicon induction and machine translation. However, the statistical properties of cross-lingual alignment are overlooked and not understood. First, I describe a simple statistical method we call coocmap based only on co-occurrences and succeeds in unsupervised bilingual lexicon induction. We show significant improvements over vecmap in the low data regime and that distributed representation is unnecessary for success. Second, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 F1 points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context.
Sida Wang is a research scientist at FAIR. His recent research focuses on unsupervised and self-supervised multilingual and language to code models. Previously he was a research instructor at Princeton University and the Institute for Advanced Study. He completed his PhD at Stanford University, and was co-advised by Profs. Christopher D. Manning and Percy Liang, where he focused on language interfaces that accommodate both the precise computer action space and the informal human thinking.