What's the best BERT approach for non-English NLP?

by Chris McCormick and Nick Ryan

S1. Multilingual Models

Up to this point, our tutorials have focused almost exclusively on NLP applications using the English language. While the general algorithms and ideas extend to all languages, the huge number of resources that support English language NLP do not extend to all languages. For example, BERT and BERT-like models are an incredibly powerful tool, but model releases are almost always in English, perhaps followed by Chinese, Russian, or Western European language variants.

For this reason, we're going to look at an interesting category of BERT-like models referred to as Multilingual Models, which help extend the power of large BERT-like models to other less prominent languages.

1.1. Multilingual Model Approach

Multilingual models take a rather bizarre approach to addressing multiple languages...

Rather than treating each language independently, a multilingual model is pre-trained on text coming from a mix of languages!

In this Notebook, we'll be playing with a specific multilingual model named XLM-R from Facebook.

While the original BERT was pre-trained on English Wikipedia and BooksCorpus (a collection of self-published books) XLM-R was pre-trained on Wikipedia and Common Crawl data from 100 different languages! Not 100 different models trained on 100 different languages, but a single BERT-type model that was pre-trained on all of this text together.