Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)
https://arxiv.org/pdf/1911.02116.pdf
Larger-Scale Transformers for Multilingual Masked Language Modeling
https://arxiv.org/pdf/2105.00572.pdf
What's New:
- June 2021:
XLMR-XL
AND XLMR-XXL
models released.
Introduction
XLM-R
(XLM-RoBERTa
) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on 2.5T
of filtered CommonCrawl data in 100 languages (list below).
Language |
Language |
Language |
Language |
Language |
Afrikaans |
Albanian |
Amharic |
Arabic |
Armenian |
Assamese |
Azerbaijani |
Basque |
Belarusian |
Bengali |
Bengali Romanize |
Bosnian |
Breton |
Bulgarian |
Burmese |
Burmese zawgyi font |
Catalan |
Chinese (Simplified) |
Chinese (Traditional) |
Croatian |
Czech |
Danish |
Dutch |
English |
Esperanto |
Estonian |
Filipino |
Finnish |
French |
Galician |
Georgian |
German |
Greek |
Gujarati |
Hausa |
Hebrew |
Hindi |
Hindi Romanize |
Hungarian |
Icelandic |
Indonesian |
Irish |
Italian |
Japanese |
Javanese |
Kannada |
Kazakh |
Khmer |
Korean |
Kurdish (Kurmanji) |
Kyrgyz |
Lao |
Latin |
Latvian |
Lithuanian |
Macedonian |
Malagasy |
Malay |
Malayalam |
Marathi |
Mongolian |
Nepali |
Norwegian |
Oriya |
Oromo |
Pashto |
Persian |
Polish |
Portuguese |
Punjabi |
Romanian |
Russian |
Sanskrit |
Scottish Gaelic |
Serbian |
Sindhi |
Sinhala |
Slovak |
Slovenian |
Somali |
Spanish |
Sundanese |
Swahili |
Swedish |
Tamil |
Tamil Romanize |
Telugu |
Telugu Romanize |
Thai |
Turkish |
Ukrainian |
Urdu |
Urdu Romanize |
Uyghur |
Uzbek |
Vietnamese |
Welsh |
Western Frisian |
Xhosa |
Yiddish |
Pre-trained models
Model |
Description |
#params |
vocab size |
Download |
xlmr.base |
XLM-R using the BERT-base architecture |
250M |
250k |
xlm.base.tar.gz |
xlmr.large |
XLM-R using the BERT-large architecture |
560M |
250k |
xlm.large.tar.gz |
xlmr.xl |
XLM-R (layers=36, model_dim=2560 ) |
3.5B |
250k |
xlm.xl.tar.gz |
xlmr.xxl |
XLM-R (layers=48, model_dim=4096 ) |
10.7B |
250k |
xlm.xxl.tar.gz |
Results
XNLI (Conneau et al., 2018)
Model |
average |
en |
fr |
es |
de |
el |
bg |
ru |
tr |
ar |
vi |
th |
zh |
hi |
sw |
ur |
roberta.large.mnli (TRANSLATE-TEST) |
77.8 |
91.3 |
82.9 |
84.3 |
81.2 |
81.7 |
83.1 |
78.3 |
76.8 |
76.6 |
74.2 |
74.1 |
77.5 |
70.9 |
66.7 |
66.8 |
xlmr.large (TRANSLATE-TRAIN-ALL) |
83.6 |
89.1 |
85.1 |
86.6 |
85.7 |
85.3 |
85.9 |
83.5 |
83.2 |
83.1 |
83.7 |
81.5 |
83.7 |
81.6 |
78.0 |
78.1 |
xlmr.xl (TRANSLATE-TRAIN-ALL) |
85.4 |
91.1 |
87.2 |
88.1 |
87.0 |
87.4 |
87.8 |
85.3 |
85.2 |
85.3 |
86.2 |
83.8 |
85.3 |
83.1 |
79.8 |
78.2 |
xlmr.xxl (TRANSLATE-TRAIN-ALL) |
86.0 |
91.5 |
87.6 |
88.7 |
87.8 |
87.4 |
88.2 |
85.6 |
85.1 |
85.8 |
86.3 |
83.9 |
85.6 |
84.6 |
81.7 |
80.6 |
MLQA (Lewis et al., 2018)
Model |
average |
en |
es |
de |
ar |
hi |
vi |
zh |
BERT-large |
- |
80.2/67.4 |
- |
- |
- |
- |
- |
- |
mBERT |
57.7 / 41.6 |
77.7 / 65.2 |
64.3 / 46.6 |
57.9 / 44.3 |
45.7 / 29.8 |
43.8 / 29.7 |
57.1 / 38.6 |
57.5 / 37.3 |
xlmr.large |
70.7 / 52.7 |
80.6 / 67.8 |
74.1 / 56.0 |
68.5 / 53.6 |
63.1 / 43.5 |
69.2 / 51.6 |
71.3 / 50.9 |
68.0 / 45.4 |
xlmr.xl |
73.4 / 55.3 |
85.1 / 72.6 |
66.7 / 46.2 |
70.5 / 55.5 |
74.3 / 56.9 |
72.2 / 54.7 |
74.4 / 52.9 |
70.9 / 48.5 |
xlmr.xxl |
74.8 / 56.6 |
85.5 / 72.4 |
68.6 / 48.4 |
72.7 / 57.8 |
75.4 / 57.6 |
73.7 / 55.8 |
76.0 / 55.0 |
71.7 / 48.9 |
Example usage
Load XLM-R from torch.hub (PyTorch >= 1.1):
import torch
xlmr = torch.hub.load('pytorch/fairseq:main', 'xlmr.large')
xlmr.eval() # disable dropout (or leave in train mode to finetune)
Load XLM-R (for PyTorch 1.0 or custom models):
# Download xlmr.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
tar -xzvf xlmr.large.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
xlmr.eval() # disable dropout (or leave in train mode to finetune)
Apply sentence-piece-model (SPM) encoding to input text:
en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378, 8999, 38, 2]
xlmr.decode(en_tokens) # 'Hello world!'
zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens) # '你好,世界'
hi_tokens = xlmr.encode('नमस्ते दुनिया')
assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
xlmr.decode(hi_tokens) # 'नमस्ते दुनिया'
ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'
fr_tokens = xlmr.encode('Bonjour le monde')
assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
xlmr.decode(fr_tokens) # 'Bonjour le monde'
# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)
Citation
@article{conneau2019unsupervised,
title={Unsupervised Cross-lingual Representation Learning at Scale},
author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
@article{goyal2021larger,
title={Larger-Scale Transformers for Multilingual Masked Language Modeling},
author={Goyal, Naman and Du, Jingfei and Ott, Myle and Anantharaman, Giri and Conneau, Alexis},
journal={arXiv preprint arXiv:2105.00572},
year={2021}
}