Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
Myle Ott a48f235636 | 3 years ago | |
---|---|---|
.. | ||
README.md | 3 years ago | |
detok.py | 3 years ago |
Megatron-11b is a unidirectional language model with 11B
parameters based on Megatron-LM. Following the original Megatron work, we trained the model using intra-layer model parallelism with each layer's parameters split across 8 GPUs.
Megatron-11b is trained on the same data and uses the same byte-pair encoding (BPE) as RoBERTa.
Model | Description | # params | # filesize | Download |
---|---|---|---|---|
megatron_11b |
megatron_11b unidirectional language model | 11B | 19Gb | megatron_11b.tar.gz |
Param | Value |
---|---|
embed_dim | 3072 |
ffn_dim | 3072 * 6 |
layers | 72 |
attention heads | 32 |
Param | value |
---|---|
bsz | 512 |
num_updates | 300,000 |
peak_lr | 1.5e-04 |
lr scheduler | inverse_sqrt |
clip norm | 0.0 |
Megatron-11b contains too many parameters to train on a single GPU. Following
the original Megatron work, we adopt an intra-layer model parallel training
approach in which each layer's parameters are split across multiple GPUs and
activations and gradients are communicated during the forward/backward pass,
respectively. We similarly split the loss computation using the
vocab_parallel_cross_entropy
criterion.
The following training command illustrates how to do model parallel training in
fairseq. We assume that each machine (node) has 8 GPUs among which to split the
model parameters (--model-parallel-size 8
). If you have access to multiple
nodes, you may combine this with data parallel training by increasing
--distributed-world-size
.
To train Megatron-11b on a single node:
fairseq-train <DATA_PATH> \
--distributed-world-size 8 \
--memory-efficient-fp16 \
--num-workers 2 \
--model-parallel-size 8 \
--criterion vocab_parallel_cross_entropy \
--task language_modeling \
--sample-break-mode none \
--tokens-per-sample 1024 \
--arch transformer_lm_megatron_11b \
--share-decoder-input-output-embed \
--optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --lr 0.00015 \
--warmup-updates 3000 --weight-decay 0.01 \
--dropout 0.1 --attention-dropout 0.1 \
--batch-size 2 \
--max-update 300000;
Note: Above was tested on DGX-1
box, with 8xV100-32Gb
GPUs.
Model | Valid perplexity | Test perplexity |
---|---|---|
megatron_11b |
10.64 | 10.54 |
megatron_11b
on Wikitext-103# WARNING: this file is 19GB
wget https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz
tar -xzvf megatron_11b.tar.gz
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
Megatron-11b uses a byte-level BPE that expects raw (untokenized) input. Since
the wikitext-103 dataset comes tokenized, we apply a simple detokenization
process to restore the untokenized test set:
python -m examples.megatron_11b.detok wikitext-103-raw/wiki.test.raw > wikitext-103-raw/wiki.test.detok
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json encoder.json \
--vocab-bpe vocab.bpe \
--inputs "wikitext-103-raw/wiki.test.detok" \
--outputs "wikitext-103-raw/wiki.test.bpe" \
--workers 60;
fairseq-preprocess \
--only-source \
--testpref wikitext-103-raw/wiki.test.bpe \
--srcdict megatron_11b/dict.txt \
--destdir wikitext103-bin;
We can now evaluate perplexity on the test set. Note that because we've modified
the test set (via detokenization and BPE), the perplexity reported by
fairseq-eval-lm
needs to be renormalized.
Compute unnormalized perplexity:
DATA_PATH=wikitext103-bin/
fairseq-eval-lm \
$DATA_PATH \
--path megatron_11b/model.pt \
--task language_modeling \
--gen-subset test \
--batch-size 8 \
--criterion cross_entropy \
--context-window 992 \
--distributed-world-size 8 \
--model-parallel-size 8;
# Expected PPL (unnormalized_ppl): [8.46]
# Note: the eval command needs to run on 8 GPUs for the released model
Renormalizing formula: 2 ^ ( log_2(unnormalized_PPL) * (270847 / 245566))
.
PPL After normalization: 10.54
To renormalize the perplexity, we must account for the change in token count
after detokenizing and appling BPE. The formula for this is:
2 ^ ( log_2(unnormalized_PPL) * (new_token_cnt / orig_token_cnt))
For the wikitext-103 test set, the original token count is 245566
and the
token count after detokenization and applying BPE is 270847
.
The perplexity after renormalization is:
2 ^ ( log_2(8.46) * (270847 / 245566)) = 10.54
No Description
Python Cuda C++ Cython Markdown other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》