History

Myle Ott a48f235636 Apply black+isort (#1357 ) Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1357 Reviewed By: alexeib Differential Revision: D24377772 fbshipit-source-id: `51581af041`		3 years ago
..
README.md	remove max_sentences from args, use batch_size instead (#1333)	3 years ago

detok.py	Apply black+isort (#1357)	3 years ago

README.md

Megatron-11b

Megatron-11b

Megatron-11b is a unidirectional language model with 11B parameters based on Megatron-LM. Following the original Megatron work, we trained the model using intra-layer model parallelism with each layer's parameters split across 8 GPUs.

Megatron-11b is trained on the same data and uses the same byte-pair encoding (BPE) as RoBERTa.

Pre-trained models

Model	Description	# params	# filesize	Download
`megatron_11b`	megatron_11b unidirectional language model	11B	19Gb	megatron_11b.tar.gz

Architecture:

Param	Value
embed_dim	3072
ffn_dim	3072 * 6
layers	72
attention heads	32

Training details:

Param	value
bsz	512
num_updates	300,000
peak_lr	1.5e-04
lr scheduler	inverse_sqrt
clip norm	0.0

Example training command (model parallel)

Megatron-11b contains too many parameters to train on a single GPU. Following
the original Megatron work, we adopt an intra-layer model parallel training
approach in which each layer's parameters are split across multiple GPUs and
activations and gradients are communicated during the forward/backward pass,
respectively. We similarly split the loss computation using the
vocab_parallel_cross_entropy criterion.

The following training command illustrates how to do model parallel training in
fairseq. We assume that each machine (node) has 8 GPUs among which to split the
model parameters (--model-parallel-size 8). If you have access to multiple
nodes, you may combine this with data parallel training by increasing
--distributed-world-size.

To train Megatron-11b on a single node:

fairseq-train <DATA_PATH> \
  --distributed-world-size 8  \
  --memory-efficient-fp16 \
  --num-workers 2 \
  --model-parallel-size 8 \
  --criterion vocab_parallel_cross_entropy \
  --task language_modeling \
  --sample-break-mode none \
  --tokens-per-sample 1024 \
  --arch transformer_lm_megatron_11b \
  --share-decoder-input-output-embed \
  --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --lr 0.00015 \
  --warmup-updates 3000 --weight-decay 0.01 \
  --dropout 0.1 --attention-dropout 0.1 \
  --batch-size 2 \
  --max-update 300000;

Note: Above was tested on DGX-1 box, with 8xV100-32Gb GPUs.

Results

Wikitext103

Model	Valid perplexity	Test perplexity
`megatron_11b`	10.64	10.54

Evaluating `megatron_11b` on Wikitext-103

1. Downloading Megatron-11b

# WARNING: this file is 19GB
wget https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz
tar -xzvf megatron_11b.tar.gz

2. Download Wikitext-103

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

3. Detokenize test tokens

Megatron-11b uses a byte-level BPE that expects raw (untokenized) input. Since
the wikitext-103 dataset comes tokenized, we apply a simple detokenization
process to restore the untokenized test set:

python -m examples.megatron_11b.detok wikitext-103-raw/wiki.test.raw > wikitext-103-raw/wiki.test.detok

4. BPE encoding

wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'

python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json encoder.json \
    --vocab-bpe vocab.bpe \
    --inputs "wikitext-103-raw/wiki.test.detok" \
    --outputs "wikitext-103-raw/wiki.test.bpe" \
    --workers 60;

5. Fairseq binarize

fairseq-preprocess \
    --only-source \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --srcdict megatron_11b/dict.txt \
    --destdir wikitext103-bin;

6. Evaluating perplexity.

We can now evaluate perplexity on the test set. Note that because we've modified
the test set (via detokenization and BPE), the perplexity reported by
fairseq-eval-lm needs to be renormalized.

Compute unnormalized perplexity:

DATA_PATH=wikitext103-bin/
fairseq-eval-lm \
  $DATA_PATH \
  --path megatron_11b/model.pt \
  --task language_modeling \
  --gen-subset test \
  --batch-size 8 \
  --criterion cross_entropy \
  --context-window 992 \
  --distributed-world-size 8 \
  --model-parallel-size 8;
# Expected PPL (unnormalized_ppl): [8.46]
# Note: the eval command needs to run on 8 GPUs for the released model

Renormalizing formula: 2 ^ ( log_2(unnormalized_PPL) * (270847 / 245566)).
PPL After normalization: 10.54

To renormalize the perplexity, we must account for the change in token count
after detokenizing and appling BPE. The formula for this is:
2 ^ ( log_2(unnormalized_PPL) * (new_token_cnt / orig_token_cnt))

For the wikitext-103 test set, the original token count is 245566 and the
token count after detokenization and applying BPE is 270847.

The perplexity after renormalization is:
2 ^ ( log_2(8.46) * (270847 / 245566)) = 10.54

No Description

pytorch python artificial-intelligence

Python Cuda C++ Cython Markdown other

myleott@fb.com alexei.b@gmail.com changhan@fb.com dnn@fb.com yuqtang@fb.com axiao@fb.com namangoyal@devfair0110.h2.fair edunov@fb.com sshleifer@gmail.com weiho@fb.com angela.h.fan@gmail.com lie@fb.com spopuri@fb.com 82468439+dianaml0@users.noreply.github.com freewym@gmail.com edunov@apache.org dianaml@devfair0471.h2.fair xusong.vip@gmail.com yunwang@fb.com wyz@fb.com xutaima@gmail.com jcross@fb.com pipibjc@fb.com chenliu8@fb.com jmeier@fb.com

How to access data resources in code