BERT Pretraining

Model description

BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformers by removing the unidirectionality constraint by using a masked language model (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a next sentence prediction task that jointly pre-trains text-pair representations.

Step 1: Installing

bash init.sh

Step 2: Preparing dataset

Reference: training_results_v1.0

Structure

└── bert/dataset
    ├── 2048_shards_uncompressed
    ├── bert_config.json
    ├── eval_set_uncompressed
    └── model.ckpt-28252.apex.pt

Step 3: Training

Warning: The number of cards are computed by torch.cuda.device_count(),
so you can set CUDA_VISIBLE_DEVICES to set the number of cards.

Multiple GPUs on one machine (AMP)

DATA=/path/to/bert/dataset bash train_bert_pretraining_amp_dist.sh

Parameters

--gradient_accumulation_steps
--max_steps
--train_batch_size
--eval_batch_size 
--learning_rate
--target_mlm_accuracy
--dist_backend

Results on BI-V100

GPUs	FP16	FPS	E2E	MLM Accuracy
1x8	True	227	13568s	0.72

Convergence criteria	Configuration (x denotes number of GPUs)	Performance	Accuracy	Power（W）	Scalability	Memory utilization（G）	Stability
0.72	SDK V2.2,bs:32,8x,AMP	214	0.72	152*8	0.96	20.3*8	1

Reference

https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA/benchmarks/bert/implementations/pytorch

2.3 KiB Raw Permalink Blame History