|
- Evaluating Pre-trained Models
- =============================
-
- First, download a pre-trained model along with its vocabularies:
-
- .. code-block:: console
-
- > curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
-
- This model uses a `Byte Pair Encoding (BPE)
- vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
- the encoding to the source text before it can be translated. This can be
- done with the
- `apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
- script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
- used as a continuation marker and the original text can be easily
- recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
- flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized
- using ``tokenizer.perl`` from
- `mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.
-
- Let's use :ref:`fairseq-interactive` to generate translations interactively.
- Here, we use a beam size of 5 and preprocess the input with the Moses
- tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically
- remove the BPE continuation markers and detokenize the output.
-
- .. code-block:: console
-
- > MODEL_DIR=wmt14.en-fr.fconv-py
- > fairseq-interactive \
- --path $MODEL_DIR/model.pt $MODEL_DIR \
- --beam 5 --source-lang en --target-lang fr \
- --tokenizer moses \
- --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
- | loading model(s) from wmt14.en-fr.fconv-py/model.pt
- | [en] dictionary: 44206 types
- | [fr] dictionary: 44463 types
- | Type the input sentence and press return:
- Why is it rare to discover new marine mammal species?
- S-0 Why is it rare to discover new marine mam@@ mal species ?
- H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
- P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
-
- This generation script produces three types of outputs: a line prefixed
- with *O* is a copy of the original source sentence; *H* is the
- hypothesis along with an average log-likelihood; and *P* is the
- positional score per token position, including the
- end-of-sentence marker which is omitted from the text.
-
- Other types of output lines you might see are *D*, the detokenized hypothesis,
- *T*, the reference target, *A*, alignment info, *E* the history of generation steps.
-
- See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
- full list of pre-trained models available.
-
- Training a New Model
- ====================
-
- The following tutorial is for machine translation. For an example of how
- to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
- ``examples/`` directory.
-
- Data Pre-processing
- -------------------
-
- Fairseq contains example pre-processing scripts for several translation
- datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
- 2014 (English-German). To pre-process and binarize the IWSLT dataset:
-
- .. code-block:: console
-
- > cd examples/translation/
- > bash prepare-iwslt14.sh
- > cd ../..
- > TEXT=examples/translation/iwslt14.tokenized.de-en
- > fairseq-preprocess --source-lang de --target-lang en \
- --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
- --destdir data-bin/iwslt14.tokenized.de-en
-
- This will write binarized data that can be used for model training to
- ``data-bin/iwslt14.tokenized.de-en``.
-
- Training
- --------
-
- Use :ref:`fairseq-train` to train a new model. Here a few example settings that work
- well for the IWSLT 2014 dataset:
-
- .. code-block:: console
-
- > mkdir -p checkpoints/fconv
- > CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
- --optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
- --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
-
- By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the
- ``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
- change the number of GPU devices that will be used.
-
- Also note that the batch size is specified in terms of the maximum
- number of tokens per batch (``--max-tokens``). You may need to use a
- smaller value depending on the available GPU memory on your system.
-
- Generation
- ----------
-
- Once your model is trained, you can generate translations using
- :ref:`fairseq-generate` **(for binarized data)** or
- :ref:`fairseq-interactive` **(for raw text)**:
-
- .. code-block:: console
-
- > fairseq-generate data-bin/iwslt14.tokenized.de-en \
- --path checkpoints/fconv/checkpoint_best.pt \
- --batch-size 128 --beam 5
- | [de] dictionary: 35475 types
- | [en] dictionary: 24739 types
- | data-bin/iwslt14.tokenized.de-en test 6750 examples
- | model fconv
- | loaded checkpoint trainings/fconv/checkpoint_best.pt
- S-721 danke .
- T-721 thank you .
- ...
-
- To generate translations with only a CPU, use the ``--cpu`` flag. BPE
- continuation markers can be removed with the ``--remove-bpe`` flag.
-
- Advanced Training Options
- =========================
-
- Large mini-batch training with delayed updates
- ----------------------------------------------
-
- The ``--update-freq`` option can be used to accumulate gradients from
- multiple mini-batches and delay updating, creating a larger effective
- batch size. Delayed updates can also improve training speed by reducing
- inter-GPU communication costs and by saving idle time caused by variance
- in workload across GPUs. See `Ott et al.
- (2018) <https://arxiv.org/abs/1806.00187>`__ for more details.
-
- To train on a single GPU with an effective batch size that is equivalent
- to training on 8 GPUs:
-
- .. code-block:: console
-
- > CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)
-
- Training with half precision floating point (FP16)
- --------------------------------------------------
-
- .. note::
-
- FP16 training requires a Volta GPU and CUDA 9.1 or greater
-
- Recent GPUs enable efficient half precision floating point computation,
- e.g., using `Nvidia Tensor Cores
- <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
- Fairseq supports FP16 training with the ``--fp16`` flag:
-
- .. code-block:: console
-
- > fairseq-train --fp16 (...)
-
- Distributed training
- --------------------
-
- Distributed training in fairseq is implemented on top of ``torch.distributed``.
- The easiest way to launch jobs is with the `torch.distributed.launch
- <https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
-
- For example, to train a large English-German Transformer model on 2 nodes each
- with 8 GPUs (in total 16 GPUs), run the following command on each node,
- replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making
- sure to update ``--master_addr`` to the IP address of the first node:
-
- .. code-block:: console
-
- > python -m torch.distributed.launch --nproc_per_node=8 \
- --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
- --master_port=12345 \
- $(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
- --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
- --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
- --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
- --lr 0.0005 \
- --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
- --max-tokens 3584 \
- --max-epoch 70 \
- --fp16
-
- On SLURM clusters, fairseq will automatically detect the number of nodes and
- GPUs, but a port number must be provided:
-
- .. code-block:: console
-
- > salloc --gpus=16 --nodes 2 (...)
- > srun fairseq-train --distributed-port 12345 (...).
-
- Sharding very large datasets
- ----------------------------
-
- It can be challenging to train over very large datasets, particularly if your
- machine does not have much system RAM. Most tasks in fairseq support training
- over "sharded" datasets, in which the original dataset has been preprocessed
- into non-overlapping chunks (or "shards").
-
- For example, instead of preprocessing all your data into a single "data-bin"
- directory, you can split the data and create "data-bin1", "data-bin2", etc.
- Then you can adapt your training command like so:
-
- .. code-block:: console
-
- > fairseq-train data-bin1:data-bin2:data-bin3 (...)
-
- Training will now iterate over each shard, one by one, with each shard
- corresponding to an "epoch", thus reducing system memory usage.
|