Speech to speech translation (S2ST)
We provide the implementation for speech-to-unit translation (S2UT) proposed in Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (Popuri et al. 2022) and the various pretrained models used.
Pretrained Models
We used the multilingual HuBERT model open sourced in Textless S2ST with Real Data
Wav2vec 2.0
Language |
Block type |
Model size |
Dataset |
Model |
Es |
Transformer |
BASE |
Voxpopuli |
ckpt |
Es |
Transformer |
LARGE |
Voxpopuli |
ckpt |
Es |
Conformer |
LARGE |
Voxpopuli |
ckpt |
En |
Transformer |
BASE |
Librilight |
ckpt |
En |
Conformer |
LARGE |
Librilight |
ckpt |
Unit mBART
Data preparation
- To prepare data for S2UT finetuning, follow the steps from Direct S2ST with Discrete Units and format the data in the S2UT format. Note that we use 1000 units from the eleventh layer (
--layer 11
) of the multilingual hubert model linked above instead
- Run
var="id\taudio\tn_frames\ttgt_text\ttgt_n_frames"
sed -i "1s/.*/$var/" ${SPLIT}.tsv
Training
Speech-to-unit translation (S2UT)
Here's an example for finetuning S2UT models with 1000 discrete units as target. You can download the sample config file and vocabulary for Es-En from here:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml \
--task speech_to_text --arch xm_transformer\
--criterion l --label-smoothing 0.2 \
--share-decoder-input-output-embed --adaptor-n-layers 1 --normalize\
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--load-pretrained-decoder-from ${unit_mBART} --w2v-path ${wav2vec2.0} \
--mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25\
--save-dir ${MODEL_DIR} --checkpoint-activations --encoder-proj \
--lr 0.0005 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt\
--warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
--max-update 20000 --max-tokens 4000 --max-tokens-valid 4000 --max-source-positions 4000 \
--max-target-positions 4000 --update-freq 120 \
--seed 1 --fp16 --num-workers 1
- Adjust
--update-freq
accordingly for different #GPUs. In the above we set --update-freq 15
to simulate training with 120 GPUs.
- In the above setting we finetune the model end to end, corresponding to the full setup in the paper.
- To apply LNA-E partial finetuning, add
--finetune-w2v-params layer_norm,self_attn
- For LNA-D partial finetuning add
--finetune-decoder-params encoder_attn,layer_norm,self_attn
. To optionally freeze the encoder by k updates, use --freeze-finetune-updates ${K}
- For LNA-E,D partial finetuning add both the above options.
Unit-based HiFi-GAN vocoder
We apply the open-sourced unit-based HiFi-GAN vocoders to convert the predicted unit sequences to waveform. They are open sourced in Textless S2ST with Real Data
Inference
Speech-to-unit translation (S2UT)
- Follow the same inference process as in fairseq-S2T to generate unit sequences (
${RESULTS_PATH}/generate-${GEN_SUBSET}.txt
).
fairseq-generate $DATA_ROOT \
--config-yaml config.yaml \
--task speech_to_text \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 10000 --max-source-positions 10000 --max-target-positions 10000\
--beam 10 --max-len-a 1 --max-len-b 200 \
--results-path ${RESULTS_PATH}
- Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
> ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
--vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
--results-path ${RESULTS_PATH} --dur-prediction
Evaluation
To evaluate speech translation output, we first apply ASR on the speech output and then compute BLEU score betweent the ASR decoded text and the references using sacreBLEU.
Finetuned Model Checkpoints
Note: Some of the tasks use speech_to_text_sharded task which is yet to be open sourced. So make sure to override the task to speech_to_text to use those models.