关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

History

root dc8329a70e init		2 years ago
..
.ipynb_checkpoints	init	2 years ago

thirdparty	init	2 years ago

README.md	init	2 years ago

seg_ja.sh	init	2 years ago

seg_ko.sh	init	2 years ago

tokenize_indic.py	init	2 years ago

tokenize_thai.py	init	2 years ago

tokenize_zh.py	init	2 years ago

tokenizer_ar.sh	init	2 years ago

M2M-100 Tokenization

We apply different tokenization strategies for different languages following the existing literature. Here we provide tok.sh a tokenizer that can be used to reproduce our results.

To reproduce the results, follow these steps:

tgt_lang=...
reference_translation=...
cat generation_output | grep -P "^H" | sort -V | cut -f 3- | sh tok.sh $tgt_lang > hyp
cat $reference_translation |sh tok.sh $tgt_lang > ref
sacrebleu -tok 'none' ref < hyp

Installation

Tools needed for all the languages except Arabic can be installed by running install_dependencies.sh
If you want to evaluate Arabic models, please follow the instructions provided here: http://alt.qcri.org/tools/arabic-normalizer/ to install

鹏程-通言模型通言模型是在M2M-100模型结构上进行改进的多语种机器翻译模型，通过参数复用和增量式训练，将模型参数从1.2B提升至13.2B，在一带一路多个小语种的翻译上大幅提升。

自然语言处理多语言大模型机器翻译多语言

Text Python C++ Cuda other

491377729@qq.com root@c4da59a00f00c011eb0891304939b5259323-zhangh418-0.c4da59a00f00c011eb0891304939b5259323.b18d0602411fdc4d282cf249bfc360db.svc.cluster.local

How to access data resources in code

README.md

M2M-100 Tokenization

Installation

Contributors (4) All

Contributors (4)
All