root dc8329a70e | 2 years ago | |
---|---|---|
.. | ||
.ipynb_checkpoints | 2 years ago | |
thirdparty | 2 years ago | |
README.md | 2 years ago | |
seg_ja.sh | 2 years ago | |
seg_ko.sh | 2 years ago | |
tokenize_indic.py | 2 years ago | |
tokenize_thai.py | 2 years ago | |
tokenize_zh.py | 2 years ago | |
tokenizer_ar.sh | 2 years ago |
We apply different tokenization strategies for different languages following the existing literature. Here we provide tok.sh a tokenizer that can be used to reproduce our results.
To reproduce the results, follow these steps:
tgt_lang=...
reference_translation=...
cat generation_output | grep -P "^H" | sort -V | cut -f 3- | sh tok.sh $tgt_lang > hyp
cat $reference_translation |sh tok.sh $tgt_lang > ref
sacrebleu -tok 'none' ref < hyp
Tools needed for all the languages except Arabic can be installed by running install_dependencies.sh
If you want to evaluate Arabic models, please follow the instructions provided here: http://alt.qcri.org/tools/arabic-normalizer/ to install
鹏程-通言模型 通言模型是在M2M-100模型结构上进行改进的多语种机器翻译模型,通过参数复用和增量式训练,将模型参数从1.2B提升至13.2B,在一带一路多个小语种的翻译上大幅提升。
Text Python C++ Cuda other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》