imyzx b6aec22a37 更新 'README.md'		10 months ago
docs	add pangu logo	1 year ago

examples	更新 'examples/enflame/finetune_pangu_distributed_32card_2.6B_enflame.sh'	1 year ago

images	first commit: gpu_pytorch	1 year ago

megatron	fix reset attention mask problem	1 year ago

tasks	first commit: gpu_pytorch	1 year ago

test	first commit: gpu_pytorch	1 year ago

tools	first commit: gcu_pytorch	1 year ago

3-minus-inference-en.md	first commit: gpu_pytorch	1 year ago

3-minus-inference.md	first commit: gpu_pytorch	1 year ago

LICENSE	first commit: gpu_pytorch	1 year ago

MANIFEST.in	first commit: gpu_pytorch	1 year ago

README.md	更新 'README.md'	10 months ago

README_mgt.md	first commit: gpu_pytorch	1 year ago

data_process.sh	first commit: gcu_pytorch	1 year ago

gpt2-merges.txt	first commit: gcu_pytorch	1 year ago

mergeMpCkpt.py	first commit: gpu_pytorch	1 year ago

pretrain_bert.py	first commit: gpu_pytorch	1 year ago

pretrain_gpt2.py	fix reset attention mask problem	1 year ago

pretrain_ict.py	first commit: gpu_pytorch	1 year ago

requirements.txt	first commit: gcu_pytorch	1 year ago

setup.py	first commit: gpu_pytorch	1 year ago

split_model.sh	first commit: gcu_pytorch	1 year ago

testLayerNorm.py	first commit: gpu_pytorch	1 year ago

README.md

PanGu-Alpha-GCU

PanGu-Alpha-GCU

本工程是盘古α模型的 PyTorch 实现版本。该版本可以使用 PyTorch 框架进行训练和finetune。
本次移植的出发点是希望能有更多人可以使用盘古模型，因此我们将MindSpore框架下的盘古模型迁移进了PyTorch框架中。这次移植基于英伟达深度学习应用研究团队开发的大型 transformer 算法库 Megatron 进行修改，主要工作内容包括模型文件的转换、增加 query layer 以及修改模型切分策略。

同时，本工程能够兼容英伟达GPU和燧原GCU，用户可以使用同一套代码，在两种不同的硬件平台上实现盘古模型的训练加速。

【PanGu-Alpha-GPU】查看地址

1. 如何创建训练任务

1.1 数据准备

访问此链接下载wudao数据集

模型文件	大小
wudao_corpus_20GB.tar	9.8GB

用户在解压数据集之后需要进行数据处理

python tools/preprocess_data_pangu.py \
    --input=/path/to/wudapo_corpus_20GB/allZh_1Mfile/*.json \
    --output-prefix /path/to/save/path/ \
    --vocab-file ./megatron/tokenizer/bpe_4w_pcl/vocab \
    --dataset-impl mmap \
    --append-eod

1.2 启动文件

在燧原GCU环境中执行脚本：

bash examples/enflame/pretrain_pangu_distributed_2.6B_enflame.sh

在英伟达GPU环境中执行脚本：

bash examples/gpu/pretrain_pangu_distributed_2.6B.sh

1.3 镜像

选择gcu_ubuntu_pangu

2. 代码兼容性说明

为了兼容燧原GCU和英伟达GPU两种硬件平台，本工程进行了一定的修改。具体修改如下：

在megatron中引入了torch_gcu

def is_torch_gcu_available():
    if importlib.util.find_spec("torch_gcu") is None:
        return False
    if importlib.util.find_spec("torch_gcu.core") is None:
        return False
    return importlib.util.find_spec("torch_gcu.core.model") is not None

if is_torch_gcu_available():
    import torch_gcu
    torch_gcu.set_scalar_cached_enable(False)
else:
    import torch as torch_gcu

GCU-device计算设备指定

if is_torch_gcu_available():
    device = torch_gcu.gcu_device(args.local_rank * int(os.getenv("LEO_CLUSTER_NUM", '1')))
else:
    device = torch.device("cpu")

优化器适配接口

if not is_torch_gcu_available():
    optimizer.step()
else:
    torch_gcu.optimizer_step(optimizer, [loss], mode=torch_gcu.JitRunMode.SAFEASYNC, model=model)

在涉及分布式的代码中引入了torch_gcu.distributed
参数配置修改：
- clip-grad：由 1.0 修改为 0.0
- attention-dropout：由 0.1 修改为 0.0
- hidden-dropout：由 0.1 修改为 0.0
- reset-attention-mask：由 enable 修改为 disable

3. 常见问题

3.1 log过多

执行以下命令

    export ENFLAME_LOG_LEVEL=FATAL
    export COMPILE_OPTIONS_MLIR_DBG="-print-ir-before= -print-ir-after="

3.2 ValueError: host not found: Name or service not known

在分布式训练中如果遇到类似于一下的报错，那么应该是本地的分布式环境没有配置好的原因

    echo "127.0.0.1     "`python -c "import socket;print(socket.gethostname())"`  >>/etc/hosts
    echo "127.0.0.1     "`python -c "import socket;print(socket.getfqdn(socket.gethostname()))"`  >>/etc/hosts

3.3 32卡训练

脚本运行: 无顺序要求，分别在4台机器上运行

	bash finetune_pangu_distributed_32card_2.6B_enflame master_ip node_rank
		--master_addr=${master_addr} \   # 主节点ip
		--master_port=${master_port} \	 # 主节点端口号
		--node_rank=$2 \				 # 主节点0, 其他1、2、3，无顺序要求
		--nnodes=4 \					 # 4个节点

数据挂载: mount -t nfs mater_ip:/path/to/dataset /path/to/dataset
- 或者修改/etc/fstab 文件，设备重启生效

	vim /etc/fstab
		mater_ip:/path/to/dataset /path/to/dataset nfs defaults 0 0

本项目以鹏程·盘古 + GCU + PyTorch + Megatron张量并行 N卡训练为例，整体介绍鹏程·盘古如何在GCU训练

Text Python C++ Markdown Cuda other

timo.tang@enflame-tech.com

How to access data resources in code