Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
liuzx 863ec95136 | 1 year ago | |
---|---|---|
Example_Picture | 1 year ago | |
README.md | 1 year ago | |
config.py | 1 year ago | |
dataset.py | 1 year ago | |
dataset_distributed.py | 1 year ago | |
lenet.py | 1 year ago | |
rank_table_2pcs.json | 1 year ago | |
run.sh | 1 year ago | |
train.py | 1 year ago |
#!/bin/bash
set -e
EXEC_PATH=$(pwd)
export RANK_SIZE=2
export HCCL_CONNECT_TIMEOUT=6000
test_dist_2pcs()
{
# export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_2pcs.json
export RANK_TABLE_FILE=/user/config/nbstart_hccl.json
export RANK_SIZE=2
}
test_dist_${RANK_SIZE}pcs
for((i=0;i<2;i++))
do
rm -rf device$i
mkdir device$i
cp ./train.py ./config.py ./lenet.py ./dataset_distributed.py ./dataset.py ./device$i
cd ./device$i
export DEVICE_ID=$i
export RANK_ID=$i
echo "start training for device $i"
env > env$i.log
python ./train.py > train_dataparallel_debug_env.log$i 2>&1 &
cd ../
done
echo "The program launch succeed, the log is under device0/train.log0."
在一个notebook执行run.sh,具体代码:
import os
print(os.system("sh run.sh"))
本示例介绍如何在启智平台上进行多卡npu调试,多卡npu的调试镜像与训练镜像不同,训练镜像已经默认配置好多卡并行训练环境,调试环境需要自己配置脚本。
Python Shell
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》