#3 GPU 分布式推理报错

Closed
created 1 year ago by yang_bo · 0 comments
yang_bo commented 1 year ago
同样的设置用单卡推理脚本不会报错,ckpt 使用绝对路径。两个脚本都在同一个目录下。 ### 背景 硬件:4 块 A100(章鱼平台) 框架:mindspore 1.8.1, cuda 11.1 ### 脚本 单卡脚本: ``` mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -x NCCL_DEBUG -x GLOG_v \ -n 1 \ --hostfile ./hostfile_1gpus \ --output-filename ./log_output \ python -s ./predict.py \ --mode 2.6B \ --run_type predict \ --distribute false \ --op_level_model_parallel_num 1 \ --load_ckpt_path /code/translation/pangu/data/ckpt/ \ --load_ckpt_name mPanGu_Alpha-53_fp16.ckpt \ --param_init_type "fp16" ``` 多卡脚本 ``` mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -x NCCL_DEBUG -x GLOG_v \ -n 4 \ --hostfile ./hostfile_4gpus \ --output-filename ./log_output \ python -s ./predict.py \ --mode 2.6B \ --run_type predict \ --op_level_model_parallel_num 4 \ --load_ckpt_path /code/translation/pangu/data/ckpt/ \ --load_ckpt_name mPanGu_Alpha-53_fp16.ckpt \ --param_init_type "fp16" ``` ### 错误 多卡脚本报错: RuntimeError: CheckPoint file is not found 单卡脚本不报错。
taoht closed this issue 1 year ago
Sign in to join this conversation.
No Label
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.