#947 RuntimeError:E40021和E61001

Open
created 2 months ago by davislee · 11 comments
我在RTX 3090(mindtorch0.2.1+mindspore2.2.14)上测试了我的代码,可以正常运行,但是在Ascend云脑的训练任务中就遇到了`E40021:Failed to compile Op`和`$61001:input number is too much`等错误。我应该怎么解决这个问题呢,有点不知道从何入手?在CUDA上能成功运行的代码在迁移到Ascend平台上的时候有什么需要额外注意的地方吗?我查看过相关的文档,但是没有找到一个比较清晰的解决方案。谢谢谢谢 ``` [ERROR] KERNEL(14,ffff85a420e0,python):2024-08-04-04:14:05.524.986 [mindspore/ccsrc/plugin/device/ascend/kernel/acl/acl_kernel_mod.cc:261] Launch] Kernel launch failed, msg: Acl compile and execute failed, op_type_:Unpack ---------------------------------------------------- - Ascend Error Message: ---------------------------------------------------- E40021: 2024-08-04-04:14:05.081.451 Failed to compile Op [Unpack52]. (oppath: [Compile /usr/local/Ascend/ascend-toolkit/8.0.RC1/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/unpack.py failed with errormsg/stack: File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/utils/errormgr/error_manager_vector.py", line 284, in raise_err_specific_reson raise RuntimeError(args_dict, msg) RuntimeError: ({'errCode': 'E61001', 'op_name': 'te_unpack_a1d9b5b08fbab18702182967063349315bd54d7add3b449268c009aef5673c56_1', 'reason': 'input number is too much'}, 'In op [te_unpack_a1d9b5b08fbab18702182967063349315bd54d7add3b449268c009aef5673c56_1], [input number is too much]') ], optype: [Unpack])[THREAD:24081] Solution: See the host log for details, and then check the Python stack where the error log is reported. TraceBack (most recent call last): Compile op[Unpack52] failed, oppath[/usr/local/Ascend/ascend-toolkit/8.0.RC1/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/unpack.py], optype[Unpack], taskID[226]. Please check op's compilation error message.[FUNC:ReportBuildErrMessage][FILE:fusion_manager.cc][LINE:748][THREAD:24081] [SubGraphOpt][Compile][ProcFailedCompTask] Thread[281466957181152] recompile single op[Unpack52] failed[FUNC:ProcessAllFailedCompileTasks][FILE:tbe_op_store_adapter.cc][LINE:962][THREAD:24081] [SubGraphOpt][Compile][ParalCompOp] Thread[281466957181152] process fail task failed[FUNC:ParallelCompileOp][FILE:tbe_op_store_adapter.cc][LINE:1010][THREAD:24081] [SubGraphOpt][Compile][CompOpOnly] CompileOp failed.[FUNC:CompileOpOnly][FILE:op_compiler.cc][LINE:1119][THREAD:24081] [GraphOpt][FusedGraph][RunCompile] Failed to compile graph with compiler Normal mode Op Compiler[FUNC:SubGraphCompile][FILE:fe_graph_optimizer.cc][LINE:1385][THREAD:24081] Call OptimizeFusedGraph failed, ret:-1, engine_name:AIcoreEngine, graph_name:partition0_rank1_new_sub_graph1[FUNC:OptimizeSubGraph][FILE:graph_optimize.cc][LINE:126][THREAD:24081] subgraph 0 optimize failed[FUNC:OptimizeSubGraphWithMultiThreads][FILE:graph_manager.cc][LINE:1021][THREAD:861] build graph failed, graph id:51, ret:-1[FUNC:BuildModelWithGraphId][FILE:ge_generator.cc][LINE:1615][THREAD:861] [Build][SingleOpModel]call ge interface generator.BuildSingleOpModel failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161][THREAD:861] [Build][Op]Fail to build op model[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145][THREAD:861] build op model failed, result = 500002[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145][THREAD:861] (Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description) ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/transform/acl_ir/acl_utils.cc:379 Run [ERROR] DEVICE(14,ffff85a420e0,python):2024-08-04-04:14:05.525.052 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_kernel_executor.cc:1007] LaunchKernel] Launch kernel failed, kernel full name: Bprop/gradStack/Unstack-op0 [ERROR] RUNTIME_FRAMEWORK(14,ffff85a420e0,python):2024-08-04-04:14:05.636.801 [mindspore/ccsrc/runtime/graph_scheduler/actor/kernel_async_launch_actor.cc:38] LaunchKernel] Failed to launch kernel: Bprop/gradStack/Unstack-op0 and catch exception: ---------------------------------------------------- - Kernel error: ---------------------------------------------------- Launch kernel failed: Bprop/gradStack/Unstack-op0 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/graph_scheduler/actor/kernel_actor.cc:753 ExecuteLaunchKernelTask Traceback (most recent call last): File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 349, in <module> loss,L1_loss,Velocity_loss = train_step(T_mels, T_pose, Nl_pose, Nl_content, T_content) File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 330, in train_step (loss, _, L1_loss, Velocity_loss), grads = grad_fn(T_mels, T_pose, Nl_pose, Nl_content, T_content) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 626, in after_grad return grad_(fn_, weights)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 130, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 602, in after_grad out = _pynative_executor.grad(fn, grad_, weights, grad_position, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1297, in grad return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values())) RuntimeError: ---------------------------------------------------- - Kernel error: ---------------------------------------------------- Launch kernel failed: Bprop/gradStack/Unstack-op0 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- ```
Erpim commented 2 months ago
Collaborator
你好,可以提供个简化的代码吗,从这个报错信息上看像是自动微分报的错,有可能是框架bug导致。一般如果期望GPU和NPU进行比较的时候,最好保持mindspore以及mindtorch的版本是一致的,可以快速排查是框架还是具体硬件算子的影响
davislee commented 2 months ago
Poster
我在官网上找到的GPU最新版本是mindtorch0.2.1+mindspore2.2.14(我在AutoDL上调试的),而Ascend调试任务只有mindtorch0.3版本的镜像可用。所以版本不一致,我晚点提供一个简化版的代码。
davislee commented 2 months ago
Poster
Ascend调试任务中,并且在图模式下,模型可以正常训练。但相同的环境,到训练任务中,训练就失败了。(我之前没有设置`ms.set_context`,用的应该是默认的配置)。报错有点区别: ```shell ---------------------------------------------------- - Ascend Error Message: ---------------------------------------------------- E40021: 2024-08-06-06:52:49.033.290 Failed to compile Op [SplitD25]. (oppath: [Compile /usr/local/Ascend/ascend-toolkit/8.0.RC1/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/split_d.py failed with errormsg/stack: File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/dsl/varshape/split_variable_shape.py", line 46, in check_params raise RuntimeError(err_msg__args, get_error_message(err_msg__args)) RuntimeError: ({'errCode': 'E90001', 'detailed_cause': 'split numbers error, split numbers must be greater 0 and less equal 63, now, it is 128'}, 'Compile operator failed, cause: Parameters check failed, detailed information: split numbers error, split numbers must be greater 0 and less equal 63, now, it is 128.') ], optype: [SplitD])[THREAD:2876] Solution: See the host log for details, and then check the Python stack where the error log is reported. TraceBack (most recent call last): Compile op[SplitD25] failed, oppath[/usr/local/Ascend/ascend-toolkit/8.0.RC1/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/split_d.py], optype[SplitD], taskID[143]. Please check op's compilation error message.[FUNC:ReportBuildErrMessage][FILE:fusion_manager.cc][LINE:748][THREAD:2876] [SubGraphOpt][Compile][ProcFailedCompTask] Thread[281464742588640] recompile single op[SplitD25] failed[FUNC:ProcessAllFailedCompileTasks][FILE:tbe_op_store_adapter.cc][LINE:962][THREAD:2876] [SubGraphOpt][Compile][ParalCompOp] Thread[281464742588640] process fail task failed[FUNC:ParallelCompileOp][FILE:tbe_op_store_adapter.cc][LINE:1010][THREAD:2876] [SubGraphOpt][Compile][CompOpOnly] CompileOp failed.[FUNC:CompileOpOnly][FILE:op_compiler.cc][LINE:1119][THREAD:2876] [GraphOpt][FusedGraph][RunCompile] Failed to compile graph with compiler Normal mode Op Compiler[FUNC:SubGraphCompile][FILE:fe_graph_optimizer.cc][LINE:1385][THREAD:2876] Call OptimizeFusedGraph failed, ret:-1, engine_name:AIcoreEngine, graph_name:partition0_rank1_new_sub_graph1[FUNC:OptimizeSubGraph][FILE:graph_optimize.cc][LINE:126][THREAD:2876] subgraph 0 optimize failed[FUNC:OptimizeSubGraphWithMultiThreads][FILE:graph_manager.cc][LINE:1021][THREAD:860] build graph failed, graph id:24, ret:-1[FUNC:BuildModelWithGraphId][FILE:ge_generator.cc][LINE:1615][THREAD:860] [Build][SingleOpModel]call ge interface generator.BuildSingleOpModel failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161][THREAD:860] [Build][Op]Fail to build op model[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145][THREAD:860] build op model failed, result = 500002[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145][THREAD:860] (Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description) ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/transform/acl_ir/acl_utils.cc:379 Run [ERROR] DEVICE(14,ffff71ac00e0,python):2024-08-06-06:52:49.420.884 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_kernel_executor.cc:1007] LaunchKernel] Launch kernel failed, kernel full name: Default/Split-op5 [ERROR] RUNTIME_FRAMEWORK(14,ffff71ac00e0,python):2024-08-06-06:52:49.506.350 [mindspore/ccsrc/runtime/graph_scheduler/actor/kernel_async_launch_actor.cc:38] LaunchKernel] Failed to launch kernel: Default/Split-op5 and catch exception: ---------------------------------------------------- - Kernel error: ---------------------------------------------------- Launch kernel failed: Default/Split-op5 ---------------------------------------------------- - The Function Call Stack: (For framework developers) ---------------------------------------------------- In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4576/ res = _get_cache_prim(P.Split)(axis, sections)(x)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4565/def _split_int(x, split_size_or_sections, axis):/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4658/ res = _split_int(tensor, split_size_or_sections, arr_axis)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4612/def split(tensor, split_size_or_sections, axis=0):/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/functional.py:1264/ output = ms.ops.split(tensor, split_size_or_sections, dim)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/functional.py:1262/def split(tensor, split_size_or_sections, dim=0):/ In file /tmp/code/iplapmindspore/models_ms/landmark_generator.py:224/ pose_embedding = torch.stack(torch.split(pose_embedding, T), dim=0) # (B,T,512)/ In file /tmp/code/iplapmindspore/models_ms/landmark_generator.py:203/ def forward(self, T_mels, T_pose, Nl_pose, Nl_content):/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/nn/modules/module.py:430/ return self.forward(*inputs, **kwargs)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/nn/modules/module.py:429/ def construct(self, *inputs, **kwargs):/ ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/graph_scheduler/actor/kernel_actor.cc:753 ExecuteLaunchKernelTask 0%| | 0/25 [05:50<?, ?it/s] 0%| | 0/332 [06:03<?, ?it/s] Traceback (most recent call last): File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 344, in <module> evaluate(model, val_data_loader) File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 269, in evaluate predict_content = model(T_mels, T_pose,Nl_pose,Nl_content) # (B*T,2,57) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 685, in __call__ out = self.compile_and_run(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1006, in compile_and_run return _cell_graph_executor(self, *new_args, phase=self.phase) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1631, in __call__ return self.run(obj, *args, phase=phase) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1670, in run return self._exec_pip(obj, *args, phase=phase_real) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 130, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1650, in _exec_pip return self._graph_executor(args, phase) RuntimeError: ---------------------------------------------------- - Kernel error: ---------------------------------------------------- Launch kernel failed: Default/Split-op5 ---------------------------------------------------- - The Function Call Stack: (For framework developers) ---------------------------------------------------- In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4576/ res = _get_cache_prim(P.Split)(axis, sections)(x)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4565/def _split_int(x, split_size_or_sections, axis):/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4658/ res = _split_int(tensor, split_size_or_sections, arr_axis)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/function/array_func.py:4612/def split(tensor, split_size_or_sections, axis=0):/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/functional.py:1264/ output = ms.ops.split(tensor, split_size_or_sections, dim)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/functional.py:1262/def split(tensor, split_size_or_sections, dim=0):/ In file /tmp/code/iplapmindspore/models_ms/landmark_generator.py:224/ pose_embedding = torch.stack(torch.split(pose_embedding, T), dim=0) # (B,T,512)/ In file /tmp/code/iplapmindspore/models_ms/landmark_generator.py:203/ def forward(self, T_mels, T_pose, Nl_pose, Nl_content):/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/nn/modules/module.py:430/ return self.forward(*inputs, **kwargs)/ In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindtorch-0.3.0-py3.9.egg/mindtorch/torch/nn/modules/module.py:429/ def construct(self, *inputs, **kwargs):/ ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/graph_scheduler/actor/kernel_actor.cc:753 ExecuteLaunchKernelTask failed ``` 如果是自动微分报错的话,我把我的模型训练的代码,也就是`train_step`函数相关的代码。提了出来然后再做测试,输入用`torch.rand`生成的和我输入数据一致的随机张量。发现可以正常运行。那应该不是微分的问题?还是说是特定的数值(比如我的训练集)会导致这样的报错?
Erpim commented 2 months ago
Collaborator
这里的调试任务和训练任务的差异点是什么?不是很理解这个描述,是说云脑镜像申请的模式吗?
Erpim commented 2 months ago
Collaborator
> 我在官网上找到的GPU最新版本是mindtorch0.2.1+mindspore2.2.14(我在AutoDL上调试的),而Ascend调试任务只有mindtorch0.3版本的镜像可用。所以版本不一致,我晚点提供一个简化版的代码。 配置ms.set_context(device_target="CPU")试试呢?先看下其他硬件条件下流程是否正常。
davislee commented 2 months ago
Poster
> 这里的调试任务和训练任务的差异点是什么?不是很理解这个描述,是说云脑镜像申请的模式吗? 是的,就是云脑中分的那些调试任务、训练任务、推理任务等。如下图: ![image](/attachments/d38d03ff-9a1b-48e3-846c-b641e4f32d03) 配置`ms.set_context(device_target="CPU",mode=0)`会报错: ``` Traceback (most recent call last): File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 351, in <module> loss,L1_loss,Velocity_loss = train_step(T_mels, T_pose, Nl_pose, Nl_content, T_content) File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 333, in train_step (loss, _, L1_loss, Velocity_loss), grads = grad_fn(T_mels, T_pose, Nl_pose, Nl_content, T_content) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 741, in staging_specialize out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 130, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 360, in __call__ raise err File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 357, in __call__ phase = self.compile(self.fn.__name__, *args_list, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 449, in compile is_compile = self._graph_executor.compile(self.fn, compile_args, kwargs, phase, True) ValueError: For 'BiasAddGrad', input tensor's dimension is 3, when data_format is NCHW the last dimension size should greater than 1. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/device/cpu/kernel/bias_add_grad_cpu_kernel.cc:45 Resize failed ``` 配置`ms.set_context(device_target="CPU")`会报错: ``` Traceback (most recent call last): File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 351, in <module> loss,L1_loss,Velocity_loss = train_step(T_mels, T_pose, Nl_pose, Nl_content, T_content) File "/tmp/code/iplapmindspore/train_landmarks_generator_ms_cuda.py", line 333, in train_step (loss, _, L1_loss, Velocity_loss), grads = grad_fn(T_mels, T_pose, Nl_pose, Nl_content, T_content) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 626, in after_grad return grad_(fn_, weights)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 130, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 602, in after_grad out = _pynative_executor.grad(fn, grad_, weights, grad_position, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1297, in grad return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values())) ValueError: For 'BiasAddGrad', input tensor's dimension is 3, when data_format is NCHW the last dimension size should greater than 1. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/device/cpu/kernel/bias_add_grad_cpu_kernel.cc:45 Resize failed ```
hanjr commented 1 month ago
Collaborator
你在调试环境中是直接执行训练程序,没有对环境做出过修改么,看到你训练模式下的这个报错应该是数据格式算子不支持,那在同样环境下,调试任务应该也是不支持会报错才对。
laich commented 1 month ago
Collaborator
@davislee 以调试模式的结果为准。mindtorch0.3的镜像在训练模式没有做过正式的测试。在训练模式中是通过docker run启动的任务,并没有进入到bash的环境,可能会出现意料之外的错误。
davislee commented 1 month ago
Poster
> 你在调试环境中是直接执行训练程序,没有对环境做出过修改么,看到你训练模式下的这个报错应该是数据格式算子不支持,那在同样环境下,调试任务应该也是不支持会报错才对。 安装过其他的python依赖包(通过python执行shell脚本的方式),但是在训练任务和测试任务中程序运行的过程都是一模一样的,不知道为什么调试环境没有报错但是训练环境报错了
davislee commented 1 month ago
Poster
> @davislee 以调试模式的结果为准。mindtorch0.3的镜像在训练模式没有做过正式的测试。在训练模式中是通过docker run启动的任务,并没有进入到bash的环境,可能会出现意料之外的错误。 可是调试模式有4小时的限制,我想训练模型的话好像只能到训练任务中,但训练任务会报这个错误。
davislee commented 2 weeks ago
Poster
又复现了bug ```python import mindtorch.torch as torch import mindspore as ms import mindtorch.torch.nn as nn class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.conv = nn.Conv1d( in_channels=2, # 输入通道数 out_channels=512, # 输出通道数 kernel_size=131, # 卷积核大小,选择与输入序列长度相同的值 stride=1, # 步长 padding=0, # 填充 ) def forward(self, Nl_pose, Nl_content): # (B,T,1,hv,wv) (B,T,2,74) (B,N_l,2,74) (B,N_l,2,57) N_l = Nl_content.size(1) Nl_ref = torch.cat([Nl_pose, Nl_content], dim=3) # (B,Nl,2,131=74+57) Nl_ref = torch.cat([Nl_ref[i] for i in range(Nl_ref.size(0))], dim=0) # (B*Nl,2,131) ref_embedding = self.conv(Nl_ref).squeeze(-1) # (B*Nl,512) ref_embedding = torch.stack(torch.split(ref_embedding, N_l, dim=0), dim=0) # (B,N_l,512) return ref_embedding # (B*T,2,57) ms.set_context(device_target="Ascend", device_id=4, ascend_config={"precision_mode": "allow_fp32_to_fp16"}) # 定义输入 Nl_pose = torch.randn(128, 15, 2, 74) Nl_content = torch.randn(128, 15, 2, 57) # 将输入打包 inputs = [Nl_pose, Nl_content] # 定义模型 model = SimpleModel() # 定义优化器和损失函数 optimizer = torch.optim.Adam(model.parameters(), lr=0.001) criterion_L1 = torch.nn.L1Loss() # 定义前向传播并定义grad函数 def forward_fn(inputs): predict_content = model(*inputs) L1_loss = criterion_L1(predict_content, torch.zeros(*predict_content.shape)) print(f'L1_loss:{L1_loss}') return L1_loss, predict_content grad_fn = ms.ops.value_and_grad(forward_fn, None, optimizer.parameters, has_aux=True) grad_fn(inputs) print("success") ``` 报错 ``` [ERROR] KERNEL(270214,fffdd3fff160,python):2024-09-19-01:46:09.655.677 [mindspore/ccsrc/plugin/device/ascend/kernel/acl/acl_kernel_mod.cc:260] Launch] Kernel launch failed, msg: Acl compile and execute failed, op_type_:Unpack ---------------------------------------------------- - Ascend Error Message: ---------------------------------------------------- E61001: 2024-09-19-01:46:09.274.889 In op [te_unpack_076dce63228e6644175f2a39b4445a6c3013edac76e87619911779e7acbabd45_1], [input number is too much][THREAD:271372] TraceBack (most recent call last): Failed to compile Op [Unpack7]. (oppath: [Compile /home/shuziren/Ascend/ascend-toolkit/8.0.RC2/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/unpack.py failed with errormsg/stack: File "/home/shuziren/miniconda3/envs/iplap/lib/python3.10/site-packages/tbe/common/utils/errormgr/error_manager_vector.py", line 284, in raise_err_specific_reson raise RuntimeError(args_dict, msg) RuntimeError: ({'errCode': 'E61001', 'op_name': 'te_unpack_076dce63228e6644175f2a39b4445a6c3013edac76e87619911779e7acbabd45_1', 'reason': 'input number is too much'}, 'In op [te_unpack_076dce63228e6644175f2a39b4445a6c3013edac76e87619911779e7acbabd45_1], [input number is too much]') ], optype: [Unpack])[THREAD:271372] Compile op[Unpack7] failed, oppath[/home/shuziren/Ascend/ascend-toolkit/8.0.RC2/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/unpack.py], optype[Unpack], taskID[40]. Please check op's compilation error message.[FUNC:ReportBuildErrMessage][FILE:fusion_manager.cc][LINE:748][THREAD:271372] [SubGraphOpt][Compile][ProcFailedCompTask] Thread[281462595072352] recompile single op[Unpack7] failed[FUNC:ProcessAllFailedCompileTasks][FILE:tbe_op_store_adapter.cc][LINE:961][THREAD:271372] [SubGraphOpt][Compile][ParalCompOp] Thread[281462595072352] process fail task failed[FUNC:ParallelCompileOp][FILE:tbe_op_store_adapter.cc][LINE:1009][THREAD:271372] [SubGraphOpt][Compile][CompOpOnly] CompileOp failed.[FUNC:CompileOpOnly][FILE:op_compiler.cc][LINE:1112][THREAD:271372] [GraphOpt][FusedGraph][RunCompile] Failed to compile graph with compiler Normal mode Op Compiler[FUNC:SubGraphCompile][FILE:fe_graph_optimizer.cc][LINE:1420][THREAD:271372] Call OptimizeFusedGraph failed, ret:-1, engine_name:AIcoreEngine, graph_name:partition0_rank1_new_sub_graph1[FUNC:OptimizeSubGraph][FILE:graph_optimize.cc][LINE:119][THREAD:271372] subgraph 0 optimize failed[FUNC:OptimizeSubGraphWithMultiThreads][FILE:graph_manager.cc][LINE:1012][THREAD:270734] build graph failed, graph id:6, ret:-1[FUNC:BuildModelWithGraphId][FILE:ge_generator.cc][LINE:1608][THREAD:270734] [Build][SingleOpModel]call ge interface generator.BuildSingleOpModel failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161][THREAD:270734] [Build][Op]Fail to build op model[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145][THREAD:270734] build op model failed, result = 500002[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145][THREAD:270734] (Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description) ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/transform/acl_ir/acl_utils.cc:379 Run [ERROR] DEVICE(270214,fffdd3fff160,python):2024-09-19-01:46:09.655.806 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_kernel_executor.cc:1156] LaunchKernel] Launch kernel failed, kernel full name: Bprop/gradStack/Unstack-op0 [ERROR] RUNTIME_FRAMEWORK(270214,ffff992bd010,python):2024-09-19-01:46:09.700.558 [mindspore/ccsrc/runtime/graph_scheduler/actor/actor_common.cc:327] WaitRuntimePipelineFinish] Wait runtime pipeline finish and an error occurred: ---------------------------------------------------- - Kernel error: ---------------------------------------------------- Launch kernel failed: Bprop/gradStack/Unstack-op0 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/graph_scheduler/actor/kernel_actor.cc:917 ExecuteLaunchKernelTask ``` 环境: * mindtorch 最新的master分支 * mindspore 2.3.1 * Cann:Ascend-cann-kernels-910_8.0.RC2_linux、Ascend-cann-toolkit_8.0.RC2_linux-aarch64 * NPU:910A * Python: 3.10.14 * gcc:7.3.0
Sign in to join this conversation.
No Label
No Milestone
No Assignees
4 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.