#946 RuntimeError: Allocate memory failed

Closed
created 2 months ago by davislee · 6 comments
云脑(mindtorch0.3环境)运行调试任务的时候报错`RuntimeError: Allocate memory failed`。 同样的代码,我在3090(24GB显存)+intel(72GB内存)上跑没有问题,我看Ascend平台给的一张910有32GB显存和96GB内存。但是运行了几次都报错`RuntimeError: Allocate memory failed`。如下: ![image](/attachments/6238cb42-d71f-4b56-bd6a-88a6b486753f) 我减少了数据集大小和`batch_size`大小但是还是报同样的错误。但我用`top`看内存占用并不高。 我又尝试用`GRAPH_MODE`,但是显示代码中有些操作不支持`GRAPH_MODE`,于是就放弃了。 请问这个报错是什么原因导致的呢?我要怎么改才能正常运行呢。
237 KiB
Erpim commented 2 months ago
Collaborator
你好,这个报错信息应该是指的显存不足了,通过top命令是无法观察的。改变数据集大小或者bs执行的迭代数有所变化吗?另外,每次报错的迭代数是一致的还是随机的呢?可以把图模式遇到的报错也发一下。
davislee commented 2 months ago
Poster
改变`batch_size`或数据集大小后执行的迭代次数有变化,每次报错的迭代数是随机的。我之前在Ascend平台(mindtorch0.3镜像)上用Res34跑MNIST数据集显存也不足了(在默认的pynative_mode下执行),当时就觉得很奇怪。 但是我用同样的代码(在默认的pynative_mode下执行),在调试任务下会报错显存不足,训练任务下就不会,环境全都是一样的。 另外图模式,我发现之前报错是因为图模式没有`with torch.no_gard():`,现在去掉这行之后就可以正常运行了。现在只有在默认的pynative_mode下才会报错显存不足。
Erpim commented 2 months ago
Collaborator
另一个思路,云脑环境是可以替换mindspore版本的,从https://www.mindspore.cn/versions 查看历史版本,安装2.3rc2或者2.3.0看下是否还复现问题
davislee commented 2 months ago
Poster
云脑环境的mindspore只有2.2.0和2.3.0的mindspore环境,其中只有2.3.0的mindspore环境有mindtorch(0.3.0)。可以看看我这个比较简单的仓库,就是我之前用Res34跑MNIST数据集的仓库,在pynative_mode下也有同样的显存不足的问题。或许更容易找到原因? 仓库地址:https://openi.pcl.ac.cn/davislee/davis202407221627027/src/branch/main 启动一个云脑调试任务,到仓库根目录下用`python main.py`就能执行了。MNIST数据集整个也都传到了仓库中,在`仓库名/data`文件夹下。 会遇到这样的报错:`Malloc Mem From Mem Pool failed` ```shell [ERROR] KERNEL(2878,fffd76ffd0e0,python):2024-08-07-02:15:23.676.963 [mindspore/ccsrc/plugin/device/ascend/kernel/acl/acl_kernel_mod.cc:261] Launch] Kernel launch failed, msg: Malloc Mem From Mem Pool failed, size:2359328 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/transform/acl_ir/acl_allocator.cc:33 AllocFunc [ERROR] DEVICE(2878,fffd76ffd0e0,python):2024-08-07-02:15:23.677.049 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_kernel_executor.cc:1007] LaunchKernel] Launch kernel failed, kernel full name: Default/Conv2DBackpropFilter-op3 Traceback (most recent call last): File "/tmp/code/davis202407221627027/main.py", line 93, in <module> loss, total_correct = train_step(images, targets, total_correct) File "/tmp/code/davis202407221627027/main.py", line 73, in train_step (loss, _, total_correct), grads = grad_fn(images, targets, total_correct) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 626, in after_grad return grad_(fn_, weights)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 130, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 602, in after_grad out = _pynative_executor.grad(fn, grad_, weights, grad_position, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1297, in grad return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values())) RuntimeError: Launch kernel failed, name:Default/Conv2DBackpropFilter-op3 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ```
Erpim commented 1 month ago
Collaborator
MindTorch一般是不会导致内存泄漏的。MindSpore框架在中间开发版本确实有几个点会触发显存泄漏,在新版本已经修复了,当前只能通过现替换环境中的mindspore版本来验证。
davislee commented 1 month ago
Poster
我通过`pip install mindspore==2.3.0rc2`下载了2.3.0rc2版本的mindspore。依旧报错: ``` [ERROR] KERNEL(3578,fffd967fc0e0,python):2024-08-15-03:55:38.544.459 [mindspore/ccsrc/plugin/device/ascend/kernel/acl/acl_kernel_mod.cc:261] Launch] Kernel launch failed, msg: Malloc Mem From Mem Pool failed, size:9437216 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/transform/acl_ir/acl_allocator.cc:33 AllocFunc [ERROR] DEVICE(3578,fffd967fc0e0,python):2024-08-15-03:55:38.544.566 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_kernel_executor.cc:951] LaunchKernel] Launch kernel failed, kernel full name: Default/Conv2DBackpropFilter-op0 [ERROR] KERNEL(3578,fffd967fc0e0,python):2024-08-15-03:55:38.728.811 [mindspore/ccsrc/plugin/device/ascend/kernel/acl/acl_kernel_mod.cc:261] Launch] Kernel launch failed, msg: Malloc Mem From Mem Pool failed, size:9437216 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/transform/acl_ir/acl_allocator.cc:33 AllocFunc [ERROR] DEVICE(3578,fffd967fc0e0,python):2024-08-15-03:55:38.728.878 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_kernel_executor.cc:951] LaunchKernel] Launch kernel failed, kernel full name: Default/Conv2DBackpropFilter-op0 Training: 57%|███████████████████████████████████████████████████████▉ | 535/938 [02:24<01:48, 3.70it/s] Traceback (most recent call last): File "/tmp/code/davis202407221627027/main.py", line 90, in <module> loss, total_correct = train_step(images, targets, total_correct) File "/tmp/code/davis202407221627027/main.py", line 73, in train_step (loss, _, total_correct), grads = grad_fn(images, targets, total_correct) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 626, in after_grad return grad_(fn_, weights)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 132, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 602, in after_grad out = _pynative_executor.grad(fn, grad_, weights, grad_position, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1337, in grad return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values())) RuntimeError: Launch kernel failed, name:Default/Conv2DBackpropFilter-op0 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/pynative/op_runner.cc:923 RunSingleOpGraph [ERROR] PIPELINE(3578,ffffb3d7d440,python):2024-08-15-03:55:39.076.584 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2510] ClearResAtexit] Check exception before process exit: Launch kernel failed, name:Default/Conv2DBackpropFilter-op0 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/pynative/op_runner.cc:923 RunSingleOpGraph ``` 安装2.3.0rc1版本的,依旧报错: ``` Training: 57%|███████████████████████████████████████████████████████▉ | 535/938 [03:29<02:37, 2.55it/s] Traceback (most recent call last): File "/tmp/code/davis202407221627027/main.py", line 90, in <module> loss, total_correct = train_step(images, targets, total_correct) File "/tmp/code/davis202407221627027/main.py", line 73, in train_step (loss, _, total_correct), grads = grad_fn(images, targets, total_correct) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 626, in after_grad return grad_(fn_, weights)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 131, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 602, in after_grad out = _pynative_executor.grad(fn, grad_, weights, grad_position, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1298, in grad return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values())) RuntimeError: ---------------------------------------------------- - Memory not enough: ---------------------------------------------------- Device(id:0) memory isn't enough and alloc failed, kernel name: kernel_graph_0_MemoryAllocActor, alloc size: 9437696B. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:801 Run ``` **但是安装了2.3.1的mindspore就可以正常运行了。**
davislee closed this issue 3 weeks ago
Sign in to join this conversation.
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.