#1096 The node was low on resource: ephemeral-storage. Container task0 was using 161438164Ki, which exceeds its request of 0.

Closed
created 9 months ago by trainer · 5 comments
trainer commented 9 months ago
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 [Evicted] 2023/7/19 21:19:03 The node was low on resource: ephemeral-storage. Container task0 was using 161438164Ki, which exceeds its request of 0. [ExceededGracePeriod] 2023/7/19 21:19:13 Container runtime did not kill the pod within specified grace period. ### 相关环境(GPU/NPU) 规格 GPU: 1*V100, CPU: 8, 显存: 32GB, 内存: 50GB ### 相关集群(启智/智算) 智算中心 ### 任务类型(调试/训练/推理) 训练 ### 任务名 train202307192069377 ### 日志说明或问题截图 [2023-07-19 13:18:56 ViT-B/16](main.py 199): INFO Train: [0/30][300/30054] eta 3:43:13 lr 0.000007959 time 0.4235 (0.4501) tot_loss 5.2645 (4.6817) mem 16190MB failed [Evicted] 2023/7/19 21:19:03 The node was low on resource: ephemeral-storage. Container task0 was using 161438164Ki, which exceeds its request of 0. [ExceededGracePeriod] 2023/7/19 21:19:13 Container runtime did not kill the pod within specified grace period. ### 期望的解决方案或建议 希望系统能给出具体的问题,这样一个failed 让人摸不着头脑。
trainer started working 9 months ago
liuzx commented 9 months ago
Collaborator
这个任务对应的智算中心的存储已使用完,需要后台清理内存。还有的话注意一下占用的存储,这个任务日志显示占用161438164Ki,若镜像内存储占用太多,也会强制退出。
trainer commented 9 months ago
Poster
> 这个任务对应的智算中心的存储已使用完,需要后台清理内存。还有的话注意一下占用的存储,这个任务日志显示占用161438164Ki,若镜像内存储占用太多,也会强制退出。 请问这个显示是多少呢?还是随机的?
一般我只保存5个训练得到的权重,而且都会下载到本地,不会长时间保存任务
liuzx commented 9 months ago
Collaborator
节点存储报错,技术人员定位问题中,请等候修复
liuzx added the
invalid
label 9 months ago
liuzx commented 6 months ago
Collaborator
已解决。
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.