Deleting a branch is permanent. It CANNOT be undone. Continue?
Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
问题描述
[Evicted] 2023/7/19 21:19:03
The node was low on resource: ephemeral-storage. Container task0 was using 161438164Ki, which exceeds its request of 0.
[ExceededGracePeriod] 2023/7/19 21:19:13
Container runtime did not kill the pod within specified grace period.
相关环境(GPU/NPU)
规格 GPU: 1*V100, CPU: 8, 显存: 32GB, 内存: 50GB
相关集群(启智/智算)
智算中心
任务类型(调试/训练/推理)
训练
任务名
train202307192069377
日志说明或问题截图
[2023-07-19 13:18:56 ViT-B/16](main.py 199): INFO Train: [0/30][300/30054] eta 3:43:13 lr 0.000007959 time 0.4235 (0.4501) tot_loss 5.2645 (4.6817) mem 16190MB
failed
[Evicted] 2023/7/19 21:19:03
The node was low on resource: ephemeral-storage. Container task0 was using 161438164Ki, which exceeds its request of 0.
[ExceededGracePeriod] 2023/7/19 21:19:13
Container runtime did not kill the pod within specified grace period.
期望的解决方案或建议
希望系统能给出具体的问题,这样一个failed 让人摸不着头脑。
这个任务对应的智算中心的存储已使用完,需要后台清理内存。还有的话注意一下占用的存储,这个任务日志显示占用161438164Ki,若镜像内存储占用太多,也会强制退出。
请问这个显示是多少呢?还是随机的?
一般我只保存5个训练得到的权重,而且都会下载到本地,不会长时间保存任务
节点存储报错,技术人员定位问题中,请等候修复
已解决。