#1101 启智集群CPU/GPU资源是暂时用不了吗

Closed
created 1 year ago by pwkQiZhi · 11 comments
pwkQiZhi commented 1 year ago
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 我新建的训练任务都无法跑起来,任务运行简况里显示: [FailedScheduling] all nodes are unavailable: 1 plugin NodeUnschedulable predicates failed node(s) were unschedulable, 10 node(s) selector fit queue failed, 26 node(s) resource fit failed. ### 相关环境(GPU/NPU) ### 相关集群(启智) ### 任务类型(训练) ### 任务名 testsourceisavilable ### 日志说明或问题截图
pwkQiZhi commented 1 year ago
Poster
状态也一直是WAITING,如果是资源不可用的话有公告吗
liuzx commented 1 year ago
Owner
可以使用啊,换个镜像试试
pwkQiZhi commented 1 year ago
Poster
好的,谢谢,我先试试
我也出现这个问题了,换个镜像也没有用
liuzx commented 1 year ago
Owner
是处于waiting状态吗,任务名叫什么
任务名:smoothl1_norm
真是受不了了,创建任务的时候显示第一名,然后就一直waiting,大概四到五个小时才能运行,然后新建一个任务又是这样。
liuzx commented 1 year ago
Owner
创建任务时显示第一是表示目前处于排队的第一位,也是需要等有卡空出来了才能开始训练。
请问下朋友们怎么解决的,我现在也遇到了这个问题,呜呜呜
liuzx commented 1 year ago
Owner
重试下看看,还有问题的话,请发下任务名
liuzx closed this issue 1 year ago
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.