Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
liutension ea707d474f | 3 years ago | |
---|---|---|
.. | ||
config | 4 years ago | |
controller | 3 years ago | |
docs | 3 years ago | |
readme.md | 3 years ago |
TaskSetController提供一个灵活的任务运行模型,用来应对ML/DL算法训练中多变的业务场景。
早期,我们是frameworkcontroller的用户,后来因为自身业务需要不得不做一些定制,因此诞生了
这个项目,非常感谢frameworkcontroller!
用户提交任务到 ApiServer, TaskSetController 从 Api-Server 获悉用户提交的任务,负责驱动任务朝期望的状态变更(等待,运行中,完成(成功或者失败))。
TaskSetController 不负责分配Pod到具体的计算节点,用户可以自主选择底层的调度器,在上图中列出的default-scheduler和kube-batch就是可选的调度器,当然社区还有其他调度器可选。
fpga-device-plugin和gpu-device-plugin这类角色则负责为任务分配特别的计算资源。
以tensorflow中常见的ps-worker场景为例:
apiVersion: octopus.openi.pcl.cn/v1alpha1
kind: TaskSet
metadata:
name: tensorflowdemo
spec:
retryPolicy:
retry: false
maxRetryCount: 1
roles:
- name: ps
replicas: 1
completionPolicy:
maxFailed: 1
minSucceeded: 1
retryPolicy:
retry: false
maxRetryCount: 1
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: busybox
command: ["sh","-c","sleep 100;exit 0"]
- name: worker
replicas: 3
eventPolicy:
- event: RoleSucceeded
action: TaskSetSucceeded
completionPolicy:
maxFailed: 2
minSucceeded: 2
retryPolicy:
retry: true
maxRetryCount: 2
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: busybox
command: ["sh","-c","sleep 30;exit 0"]
resources:
limits:
nvidia.com/gpu: 1
执行 kubectl create -f taskset.yaml
提交任务到k8s.
@Deprecated 此仓库已弃用,请移步至 https://git.openi.org.cn/OpenI/octopus.
启智章鱼项目(OPENI-OCTOPUS)是一个集群管理和资源调度系统,支持在GPU集群中运行AI任务作业(比如深度学习任务作业)。平台提供了一系列接口,能够支持主流的深度学习框架。
JavaScript Go SVG Python JSX other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》