Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
fangzehua e361e2b943 | 1 year ago | |
---|---|---|
.gitee | 4 years ago | |
.github | 4 years ago | |
cmd/manager | 2 years ago | |
config | 2 years ago | |
deploy/v1 | 2 years ago | |
hack | 2 years ago | |
pkg | 1 year ago | |
.dockerignore | 2 years ago | |
.gitignore | 2 years ago | |
Dockerfile | 2 years ago | |
LICENSE | 2 years ago | |
Makefile | 2 years ago | |
OWNERS | 1 year ago | |
PROJECT | 2 years ago | |
README.md | 1 year ago | |
go.mod | 2 years ago | |
go.sum | 2 years ago |
MindSpore Operator 是MindSpore在Kubernetes上进行分布式训练的插件。CRD(Custom Resource Definition)中定义了Scheduler、PS、Worker三种角色,用户只需配置yaml文件,即可轻松实现分布式训练。
安装方法可以有以下几种
kubectl apply -f deploy/v1/ms-operator.yaml
安装后:
使用kubectl get pods --all-namespaces
,即可看到namespace为ms-operator-system的部署任务。
使用kubectl describe pod ms-operator-controller-manager-xxx-xxx -n ms-operator-system
,可查看pod的详细信息。
make deploy IMG=swr.cn-south-1.myhuaweicloud.com/mindspore/ms-operator:latest
make run
当前ms-operator支持普通单Worker训练、PS模式的单Worker训练以及自动并行(例如数据并行、模型并行等)的Scheduler、Worker启动。
在config/samples/
中有运行样例。
以数据并行的Scheduler、Worker启动为例,其中数据集和网络脚本需提前准备:
kubectl apply -f config/samples/ms_wide_deep_dataparallel.yaml
使用kubectl get all -o wide
即可看到集群中启动的Scheduler和Worker,以及Scheduler对应的Service。
pkg/apis/v1/msjob_types.go
中为MSJob的CRD定义。
pkg/controllers/v1/msjob_controller.go
中为MSJob controller的核心逻辑。
make docker-build IMG=swr.cn-south-1.myhuaweicloud.com/mindspore/ms-operator:latest
docker push swr.cn-south-1.myhuaweicloud.com/mindspore/ms-operator:latest
No Description
Text Go Makefile Markdown Shell other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》