#2666 启智和智算网络集群中GPU训练任务的数据集、训练脚本、训练输出存储路径统一

Closed
created 1 year ago by tanglj · 11 comments
tanglj commented 1 year ago
目前启智集群和智算网络集群中GPU训练任务的数据集、训练脚本、训练输出存储路径如下: 1、启智集群GPU训练任务:训练脚本存储在/code中,数据集存储在/dataset中,训练输出请存储在/model中以供后续下载。 2、智算网络集群GPU训练任务:训练脚本存储在/tmp/code中,数据集存储在/tmp/dataset中,训练输出请存储在/tmp/output中以供后续下载。 希望将两者统一为:训练脚本存储在/code中,数据集存储在/dataset中,训练输出请存储在/model中以供后续下载。
tanglj changed title from 启智集群中GPU训练任务的数据集、训练脚本、训练输出存储路径统一 to 启智和智算网络集群中GPU训练任务的数据集、训练脚本、训练输出存储路径统一 1 year ago
tanglj added the
enhancement
label 1 year ago
tanglj added this to the V20220815 milestone 1 year ago
lewis was assigned by tanglj 1 year ago
lewis commented 1 year ago
Owner
在智算集群中,平台没有权限去写/code、/dataset、/model等目录,因此无法将代码、数据集等设置成上述目录。 启智集群上线已有很长时间,不建议去修改代码、数据集等路径。 因此无法统一路径。
lewis added the
reviewed
label 1 year ago
lewis added the
need review
label 1 year ago
lewis removed the
reviewed
label 1 year ago
tanglj commented 1 year ago
Poster
暂时先维持原状,该任务先挂起。
tanglj removed this from the V20220815 milestone 1 year ago
tanglj commented 11 months ago
Poster
通过平台参数的方式,设置环境变量的方式。第一个版本先出方案。
tanglj added this to the V20230531 milestone 11 months ago
tanglj removed the
need review
label 11 months ago
tanglj removed the
enhancement
label 11 months ago
tanglj added the
feature
label 11 months ago
lewis was unassigned by tanglj 11 months ago
liuzx was assigned by tanglj 11 months ago
liuzx referenced this issue from a commit 11 months ago
zhoupzh referenced this issue from a commit 11 months ago
wangj modified the milestone from V20230531 to V20230628 10 months ago
liuzx commented 10 months ago
Collaborator
统一所有资源的参数使用方式
tanglj modified the milestone from V20230628 to V20230718 10 months ago
liuzx added this to the path_unite_all branch 9 months ago
liuzx commented 9 months ago
Collaborator
方案:https://openi.pcl.ac.cn/OpenIOSSG/AiForge-Doc/src/branch/master/%e8%ae%be%e8%ae%a1%e6%96%87%e6%a1%a3/%e4%ba%91%e8%84%91%e8%b6%85%e5%8f%82%e6%95%b0%e7%bb%9f%e4%b8%80%e6%96%b9%e6%a1%88/%e4%ba%91%e8%84%91GPU%e8%b5%84%e6%ba%90%e5%8f%82%e6%95%b0%e6%95%b4%e5%90%88.md sdk仓库:https://openi.pcl.ac.cn/liuzx/openi-pypi-test/src/branch/liuzx 测试分支:path_unite_all 提测文档:https://openi.pcl.ac.cn/OpenIOSSG/AiForge-Doc/src/branch/master/%e8%ae%be%e8%ae%a1%e6%96%87%e6%a1%a3/PythonSDK/%e4%ba%91%e8%84%91%e8%b7%af%e5%be%84%e7%bb%9f%e4%b8%80%e6%8f%90%e6%b5%8b%e6%96%87%e6%a1%a3.md
liuzx added the
test
label 9 months ago
wangj commented 9 months ago
Owner
移到下个里程碑
wangj modified the milestone from V20230718 to V20230808 9 months ago
wangj commented 9 months ago
Owner
部分环境无法在线安装openi包,导致无法通过api方式获取参数。待确认需求范围? @tanglj 无法安装的原因包括:有些分中心不能连外网、有些镜像python版本过低。 不能连外网,可以考虑通过离线安装的方式解决。 镜像python版本过低,需要分中心解决。
tanglj was assigned by wangj 9 months ago
wangj added the
need review
label 9 months ago
wangj removed the
test
label 8 months ago
wangj modified the milestone from V20230808 to V20230828 8 months ago
wangj modified the milestone from V20230828 to V20230912 8 months ago
wangj modified the milestone from V20230912 to V20231018 7 months ago
chenyifan01 was assigned by tanglj 6 months ago
wangj modified the milestone from V20231018 to V20231102 6 months ago
chenzh was assigned by tanglj 6 months ago
wangj modified the milestone from V20231102 to V20231120 5 months ago
liuzx commented 5 months ago
Collaborator
提测文档位于: https://openi.pcl.ac.cn/OpenIOSSG/AiForge-Doc/src/branch/master/%e8%ae%be%e8%ae%a1%e6%96%87%e6%a1%a3/%e4%ba%91%e8%84%91%e8%b6%85%e5%8f%82%e6%95%b0%e7%bb%9f%e4%b8%80%e6%96%b9%e6%a1%88/sdk%e7%bb%9f%e4%b8%80%e8%b7%af%e5%be%84%e6%8f%90%e6%b5%8b%e6%96%87%e6%a1%a3.md 测试分支:path_unite_all sdk分支:cloudbrain
liuzx added the
test
label 5 months ago
wangj was assigned by liuzx 5 months ago
wangj commented 5 months ago
Owner
目前,sdk预安装未覆盖所有类型云脑任务。 1.所有调试任务需要用户手动安装sdk,GPU调试任务再次调试后需要重装sdk; 2.训练任务自动安装sdk(待验证) 发现问题: #4913 、 #4918 、 #4919 、 #4920
tanglj modified the milestone from V20231120 to V20231211 5 months ago
tanglj modified the milestone from V20231211 to V20240109 5 months ago
tanglj modified the milestone from V20240109 to V20240129 4 months ago
tanglj modified the milestone from V20240129 to V20240109 3 months ago
tanglj modified the milestone from V20240109 to V20240116 3 months ago
wangj commented 1 month ago
Owner
下列问题,经过讨论暂时维持现状,已标记为wont'x。 #5165 【示例代码】继续训练示例代码未更新 #5155 在线推理示例代码问题 #4920 【c2net库】多节点智算npu训练任务,upload_openi()方法回传的结果无法区分是哪个节点的 #4919 【统一路径】智算NPU训练任务示例代码,回传了2份训练输出结果 #5116 【统一路径】弹窗通知勾选了不再提醒后,清空浏览器缓存还是会弹出来 #5204 【c2net库】新建调试任务时未选数据集文件,进入容器调用c2net库的prepare方法后会生成dataset目录下的标记文件
wangj commented 1 month ago
Owner
已经上线。遗留问题: #5203 、 #5124 、 #5169 、 #5210
wangj closed this issue 1 month ago
Sign in to join this conversation.
No Milestone
4 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.