#59 add dp_main

Merged
Erpim merged 15 commits from frelam/MSAdapterModelZoo:master0219 into master 1 month ago
frelam changed title from [WIP] add dp_main to add dp_main 2 months ago
frelam reviewed 1 month ago
@@ -197,2 +213,4 @@
# device = torch.device("cpu")
# define loss function (criterion), optimizer, and learning rate scheduler

if not torch.cuda.is_available() and not torch.backends.mps.is_available():
frelam commented 1 month ago
这部分代码与上面注释掉部分高度重合, 但当前保留注释, 可方便对比。
zoulq commented 1 month ago
和修改前代码比,还是和torch代码比?理论上有torch源码,不需要用注释的方式提供
frelam commented 1 month ago
和修改前代码比, 修改前和torch是接近的。 准备上库两个文件, 一份torch源码, 一份最终代码, 对比就可知道改了哪些。
frelam reviewed 1 month ago
@@ -126,2 +125,3 @@
# main_worker(args.gpu, ngpus_per_node, args)
main_worker(args.gpu, 1, args)

args.distributed = args.world_size > 1 or args.multiprocessing_distributed
frelam commented 1 month ago
这部分代码与上面注释掉部分高度重合, 但当前保留注释, 可方便对比。
frelam reviewed 1 month ago
@@ -475,3 +518,1 @@
# dist.all_reduce(total, dist.ReduceOp.SUM, async_op=False)
# self.sum, self.count = total.tolist()
# self.avg = self.sum / self.count
def all_reduce(self):
frelam commented 1 month ago
allreduce已经支持, 放开该段注释。
Erpim commented 1 month ago
Collaborator
拆成两个文件,分别torch和adapter,方便对比
zoulq reviewed 1 month ago
@@ -10,3 +10,3 @@
cd ./device
echo "start data parallel training"
mpirun -n 4 --output-filename log_output --merge-stderr-to-stdout python ./data_dp_main.py ./ImageNet2012 -a resnet50 -b 64 -dist > train.log 2>&1 &
# mpirun -n 4 --output-filename log_output --merge-stderr-to-stdout python ./data_dp_main.py ./ImageNet2012 -a resnet50 -b 64 -dist > train.log 2>&1 &
zoulq commented 1 month ago
继续保留mpirun的原有是什么?
frelam commented 1 month ago
原来打算保留以前记录, 非必须, 可以删掉。 用msrun即可。
zoulq commented 1 month ago
Collaborator
适配后,相对迁移前源码,适配量是增加了还是减少了?
frelam commented 1 month ago
Poster
> 适配后,相对迁移前源码,适配量是增加了还是减少了? 减少了。 一个是打开了allreduce相关的注释代码(10多行), 另一个是不用mindspore的组网接口, 打开原来注释掉的代码, 只对原有代码做部分修改。
frelam changed title from add dp_main to [WIP]add dp_main 1 month ago
frelam changed title from [WIP]add dp_main to add dp_main 1 month ago
frelam commented 1 month ago
Poster
> 拆成两个文件,分别torch和adapter,方便对比 done
Erpim merged commit 07974b7af3 into master 1 month ago
The pull request has been merged as 07974b7af3.
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This pull request currently doesn't have any dependencies.

Loading…
There is no content yet.