#826 RoCE网络,起分布式任务报错ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device

Open
created 5 months ago by liuxingbo12138 · 0 comments
ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device pod信息 ``` root@10-101-26-99:~# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE 59ae40a6106f4ab1a05239d62961a06e i0b52fc4c8b24636b143424c0b6087f9-task0-0 1/1 Running 0 125m 59ae40a6106f4ab1a05239d62961a06e i0b52fc4c8b24636b143424c0b6087f9-task0-1 1/1 Running 0 125m default gpu-deployment-666bf7b55c-j2sck 1/1 Running 25 28d default gpu-label-6cmdn 1/1 Running 1 2d1h default gpu-label-c88c9 1/1 Running 0 2d1h default gpu-label-p8c2t 1/1 Running 1 2d1h default gpu-label-xh8pz 1/1 Running 1 2d1h default nfs-client-provisioner-75d4697d86-zh5s6 1/1 Running 14 29d default octopus-adminportal-6679778699-5k8tk 1/1 Running 1 2d1h default octopus-adminserver-85db6c6bfc-7fj5w 1/1 Running 1 2d1h default octopus-ambassador-54499877f8-j848r 1/1 Running 1 2d1h default octopus-apidoc-8556bf9785-q8dhj 1/1 Running 1 2d1h default octopus-baseserver-58f97ccfd4-qxnhs 1/1 Running 6 2d default octopus-controller-5c754f6946-bm4wx 1/1 Running 1 2d1h default octopus-eventrouter-7c4cb68495-p9kmc 1/1 Running 6 2d default octopus-grafana-7664d5bd7c-dp8tm 1/1 Running 1 2d1h default octopus-influxdb-0 1/1 Running 1 2d1h default octopus-logger-filebeat-mrzm8 1/1 Running 1 2d1h default octopus-logger-filebeat-nv8hj 1/1 Running 1 2d1h default octopus-logger-filebeat-zrlrx 1/1 Running 1 2d1h default octopus-logger-httpd-7bd674b56d-pnt9m 1/1 Running 1 2d1h default octopus-logger-logstash-f46cccbcf-mxzx6 1/1 Running 1 2d1h default octopus-minio-bc94fb54-ckd2j 1/1 Running 1 2d1h default octopus-mysql-0 1/1 Running 1 2d1h default octopus-nginx-ingress-controller-885dbc9f5-btd4t 1/1 Running 1 2d1h default octopus-nginx-ingress-controller-default-backend-596979849zrldx 1/1 Running 1 2d1h default octopus-node-agent-xpqlt 1/1 Running 1 2d1h default octopus-node-agent-xztf8 1/1 Running 1 2d1h default octopus-node-agent-z8kjg 1/1 Running 1 2d1h default octopus-openaiportal-6d6cf7f4c8-clrmz 1/1 Running 1 2d1h default octopus-openaiserver-cf68ffc9f-78d6j 1/1 Running 1 2d1h default octopus-prometheus-597fd55cc9-2pt9d 1/1 Running 1 2d1h default octopus-prometheus-gpu-exporter-c4xnn 2/2 Running 2 2d1h default octopus-prometheus-gpu-exporter-hssjc 2/2 Running 2 2d1h default octopus-prometheus-gpu-exporter-kjdtk 2/2 Running 2 2d1h default octopus-prometheus-node-exporter-hxw6n 0/1 Pending 0 2d1h default octopus-prometheus-node-exporter-m8jml 0/1 Pending 0 2d1h default octopus-prometheus-node-exporter-tksf4 0/1 Pending 0 2d1h default octopus-prometheus-node-exporter-tswsc 0/1 Pending 0 2d1h default octopus-redis-master-0 1/1 Running 1 2d1h default octopus-scheduler-86cc9cf766-f2kq5 1/1 Running 1 2d1h default octopus-sftpgo-6664578f76-xmcv4 1/1 Running 2 2d1h fluid-system csi-nodeplugin-fluid-fpscn 2/2 Running 2 2d1h fluid-system csi-nodeplugin-fluid-hwqq8 2/2 Running 0 2d1h fluid-system csi-nodeplugin-fluid-qfmqd 2/2 Running 2 2d1h fluid-system csi-nodeplugin-fluid-rlnqq 2/2 Running 2 2d1h fluid-system dataset-controller-75856556d8-mh7bg 1/1 Running 1 2d1h fluid-system fluid-crds-upgrade-0.8.0-e730b87-hjq4f 0/1 ImagePullBackOff 0 2d1h fluid-system fluid-webhook-75759d7d6b-bz4x2 1/1 Running 1 2d1h fluid-system fluidapp-controller-75b9586d58-9cg64 1/1 Running 1 2d1h kube-system calico-kube-controllers-577f77cb5c-s56s4 1/1 Running 5 29d kube-system calico-node-bnr5s 1/1 Running 6 29d kube-system calico-node-rqpq8 1/1 Running 8 29d kube-system calico-node-tnhnr 1/1 Running 5 29d kube-system calico-node-zvv8j 1/1 Running 7 29d kube-system coredns-7f89b7bc75-88zrm 1/1 Running 5 30d kube-system coredns-7f89b7bc75-bmr9t 1/1 Running 5 30d kube-system etcd-10-101-26-99 1/1 Running 5 30d kube-system kube-apiserver-10-101-26-99 1/1 Running 5 29d kube-system kube-controller-manager-10-101-26-99 1/1 Running 6 30d kube-system kube-proxy-9z2mp 1/1 Running 8 30d kube-system kube-proxy-crjxm 1/1 Running 7 30d kube-system kube-proxy-nlhr7 1/1 Running 5 30d kube-system kube-proxy-z4s7h 1/1 Running 5 30d kube-system kube-scheduler-10-101-26-99 1/1 Running 6 30d kube-system nvidia-device-plugin-daemonset-hsn57 1/1 Running 1 2d1h kube-system nvidia-device-plugin-daemonset-kwj5b 1/1 Running 1 2d1h kube-system nvidia-device-plugin-daemonset-snqxq 1/1 Running 1 2d1h kube-system rdma-shared-dp-ds-brk8w 1/1 Running 0 3h1m kube-system rdma-shared-dp-ds-q8cmn 1/1 Running 0 3h3m kube-system rdma-shared-dp-ds-qjpj5 1/1 Running 0 3h3m kube-system seldon-spartakus-volunteer-5b57b95596-bn89s 0/1 ImagePullBackOff 0 2d1h kube-system snapshot-controller-0 1/1 Running 7 29d kubesphere-controls-system default-http-backend-76d9fb4bb7-5cv8p 1/1 Running 4 29d kubesphere-controls-system kubectl-admin-776b98f44f-z7vwk 1/1 Running 4 29d kubesphere-monitoring-system alertmanager-main-0 2/2 Running 8 29d kubesphere-monitoring-system alertmanager-main-1 2/2 Running 12 29d kubesphere-monitoring-system alertmanager-main-2 2/2 Running 14 29d kubesphere-monitoring-system kube-state-metrics-687d66b747-c2trb 3/3 Running 18 29d kubesphere-monitoring-system node-exporter-gk2pk 2/2 Running 36 29d kubesphere-monitoring-system node-exporter-rf5wx 2/2 Running 40 29d kubesphere-monitoring-system node-exporter-wkkn4 2/2 Running 34 29d kubesphere-monitoring-system node-exporter-wwh4k 2/2 Running 28 29d kubesphere-monitoring-system notification-manager-deployment-78664576cb-b256n 2/2 Running 8 29d kubesphere-monitoring-system notification-manager-deployment-78664576cb-m8p86 2/2 Running 8 29d kubesphere-monitoring-system notification-manager-operator-7d44854f54-jdgq7 2/2 Running 11 29d kubesphere-monitoring-system prometheus-k8s-0 2/2 Running 8 29d kubesphere-monitoring-system prometheus-k8s-1 2/2 Running 12 29d kubesphere-monitoring-system prometheus-operator-8955bbd98-6sf9v 2/2 Running 12 29d kubesphere-system ks-apiserver-7c97cccb79-l68w2 1/1 Running 3 29d kubesphere-system ks-console-548ff58c89-xx5hz 1/1 Running 3 29d kubesphere-system ks-controller-manager-76c8bbdc8d-q884k 1/1 Running 3 29d kubesphere-system ks-installer-846c78ddbf-xv6bp 1/1 Running 7 29d seldon-system seldon-controller-manager-56848fd587-pmkls 1/1 Running 1 2d1h ``` 报错日志 ``` [2024-08-01 14:54:03,355] torch.distributed.run: [WARNING] ***************************************** [2024-08-01 14:54:03,355] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-08-01 14:54:03,355] torch.distributed.run: [WARNING] [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [2024-08-01 14:54:03,355] torch.distributed.run: [WARNING] ***************************************** [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). [W socket.cpp:663] [c10d] The IPv6 network addresses of (i0b52fc4c8b24636b143424c0b6087f9-task0-0.i0b52fc4c8b24636b143424c0b6087f9, 12137) cannot be retrieved (gai error: -2 - Name or service not known). i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO cudaDriverVersion 12020 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO Bootstrap : Using eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:98 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:94 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:92 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO comm 0x55e235d18630 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 4d000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO comm 0x55cc76512b60 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId ca000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:96 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO P2P plugin IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO NET/IB : Using [0]mlx5_11:1/RoCE [1]mlx5_12:1/RoCE [2]mlx5_13:1/RoCE [3]mlx5_14:1/RoCE [RO]; OOB eth0:10.244.114.108<0> i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Using non-device net plugin version 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Using network IBext i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO comm 0x55ce634c2ab0 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 16000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO comm 0x556ff7b27310 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 10000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO comm 0x561193f5f0d0 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 49000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO comm 0x55dda8125fb0 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 8a000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO comm 0x5563a0d18540 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId c6000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO comm 0x558d15461390 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 8f000 commId 0x227fc4b9f36b77e - Init START i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO NVLS multicast support is not available on dev 4 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO NVLS multicast support is not available on dev 3 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO NVLS multicast support is not available on dev 7 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO NVLS multicast support is not available on dev 1 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO NVLS multicast support is not available on dev 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO NVLS multicast support is not available on dev 6 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO NVLS multicast support is not available on dev 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO NVLS multicast support is not available on dev 5 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO ib_plugin.c:297 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO ib_plugin.c:481 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO transport/net.cc:826 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->4 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/4/-1->12->-1 [7] 13/-1/-1->12->11 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] NCCL INFO ib_plugin.c:481 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO ib_plugin.c:297 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO ib_plugin.c:481 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO transport/net.cc:826 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO group.cc:64 -> 2 [Async thread] i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO ib_plugin.c:297 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO ib_plugin.c:481 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO transport/net.cc:826 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] -1/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] -1/-1/-1->11->10 [7] 12/-1/-1->11->10 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/2/-1->10->-1 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] NCCL INFO NCCL_IB_TC set by environment to 106. i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] -1/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/-1/-1->13->12 [7] -1/-1/-1->13->12 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO init.cc:1396 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->6 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/6/-1->14->-1 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] NCCL INFO ib_plugin.c:297 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 04/0 : 8[0] -> 15[7] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] proxy.cc:1557 NCCL WARN [Proxy Service 14] Failed to execute operation Connect from rank 14, retcode 3 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] NCCL INFO transport/net.cc:826 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO transport.cc:166 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO transport/net.cc:399 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO init.cc:1117 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 05/0 : 12[4] -> 11[3] via P2P/CUMEM/read File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 8/-1/-1->15->14 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO init.cc:1396 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/CUMEM/read ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 05/0 : 10[2] -> 9[1] via P2P/CUMEM/read Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 06/0 : 10[2] -> 9[1] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/pretrain_gpt.py", line 207, in <module> i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/0/-1->8->-1 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO group.cc:418 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] -1/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:93 [2] NCCL INFO group.cc:95 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO transport/net.cc:399 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO P2P Chunksize set to 131072 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO transport.cc:166 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [send] via NET/IBext/0/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO init.cc:1117 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [send] via NET/IBext/0/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO init.cc:1396 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM/read pretrain(train_valid_test_datasets_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 02/0 : 5[5] -> 12[4] [receive] via NET/IBext/2/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO group.cc:64 -> 2 [Async thread] i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 06/0 : 5[5] -> 12[4] [receive] via NET/IBext/2/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO group.cc:418 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 03/0 : 7[7] -> 14[6] [receive] via NET/IBext/3/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:95 [4] NCCL INFO group.cc:95 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 02/0 : 13[5] -> 4[4] [send] via NET/IBext/2/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO transport/net.cc:399 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 07/0 : 7[7] -> 14[6] [receive] via NET/IBext/3/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO transport.cc:166 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 06/0 : 13[5] -> 4[4] [send] via NET/IBext/2/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO init.cc:1117 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [send] via NET/IBext/1/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO init.cc:1396 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [send] via NET/IBext/1/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO group.cc:64 -> 2 [Async thread] i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO transport/net.cc:826 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:897 [7] NCCL INFO [Service thread] Connection closed by localRank 6 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [receive] via NET/IBext/1/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO group.cc:418 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [receive] via NET/IBext/0/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:91 [0] NCCL INFO group.cc:95 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [receive] via NET/IBext/1/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO transport/net.cc:399 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [receive] via NET/IBext/0/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO transport.cc:166 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO init.cc:1117 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 03/0 : 15[7] -> 6[6] [send] via NET/IBext/3/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO group.cc:64 -> 2 [Async thread] i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 07/0 : 15[7] -> 6[6] [send] via NET/IBext/3/GDRDMA i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:891 [6] proxy.cc:1523 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO group.cc:418 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/CUMEM/read i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:97 [6] NCCL INFO group.cc:95 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/CUMEM/read Traceback (most recent call last): i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/CUMEM/read pretrain(train_valid_test_datasets_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/training.py", line 177, in pretrain i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:851 [0] NCCL INFO Channel 06/0 : 8[0] -> 15[7] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/pretrain_gpt.py", line 207, in <module> i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM/read initialize_megatron(extra_args_provider=extra_args_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 89, in initialize_megatron i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/CUMEM/read _compile_dependencies() i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 04/0 : 12[4] -> 11[3] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 156, in _compile_dependencies i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/CUMEM/read torch.distributed.barrier() i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/CUMEM/read return func(*args, **kwargs) i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 06/0 : 12[4] -> 11[3] via P2P/CUMEM/read work = default_pg.barrier(opts=opts) i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/CUMEM/read File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:857 [4] NCCL INFO Channel 07/0 : 12[4] -> 11[3] via P2P/CUMEM/read Last error: i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 04/0 : 10[2] -> 9[1] via P2P/CUMEM/read torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/CUMEM/read Traceback (most recent call last): i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 04/0 : 15[7] -> 14[6] via P2P/CUMEM/read pretrain(train_valid_test_datasets_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:852 [2] NCCL INFO Channel 07/0 : 10[2] -> 9[1] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/training.py", line 177, in pretrain i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 05/0 : 15[7] -> 14[6] via P2P/CUMEM/read initialize_megatron(extra_args_provider=extra_args_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:853 [7] NCCL INFO Channel 06/0 : 15[7] -> 14[6] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 89, in initialize_megatron i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/CUMEM/read _compile_dependencies() i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/CUMEM/read torch.distributed.barrier() i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/CUMEM/read File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 04/0 : 11[3] -> 10[2] via P2P/CUMEM/read return func(*args, **kwargs) i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 02/0 : 11[3] -> 10[2] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 156, in _compile_dependencies i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/CUMEM/read File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 06/0 : 11[3] -> 10[2] via P2P/CUMEM/read work = default_pg.barrier(opts=opts) i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/CUMEM/read torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:855 [3] NCCL INFO Channel 07/0 : 11[3] -> 10[2] via P2P/CUMEM/read ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 04/0 : 13[5] -> 12[4] via P2P/CUMEM/read Last error: i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 07/0 : 13[5] -> 12[4] via P2P/CUMEM/read Traceback (most recent call last): i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:858 [5] NCCL INFO Channel 05/0 : 13[5] -> 12[4] via P2P/CUMEM/read [Proxy Service 14] Failed to execute operation Connect from rank 14, retcode 3 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:894 [3] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. File "/code/Megatron-LM-core_v0.5.0/pretrain_gpt.py", line 207, in <module> i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/training.py", line 177, in pretrain i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 03/0 : 14[6] -> 13[5] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 89, in initialize_megatron i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/CUMEM/read initialize_megatron(extra_args_provider=extra_args_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 04/0 : 14[6] -> 13[5] via P2P/CUMEM/read _compile_dependencies() i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 06/0 : 14[6] -> 13[5] via P2P/CUMEM/read Traceback (most recent call last): i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 05/0 : 14[6] -> 13[5] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 156, in _compile_dependencies i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. File "/code/Megatron-LM-core_v0.5.0/pretrain_gpt.py", line 207, in <module> i0b52fc4c8b24636b143424c0b6087f9-task0-1:97:854 [6] NCCL INFO Channel 07/0 : 14[6] -> 13[5] via P2P/CUMEM/read torch.distributed.barrier() i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/CUMEM/read File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper pretrain(train_valid_test_datasets_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/CUMEM/read return func(*args, **kwargs) i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device File "/code/Megatron-LM-core_v0.5.0/megatron/training.py", line 177, in pretrain i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO ib_plugin.c:297 -> 2 initialize_megatron(extra_args_provider=extra_args_provider, i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO NCCL_IB_TC set by environment to 106. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO ib_plugin.c:481 -> 2 File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 89, in initialize_megatron i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO transport/net.cc:826 -> 2 _compile_dependencies() i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 03/0 : 9[1] -> 8[0] via P2P/CUMEM/read File "/code/Megatron-LM-core_v0.5.0/megatron/initialize.py", line 156, in _compile_dependencies i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 05/0 : 9[1] -> 8[0] via P2P/CUMEM/read torch.distributed.barrier() i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:893 [5] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 07/0 : 9[1] -> 8[0] via P2P/CUMEM/read File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:856 [1] NCCL INFO Channel 06/0 : 9[1] -> 8[0] via P2P/CUMEM/read return func(*args, **kwargs) i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. work = default_pg.barrier(opts=opts) i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:896 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 Last error: i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO NCCL_IB_TC set by environment to 106. ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO ib_plugin.c:297 -> 2 work = default_pg.barrier(opts=opts) i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO ib_plugin.c:481 -> 2 torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 i0b52fc4c8b24636b143424c0b6087f9-task0-1:93:895 [2] NCCL INFO transport/net.cc:826 -> 2 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:897 [7] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. Last error: Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:893 [5] NCCL INFO [Service thread] Connection closed by localRank 4 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO ib_plugin.c:297 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:894 [3] NCCL INFO [Service thread] Connection closed by localRank 4 i0b52fc4c8b24636b143424c0b6087f9-task0-1:95:892 [4] NCCL INFO ib_plugin.c:481 -> 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:96:893 [5] NCCL INFO [Service thread] Connection closed by localRank 6 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. i0b52fc4c8b24636b143424c0b6087f9-task0-1:94:894 [3] NCCL INFO [Service thread] Connection closed by localRank 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] NCCL INFO NCCL_IB_TC set by environment to 106. i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:896 [1] NCCL INFO [Service thread] Connection closed by localRank 2 i0b52fc4c8b24636b143424c0b6087f9-task0-1:92:896 [1] NCCL INFO [Service thread] Connection closed by localRank 0 i0b52fc4c8b24636b143424c0b6087f9-task0-1:91:898 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error No such device i0b52fc4c8b24636b143424c0b6087f9-task0-1:98:897 [7] NCCL INFO [Service thread] Connection closed by localRank 0 [2024-08-01 14:54:45,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 92 closing signal SIGTERM [2024-08-01 14:54:45,887] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 94 closing signal SIGTERM [2024-08-01 14:54:45,887] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 96 closing signal SIGTERM run(args) [2024-08-01 14:54:45,888] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 98 closing signal SIGTERM File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ [2024-08-01 14:54:45,952] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 91) of binary: /usr/bin/python raise ChildFailedError( Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> /code/Megatron-LM-core_v0.5.0/pretrain_gpt.py FAILED sys.exit(main()) ------------------------------------------------------------ File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ exitcode : 1 (pid: 95) error_file: <N/A> Failures: [1]: host : i0b52fc4c8b24636b143424c0b6087f9-task0-1.i0b52fc4c8b24636b143424c0b6087f9.59ae40a6106f4ab1a05239d62961a06e.svc.cluster.local time : 2024-08-01_14:54:45 exitcode : 1 (pid: 97) rank : 10 (local_rank: 2) exitcode : 1 (pid: 93) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: host : i0b52fc4c8b24636b143424c0b6087f9-task0-1.i0b52fc4c8b24636b143424c0b6087f9.59ae40a6106f4ab1a05239d62961a06e.svc.cluster.local time : 2024-08-01_14:54:45 host : i0b52fc4c8b24636b143424c0b6087f9-task0-1.i0b52fc4c8b24636b143424c0b6087f9.59ae40a6106f4ab1a05239d62961a06e.svc.cluster.local rank : 12 (local_rank: 4) traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-08-01_14:54:45 host : i0b52fc4c8b24636b143424c0b6087f9-task0-1.i0b52fc4c8b24636b143424c0b6087f9.59ae40a6106f4ab1a05239d62961a06e.svc.cluster.local rank : 14 (local_rank: 6) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-01_14:54:45 rank : 8 (local_rank: 0) exitcode : 1 (pid: 91) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html 11111111111 ============================================================ ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.