pytorch多卡训练DDP卡死问题排查

背景

单机多卡并行模型训练，使用DistributedDataParallel加速，调用超过一个GPU会发生卡死，表现为GPU0占用100%且无法继续。

排查

使用nvtop工具查看，发现GPU0会被分配nproc_per_node对应数量的process，表现与预期N卡N线不符。
调用DDP部分代码展示如下：

model = MyNet(config).cuda()
model = torch.nn.parallel.DistributedDataParallel(model, 
                                                      device_ids=[config.LOCAL_RANK], 
                                                      output_device=config.LOCAL_RANK, 
                                                      broadcast_buffers=False,
                                                      find_unused_parameters=True)

通过log排查出每次model都被分配在cuda:0上，这也就解释了为什么nproc_per_node=1才能正常训练。

案例

阅读DDP官方文档，多进程部分实现主要分为两种：

使用torch.multiprocessing手动spawn
使用torch.distributed.run/torchrun自动初始化

这里选用后者的官方DDP案例elastic_ddp.py

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()

if __name__ == "__main__":
    demo_basic()

运行结果

$ torchrun --nproc_per_node=8 elastic_ddp.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Start running basic DDP example on rank 7.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 0.
Start running basic DDP example on rank 5.

解决

参考官方案例后，基本确定是cuda device分配出现问题。
修改mian函数如下：

dist.init_process_group("nccl")
rank = dist.get_rank()
print(f"Start running basic DDP example on rank {rank}.")

# create model and move it to GPU with id rank
device_id = rank % torch.cuda.device_count()
model = MyNet(config).to(device_id)
ddp_model = DDP(model, broadcast_buffers=False, find_unused_parameters=True)

官方文档指出device_ids和output_device两个参数在multi-GPU模式下必须给默认值None

device_ids (list of python:int or torch.device) :
CUDA devices. 1) For single-device modules, device_ids can contain exactly one device id, which represents the only CUDA device where the input module corresponding to this process resides. Alternatively, device_ids can also be None. 2) For multi-device modules and CPU modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the forward pass and the actual module must be placed on the correct device. (default: None)
output_device (int or torch.device) :
Device location of output for single-device CUDA modules. For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)

使用上述方法获得正确的device_id后log显示确实将模型分配在不同cuda，随后开始分布训练问题演变为：

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1"

看来模型中还存在参数与输入不在一张卡的问题，由于数据集采用numpy格式的pickle进行data feed存在转换，
因此改动思路是在所有layer调用的forward函数中偷传device_id参数，从而定转换后cuda tensor保存位置。

def forward(self, input, device):
    input = torch.from_numpy(input).float().cuda(device, non_blocking=True)

简化版的input.cuda()方法会自动分配current_cuda_device = cuda:0导致错误。

参考

Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.1+cu102 documentation
python - Stuck at this error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu" - Stack Overflow

相关阅读:
Bootstrap标签（label）的使用
 Docker学习（二）
linux 的tee命令
 解决 Docker pull 出现的net/http: TLS handshake timeout 的一个办法
 win 10 安装.msi 程序出现the error code is 2503
Kbuntu16.04利用快捷键调用终端Konsole
ubuntu上swift开发学习2
ubuntu上swift开发学习1
Linux中常用文件传输命令及使用方法
 Kbuntu16.04添加工作空间
原文地址：https://www.cnblogs.com/azureology/p/16632988.html