• pytorch多卡训练DDP卡死问题排查


    背景

    单机多卡并行模型训练,使用DistributedDataParallel加速,调用超过一个GPU会发生卡死,表现为GPU0占用100%且无法继续。

    排查

    使用nvtop工具查看,发现GPU0会被分配nproc_per_node对应数量的process,表现与预期N卡N线不符。
    调用DDP部分代码展示如下:

    model = MyNet(config).cuda()
    model = torch.nn.parallel.DistributedDataParallel(model, 
                                                          device_ids=[config.LOCAL_RANK], 
                                                          output_device=config.LOCAL_RANK, 
                                                          broadcast_buffers=False,
                                                          find_unused_parameters=True)
    

    通过log排查出每次model都被分配在cuda:0上,这也就解释了为什么nproc_per_node=1才能正常训练。

    案例

    阅读DDP官方文档,多进程部分实现主要分为两种:

    • 使用torch.multiprocessing手动spawn
    • 使用torch.distributed.run/torchrun自动初始化

    这里选用后者的官方DDP案例elastic_ddp.py

    import torch
    import torch.distributed as dist
    import torch.nn as nn
    import torch.optim as optim
    
    from torch.nn.parallel import DistributedDataParallel as DDP
    
    class ToyModel(nn.Module):
        def __init__(self):
            super(ToyModel, self).__init__()
            self.net1 = nn.Linear(10, 10)
            self.relu = nn.ReLU()
            self.net2 = nn.Linear(10, 5)
    
        def forward(self, x):
            return self.net2(self.relu(self.net1(x)))
    
    
    def demo_basic():
        dist.init_process_group("nccl")
        rank = dist.get_rank()
        print(f"Start running basic DDP example on rank {rank}.")
    
        # create model and move it to GPU with id rank
        device_id = rank % torch.cuda.device_count()
        model = ToyModel().to(device_id)
        ddp_model = DDP(model, device_ids=[device_id])
    
        loss_fn = nn.MSELoss()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    
        optimizer.zero_grad()
        outputs = ddp_model(torch.randn(20, 10))
        labels = torch.randn(20, 5).to(device_id)
        loss_fn(outputs, labels).backward()
        optimizer.step()
    
    if __name__ == "__main__":
        demo_basic()
    

    运行结果

    $ torchrun --nproc_per_node=8 elastic_ddp.py
    WARNING:torch.distributed.run:
    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    *****************************************
    Start running basic DDP example on rank 7.
    Start running basic DDP example on rank 4.
    Start running basic DDP example on rank 2.
    Start running basic DDP example on rank 6.
    Start running basic DDP example on rank 1.
    Start running basic DDP example on rank 3.
    Start running basic DDP example on rank 0.
    Start running basic DDP example on rank 5.
    

    解决

    参考官方案例后,基本确定是cuda device分配出现问题。
    修改mian函数如下:

    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")
    
    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = MyNet(config).to(device_id)
    ddp_model = DDP(model, broadcast_buffers=False, find_unused_parameters=True)
    

    官方文档指出device_idsoutput_device两个参数在multi-GPU模式下必须给默认值None

    device_ids (list of python:int or torch.device) :
    CUDA devices. 1) For single-device modules, device_ids can contain exactly one device id, which represents the only CUDA device where the input module corresponding to this process resides. Alternatively, device_ids can also be None. 2) For multi-device modules and CPU modules, device_ids must be None.
    When device_ids is None for both cases, both the input data for the forward pass and the actual module must be placed on the correct device. (default: None)
    output_device (int or torch.device) :
    Device location of output for single-device CUDA modules. For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)

    使用上述方法获得正确的device_id后log显示确实将模型分配在不同cuda,随后开始分布训练问题演变为:

    "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1"
    

    看来模型中还存在参数与输入不在一张卡的问题,由于数据集采用numpy格式的pickle进行data feed存在转换,
    因此改动思路是在所有layer调用的forward函数中偷传device_id参数,从而定转换后cuda tensor保存位置。

    def forward(self, input, device):
        input = torch.from_numpy(input).float().cuda(device, non_blocking=True)
    

    简化版的input.cuda()方法会自动分配current_cuda_device = cuda:0导致错误。

    参考

    Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.1+cu102 documentation
    python - Stuck at this error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu" - Stack Overflow

  • 相关阅读:
    Bootstrap标签(label)的使用
    Docker学习(二)
    linux 的tee命令
    解决 Docker pull 出现的net/http: TLS handshake timeout 的一个办法
    win 10 安装.msi 程序出现the error code is 2503
    Kbuntu16.04利用快捷键调用终端Konsole
    ubuntu上swift开发学习2
    ubuntu上swift开发学习1
    Linux中常用文件传输命令及使用方法
    Kbuntu16.04添加工作空间
  • 原文地址:https://www.cnblogs.com/azureology/p/16632988.html
Copyright © 2020-2023  润新知