• mesos支持gpu代码分析以及capos支持gpu实现


    这篇文章涉及mesos如何在原生的mesoscontainerizer和docker containerizer上支持gpu的,以及如果自己实现一个mesos之上的framework capos支持gpu调度的实现原理,(capos是hulu内部的资源调度平台 refer to https://www.cnblogs.com/yanghuahui/p/9304302.html)。

    mesos slave在启动的时候需要初始化containerizer的resource,包含cpu/mem/gpu等,这对于mesos containerizer和docker containerizer都是通用的

    void Slave::initialize() {
       ...
       Try<Resources> resources = Containerizer::resources(flags);
       ...
    }

    然后到了src/slave/containerizer/containerizer.cpp 代码块中, 根据mesos-slave/agent的启动参数flags,调用allocator逻辑

    Try<Resources> Containerizer::resources(const Flags& flags)
    {
      ...
      // GPU resource.
      Try<Resources> gpus = NvidiaGpuAllocator::resources(flags);
      if (gpus.isError()) {
        return Error("Failed to obtain GPU resources: " + gpus.error());
      }
    
      // When adding in the GPU resources, make sure that we filter out
      // the existing GPU resources (if any) so that we do not double
      // allocate GPUs.
      resources = gpus.get() + resources.filter(
          [](const Resource& resource) {
            return resource.name() != "gpus";
          });  
      ...
    }

    src/slave/containerizer/mesos/isolators/gpu/allocator.cpp 会用nvidia的管理gpu的命令nvml以及根据启动参数,返回这台机器上gpu的资源,供之后的调度使用。

    // To determine the proper number of GPU resources to return, we
    // need to check both --resources and --nvidia_gpu_devices.
    // There are two cases to consider:
    //
    //   (1) --resources includes "gpus" and --nvidia_gpu_devices is set.
    //       The number of GPUs in --resources must equal the number of
    //       GPUs within --nvidia_gpu_resources.
    //
    //   (2) --resources does not include "gpus" and --nvidia_gpu_devices
    //       is not specified. Here we auto-discover GPUs using the
    //       NVIDIA management Library (NVML). We special case specifying
    //       `gpus:0` explicitly to not perform auto-discovery.
    //
    static Try<Resources> enumerateGpuResources(const Flags& flags)
    {
     ...
    }

    因为gpu资源是需要绑定gpu卡number的,gpu资源在调度的数据结构中,是一个set<Gpu>, allocator.go提供allocate和deallocate接口的实现

      Future<Nothing> allocate(const set<Gpu>& gpus)
      {
        set<Gpu> allocation = available & gpus;
    
        if (allocation.size() < gpus.size()) {
          return Failure(stringify(gpus - allocation) + " are not available");
        }
    
        available = available - allocation;
        allocated = allocated | allocation;
    
        return Nothing();
      }
    
      Future<Nothing> deallocate(const set<Gpu>& gpus)
      {
        set<Gpu> deallocation = allocated & gpus;
    
        if (deallocation.size() < gpus.size()) {
          return Failure(stringify(gpus - deallocation) + " are not allocated");
        }
    
        allocated = allocated - deallocation;
        available = available | deallocation;
    
        return Nothing();
      }

    但是封装到上层,供containerizer调用的时候,指定需要allocate的gpu number就可以

    Future<set<Gpu>> NvidiaGpuAllocator::allocate(size_t count)
    {
      // Need to disambiguate for the compiler.
      Future<set<Gpu>> (NvidiaGpuAllocatorProcess::*allocate)(size_t) =
        &NvidiaGpuAllocatorProcess::allocate;
    
      return process::dispatch(data->process, allocate, count);
    }

    但是deallocate仍然需要显示指定需要释放哪些gpu

    Future<Nothing> NvidiaGpuAllocator::deallocate(const set<Gpu>& gpus)
    {
      return process::dispatch(
          data->process,
          &NvidiaGpuAllocatorProcess::deallocate,
          gpus);
    }

    然后如果作业是用docker containerizer,可以看到src/slave/containerizer/docker.cpp中调用gpu的逻辑

    Future<Nothing> DockerContainerizerProcess::allocateNvidiaGpus(
        const ContainerID& containerId,
        const size_t count)
    {
      if (!nvidia.isSome()) {
        return Failure("Attempted to allocate GPUs"
                       " without Nvidia libraries available");
      }
    
      if (!containers_.contains(containerId)) {
        return Failure("Container is already destroyed");
      }
    
      return nvidia->allocator.allocate(count)
        .then(defer(
            self(),
            &Self::_allocateNvidiaGpus,
            containerId,
            lambda::_1));
    }

    所以之上,就是在slave中启动的时候加载确认gpu资源,然后在启动containerizer的时候,可以利用slave中维护的gpu set资源池,去拿到资源,之后启动作业。

    那capos是如何实现的呢,capos是hulu内部的资源调度平台(refer to https://www.cnblogs.com/yanghuahui/p/9304302.html),因为自己实现了mesos的capos containerizer,我们的做法是,在mesos slave注册的时候显示的通过参数或者自动探测的机制,发现gpu资源,然后用--resources=gpu range的形式启动mesos agent,这样offer资源的gpu在capos看来就是一个range,可以类似使用port资源的方式,来调度gpu,在capos containerizer中,根据调度器指定的gpu range,去绑定一个或者多个gpu资源到docker nvidia runtime中。完成gpu调度功能。

  • 相关阅读:
    「Kafka」Kafka中offset偏移量提交
    「Flink」Flink中的时间类型
    Hash存储模型、B-Tree存储模型、LSM存储模型介绍
    「Flink」RocksDB介绍以及Flink对RocksDB的支持
    「Flink」理解流式处理重要概念
    「Flink」Flink 1.9 WebUI运行作业界面分析
    「Spark」Spark SQL Thrift Server运行方式
    「Flink」配置使用Flink调试WebUI
    Django基础篇之简介和项目创建
    在python文件中操作django orm提示环境变量设置问题
  • 原文地址:https://www.cnblogs.com/yanghuahui/p/9381857.html
Copyright © 2020-2023  润新知