• 腾讯vCUDA(gpumanager)部署


    参考文档:https://cloud.tencent.com/developer/article/1685122

                      https://blog.csdn.net/o0haidee0o/article/details/119407372

                      https://www.jianshu.com/p/7d795bc226c7 

    一、GPU虚拟化简介

           GPU是一种用于矩阵计算的PCIe设备,一般用于解码、渲染和科学计算等并行计算场景,不同场景对GPU使用方式不同,使用的加速库也各不相同,本文提到的GPU虚拟化主要针对科学计算场景,使用的加速库为nvidia cuda。

           从用户角度,GPU虚拟化可以简单分为两种类型,虚拟机层面的虚拟化和容器层面的虚拟化。虚拟机层面的虚拟化是将GPU硬件设备虚拟给多个KVM虚拟机使用,各个虚拟机独立安装驱动,这样既保证了虚拟机内的GPU功能完备又实现GPU资源的隔离和共享,唯一缺点就是资源损耗相对较大。容器层面的虚拟化则有两个思路,一个是将GPU纳入cgroup管理,目前尚未有成熟的提案,短期内难以实现,二是基于GPU驱动封装实现,用户根据需要对驱动的某些关键接口(如显存分配、cuda thread创建等)进行封装劫持,在劫持过程中限制用户进程对计算资源的使用,此类方案缺点是兼容性依赖于厂商驱动,但是整体方案较为轻量化,性能损耗极小。GPUManager即为第二类容器层面的虚拟化方案,本文主要介绍GPUManager方案原理和部署流程。

    二、GPUManager架构介绍

           GPUManager是一个运行在k8s上的GPU虚拟化方案,了解GPUManager方案架构前我们先看一下k8s对异构资源的支持。1.6版本开始,k8s的in-tree代码里开始引入Nvidia GPU相关的代码,但不支持GPU调度无法在实际生产环境中使用,为了满足越来越多的异构资源(如GPU、Infiniband、FPGA等)使用需求,1.8版本社区提出了Extended Resource和Device Plugin方案,以OutOfTree形式支持异构资源的调度和映射。

           GPUManager是腾讯自研的容器层GPU虚拟化方案,除兼容Nvidia 官方插件的GPU资源管理功能外,还增加碎片资源调度、GPU调度拓扑优化、GPU资源Quota等功能,在容器层面实现了GPU资源的化整为零,而在原理上仅使用了wrap library和linux动态库链接技术,就实现了GPU 算力和显存的上限隔离。

           在工程设计上,GPUManager方案包括三个部分,cuda封装库vcuda、k8s device plugin 插件gpu-manager-daemonset和k8s调度插件gpu-quota-admission。

           vcuda库是一个对nvidia-ml和libcuda库的封装库,通过劫持容器内用户程序的cuda调用限制当前容器内进程对GPU和显存的使用

           gpu-manager-daemonset是标准的k8s device plugin,实现了GPU拓扑感知、设备和驱动映射等功能。GPUManager支持共享和独占两种模式,当负载里tencent.com/vcuda-core request 值在0~100情况下,采用共享模式调度,优先将碎片资源集中到一张卡上,当负载里的tencent.com/vcuda-core request为100的倍数时,采用独占模式调度,gpu-manager-daemonset会根据GPU拓扑结构生成GPU卡的拓扑树,选择最优的结构(距离最短的叶子节点)进行调度分配。需要注意的是GPUManager仅支持0~100和100的整数倍的GPU需求调度,无法支持150,220类的非100整数倍的GPU需求调度。每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存。

           gpu-quota-admission是一个k8s Scheduler extender,实现了Scheduler的predicates接口,kube-scheduler在调度tencent.com/vcuda-core资源请求的Pod时,predicates阶段会调用gpu-quota-admission的predicates接口对节点进行过滤和绑定,同时gpu-quota-admission提供了GPU资源池调度功能,解决不同类型的GPU在namespace下的配额问题

          GPUManager整体方案如下:

     三、GPUManager部署

    ## github
    gpu-admission:   https://github.com/tkestack/gpu-admission
    gpu-manager:     https://github.com/tkestack/gpu-manager   

    1、驱动安装

    参考文档: https://www.cnblogs.com/deny/p/16305945.html

    2、部署

    1)部署gpu-quota-admission服务

    kubectl apply -f   gpu-admission.yaml

    内容如下:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: gpu-quota-admission
      namespace: kube-system
    data:
      gpu-quota-admission.config: |
        {
             "QuotaConfigMapName": "gpuquota",
             "QuotaConfigMapNamespace": "kube-system",
             "GPUModelLabel": "gaia.tencent.com/gpu-model",
             "GPUPoolLabel": "gaia.tencent.com/gpu-pool"
         }
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gpu-quota-admission
      namespace: kube-system
    spec:
      ports:
      - port: 3456
        protocol: TCP
        targetPort: 3456
      selector:
        k8s-app: gpu-quota-admission
      type: ClusterIP
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        k8s-app: gpu-quota-admission
      name: gpu-quota-admission
      namespace: kube-system
    spec:
      replicas: 1
      selector:
        matchLabels:
          k8s-app: gpu-quota-admission
      template:
        metadata:
          labels:
            k8s-app: gpu-quota-admission
          namespace: kube-system
        spec:
          affinity:
            nodeAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - preference:
                  matchExpressions:
                  - key: node-role.kubernetes.io/master
                    operator: Exists
                weight: 1
          containers:
          - env:
            - name: LOG_LEVEL
              value: "4"
            - name: EXTRA_FLAGS
              value: --incluster-mode=true
            image: ccr.ccs.tencentyun.com/tkeimages/gpu-quota-admission:latest
            imagePullPolicy: IfNotPresent
            name: gpu-quota-admission
            ports:
            - containerPort: 3456
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 2Gi
              requests:
                cpu: "1"
                memory: 1Gi
            volumeMounts:
            - mountPath: /root/gpu-quota-admission/
              name: config
          dnsPolicy: ClusterFirstWithHostNet
          initContainers:
          - command:
            - sh
            - -c
            - ' mkdir -p /etc/kubernetes/ && cp /root/gpu-quota-admission/gpu-quota-admission.config
              /etc/kubernetes/'
            image: busybox
            imagePullPolicy: Always
            name: init-kube-config
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /root/gpu-quota-admission/
              name: config
          priority: 2000000000
          priorityClassName: system-cluster-critical
          restartPolicy: Always
          serviceAccount: gpu-manager
          serviceAccountName: gpu-manager
          terminationGracePeriodSeconds: 30
          tolerations:
          - effect: NoSchedule
            key: node-role.kubernetes.io/master
          volumes:
          - configMap:
              defaultMode: 420
              name: gpu-quota-admission
            name: config

    2 )部署gpu-manager-daemonset

    kubectl apply -f   gpu-manager.yaml

    内容如下:

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gpu-manager
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: cluster-admin
    subjects:
    - kind: ServiceAccount
      name: gpu-manager
      namespace: kube-system
    
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: gpu-manager
      namespace: kube-system
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gpu-manager-metric
      namespace: kube-system
      annotations:
        prometheus.io/scrape: "true"
      labels:
        kubernetes.io/cluster-service: "true"
    spec:
      clusterIP: None
      ports:
        - name: metrics
          port: 5678
          protocol: TCP
          targetPort: 5678
      selector:
        name: gpu-manager-ds
    
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: gpu-manager-daemonset
      namespace: kube-system
    spec:
      updateStrategy:
        type: RollingUpdate
      selector:
        matchLabels:
          name: gpu-manager-ds
      template:
        metadata:
          # This annotation is deprecated. Kept here for backward compatibility
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ""
          labels:
            name: gpu-manager-ds
        spec:
          serviceAccount: gpu-manager
          tolerations:
            # This toleration is deprecated. Kept here for backward compatibility
            # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
            - key: CriticalAddonsOnly
              operator: Exists
            - key: tencent.com/vcuda-core
              operator: Exists
              effect: NoSchedule
          # Mark this pod as a critical add-on; when enabled, the critical add-on
          # scheduler reserves resources for critical add-on pods so that they can
          # be rescheduled after a failure.
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          priorityClassName: "system-node-critical"
          # only run node has gpu device
          nodeSelector:
            nvidia-device-enable: enable
          hostPID: true
          containers:
            - image: tkestack/gpu-manager:v1.1.5
              imagePullPolicy: IfNotPresent
              name: gpu-manager
              securityContext:
                privileged: true
              ports:
                - containerPort: 5678
              volumeMounts:
                - name: device-plugin
                  mountPath: /var/lib/kubelet/device-plugins
                - name: vdriver
                  mountPath: /etc/gpu-manager/vdriver
                - name: vmdata
                  mountPath: /etc/gpu-manager/vm
                - name: log
                  mountPath: /var/log/gpu-manager
                - name: checkpoint
                  mountPath: /etc/gpu-manager/checkpoint
                - name: run-dir
                  mountPath: /var/run
                - name: cgroup
                  mountPath: /sys/fs/cgroup
                  readOnly: true
                - name: usr-directory
                  mountPath: /usr/local/host
                  readOnly: true
              env:
                - name: LOG_LEVEL
                  value: "4"
                - name: EXTRA_FLAGS
                  value: "--logtostderr=false"
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
          volumes:
            - name: device-plugin
              hostPath:
                type: Directory
                path: /var/lib/kubelet/device-plugins
            - name: vmdata
              hostPath:
                type: DirectoryOrCreate
                path: /etc/gpu-manager/vm
            - name: vdriver
              hostPath:
                type: DirectoryOrCreate
                path: /etc/gpu-manager/vdriver
            - name: log
              hostPath:
                type: DirectoryOrCreate
                path: /etc/gpu-manager/log
            - name: checkpoint
              hostPath:
                type: DirectoryOrCreate
                path: /etc/gpu-manager/checkpoint
            # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
            # inode change after host docker is restarted
            - name: run-dir
              hostPath:
                type: Directory
                path: /var/run
            - name: cgroup
              hostPath:
                type: Directory
                path: /sys/fs/cgroup
            # We have to mount /usr directory instead of specified library path, because of non-existing
            # problem for different distro
            - name: usr-directory
              hostPath:
                type: Directory
                path: /usr

    3 )给GPU节点打nvidia-device-enable=enable 标签

    kubectl label node *.*.*.* nvidia-device-enable=enable

    4 ) 验证gpu-manager-daemonset是否正确派发到GPU节点

    kubectl get pods -n kube-system

    3、自定义调度器

    1)准备自定义调度器文件 /etc/kubernetes/scheduler-policy-config.json,配置文件内容:

    {
      "kind": "Policy",
      "apiVersion": "v1",
      "predicates": [
        {
          "name": "PodFitsHostPorts"
        },
        {
          "name": "PodFitsResources"
        },
        {
          "name": "NoDiskConflict"
        },
        {
          "name": "MatchNodeSelector"
        },
        {
          "name": "HostName"
        }
      ],
      "extenders": [
        {
          "urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler",
          "apiVersion": "v1beta1",
          "filterVerb": "predicates",
          "enableHttps": false,
          "nodeCacheCapable": false
        }
      ],
      "hardPodAffinitySymmetricWeight": 10,
      "alwaysCheckAllPredicates": false
    }

    其中"urlPrefix": "http://gpu-quota-admission.kube-system:3456/scheduler"中的IP地址和端口号,如果有特殊需求则按照需求更换,没有特殊需求这样写就可以了

    2)修改调度器scheduler的manifest文件

    如果是kubeadm部署的k8s,调度器是以pod形式运行的,kubelet会一直监听manifest文件的修改,发现文件被修改后会自动重启pod以加载新的配置。因此,这里我们只需要修改调度器的manifest文件即可。

    cp /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml.bak

    在command关键字下面加两行内容:

    --policy-config-file=/etc/kubernetes/scheduler-policy-config.json
    --use-legacy-policy-config=true

    修改后文件为:

    apiVersion: v1
    kind: Pod
    metadata:
      creationTimestamp: null
      labels:
        component: kube-scheduler
        tier: control-plane
      name: kube-scheduler
      namespace: kube-system
    spec:
      containers:
      - command:
        - kube-scheduler
        - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
        - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
        - --bind-address=127.0.0.1
        - --kubeconfig=/etc/kubernetes/scheduler.conf
        - --leader-elect=true
        - --port=0
        - --policy-config-file=/etc/kubernetes/scheduler-policy-config.json              ####  增加项
        - --use-legacy-policy-config=true                                                #### 增加项
        image: 10.2.57.16:5000/kubernetes/kube-scheduler:v1.19.8
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 8
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 15
        name: kube-scheduler
        resources:
          requests:
            cpu: 100m
        startupProbe:
          failureThreshold: 24
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 15
        volumeMounts:
        - mountPath: /etc/kubernetes/scheduler.conf
          name: kubeconfig
          readOnly: true
        - mountPath: /etc/kubernetes/scheduler-policy-config.json              #### 将文件挂载
          name: policyconfig
          readOnly: true
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet                                   #### 修改dns策略
      priorityClassName: system-node-critical
      volumes:
      - hostPath:
          path: /etc/kubernetes/scheduler.conf
          type: FileOrCreate
        name: kubeconfig
      - hostPath:
          path: /etc/kubernetes/scheduler-policy-config.json
          type: FileOrCreate
        name: policyconfig
    status: {}

    保存退出后就自动生效了

    可以用如下命令确定一下:

    [root@cri3dp1 manifests]# kubectl -n kube-system get pod | grep sch
    kube-scheduler-cri3dp1                       1/1     Running   0          141m

    输出中找到一个名字为 kube-scheduler-XXX 的pod,看后面对应的AGE项,是不是刚刚启动。如果刚启动过,代表调度器配置已经更新。

    4、查看gpu节点信息

    [root@cri3dp1 manifests]# kubectl describe node k8s-node3
    .........
    Capacity:
      cpu:                       20
      ephemeral-storage:         958487280Ki
      hugepages-1Gi:             0
      hugepages-2Mi:             0
      memory:                    65492456Ki
      pods:                      110
      tencent.com/vcuda-core:    100
      tencent.com/vcuda-memory:  96
    Allocatable:
      cpu:                       20
      ephemeral-storage:         883341875786
      hugepages-1Gi:             0
      hugepages-2Mi:             0
      memory:                    65390056Ki
      pods:                      110
      tencent.com/vcuda-core:    100
      tencent.com/vcuda-memory:  96.........

    四、方案测试

    方案测试采用Tensorflow框架,内置了Mnist,cifar10和Alexnet benchmark等测试数据集,可以根据需要选择不同的测试方案。

    测试步骤:

    1、使用TensorFlow框架+minst数据集进行测试验证,TensorFlow镜像:

    ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2

    2、创建一个测试负载,yaml文件如下:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
      name: vcuda-test
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          k8s-app: vcuda-test
      template:
        metadata:
          labels:
            k8s-app: vcuda-test
            qcloud-app: vcuda-test
        spec:
          containers:
          - command:
            - sleep
            - 360000s
            env:
            - name: PATH
              value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
            image: ccr.ccs.tencentyun.com/menghe/tensorflow-gputest:0.2
            imagePullPolicy: IfNotPresent
            name: tensorflow-test
            resources:
              limits:
                cpu: "4"
                memory: 8Gi
                tencent.com/vcuda-core: "50"
                tencent.com/vcuda-memory: "32"
              requests:
                cpu: "4"
                memory: 8Gi
                tencent.com/vcuda-core: "50"
                tencent.com/vcuda-memory: "32"

    3、进入测试容器(在默认default namespace下,如修改了测试yaml,按需指定namespace)

    kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash

    4、执行测试命令,可以根据需求选择不同训练框架/数据集

    a. Mnist

    cd /data/tensorflow/mnist && time python convolutional.py

    b. AlexNet

    cd /data/tensorflow/alexnet && time python alexnet_benchmark.py

    c. Cifar10

    cd /data/tensorflow/cifar10 && time python cifar10_train.py

    5、在物理机上通过nvidia-smi pmon -s u -d 1命令查看GPU资源使用情况

    五、pod使用

    下面给出 yaml 示例:

    1)使用1张卡的 P4 设备:

    apiVersion: v1
    kind: Pod
    ...
    spec:
    containers:
     - name: gpu
    resources:
      limits:
        cpu: "4"
        memory: 8Gi
        tencent.com/vcuda-core: "100"
      requests:
        cpu: "4"
        memory: 8Gi
        tencent.com/vcuda-core: "100"

    2)使用0.3张卡,5GiB 显存的应用:

    apiVersion: v1
    kind: Pod
    ...
    spec:
    containers:
     - name: gpu
    resources:
      limits:
        cpu: "4"
        memory: 8Gi
        tencent.com/vcuda-core: "30"
        tencent.com/vcuda-memory: "20"
      requests:
        cpu: "4"
        memory: 8Gi
        tencent.com/vcuda-core: "30"
        tencent.com/vcuda-memory: "20"
  • 相关阅读:
    快速架设OpenStack云基础平台
    源码编译安装Nginx全程视频演示
    参加2012 Openstack亚太技术大会
    FFmpeg的安装与使用
    Linux下图解minicom安装
    [转]ARM/Thumb2PortingHowto
    [原]逆向iOS SDK -- _UIImageAtPath 的实现(SDK 6.1)
    [原]逆向iOS SDK -- +[UIImage imageNamed:] 的实现
    在 Emacs 中如何退出 Slime Mode
    [转] iOS ABI Function Call Guide
  • 原文地址:https://www.cnblogs.com/deny/p/16305744.html
Copyright © 2020-2023  润新知