• 节点资源超卖方案的实现


    背景

    腾讯云对k8s集群负载的优化讨论中提到了关于节点资源超卖的方案,可以简述为:根据每个节点的真实历史负载数据,动态的配置节点可分配资源总量Allocatable,以控制允许调度到该节点的pod数量。文章中给出了整体的技术方案,也提出了很多需要我们自己去考虑的细节问题,这里尝试简单的实现这个方案。

    实现

    文章中提到的方案如下

    - 每个节点的资源超卖比例,我们设置到Node的Annotation中,比如cpu超卖对应Annotation stke.platform/cpu-oversale-ratio。
    - 每个节点的超卖比例,由自研组件基于节点历史监控数据,动态的/周期性的去调整超卖比例
    - Node超卖特性一定要是可以关闭和还原的,通过Node Annotation stke.platform/mutate: "false"关闭Node超卖,Node在下一个心跳会完成资源复原。
    - 通过kube-apiserver的Mutating Admission Webhook对Node的Create和Status Update事件进行拦截,根据超卖比重新计算该Node的Allocatable&Capacity Resource,Patch到APIServer。

    这里关键是需要考虑以下细节问题

    1.Kubelet Register Node To ApiServer的详细原理是什么,通过webhook直接Patch Node Status是否可行?

    1.1 简单描述心跳方式:目前k8s支持两种心跳方式,其中新的心跳方式NodeLease更加轻量,每次心跳内容0.1Kb左右,可以较好的解决节点数量多的时候频繁的心跳信息给apiserver带来的压力,NodeLease目前作为未正式可用的新特性,需要kubelet中开启这个feature,这里仅考虑原心跳方式(关于两种心跳的协同工作原理之后可以记录一下)。

    简单看一下源码https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go 中的tryUpdateNodeStatus方法,先获取Node信息(注释里面说到了这个Get操作优先从本地缓存里面拿,效率高一点),然后通过setNodeStatus方法对node的status进行补充(心跳的内容主要就是node对象的node.status字段值),例如补充MachineInfo即cpu和内存信息、ImageList、Condition等(具体会补充什么内容可以查看pkg/kubelet/kubelet_node_status.go#defaultNodeStatusFuncs),之后会调用函数PatchNodeStatus向apiserver发送这个node信息,参数nodeStatusUpdateFrequency指定发送频率,默认是10s。

    func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {
        // In large clusters, GET and PUT operations on Node objects coming
        // from here are the majority of load on apiserver and etcd.
        // To reduce the load on etcd, we are serving GET operations from
        // apiserver cache (the data might be slightly delayed but it doesn't
        // seem to cause more conflict - the delays are pretty small).
        // If it result in a conflict, all retries are served directly from etcd.
        opts := metav1.GetOptions{}
        if tryNumber == 0 {
            util.FromApiserverCache(&opts)
        }
        node, err := kl.heartbeatClient.CoreV1().Nodes().Get(context.TODO(), string(kl.nodeName), opts)
        if err != nil {
            return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
        }
    
        originalNode := node.DeepCopy()
        if originalNode == nil {
            return fmt.Errorf("nil %q node object", kl.nodeName)
        }
    
        podCIDRChanged := false
        if len(node.Spec.PodCIDRs) != 0 {
            // Pod CIDR could have been updated before, so we cannot rely on
            // node.Spec.PodCIDR being non-empty. We also need to know if pod CIDR is
            // actually changed.
            podCIDRs := strings.Join(node.Spec.PodCIDRs, ",")
            if podCIDRChanged, err = kl.updatePodCIDR(podCIDRs); err != nil {
                klog.Errorf(err.Error())
            }
        }
    
        kl.setNodeStatus(node)
    
        now := kl.clock.Now()
        if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) {
            if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) {
                // We must mark the volumes as ReportedInUse in volume manager's dsw even
                // if no changes were made to the node status (no volumes were added or removed
                // from the VolumesInUse list).
                //
                // The reason is that on a kubelet restart, the volume manager's dsw is
                // repopulated and the volume ReportedInUse is initialized to false, while the
                // VolumesInUse list from the Node object still contains the state from the
                // previous kubelet instantiation.
                //
                // Once the volumes are added to the dsw, the ReportedInUse field needs to be
                // synced from the VolumesInUse list in the Node.Status.
                //
                // The MarkVolumesAsReportedInUse() call cannot be performed in dsw directly
                // because it does not have access to the Node object.
                // This also cannot be populated on node status manager init because the volume
                // may not have been added to dsw at that time.
                kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse)
                return nil
            }
        }
    
        // Patch the current status on the API server
        updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node)
        if err != nil {
            return err
        }
        kl.lastStatusReportTime = now
        kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)
        // If update finishes successfully, mark the volumeInUse as reportedInUse to indicate
        // those volumes are already updated in the node's status
        kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
        return nil
    }
    View Code

    1.2 通过webhook来patch node.status是否可行:经过验证通过配置mutatingwebhook获取心跳信息并patch node.status是可行的

    1.3 webhook配置细节:mutatingwebhookconfiguration用于向apiserver注册webhook,如下所示

    apiVersion: admissionregistration.k8s.io/v1beta1
    kind: MutatingWebhookConfiguration
    metadata:
      name: demo-webhook
      labels:
        app: demo-webhook
        kind: mutator
    webhooks:
      - name: demo-webhook.app.svc
        clientConfig:
          service:
            name: demo-webhook
            namespace: app
            path: "/mutate"
          caBundle:  ${CA_BUNDLE}
        rules:
          - operations: [ "UPDATE" ]
            apiGroups: [""]
            apiVersions: ["v1"]
            resources: ["nodes/status"]

    关注上述yaml的.webhooks[0].rules字段,对于k8s中的Resource如pod,pod.status被称为pod的subResource,起初我们将rules.resorces配置为["nodes"]和或者["*"],webhook中都无法获取到10s一次的心跳信息,最终通过查看源码中对到达apiserver的请求是否需要发给webhook这一匹配逻辑,如下:会同时判断Resource和SubResource,也就说["*/*"]才能匹配到所有情况

    func (r *Matcher) resource() bool {
        opRes, opSub := r.Attr.GetResource().Resource, r.Attr.GetSubresource()
        for _, res := range r.Rule.Resources {
            res, sub := splitResource(res)
            resMatch := res == "*" || res == opRes
            subMatch := sub == "*" || sub == opSub
            if resMatch && subMatch {
                return true
            }
        }
        return false
    }

     1.4 k8s中的patch分为哪些类型:apiserver根据请求header中的Content-Type字段来区分patch的类型

    Json Patch Content-Type: application/json-patch+json,参考https://tools.ietf.org/html/rfc6902,这种patch方式支持的操作类型挺丰富的,如add、replace、remove、copy等

    Merge Patch Content-Type: application/merge-patch+json,参考https://tools.ietf.org/html/rfc7386,这种方式每次patch对某个域的修改总是被替换

    Strategic Merge Patch Content-Type: application/strategic-merge-patch+json,这种方式会根据对象的定义中添加的元数据来判断哪些应该是要被merge而不是默认的替换,如下pod.Spec定义中的containers字段通过patchStrategy和patchMergeKey表明当name不相同时,这个字段要被merge

    containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`

    2.当节点资源超卖后,Kubernetes对应的Cgroup动态调整机制是否能继续正常工作?

    3.Node status更新太频繁,每次status update都会触发webhook,大规模集群容易对apiserver造成性能问题,怎么解决?

     可启用NodeLease心跳机制(需要看一下NodeLease心跳信息中是否有资源分配水位线allocatable值),也可考虑增大心跳频率发送频率,同时webhook逻辑要尽量简单

    4.节点资源超卖对Kubelet Eviction的配置是否也有超配效果,还是仍然按照实际Node配置和负载进行evict? 如果对Evict有影响,又该如解决?

    5.超卖比例从大往小调低时,存在节点上 Sum(pods' request resource) > node's allocatable情况出现,这里是否有风险,该如何处理?

     初步方案是当超卖比例从大往小调低时,保证allocatable的值不会比 Sum(pods' request resource)小

    6.监控系统对Node的监控与Node Allocatable&Capacity Resource有关,超卖后,意味着监控系统对Node的监控不再正确,需要做一定程度的修正,如何让监控系统也能动态的感知超卖比例进行数据和视图的修正?

    7.Node Allocatable和Capacity分别该如何超卖?超卖对节点预留资源的影响是如何的?

    注意事项

    Ocp环境中需要在master中添加如下配置开启mutatingwebhook

    admissionConfig:
      pluginConfig:
        MutatingAdmissionWebhook:
          configuration:
            apiVersion: v1
            kind: DefaultAdmissionConfig
            disable: false
     
    View Code

    参考链接:

    https://blog.csdn.net/shida_csdn/article/details/84286058

  • 相关阅读:
    CFree 提示main must return int
    策略模式
    CFree 提示no newline at the end of file
    EEPROM的写入操作解析
    一些关于mic的知识
    一些关于电池的资料
    太阳能电池板日发电量简易计算方法
    ubuntu 下载编译android源代码
    SC44B0的内存地址解析
    RequireJS 2.0 学习笔记一
  • 原文地址:https://www.cnblogs.com/orchidzjl/p/12365156.html
Copyright © 2020-2023  润新知