• 如何在k8s集群节点故障时(not ready,unreachable),定义pod的驱逐行为?


    1、概述

    #问题当k8s集群中的某个节点出现故障时,在上面运行的pod会有什么样的行为

    OK,本文档就介绍下在节点故障时,pod的驱逐行为是如何定义的。

    2、一个实验

    在这个实验中,我们关闭k8s中的一个节点,然后看下这个节点上的信息会有哪些的变化及pod的运行的行为的变化。

    2.1、运行一个deployment

    确保在要测试的节点上,有pod运行。

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-taints
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      selector:
        matchLabels:
          app: nginx-taints
      replicas: 5
      template:
        metadata:
          labels:
            app: nginx-taints
        spec:
          containers:
          - image: 172.20.58.152/middleware/nginx:1.21.4
            imagePullPolicy: IfNotPresent
            name: nginx
          dnsPolicy: ClusterFirst
          restartPolicy: Always

    基于以上的配置,创建一个deployment.

    [root@nccztsjb-node-23 ~]# kubectl apply -f nginx-taints.yaml 
    deployment.apps/nginx-taints created
    [root@nccztsjb-node-23 ~]# kubectl get pod -l app=nginx-taints -o wide
    NAME                            READY   STATUS    RESTARTS   AGE   IP               NODE               NOMINATED NODE   READINESS GATES
    nginx-taints-6698889db5-j546r   1/1     Running   0          12s   172.39.157.212   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-tpmb2   1/1     Running   0          12s   172.39.209.124   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-w7rdm   1/1     Running   0          12s   172.39.209.123   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-w7zjm   1/1     Running   0          12s   172.39.157.211   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-x9mdz   1/1     Running   0          12s   172.39.21.67     nccztsjb-node-25   <none>           <none>

    ok,pod已经运行。

    我们这次以节点nccztsjb-node-24为例来进行验证。

    2.2、将节点kubelet进程关闭

    关闭节点nccztsjb-node-24的kubelet进程

    systemctl stop kubelet

    关闭服务,几分钟后······

    查看集群中,节点的状态

    [root@nccztsjb-node-23 ~]# kubectl get nodes
    NAME               STATUS     ROLES                       AGE   VERSION
    nccztsjb-node-23   Ready      control-plane,master        36d   v1.23.2
    nccztsjb-node-24   NotReady   <none>                      36d   v1.23.2
    nccztsjb-node-25   Ready      ingress,prometheus-server   36d   v1.23.2
    [root@nccztsjb-node-23 ~]# 

    节点nccztsjb-node-24的状态已经变为NoteReady了。

    查看节点的信息变化

    [root@nccztsjb-node-23 ~]# kubectl describe nodes nccztsjb-node-24 | more
    Name:               nccztsjb-node-24
    Roles:              <none>
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/arch=amd64
                        kubernetes.io/hostname=nccztsjb-node-24
                        kubernetes.io/os=linux
    Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        projectcalico.org/IPv4Address: 172.20.58.65/24
                        projectcalico.org/IPv4IPIPTunnelAddr: 172.39.157.192
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Tue, 25 Jan 2022 12:07:13 +0800
    Taints:             node.kubernetes.io/unreachable:NoExecute
                        node.kubernetes.io/unreachable:NoSchedule

    发现已经自动加上了如下的taints

    Taints:             node.kubernetes.io/unreachable:NoExecute
                        node.kubernetes.io/unreachable:NoSchedule

    查看pod的变化

    [root@nccztsjb-node-23 ~]# kubectl get pod
    NAME                            READY   STATUS    RESTARTS   AGE
    nginx-taints-6698889db5-j546r   1/1     Running   0          2m5s
    nginx-taints-6698889db5-tpmb2   1/1     Running   0          2m5s
    nginx-taints-6698889db5-w7rdm   1/1     Running   0          2m5s
    nginx-taints-6698889db5-w7zjm   1/1     Running   0          2m5s
    nginx-taints-6698889db5-x9mdz   1/1     Running   0          2m5s
    [root@nccztsjb-node-23 ~]# 
    kubectl get pod nginx-taints-6698889db5-x9mdz -o yaml
    

    发现··· ···

    被加上了如下的tolerations

      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300

    这样pod上就加上了tolerations,就是在节点not-ready时,tolerationSeconds: 300还会在节点上运行5分钟,而不会立即被驱逐。

    观察nccztsjb-node-24节点上,docker进程的状态

    [root@nccztsjb-node-24 ~]# docker ps | grep nginx-taints
    efc6733b1866   ea335eea17ab         "/docker-entrypoint.…"   6 minutes ago   Up 5 minutes             k8s_nginx_nginx-taints-6698889db5-j546r_default_c67a09b1-cb53-4f98-b2b6-c6e7ad45b818_0
    ed4dce36693c   ea335eea17ab         "/docker-entrypoint.…"   6 minutes ago   Up 5 minutes             k8s_nginx_nginx-taints-6698889db5-w7zjm_default_3eb2dbcf-ee55-420b-8758-0512016747b4_0
    c5a78f9b2459   gotok8s/pause:3.6    "/pause"                 6 minutes ago   Up 5 minutes             k8s_POD_nginx-taints-6698889db5-j546r_default_c67a09b1-cb53-4f98-b2b6-c6e7ad45b818_0
    e78370d4fcf6   gotok8s/pause:3.6    "/pause"                 6 minutes ago   Up 5 minutes             k8s_POD_nginx-taints-6698889db5-w7zjm_default_3eb2dbcf-ee55-420b-8758-0512016747b4_0
    [root@nccztsjb-node-24 ~]# 

    处于运行的状态,因为……没有人给kubelet下发任务来关闭docker服务

    观察pod的状态,5分钟后··· ···

    [root@nccztsjb-node-23 ~]# kubectl get pod -o wide -w
    NAME                            READY   STATUS    RESTARTS   AGE     IP               NODE               NOMINATED NODE   READINESS GATES
    nginx-taints-6698889db5-j546r   1/1     Running   0          4m40s   172.39.157.212   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-tpmb2   1/1     Running   0          4m40s   172.39.209.124   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-w7rdm   1/1     Running   0          4m40s   172.39.209.123   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-w7zjm   1/1     Running   0          4m40s   172.39.157.211   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-x9mdz   1/1     Running   0          4m40s   172.39.21.67     nccztsjb-node-25   <none>           <none>
    
    nginx-taints-6698889db5-w7zjm   1/1     Terminating   0          6m26s   172.39.157.211   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-j546r   1/1     Terminating   0          6m26s   172.39.157.212   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-dlqht   0/1     Pending       0          0s      <none>           <none>             <none>           <none>
    nginx-taints-6698889db5-msdnh   0/1     Pending       0          0s      <none>           <none>             <none>           <none>
    nginx-taints-6698889db5-dlqht   0/1     Pending       0          0s      <none>           nccztsjb-node-25   <none>           <none>
    nginx-taints-6698889db5-msdnh   0/1     Pending       0          0s      <none>           nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-dlqht   0/1     ContainerCreating   0          0s      <none>           nccztsjb-node-25   <none>           <none>
    nginx-taints-6698889db5-msdnh   0/1     ContainerCreating   0          0s      <none>           nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-msdnh   0/1     ContainerCreating   0          1s      <none>           nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-dlqht   0/1     ContainerCreating   0          1s      <none>           nccztsjb-node-25   <none>           <none>
    nginx-taints-6698889db5-dlqht   1/1     Running             0          2s      172.39.21.68     nccztsjb-node-25   <none>           <none>
    nginx-taints-6698889db5-msdnh   1/1     Running             0          2s      172.39.209.125   nccztsjb-node-23   <none>           <none>

    nccztsjb-node-24节点上的pod处于Terminating的状态,并且在其他的节点重新启动了2个实例

    [root@nccztsjb-node-23 ~]# kubectl get pods --sort-by=.spec.nodeName -o wide
    NAME                            READY   STATUS        RESTARTS   AGE     IP               NODE               NOMINATED NODE   READINESS GATES
    nginx-taints-6698889db5-msdnh   1/1     Running       0          5m57s   172.39.209.125   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-tpmb2   1/1     Running       0          12m     172.39.209.124   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-w7rdm   1/1     Running       0          12m     172.39.209.123   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-j546r   1/1     Terminating   0          12m     172.39.157.212   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-w7zjm   1/1     Terminating   0          12m     172.39.157.211   nccztsjb-node-24   <none>           <none>
    nginx-taints-6698889db5-dlqht   1/1     Running       0          5m57s   172.39.21.68     nccztsjb-node-25   <none>           <none>
    nginx-taints-6698889db5-x9mdz   1/1     Running       0          12m     172.39.21.67     nccztsjb-node-25   <none>           <none>

    那么,此时在节点nccztsjb-node-24上的docker容器是什么状态?

    [root@nccztsjb-node-24 ~]# docker ps | grep nginx-taints
    efc6733b1866   ea335eea17ab         "/docker-entrypoint.…"   13 minutes ago   Up 13 minutes             k8s_nginx_nginx-taints-6698889db5-j546r_default_c67a09b1-cb53-4f98-b2b6-c6e7ad45b818_0
    ed4dce36693c   ea335eea17ab         "/docker-entrypoint.…"   13 minutes ago   Up 13 minutes             k8s_nginx_nginx-taints-6698889db5-w7zjm_default_3eb2dbcf-ee55-420b-8758-0512016747b4_0
    c5a78f9b2459   gotok8s/pause:3.6    "/pause"                 13 minutes ago   Up 13 minutes             k8s_POD_nginx-taints-6698889db5-j546r_default_c67a09b1-cb53-4f98-b2b6-c6e7ad45b818_0
    e78370d4fcf6   gotok8s/pause:3.6    "/pause"                 13 minutes ago   Up 13 minutes             k8s_POD_nginx-taints-6698889db5-w7zjm_default_3eb2dbcf-ee55-420b-8758-0512016747b4_0
    [root@nccztsjb-node-24 ~]# 

    依然,处于运行的状态。

    原因很简单,kubelet和apiserver失联,无法接收到关闭pod的指令。

    2.3、重新启动节点的kubelet服务

    systemctl start kubelet

    此时,再次,查看节点的状态

    [root@nccztsjb-node-23 ~]# kubectl get nodes
    NAME               STATUS   ROLES                       AGE   VERSION
    nccztsjb-node-23   Ready    control-plane,master        36d   v1.23.2
    nccztsjb-node-24   Ready    <none>                      36d   v1.23.2
    nccztsjb-node-25   Ready    ingress,prometheus-server   36d   v1.23.2
    [root@nccztsjb-node-23 ~]# 

    恢复正常,为Ready的状态。

    查看pod的状态

    [root@nccztsjb-node-23 ~]# kubectl get pods --sort-by=.spec.nodeName -o wide
    NAME                            READY   STATUS    RESTARTS   AGE     IP               NODE               NOMINATED NODE   READINESS GATES
    nginx-taints-6698889db5-msdnh   1/1     Running   0          8m47s   172.39.209.125   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-tpmb2   1/1     Running   0          15m     172.39.209.124   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-w7rdm   1/1     Running   0          15m     172.39.209.123   nccztsjb-node-23   <none>           <none>
    nginx-taints-6698889db5-dlqht   1/1     Running   0          8m47s   172.39.21.68     nccztsjb-node-25   <none>           <none>
    nginx-taints-6698889db5-x9mdz   1/1     Running   0          15m     172.39.21.67     nccztsjb-node-25   <none>           <none>
    [root@nccztsjb-node-23 ~]# 

    之前为Terminating状态的pod,顺利被删除。

    节点nccztsjb-node-24上查看docker容器

    [root@nccztsjb-node-24 ~]# docker ps | grep nginx-taints
    [root@nccztsjb-node-24 ~]# 

    已经被关闭。原因很简单,kubelet正常和api server通信,获取api server指令,关闭了节点上的pod.

    查看pod的描述信息

    kubectl get pod nginx-taints-6698889db5-x9mdz -o yaml
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300

    这些被加上去的toleration还在没有被删除掉。因为,对pod的运行没有影响。

    OK,以上就是整个的实验,关于模拟,k8s集群节点故障的实验。

    3、思考及解释

    • 1、node上的taint是如何加上去的?
    • 2、pod上的tolerations是如何被加上去的?
    • 3、node故障时,还会运行多久?

    OK,那让我们来一一说明以上的问题……

    1、node上的taints是如何加上去的?

    node controller(节点控制器)在某些条件下,会自动的为节点上taints.

    详细可参考:https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

    2、pod上的tolerations是如何加上去的?

    pod上tolerations是由Admission Controller加上去的。

    默认的Admission Controller中的DefaultTolerationSeconds插件,会自动将node.kubernetes.io/not-readynode.kubernetes.io/unreachable这2个tolerations加上,并且默认的tolerationSeconds=300(单位:秒)

    3、node故障时,pod还会运行多久?

    通过以上的实验,tolerationSeconds=300即默认,node故障时,node会自动加上taints,pod会增加这个tolerations属性,默认容忍时间是300s,5分钟。

    即,节点故障时,pod可再运行5分钟。

  • 相关阅读:
    http请求头和响应头详细解释
    http协议POST请求头content-type主要的四种取值
    什么是精准测试
    测试管理(管事篇)
    有赞全链路压测方案设计与实施详解
    饿了么全链路压测平台的实现与原理
    京东全链路压测军演系统(ForceBot)架构解密
    java Apache common-io 讲解
    CentOS 7.0 安装go 1.3.1
    异常
  • 原文地址:https://www.cnblogs.com/chuanzhang053/p/15955226.html
Copyright © 2020-2023  润新知