• kubernetes集群常见错误及解决方法


     记录一次kubernetes集群worker节点/var/log/messages系统日志错误信息如下:

    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9036_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9036_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9036_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9036_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9042_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9042_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9042_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9042_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9107_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9107_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9107_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9107_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9114_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9114_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9114_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9114_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9128_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9128_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9128_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9128_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9142_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9142_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Created slice libcontainer_9142_systemd_test_default.slice.
    Jan  7 09:54:20 worker02 systemd: Removed slice libcontainer_9142_systemd_test_default.slice.
    Jan  7 09:54:21 worker02 systemd: Created slice libcontainer_9148_systemd_test_default.slice.
    

     问题原因:

    问题大致原因是由于将 cgroup-driver 设置为 systemd 后引起的,但是这些错误信息不会影响容器指标

    https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L123

    https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.2.0/troubleshoot/cgroup_driver.html
    https://www.ibm.com/support/pages/recurring-messages-complain-scope-libcontainer-nnnnn-has-no-pids-refusing

    解决方法:

    ## 方法1:
    cat <<EOF >/etc/rsyslog.d/ignore-systemd-session-slice.conf
    if ($programname == "systemd") and ($msg contains "_systemd_test_default.slice" or$msg contains "systemd-test-default-dependencies.scope") then { stop }
    EOF
    systemctl restart rsyslog.service
    
    ## 方法2:
    cat <<EOF >/etc/rsyslog.d/ignore-systemd-session-slice.conf
    :rawmsg, contains, "libcontainer" ~
    EOF
    systemctl restart rsyslog.service
    

     

     建议采用如下方式: 

    从根源上解决:修改kubelet启动参数

    KUBELET_EXTRA_ARGS=--kubelet-cgroups=/system.slice/kubelet.service --runtime-cgroups=/system.slice/docker.service
    

     修改kubelet服务启动的配置文件

    cat /etc/sysconfig/kubelet
    KUBELET_EXTRA_ARGS=
    
    #KUBELET_EXTRA_ARGS=--kubelet-cgroups=/system.slice/kubelet.service --runtime-cgroups=/system.slice/docker.service --feature-gates=LocalStorageCapacityIsolation=true 
    #                   --kube-reserved-cgroup=/kubepods.slice --kube-reserved=cpu=500m,memory=500Mi,ephemeral-storage=1Gi 
    #                   --system-reserved-cgroup=/system.slice --system-reserved=cpu=500m,memory=500Mi,ephemeral-storage=1Gi 
    #                   --eviction-hard=memory.available<500Mi,nodefs.available<10%
    #                   --max-pods=200
    

    重启kubelet服务

    systemctl restart kubelet.service
    

      

    2020.04.28记录

    kubernetes集群worker节点/var/log/messages系统日志错误信息如下:

    Apr 28 11:25:45 worker02 kernel: nfs4_reclaim_open_state: 6 callbacks suppressed
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    Apr 28 11:25:45 worker02 kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
    

      

    原因:文件句柄不释放,发生死锁

    解决方法:优化worker节点的内核参数并重启nfs-server服务

    有条件建议使用  GlusterFS 或者Ceph 等其他文件系统作为底层存储,性能、稳定,高可用都胜过 nfs

    以下配置仅供参考,具体请根据服务器配置来自行优化:

    1、调整sysctl.conf配置文件

    cat /etc/sysctl.conf
    
    net.ipv4.ip_forward=1
    net.bridge.bridge-nf-call-iptables=1
    net.bridge.bridge-nf-call-ip6tables=1
    fs.aio-max-nr = 1048576
    fs.file-max = 76724600
    net.core.netdev_max_backlog = 10000
    net.core.rmem_default = 262144 
    # The default setting of the socket receive buffer in bytes.
    net.core.rmem_max = 4194304 
    # The maximum receive socket buffer size in bytes
    net.core.wmem_default = 262144 
    # The default setting (in bytes) of the socket send buffer.
    net.core.wmem_max = 4194304 
    # The maximum send socket buffer size in bytes.
    net.core.somaxconn = 4096
    net.ipv4.tcp_max_syn_backlog = 4096
    net.ipv4.tcp_keepalive_intvl = 20
    net.ipv4.tcp_keepalive_probes = 3
    net.ipv4.tcp_keepalive_time = 60
    net.ipv4.tcp_mem = 8388608 12582912 16777216
    net.ipv4.tcp_fin_timeout = 5
    net.ipv4.tcp_synack_retries = 2
    net.ipv4.tcp_syncookies = 1 
    # 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击
    net.ipv4.tcp_timestamps = 1 
    # 减少time_wait
    net.ipv4.tcp_tw_recycle = 0 
    # 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它
    net.ipv4.tcp_tw_reuse = 1 
    # 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接
    net.ipv4.tcp_max_tw_buckets = 262144
    net.ipv4.tcp_rmem = 8192 87380 16777216
    net.ipv4.tcp_wmem = 8192 65536 16777216
    net.nf_conntrack_max = 1200000
    net.netfilter.nf_conntrack_max = 1200000
    vm.dirty_background_bytes = 409600000 
    # 系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘
    vm.dirty_expire_centisecs = 3000 
    # 比这个值老的脏页,将被刷到磁盘。3000表示30秒。
    vm.dirty_ratio = 95 
    # 如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。
    # 有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。 
    vm.dirty_writeback_centisecs = 100 
    # pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。
    vm.mmap_min_addr = 65536
    vm.overcommit_memory = 0 
    # 在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 . 
    vm.overcommit_ratio = 90 
    # 当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。
    vm.swappiness = 0 
    # 关闭交换分区
    vm.zone_reclaim_mode = 0 
    # 禁用 numa, 或者在vmlinux中禁止. 
    net.ipv4.ip_local_port_range = 40000 65535 
    # 本地自动分配的TCP, UDP端口号范围
    fs.nr_open=20480000
    # 单个进程允许打开的文件句柄上限
    net.ipv4.tcp_max_syn_backlog = 16384
    net.core.somaxconn = 16384

    2、执行sysctl -p命令使参数立即生效

    sysctl -p

    3、修改文件最大打开数vi /etc/security/limits.conf

    * soft nofile 1024000
    * hard nofile 1024000
    * soft nproc unlimited
    * hard nproc unlimited
    * soft core unlimited
    * hard core unlimited
    * soft memlock unlimited
    * hard memlock unlimited

    4、重启nfs-server服务端服务

    systemctl restart nfs-server.service
    

      

  • 相关阅读:
    关于vue 自定义组件的写法与用法
    常用的几种监控服务器性能的Linux命令
    Web自动化测试入门
    接口测试入门
    Selenium+IDEA(java+maven+testNG)+Jenkins环境搭建
    Jmeter+ant+Jenkins环境搭建
    iframe在移动端的缩放
    CSS3的颜色渐变效果
    Hexo建博小结
    Ajax基本概念和原理
  • 原文地址:https://www.cnblogs.com/caidingyu/p/12160105.html
Copyright © 2020-2023  润新知