• 记一次容器内执行ansible命令卡住


    1.由来

      最近在使用kylin_v10系统,发现当在此系统下运行的容器内执行#ansible localhost -m setup 命令会卡住不动,于是和同事一起经过如下排查最终找到解决问题的办法。

    2.环境

    2.1.系统信息

    # cat /etc/*-release
    Kylin Linux Advanced Server release V10 (Tercel)
    NAME="Kylin Linux Advanced Server"
    VERSION="V10 (Tercel)"
    ID="kylin"
    VERSION_ID="V10"
    PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)"
    ANSI_COLOR="0;31"
    
    Kylin Linux Advanced Server release V10 (Tercel)  

    2.2.内核信息

    # uname -a
    Linux reg.wps.lan 4.19.90-17.ky10.aarch64 #1 SMP Sun Jun 28 14:27:40 CST 2020 aarch64 aarch64 aarch64 GNU/Linux

    2.3. docker信息

    # docker info
    Containers: 1
     Running: 1
     Paused: 0
     Stopped: 0
    Images: 1
    Server Version: 18.09.9
    Storage Driver: overlay2
     Backing Filesystem: xfs
     Supports d_type: true
     Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    

    2.4.ansible信息

    # ansible --version
    ansible 2.6.2
      config file = None
      configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
      ansible python module location = /usr/lib/python2.7/site-packages/ansible
      executable location = /usr/bin/ansible
      python version = 2.7.16 (default, Jul  9 2020, 06:35:45) [GCC 7.3.0]
    

    3.分析排查

      在排查时候发现#ansible localhost -m setup命令卡住,放将localhost换成自定义ip+账号密码的配置文件即可正常运行。

           于是加入export ANSIBLE_DEBUG=True用于输出debug日志。

           发现卡在如下地方:

        82 1606185861.10586: transferring module to remote /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py
        82 1606185861.10840: done transferring module to remote
        82 1606185861.10894: _low_level_execute_command(): starting
        82 1606185861.10924: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/ /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py && sleep 0'
        82 1606185861.10940: in local.exec_command()
        82 1606185861.10957: opening command with Popen()
        82 1606185861.11488: done running command with Popen()
        82 1606185861.11523: getting output with communicate()
        82 1606185861.11918: done communicating
        82 1606185861.11936: done with local.exec_command()
        82 1606185861.11961: _low_level_execute_command() done: rc=0, stdout=, stderr=
        82 1606185861.11977: _low_level_execute_command(): starting
        82 1606185861.12019: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py && sleep 0'
        82 1606185861.12038: in local.exec_command()
        82 1606185861.12055: opening command with Popen()
        82 1606185861.12599: done running command with Popen()
        82 1606185861.12631: getting output with communicate()
    

      于是进到物理机上去查看ansible进程

    # ps -ef |grep ansible
    root      672540  672016 99 10:44 pts/0    00:03:06 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606185860.41-269842916667107/AnsiballZ_setup.py
    root      673881  672428 51 10:47 pts/0    00:00:02 /usr/bin/python /usr/local/bin/ansible localhost -m setup
    root      673893  673881 33 10:47 pts/0    00:00:00 /usr/bin/python /usr/local/bin/ansible localhost -m setup
    root      673908  673893  0 10:47 pts/0    00:00:00 /bin/sh -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py && sleep 0'
    root      673909  673908  0 10:47 pts/0    00:00:00 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py && sleep 0
    root      673910  673909 23 10:47 pts/0    00:00:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py
    root      673914  673910 99 10:47 pts/0    00:00:01 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606186046.03-129145088760493/AnsiballZ_setup.py
    root      673971  443741  0 10:47 pts/1    00:00:00 grep ansible
    

      再用strace追踪下673914进程

    # strace -p 673914
    close(216995106)                        = -1 EBADF (错误的文件描述符)
    close(216995107)                        = -1 EBADF (错误的文件描述符)
    close(216995108)                        = -1 EBADF (错误的文件描述符)
    close(216995109)                        = -1 EBADF (错误的文件描述符)
    close(216995110)                        = -1 EBADF (错误的文件描述符)
    close(216995111)                        = -1 EBADF (错误的文件描述符)
    close(216995112)                        = -1 EBADF (错误的文件描述符)
    close(216995113)                        = -1 EBADF (错误的文件描述符)
    close(216995114)                        = -1 EBADF (错误的文件描述符)
    close(216995115)                        = -1 EBADF (错误的文件描述符)
    close(216995116)                        = -1 EBADF (错误的文件描述符)
    close(216995117)                        = -1 EBADF (错误的文件描述符)
    close(216995118)                        = -1 EBADF (错误的文件描述符)
    close(216995119)                        = -1 EBADF (错误的文件描述符)
    close(216995120)                        = -1 EBADF (错误的文件描述符)
    close(216995121)                        = -1 EBADF (错误的文件描述符)
    close(216995122)                        = -1 EBADF (错误的文件描述符)
    close(216995123)                        = -1 EBADF (错误的文件描述符)
    close(216995124)                        = -1 EBADF (错误的文件描述符)
    close(216995125)                        = -1 EBADF (错误的文件描述符)
    close(216995126)                        = -1 EBADF (错误的文件描述符)
    close(216995127)                        = -1 EBADF (错误的文件描述符)
    close(216995128)                        = -1 EBADF (错误的文件描述符)
    close(216995129)                        = -1 EBADF (错误的文件描述符)
    close(216995130)                        = -1 EBADF (错误的文件描述符)
    close(216995131)                        = -1 EBADF (错误的文件描述符)
    close(216995132)                        = -1 EBADF (错误的文件描述符)
    

      终端一直刷上面的,看样子是文件描述符泄露,搜了下  docker Bad file descriptor,找到了 Spawning PTY processes is many times slower on Docker 18.09 里几位大佬排查到是容器的 nofile 太高就会卡,如果启动容器 nofile 设置低则没问题,

          在容器内执行ulimit  -n果然默认值很高

    > ulimit -n
    1073741816

          再查了下 docker nofile limit  找到 Docker: How to increase number of open files limit 里面描述可以在run  docker的时候设置容器内的nofile参数大小。

         于是添加 --ulimit nofile=65535 重新启动docker,并查看容器内ulimit  -n值果然变小了,而且#ansible localhost -m setup 问题也得到了解决。

    4.参考

      https://github.com/pexpect/ptyprocess/issues/50
      https://github.com/docker/for-linux/issues/502
      https://github.com/moby/moby/issues/38814

  • 相关阅读:
    《极客时间--算法面试》--二叉树
    《极客时间--算法面试》-哈希表
    《极客时间-算法面试》-堆栈和队列
    《极客时间-算法面试》-数组和链表
    《极客时间-算法面试》
    《极客时间-算法面试》如何计算复杂度
    查找算法
    排序算法
    AI 期刊会议
    《剑指offer》数组中只出现一次的数字
  • 原文地址:https://www.cnblogs.com/yaohong/p/14029143.html
Copyright © 2020-2023  润新知