• 为什么KVM计算机点无故重启?


    一.故障1:机器hangs

    本地一台cloudstack计算节点无故连不上了,cloudstack也坏了,后查看有一台系统虚拟机在这台计算节点上,导致cs挂了。去找到这台机器后,发现这台机器卡住了,重启后卡在starting udev,等好久也不行,即使进入单用户也是一样,重启很多次也是卡在这。最后查到一篇文章,原话:

    Remove quiet from the kernel command line and you should get enough output to see the cause of the hang.

    然后进入系统菜单,在编辑kernel行,quiet静默模式。相当于设置"loglevel=4"(WARNING)。将quiet删除后,等待进入系统。大概一个小时后,进去系统了,能ssh了,问题解决了一个,得查一下机器为什么重启。

    二.故障2:kvm计算机点为什么无故重启

    通过查看cloudstack-agent的日志。

    [root@kvm204 ~]# tail -1000f /var/log/cloudstack/agent/cloudstack-agent.out 

    看到了这里有报警,并且执行了reboot命令。继续往上面翻,又看到一条信息,如下:

    看到很多这样的报错,但最后一次报错是下面这个,可能是它导致了kvm检测脚本执行了reboot命令,然后就出现了本文第一张图片日志里reboot the host。

    现在知道了是谁重启了电脑,但kvm为什么会重启这台计算节点呢?带着疑问,我去查看了一下那个脚本,也就是kvmheartbeat.sh脚本。内容如下:

    #!/bin/bash
    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    # 
    #   http://www.apache.org/licenses/LICENSE-2.0
    # 
    # Unless required by applicable law or agreed to in writing,
    # software distributed under the License is distributed on an
    # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    # KIND, either express or implied.  See the License for the
    # specific language governing permissions and limitations
    # under the License.
    
    help() {
      printf "Usage: $0 
                        -i nfs server ip 
                        -p nfs server path
                        -m mount point 
                        -h host  
                        -r write/read hb log 
                        -c cleanup
                        -t interval between read hb log
    "
      exit 1
    }
    #set -x
    NfsSvrIP=
    NfsSvrPath=
    MountPoint=
    HostIP=
    interval=
    rflag=0
    cflag=0
    
    while getopts 'i:p:m:h:t:rc' OPTION
    do
      case $OPTION in
      i)
         NfsSvrIP="$OPTARG"
         ;;
      p)
         NfsSvrPath="$OPTARG"
         ;;
      m)
         MountPoint="$OPTARG"
         ;;
      h)
         HostIP="$OPTARG"
         ;;
      r)
         rflag=1
         ;;
      t)
         interval="$OPTARG"
         ;;
      c)
        cflag=1
         ;;
      *)
         help
         ;;
      esac
    done
    
    if [ -z "$NfsSvrIP" ]
    then
       exit 1
    fi
    
    
    #delete VMs on this mountpoint
    deleteVMs() {
      local mountPoint=$1
      vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null)
      if [ $? -gt 0 ]
      then
         return
      fi
    
      if [ -z "$vmPids" ]
      then
         return
      fi
    
      for pid in $vmPids
      do
         kill -9 $pid &> /dev/null
      done
    }
    
    #checking is there the same nfs server mounted under $MountPoint?
    mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint)
    if [ $? -gt 0 ]
    then
       # remount it
       mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null
       if [ $? -gt 0 ]
       then
          printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint"
          exit 1
       fi
       if [ "$rflag" == "0" ]
       then
         deleteVMs $MountPoint
       fi
    fi
    
    hbFolder=$MountPoint/KVMHA/
    hbFile=$hbFolder/hb-$HostIP
    
    write_hbLog() {
    #write the heart beat log
      stat $hbFile &> /dev/null
      if [ $? -gt 0 ]
      then
         # create a new one
         mkdir -p $hbFolder &> /dev/null
         touch $hbFile &> /dev/null
         if [ $? -gt 0 ]
         then
            printf "Failed to create $hbFile"
            return 2
         fi
      fi
    
      timestamp=$(date +%s)
      echo $timestamp > $hbFile
      return $?
    }
    
    check_hbLog() {
      now=$(date +%s)
      hb=$(cat $hbFile)
      diff=`expr $now - $hb`
      if [ $diff -gt $interval ]
      then
        return 1
      fi
      return 0
    }
    
    if [ "$rflag" == "1" ]
    then
      check_hbLog
      if [ $? == 0 ]
      then
        echo "=====> ALIVE <====="
      else
        echo "=====> DEAD <======"
      fi
      exit 0
    elif [ "$cflag" == "1" ]
    then
      reboot
      exit $?
    else
      write_hbLog
      exit $?
    fi

    上面这个脚本是我们现在的计算节点上的版本。从上面可以看出,有判断reboot命令,也就是知道了为什么会重启。但这很坑爹呀,你检测检测,你还重启。。。。

    下面这个脚本是我在github上看到的。而且版本已经更新。距离现在8月。应该是最新的,从下面看出,这个坑爹的reboot命令已经被修改了。

    #!/bin/bash
    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    # 
    #   http://www.apache.org/licenses/LICENSE-2.0
    # 
    # Unless required by applicable law or agreed to in writing,
    # software distributed under the License is distributed on an
    # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    # KIND, either express or implied.  See the License for the
    # specific language governing permissions and limitations
    # under the License.
    
    help() {
      printf "Usage: $0 
                        -i nfs server ip 
                        -p nfs server path
                        -m mount point 
                        -h host  
                        -r write/read hb log 
                        -c cleanup
                        -t interval between read hb log
    "
      exit 1
    }
    #set -x
    NfsSvrIP=
    NfsSvrPath=
    MountPoint=
    HostIP=
    interval=
    rflag=0
    cflag=0
    
    while getopts 'i:p:m:h:t:rc' OPTION
    do
      case $OPTION in
      i)
         NfsSvrIP="$OPTARG"
         ;;
      p)
         NfsSvrPath="$OPTARG"
         ;;
      m)
         MountPoint="$OPTARG"
         ;;
      h)
         HostIP="$OPTARG"
         ;;
      r)
         rflag=1 
         ;;
      t)
         interval="$OPTARG"
         ;;
      c)
        cflag=1
         ;;
      *)
         help
         ;;
      esac
    done
    
    if [ -z "$NfsSvrIP" ]
    then
       exit 1
    fi
    
    
    #delete VMs on this mountpoint
    deleteVMs() {
      local mountPoint=$1
      vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null) 
      if [ $? -gt 0 ]
      then
         return
      fi
    
      if [ -z "$vmPids" ]
      then
         return
      fi
    
      for pid in $vmPids
      do
         kill -9 $pid &> /dev/null
      done
    }
    
    #checking is there the same nfs server mounted under $MountPoint?
    mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint)
    if [ $? -gt 0 ]
    then
       # remount it
       mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null
       if [ $? -gt 0 ]
       then
          printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint" 
          exit 1
       fi
       if [ "$rflag" == "0" ]
       then
         deleteVMs $MountPoint
       fi
    fi
    
    hbFolder=$MountPoint/KVMHA/
    hbFile=$hbFolder/hb-$HostIP
    
    write_hbLog() {
    #write the heart beat log
      stat $hbFile &> /dev/null
      if [ $? -gt 0 ]
      then
         # create a new one
         mkdir -p $hbFolder &> /dev/null
         touch $hbFile &> /dev/null
         if [ $? -gt 0 ]
         then
         printf "Failed to create $hbFile"
            return 2
         fi
      fi
    
      timestamp=$(date +%s)
      echo $timestamp > $hbFile
      return $?
    }
    
    check_hbLog() {
      now=$(date +%s)
      hb=$(cat $hbFile)
      diff=`expr $now - $hb`
      if [ $diff -gt $interval ]
      then
        return 1
      fi
      return 0
    }
    
    if [ "$rflag" == "1" ]
    then
      check_hbLog
      if [ $? == 0 ]
      then
        echo "=====> ALIVE <====="
      else
        echo "=====> DEAD <======"
      fi
      exit 0
    elif [ "$cflag" == "1" ]
    then
      /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was unable to write the heartbeat to the storage."
      sync &
      sleep 5
      echo b > /proc/sysrq-trigger
      exit $?
    else
      write_hbLog 
      exit $?
    fi
  • 相关阅读:
    一起学爬虫(Python) — 02
    模块(第1章)实验——编译问题(没有规则可以创建目标“modules”)
    linux 启动时,sendmail 长时间等待
    红帽发布首个RHEL 7测试版本
    AMD:引入ARM将是自64位以来的最大变革
    (OK) Windows XP 硬盘安装 RHEL7/CentOS7/Fedora19/Fedora20
    Windows 7 硬盘安装 RHEL7/CentOS7/Fedora19/Fedora20
    linux-0.11内核调试运行阅读环境的搭建及使用
    rhel 7—— /boot/grub2/grub.cfg
    Linux环境下网络编程杂谈
  • 原文地址:https://www.cnblogs.com/hanyifeng/p/5019158.html
Copyright © 2020-2023  润新知