为什么KVM计算机点无故重启？

一.故障1：机器hangs

本地一台cloudstack计算节点无故连不上了，cloudstack也坏了，后查看有一台系统虚拟机在这台计算节点上，导致cs挂了。去找到这台机器后，发现这台机器卡住了，重启后卡在starting udev，等好久也不行，即使进入单用户也是一样，重启很多次也是卡在这。最后查到一篇文章，原话：

Remove quiet from the kernel command line and you should get enough output to see the cause of the hang.

然后进入系统菜单，在编辑kernel行，quiet静默模式。相当于设置"loglevel=4"(WARNING)。将quiet删除后，等待进入系统。大概一个小时后，进去系统了，能ssh了，问题解决了一个，得查一下机器为什么重启。

二.故障2:kvm计算机点为什么无故重启

通过查看cloudstack-agent的日志。

[root@kvm204 ~]# tail -1000f /var/log/cloudstack/agent/cloudstack-agent.out

看到了这里有报警，并且执行了reboot命令。继续往上面翻，又看到一条信息，如下：

看到很多这样的报错，但最后一次报错是下面这个，可能是它导致了kvm检测脚本执行了reboot命令，然后就出现了本文第一张图片日志里reboot the host。

现在知道了是谁重启了电脑，但kvm为什么会重启这台计算节点呢？带着疑问，我去查看了一下那个脚本，也就是kvmheartbeat.sh脚本。内容如下：

#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
# 
#   http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

help() {
  printf "Usage: $0 
                    -i nfs server ip 
                    -p nfs server path
                    -m mount point 
                    -h host  
                    -r write/read hb log 
                    -c cleanup
                    -t interval between read hb log
"
  exit 1
}
#set -x
NfsSvrIP=
NfsSvrPath=
MountPoint=
HostIP=
interval=
rflag=0
cflag=0

while getopts 'i:p:m:h:t:rc' OPTION
do
  case $OPTION in
  i)
     NfsSvrIP="$OPTARG"
     ;;
  p)
     NfsSvrPath="$OPTARG"
     ;;
  m)
     MountPoint="$OPTARG"
     ;;
  h)
     HostIP="$OPTARG"
     ;;
  r)
     rflag=1
     ;;
  t)
     interval="$OPTARG"
     ;;
  c)
    cflag=1
     ;;
  *)
     help
     ;;
  esac
done

if [ -z "$NfsSvrIP" ]
then
   exit 1
fi


#delete VMs on this mountpoint
deleteVMs() {
  local mountPoint=$1
  vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null)
  if [ $? -gt 0 ]
  then
     return
  fi

  if [ -z "$vmPids" ]
  then
     return
  fi

  for pid in $vmPids
  do
     kill -9 $pid &> /dev/null
  done
}

#checking is there the same nfs server mounted under $MountPoint?
mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint)
if [ $? -gt 0 ]
then
   # remount it
   mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null
   if [ $? -gt 0 ]
   then
      printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint"
      exit 1
   fi
   if [ "$rflag" == "0" ]
   then
     deleteVMs $MountPoint
   fi
fi

hbFolder=$MountPoint/KVMHA/
hbFile=$hbFolder/hb-$HostIP

write_hbLog() {
#write the heart beat log
  stat $hbFile &> /dev/null
  if [ $? -gt 0 ]
  then
     # create a new one
     mkdir -p $hbFolder &> /dev/null
     touch $hbFile &> /dev/null
     if [ $? -gt 0 ]
     then
        printf "Failed to create $hbFile"
        return 2
     fi
  fi

  timestamp=$(date +%s)
  echo $timestamp > $hbFile
  return $?
}

check_hbLog() {
  now=$(date +%s)
  hb=$(cat $hbFile)
  diff=`expr $now - $hb`
  if [ $diff -gt $interval ]
  then
    return 1
  fi
  return 0
}

if [ "$rflag" == "1" ]
then
  check_hbLog
  if [ $? == 0 ]
  then
    echo "=====> ALIVE <====="
  else
    echo "=====> DEAD <======"
  fi
  exit 0
elif [ "$cflag" == "1" ]
then
  reboot
  exit $?
else
  write_hbLog
  exit $?
fi

上面这个脚本是我们现在的计算节点上的版本。从上面可以看出，有判断reboot命令，也就是知道了为什么会重启。但这很坑爹呀，你检测检测，你还重启。。。。

下面这个脚本是我在github上看到的。而且版本已经更新。距离现在8月。应该是最新的，从下面看出，这个坑爹的reboot命令已经被修改了。

#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
# 
#   http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

help() {
  printf "Usage: $0 
                    -i nfs server ip 
                    -p nfs server path
                    -m mount point 
                    -h host  
                    -r write/read hb log 
                    -c cleanup
                    -t interval between read hb log
"
  exit 1
}
#set -x
NfsSvrIP=
NfsSvrPath=
MountPoint=
HostIP=
interval=
rflag=0
cflag=0

while getopts 'i:p:m:h:t:rc' OPTION
do
  case $OPTION in
  i)
     NfsSvrIP="$OPTARG"
     ;;
  p)
     NfsSvrPath="$OPTARG"
     ;;
  m)
     MountPoint="$OPTARG"
     ;;
  h)
     HostIP="$OPTARG"
     ;;
  r)
     rflag=1 
     ;;
  t)
     interval="$OPTARG"
     ;;
  c)
    cflag=1
     ;;
  *)
     help
     ;;
  esac
done

if [ -z "$NfsSvrIP" ]
then
   exit 1
fi


#delete VMs on this mountpoint
deleteVMs() {
  local mountPoint=$1
  vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null) 
  if [ $? -gt 0 ]
  then
     return
  fi

  if [ -z "$vmPids" ]
  then
     return
  fi

  for pid in $vmPids
  do
     kill -9 $pid &> /dev/null
  done
}

#checking is there the same nfs server mounted under $MountPoint?
mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint)
if [ $? -gt 0 ]
then
   # remount it
   mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null
   if [ $? -gt 0 ]
   then
      printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint" 
      exit 1
   fi
   if [ "$rflag" == "0" ]
   then
     deleteVMs $MountPoint
   fi
fi

hbFolder=$MountPoint/KVMHA/
hbFile=$hbFolder/hb-$HostIP

write_hbLog() {
#write the heart beat log
  stat $hbFile &> /dev/null
  if [ $? -gt 0 ]
  then
     # create a new one
     mkdir -p $hbFolder &> /dev/null
     touch $hbFile &> /dev/null
     if [ $? -gt 0 ]
     then
     printf "Failed to create $hbFile"
        return 2
     fi
  fi

  timestamp=$(date +%s)
  echo $timestamp > $hbFile
  return $?
}

check_hbLog() {
  now=$(date +%s)
  hb=$(cat $hbFile)
  diff=`expr $now - $hb`
  if [ $diff -gt $interval ]
  then
    return 1
  fi
  return 0
}

if [ "$rflag" == "1" ]
then
  check_hbLog
  if [ $? == 0 ]
  then
    echo "=====> ALIVE <====="
  else
    echo "=====> DEAD <======"
  fi
  exit 0
elif [ "$cflag" == "1" ]
then
  /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was unable to write the heartbeat to the storage."
  sync &
  sleep 5
  echo b > /proc/sysrq-trigger
  exit $?
else
  write_hbLog 
  exit $?
fi

相关阅读:
一起学爬虫（Python） — 02
模块（第1章）实验——编译问题（没有规则可以创建目标“modules”）
linux 启动时，sendmail 长时间等待
 红帽发布首个RHEL 7测试版本
 AMD：引入ARM将是自64位以来的最大变革
 (OK) Windows XP 硬盘安装 RHEL7/CentOS7/Fedora19/Fedora20
Windows 7 硬盘安装 RHEL7/CentOS7/Fedora19/Fedora20
linux-0.11内核调试运行阅读环境的搭建及使用
 rhel 7—— /boot/grub2/grub.cfg
Linux环境下网络编程杂谈
原文地址：https://www.cnblogs.com/hanyifeng/p/5019158.html