一.故障1:机器hangs
本地一台cloudstack计算节点无故连不上了,cloudstack也坏了,后查看有一台系统虚拟机在这台计算节点上,导致cs挂了。去找到这台机器后,发现这台机器卡住了,重启后卡在starting udev,等好久也不行,即使进入单用户也是一样,重启很多次也是卡在这。最后查到一篇文章,原话:
Remove quiet
from the kernel command line and you should get enough output to see the cause of the hang.
然后进入系统菜单,在编辑kernel行,quiet静默模式。相当于设置"loglevel=4"(WARNING)。将quiet删除后,等待进入系统。大概一个小时后,进去系统了,能ssh了,问题解决了一个,得查一下机器为什么重启。
二.故障2:kvm计算机点为什么无故重启
通过查看cloudstack-agent的日志。
[root@kvm204 ~]# tail -1000f /var/log/cloudstack/agent/cloudstack-agent.out
看到了这里有报警,并且执行了reboot命令。继续往上面翻,又看到一条信息,如下:
看到很多这样的报错,但最后一次报错是下面这个,可能是它导致了kvm检测脚本执行了reboot命令,然后就出现了本文第一张图片日志里reboot the host。
现在知道了是谁重启了电脑,但kvm为什么会重启这台计算节点呢?带着疑问,我去查看了一下那个脚本,也就是kvmheartbeat.sh脚本。内容如下:
#!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. help() { printf "Usage: $0 -i nfs server ip -p nfs server path -m mount point -h host -r write/read hb log -c cleanup -t interval between read hb log " exit 1 } #set -x NfsSvrIP= NfsSvrPath= MountPoint= HostIP= interval= rflag=0 cflag=0 while getopts 'i:p:m:h:t:rc' OPTION do case $OPTION in i) NfsSvrIP="$OPTARG" ;; p) NfsSvrPath="$OPTARG" ;; m) MountPoint="$OPTARG" ;; h) HostIP="$OPTARG" ;; r) rflag=1 ;; t) interval="$OPTARG" ;; c) cflag=1 ;; *) help ;; esac done if [ -z "$NfsSvrIP" ] then exit 1 fi #delete VMs on this mountpoint deleteVMs() { local mountPoint=$1 vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null) if [ $? -gt 0 ] then return fi if [ -z "$vmPids" ] then return fi for pid in $vmPids do kill -9 $pid &> /dev/null done } #checking is there the same nfs server mounted under $MountPoint? mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint) if [ $? -gt 0 ] then # remount it mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null if [ $? -gt 0 ] then printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint" exit 1 fi if [ "$rflag" == "0" ] then deleteVMs $MountPoint fi fi hbFolder=$MountPoint/KVMHA/ hbFile=$hbFolder/hb-$HostIP write_hbLog() { #write the heart beat log stat $hbFile &> /dev/null if [ $? -gt 0 ] then # create a new one mkdir -p $hbFolder &> /dev/null touch $hbFile &> /dev/null if [ $? -gt 0 ] then printf "Failed to create $hbFile" return 2 fi fi timestamp=$(date +%s) echo $timestamp > $hbFile return $? } check_hbLog() { now=$(date +%s) hb=$(cat $hbFile) diff=`expr $now - $hb` if [ $diff -gt $interval ] then return 1 fi return 0 } if [ "$rflag" == "1" ] then check_hbLog if [ $? == 0 ] then echo "=====> ALIVE <=====" else echo "=====> DEAD <======" fi exit 0 elif [ "$cflag" == "1" ] then reboot exit $? else write_hbLog exit $? fi
上面这个脚本是我们现在的计算节点上的版本。从上面可以看出,有判断reboot命令,也就是知道了为什么会重启。但这很坑爹呀,你检测检测,你还重启。。。。
下面这个脚本是我在github上看到的。而且版本已经更新。距离现在8月。应该是最新的,从下面看出,这个坑爹的reboot命令已经被修改了。
#!/bin/bash # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. help() { printf "Usage: $0 -i nfs server ip -p nfs server path -m mount point -h host -r write/read hb log -c cleanup -t interval between read hb log " exit 1 } #set -x NfsSvrIP= NfsSvrPath= MountPoint= HostIP= interval= rflag=0 cflag=0 while getopts 'i:p:m:h:t:rc' OPTION do case $OPTION in i) NfsSvrIP="$OPTARG" ;; p) NfsSvrPath="$OPTARG" ;; m) MountPoint="$OPTARG" ;; h) HostIP="$OPTARG" ;; r) rflag=1 ;; t) interval="$OPTARG" ;; c) cflag=1 ;; *) help ;; esac done if [ -z "$NfsSvrIP" ] then exit 1 fi #delete VMs on this mountpoint deleteVMs() { local mountPoint=$1 vmPids=$(ps aux| grep qemu | grep "$mountPoint" | awk '{print $2}' 2> /dev/null) if [ $? -gt 0 ] then return fi if [ -z "$vmPids" ] then return fi for pid in $vmPids do kill -9 $pid &> /dev/null done } #checking is there the same nfs server mounted under $MountPoint? mounts=$(cat /proc/mounts |grep nfs|grep $MountPoint) if [ $? -gt 0 ] then # remount it mount $NfsSvrIP:$NfsSvrPath $MountPoint -o sync,soft,proto=tcp,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,noac,timeo=133,retrans=10 &> /dev/null if [ $? -gt 0 ] then printf "Failed to remount $NfsSvrIP:$NfsSvrPath under $MountPoint" exit 1 fi if [ "$rflag" == "0" ] then deleteVMs $MountPoint fi fi hbFolder=$MountPoint/KVMHA/ hbFile=$hbFolder/hb-$HostIP write_hbLog() { #write the heart beat log stat $hbFile &> /dev/null if [ $? -gt 0 ] then # create a new one mkdir -p $hbFolder &> /dev/null touch $hbFile &> /dev/null if [ $? -gt 0 ] then printf "Failed to create $hbFile" return 2 fi fi timestamp=$(date +%s) echo $timestamp > $hbFile return $? } check_hbLog() { now=$(date +%s) hb=$(cat $hbFile) diff=`expr $now - $hb` if [ $diff -gt $interval ] then return 1 fi return 0 } if [ "$rflag" == "1" ] then check_hbLog if [ $? == 0 ] then echo "=====> ALIVE <=====" else echo "=====> DEAD <======" fi exit 0 elif [ "$cflag" == "1" ] then /usr/bin/logger -t heartbeat "kvmheartbeat.sh rebooted system because it was unable to write the heartbeat to the storage." sync & sleep 5 echo b > /proc/sysrq-trigger exit $? else write_hbLog exit $? fi