1、系统监控概述
采集的监控信息主要有内存占用率,CPU占用率,当前在线用户,磁盘挂载及磁盘空间使用率,平均每秒写入流量,平均每秒流出流量。磁盘IO:平均每秒从磁盘读入内存的速率,平均每秒从内存写入磁盘的速率。
2、监控原理
2.1、CPU占用率
监控原理:
CPU相关信息记录在文件 /proc/stat中。详情请查看博文:https://blog.csdn.net/ustclu/article/details/1721673
stephen@stephen-K55VD:~/shell$ cat /proc/stat cpu 348229 906 98356 7304276 81726 0 2821 0 0 0 cpu0 95033 273 22980 1803962 33023 0 1721 0 0 0 cpu1 79735 255 24756 1836717 17035 0 454 0 0 0 cpu2 84045 211 25742 1831963 16753 0 582 0 0 0 cpu3 89415 166 24876 1831633 14913 0 62 0 0 0 intr 10306028 7 28486 0 0 0 0 0 0 1 825 0 0 50130 0 0 0 76 284421 0 213811 0 0 0 29 795993 19 0 81 766580 15 648 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 51268973 btime 1554444493 processes 14526 procs_running 1 procs_blocked 0 softirq 9059312 7 2712077 5 5478 204089 0 1245879 2780432 0 2111345
代码实现:
1 #获取CPU的总量与使用量 2 cpuTotalStart=`awk 'BEGIN{total=0} /cpu / {for(i=2;i<=NF;i++);total+=i}END{print $total}' /proc/stat` 3 cpuUsedStart=`awk 'BEGIN{used=0} /cpu / { used=$2+$3+$4+$7+$8 }END{print used}' /proc/stat` 4 #隔30s再获取一次CPU总量与使用量并计算差值 5 sleep 30 6 cpuTotalEnd=`awk 'BEGIN{total=0} /cpu / {for(i=2;i<=NF;i++);total+=i}END{print $total}' /proc/stat` 7 cpuUsedEnd=`awk 'BEGIN{used=0} /cpu / { used=$2+$3+$4+$7+$8 }END{print used}' /proc/stat` 8 usedCPU=`expr ${cpuUsedEnd} - ${cpuUsedStart}` 9 totalCPU=`expr ${cpuTotalEnd} - ${cpuTotalStart}`
2.2、内存占用率
监控原理:
内存相关的信息记录在/proc/meminfo文件中,MemTotal为内存总量,单位为kb,MemFree为空闲内存。内存占用率=(总内存-空闲内存)/ 总内存。
stephen@stephen-K55VD:~/shell$ cat /proc/meminfo MemTotal: 3922884 kB MemFree: 139108 kB MemAvailable: 317700 kB Buffers: 31792 kB Cached: 538160 kB SwapCached: 10012 kB Active: 2615652 kB
代码实现:
1 #获取内存使用率 2 function memUsage(){ 3 logInfo "Begin to get mem usage of Host [${ip}]" 4 #获取总内存 5 totalMem=`awk '/MemTotal/{print $2}' /proc/meminfo` 6 #获取空闲内存 7 freeMem=`awk '/MemFree/{print $2}' /proc/meminfo` 8 usedMem=`expr ${totalMem} - ${freeMem}` 9 #echo $(usagePercent ${usedMem} ${totalMem}) 10 #echo $(kbToGb ${totalMem}) 11 logInfo "Host [${ip}] total mem is : $(kbToGb ${totalMem}) GB" 12 #计算内存使用率并打印到日志中 13 logInfo "Host [${ip}] mem usage is : $(usagePercent ${usedMem} ${totalMem}) %" 14 logInfo "End to get mem usage of Host [${ip}]" 15 }
2.3、流量监控
监控原理:
Linux机器流量信息记录在/proc/net/dev文件中。通过计算一段时间段内接收和发送的字节数来计算速率。第一列为网卡信息,第二列为接收的字节数,第10列为发送的字节数。
stephen@stephen-K55VD:~/shell/sysMonitor$ cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed wlp3s0: 19595253 41163 0 0 0 0 0 0 34741446 49185 0 0 0 0 0 0 enp4s0f2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 docker0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lo: 907275 5032 0 0 0 0 0 0 907275 5032 0 0 0 0 0 0
代码实现:
1 #ethName为网卡名称 2 receiveByteStart=`cat /proc/net/dev |grep -E "${ethName}"|awk '{print $2}'` 3 sendByteStart=`cat /proc/net/dev |grep -E "${ethName}"|awk '{print $10}'`
2.4、磁盘IO
监控原理:
磁盘IO相关的信息记录在/proc/vmstat文件中,pgpgin对应的为输入方向的数据量。pgpgout对应的为输出方向的数据量。采集一段时间的数据量,除以时间来计算速率。
代码实现:
1 #disk IO in 2 function diskIOIn(){ 3 #获取磁盘入方向IO 4 inIoStart=`awk '/pgpgin/{print $2}' /proc/vmstat` 5 sleep 30 6 inIoEnd=`awk '/pgpgin/{print $2}' /proc/vmstat` 7 inIo=$(((inIoEnd-inIoStart)/(30*1024))) 8 logInfo "Host [${ip}] in IO is : ${inIo} MB / s" 9 10 }
3、脚本代码
- hostLists:监控主机的IP集合。
- sysMonitor.sh*:获取各项监控信息的脚本。
1 #!/bin/bash 2 #监控linux主机系统信息 3 #导入工具模块 4 source utils 5 6 #获取CPU占用率 7 function cpuUsage() 8 { 9 #物理CPU个数 10 phyCPUNums=`cat /proc/cpuinfo |grep "physical id"|sort |uniq|wc -l` 11 #逻辑CPU个数 12 lgCPUNums=`cat /proc/cpuinfo |grep "processor"|wc -l` 13 #core 14 cores=`cat /proc/cpuinfo |grep "cores"|uniq|awk '{print $4}'` 15 logInfo "Host [${ip}] physical CPU nums is : ${phyCPUNums}" 16 logInfo "Host [${ip}] logic CPU nums is : ${lgCPUNums}" 17 logInfo "Host [${ip}] core nums is : ${cores}" 18 #CPU占用率 19 #获取CPU的总量与使用量 20 cpuTotalStart=`awk 'BEGIN{total=0} /cpu / {for(i=2;i<=NF;i++);total+=i}END{print $total}' /proc/stat` 21 cpuUsedStart=`awk 'BEGIN{used=0} /cpu / { used=$2+$3+$4+$7+$8 }END{print used}' /proc/stat` 22 #隔30s再获取一次CPU总量与使用量并计算差值 23 sleep 30 24 cpuTotalEnd=`awk 'BEGIN{total=0} /cpu / {for(i=2;i<=NF;i++);total+=i}END{print $total}' /proc/stat` 25 cpuUsedEnd=`awk 'BEGIN{used=0} /cpu / { used=$2+$3+$4+$7+$8 }END{print used}' /proc/stat` 26 usedCPU=`expr ${cpuUsedEnd} - ${cpuUsedStart}` 27 totalCPU=`expr ${cpuTotalEnd} - ${cpuTotalStart}` 28 logInfo "Host [${ip}] CPU usage is : $(usagePercent ${usedCPU} ${totalCPU}) %" 29 30 } 31 32 #获取内存使用率 33 function memUsage(){ 34 logInfo "Begin to get mem usage of Host [${ip}]" 35 #获取总内存 36 totalMem=`awk '/MemTotal/{print $2}' /proc/meminfo` 37 #获取空闲内存 38 freeMem=`awk '/MemFree/{print $2}' /proc/meminfo` 39 usedMem=`expr ${totalMem} - ${freeMem}` 40 #echo $(usagePercent ${usedMem} ${totalMem}) 41 #echo $(kbToGb ${totalMem}) 42 logInfo "Host [${ip}] total mem is : $(kbToGb ${totalMem}) GB" 43 #计算内存使用率并打印到日志中 44 logInfo "Host [${ip}] mem usage is : $(usagePercent ${usedMem} ${totalMem}) %" 45 logInfo "End to get mem usage of Host [${ip}]" 46 } 47 48 #网卡平均每秒流量 49 function netData(){ 50 logInfo "Begin to get net data of Host [${ip}]" 51 ethName=$1 52 receiveByteStart=`cat /proc/net/dev |grep -E "${ethName}"|awk '{print $2}'` 53 sendByteStart=`cat /proc/net/dev |grep -E "${ethName}"|awk '{print $10}'` 54 sleep 10 55 receiveByteSEnd=`cat /proc/net/dev |grep -E "${ethName}"|awk '{print $2}'` 56 sendBytesEnd=`cat /proc/net/dev |grep -E "${ethName}"|awk '{print $10}'` 57 inDataRate=$(echo "scale=2;(${receiveByteSEnd}-${receiveByteStart})/10" | bc) 58 outDataRate=$(echo "scale=2;(${sendBytesEnd}-${sendByteStart})/10" | bc) 59 logInfo "Host [${ip}] in data is : ${inDataRate} kb / s" 60 logInfo "Host [${ip}] out data is : ${outDataRate} kb / s" 61 logInfo "End to get net data of Host [${ip}]" 62 } 63 64 #磁盘空间使用情况 65 function diskUsage(){ 66 logInfo "Begin to get disk usage of Host [${ip}]" 67 noTimeLogInfo "`df -h`" 68 logInfo "End to get disk usage of Host [${ip}]" 69 } 70 71 #disk IO in 72 function diskIOIn(){ 73 #获取磁盘入方向IO 74 inIoStart=`awk '/pgpgin/{print $2}' /proc/vmstat` 75 sleep 30 76 inIoEnd=`awk '/pgpgin/{print $2}' /proc/vmstat` 77 inIo=$(((inIoEnd-inIoStart)/(30*1024))) 78 logInfo "Host [${ip}] in IO is : ${inIo} MB / s" 79 80 } 81 82 #disk IO out 83 function diskIOout(){ 84 #获取磁盘出方向的IO 85 outIoStart=`awk '/pgpgout/{print $2}' /proc/vmstat` 86 sleep 60 87 outIoEnd=`awk '/pgpgout/{print $2}' /proc/vmstat` 88 outIo=$(((outIoEnd-outIoStart)/(60*1024))) 89 logInfo "Host [${ip}] out IO is : ${outIo} MB / s" 90 } 91 92 #当前在线用户 93 function onlineUser(){ 94 user=`w |awk 'NR>1'|awk '{print $1 " " " " $4}'` 95 userCount=`w |awk 'NR>1'|wc -l` 96 #loginAt=`w |awk 'NR>1'|awk '{print $4 }'` 97 logInfo "There are [${userCount}] users online now." 98 noTimeLogInfo "UserName loginAt" 99 noTimeLogInfo "${user}" 100 } 101 102 #判断主机网络连通性 103 function isAlive(){ 104 for ip in `cat hostLists` 105 do 106 ping ${ip} -c 3 >/dev/null 107 if [ $? -eq 0 ];then 108 logInfo "${ip} is reachable" 109 #查看在线用户 110 onlineUser 111 #获取CPU相关信息 112 cpuUsage 113 #获取mem相关信息 114 memUsage 115 #获取磁盘IO 116 diskIOIn 117 diskIOout 118 #磁盘使用率 119 diskUsage 120 #平均每秒流接收或输出流量 121 netData wlp3s0 122 else 123 logInfo "ERROR ${ip} is unreachable,try login in see more details.." 124 fi 125 done 126 } 127 128 while [ 1 ] 129 do 130 isAlive 131 sleep 60 132 done
- utils:打印日志的函数等。
1 #!/bin/bash 2 #日志打印 3 curr_path=`pwd` 4 function logInfo() 5 { 6 local curr_time=`date "+%Y-%m-%d %H:%M:%S"` 7 log_file=${curr_path}/system_status.log 8 #判断日志文件是否存在 9 if [ -e ${log_file} ] 10 then 11 #检测文件是否可写 12 if [ -w ${log_file} ] 13 then 14 #若文件无写权限则使用chmod命令赋予权限 15 chmod 770 ${log_file} 16 fi 17 else 18 #若日志文件不存在则创建 19 touch ${log_file} 20 fi 21 #写日志 22 local info=$1 23 echo "${curr_time} `whoami` [Info] ${info}">>${log_file} 24 } 25 function noTimeLogInfo(){ 26 msg=$1 27 echo "${msg}">>${log_file} 28 } 29 30 #把kb转换成gb,精度为3。expr只支持整数计算 31 function kbToGb(){ 32 kbVal=$1 33 gbVal=$(echo "scale=3;${kbVal}/1024/1024"| bc) 34 echo $gbVal 35 } 36 #使用率以百分比的形式 37 #第一个参数为已使用量,第二个参数为总量 38 function usagePercent(){ 39 used=$1 40 total=$2 41 usedPercent=$(echo "scale=2;${used}*100/${total}"| bc) 42 echo ${usedPercent} 43 }
脚本结构:
1 -rw-r--r-- 1 stephen stephen 30 4月 5 18:33 hostLists 2 -rwxrwxr-x 1 stephen stephen 4164 4月 5 18:50 sysMonitor.sh* 3 -rw-r--r-- 1 stephen stephen 951 4月 5 15:23 utils
4、运行结果
监控信息记录在日志system_status.log中。运行结果如下:
2019-04-05 19:44:42 stephen [Info] 192.168.1.109 is reachable 2019-04-05 19:44:42 stephen [Info] There are [2] users online now. UserName loginAt USER LOGIN@ stephen 14:09 2019-04-05 19:44:42 stephen [Info] Host [192.168.1.109] physical CPU nums is : 1 2019-04-05 19:44:42 stephen [Info] Host [192.168.1.109] logic CPU nums is : 4 2019-04-05 19:44:42 stephen [Info] Host [192.168.1.109] core nums is : 2 2019-04-05 19:45:12 stephen [Info] Host [192.168.1.109] CPU usage is : 10.12 % 2019-04-05 19:45:12 stephen [Info] Begin to get mem usage of Host [192.168.1.109] 2019-04-05 19:45:12 stephen [Info] Host [192.168.1.109] total mem is : 3.741 GB 2019-04-05 19:45:12 stephen [Info] Host [192.168.1.109] mem usage is : 95.83 % 2019-04-05 19:45:12 stephen [Info] End to get mem usage of Host [192.168.1.109] 2019-04-05 19:45:42 stephen [Info] Host [192.168.1.109] in IO is : 0 MB / s 2019-04-05 19:46:42 stephen [Info] Host [192.168.1.109] out IO is : 0 MB / s 2019-04-05 19:46:42 stephen [Info] Begin to get disk usage of Host [192.168.1.109] 文件系统 容量 已用 可用 已用% 挂载点 udev 1.9G 0 1.9G 0% /dev tmpfs 384M 2.0M 382M 1% /run /dev/sda10 42G 20G 20G 51% / tmpfs 1.9G 20M 1.9G 2% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/loop0 3.8M 3.8M 0 100% /snap/notepad-plus-plus/202 /dev/loop2 54M 54M 0 100% /snap/core18/782 /dev/loop4 441M 441M 0 100% /snap/wine-platform/111 /dev/loop5 441M 441M 0 100% /snap/wine-platform/105 /dev/loop7 3.8M 3.8M 0 100% /snap/notepad-plus-plus/199 /dev/loop3 90M 90M 0 100% /snap/core/6673 /dev/loop1 274M 274M 0 100% /snap/wps-office-multilang/1 /dev/loop6 91M 91M 0 100% /snap/core/6405 /dev/loop8 92M 92M 0 100% /snap/core/6531 /dev/loop9 36M 36M 0 100% /snap/gtk-common-themes/1198 /dev/loop10 3.8M 3.8M 0 100% /snap/notepad-plus-plus/195 /dev/loop11 441M 441M 0 100% /snap/wine-platform/103 tmpfs 384M 16K 384M 1% /run/user/125 tmpfs 384M 52K 384M 1% /run/user/1000 2019-04-05 19:46:42 stephen [Info] End to get disk usage of Host [192.168.1.109] 2019-04-05 19:46:42 stephen [Info] Begin to get net data of Host [192.168.1.109] 2019-04-05 19:46:52 stephen [Info] Host [192.168.1.109] in data is : 42.90 kb / s 2019-04-05 19:46:52 stephen [Info] Host [192.168.1.109] out data is : 7.00 kb / s 2019-04-05 19:46:52 stephen [Info] End to get net data of Host [192.168.1.109] 2019-04-05 19:47:04 stephen [Info] ERROR 255.255.255.254 is unreachable,try login in see more details..
5、参考文档
5.1、ifstat网络流量监控之/proc/net/dev文件
https://blog.csdn.net/kongshuai19900505/article/details/80676607
5.2、awk命令
http://man.linuxde.net/awk
5.3、使用shell脚本采集系统cpu、内存、磁盘、网络等信息
https://www.jb51.net/article/50436.htm