alertmanager
alertmanager可以放在远程服务器上
报警机制
在 prometheus 中定义你的监控规则,即配置一个触发器,某个值超过了设置的阈值就触发告警, prometheus 会推送当前的告警规则到 alertmanager,alertmanager 收到了会进行一系列的流程处理,然后发送到接收人手里
配置安装
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
tar zxf alertmanager-0.19.0.linux-amd64.tar.gz
mv alertmanager-0.19.0.linux-amd64.tar.gz /usr/local/alertmanager && cd /usr/local/alertmanager && ls
配置文件
cat alertmanager.yml
global:
resolve_timeout: 5m ##全局配置,设置解析超时时间
route:
group_by: ['alertname'] ##alertmanager中的分组,选哪个标签作为分组的依据
group_wait: 10s ##分组等待时间,拿到第一条告警后等待10s,如果有其他的一起发送出去
group_interval: 10s ##各个分组之前发搜告警的间隔时间
repeat_interval: 1h ##重复告警时间,默认1小时
receiver: 'web.hook' ##接收者
##配置告警接受者
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
##配置告警收敛
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
邮件接收配置
cat alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' #smtp服务地址
smtp_from: 'xxx@163.com' #发送邮箱
smtp_auth_username: 'xxx@163.com' #认证用户名
smtp_auth_password: 'xxxx' #认证密码
smtp_require_tls: false #禁用tls
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'email' #定义接受告警组名
receivers:
- name: 'email' #定义组名
email_configs: #配置邮件
- to: 'xx@xxx.com' #收件人
检查配置文件
./amtool check-config alertmanager.yml
配置为系统服务
cat > /usr/lib/systemd/system/alertmanager.service <<EOF
> [Unit]
> Description=alertmanager
>
> [Service]
> Restart=on-failure
> ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
>
> [Install]
> WantedBy=multi-user.target
> EOF
和prometheus 结合配置
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093 ##配置alertmanager地址
rule_files:
- "rules/*.yml" ##配置告警规则的文件
配置报警规则
报警规则的目录 /usr/local/prometheus/rules
/usr/local/prometheus/rules]# cat example.yml
groups:
- name: exports.rules ##定义这组告警的组名,同性质的,都是监控实例exports是否开启的模板
rules:
- alert: 采集器挂了 ## 告警名称
expr: up == 0 ## 告警表达式,监控up指标,如果等于0,表示监控的节点没有起来,然后进行下面的操作
for: 1m ## 持续一分钟为0就进行告警
labels: ## 定义告警级别
severity: ERROR
annotations: ## 定义告警通知怎么写,默认调用了{$labels.instance&$labels.job}的值
summary: "实例 {{ $labels.instance }} 挂了"
description: "实例 {{ $labels.instance }} job 名为 {{ $labels.job }} 的挂了"
配置的变量解释:
{{ $labels.instance }} #提取了up里的instance 值
{{ $labels.job }}
相同的报警名称即 alertname (根据配置文件 alert 归类)会被合并到同一个邮件里一并发出
告警的分配
分配策略,在报警的配置文件中设定
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'email'
告警分配示例
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'xxx@163.com'
smtp_auth_username: 'xxx@163.com'
smtp_auth_password: 'xxx'
smtp_require_tls: false
route:
receiver: 'default-receiver' ##定义默认接收器名,如果其他的匹配不到走这个
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [cluster, alertname] ##分组设置
routes: ##子路由
- receiver: 'database-pager' ##定义接收器名字
group_wait: 10s ##分组设置
match_re: ##正则匹配
service: mysql|cassandra ##接收标签service值为mysql&&cassandra的告警
- receiver: 'frontend-pager' ##接收器名
group_by: [product, environment] ##分组设置
match: ##直接匹配
team: frontend ##匹配标签team值为frontend的告警
receivers: ##定义接收器
- name: 'default-receiver' ##接收器名字
email_configs: ##邮件接口
- to: 'xxx.xx.com' ##接收人,下面以此类推
- name: 'database-pager'
email_configs:
- to: 'xxx.xx.com'
- name: 'frontend-pager'
email_configs:
- to: 'xxx@.xx.com'
告警收敛
收敛就是尽量压缩告警邮件的数量,防止关键信息淹没,alertmanager 中有很多收敛机制,最主要的就是分组抑制静默,alertmanager 收到告警之后会先进行分组,然后进入通知队列,这个队列会对通知的邮件进行抑制静默,再根据 router 将告警路由到不同的接收器
机制 说明
分组 (group) 将类似性质的告警合并为单个进行通知
抑制 (Inhibition) 当告警发生后,停止重复发送由此告警引发的其他告警
静默 (Silences) 一种简单的特定时间静音提醒的机制
分组:根据报警名称分组,如果相同的报警名称的信息有多条,会合并到一个邮件里发出。
匹配的报警名称:
prometheus 监控的报警规则
/usr/local/prometheus/rules/*.yml
- alert: 节点挂了
抑制:消除冗余告警,在 alertmanager 中配置的
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
##当我收到一个告警级别为 critical 时,他就会抑制掉 warning 这个级别的告警,这个告警等级是在你编写规则的时候定义的,最后一行就是要对哪些告警做抑制,通过标签匹配的,我这里只留了一个 instance,举个最简单的例子,当现在 alertmanager 先收到一条 critical、又收到一条 warning 且 instance 值一致的两条告警他的处理逻辑是怎样的。
##在监控 nginx,nginx 宕掉的告警级别为 warning,宿主机宕掉的告警级别为 critical,譬如说现在我跑 nginx 的服务器凉了,这时候 nginx 肯定也凉了,普罗米修斯发现后通知 alertmanager,普罗米修斯发过来的是两条告警信息,一条是宿主机凉了的,一条是 nginx 凉了的,alertmanager 收到之后,发现告警级别一条是 critical,一条是 warning,而且 instance 标签值一致,也就是说这是在一台机器上发生的,所以他就会只发一条 critical 的告警出来,warning 的就被抑制掉了,我们收到的就是服务器凉了的通知
静默:
特定时间静音提醒的机制,主要是使用标签匹配这一批不发送告警,假如某天要对服务器进行维护,可能会涉及到服务器重启,在这期间肯定会有 N 多告警发出来, 在这期间配置一个静默,这类的告警就不要发了
告警示例
监控内存
promsql
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80
编写规则:
CD /usr/local/prometheus/rules
cat memory.yml
groups:
- name: memeory_rules
rules:
- alert: 内存没了
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80 #表达式成立,即可以查询到数据
for: 1m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 内存没了"
description: "{{ $labels.instance }} 内存没了,当前使用率为 {{ $value }}"
配置告警分配
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'default-receiver'
routes:
- group_by: ['mysql']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'mysql-pager'
match_re:
job: mysql
receivers:
- name: 'default-receiver'
email_configs:
- to: 'xxx@xx.com'
- name: 'mysql-pager'
email_configs:
- to: 'xxx@xx.cn'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
钉钉报警
编译钉钉webhook接口
#安装go环境
wget -c https://storage.googleapis.com/golang/go1.8.3.linux-amd64.tar.gz
tar -C /usr/local/ -zxvf go1.8.3.linux-amd64.tar.gz
mkdir -p /home/gocode
cat << EOF >> /etc/profile
export GOROOT=/usr/local/go #设置为go安装的路径
export GOPATH=/home/gocode #默认安装包的路径
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
EOF
source /etc/profile
----------------------------------------
#安装钉钉插件
cd /home/gocode/
mkdir -p src/github.com/timonwong/
cd /home/gocode/src/github.com/timonwong/
git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git
cd prometheus-webhook-dingtalk
make
#编译成功
[root@mini-install prometheus-webhook-dingtalk]# make
>> formatting code
>> building binaries
> prometheus-webhook-dingtalk
>> checking code style
>> running tests
? github.com/timonwong/prometheus-webhook-dingtalk/chilog [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/cmd/prometheus-webhook-dingtalk [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/models [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/notifier [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/template [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/template/internal/deftmpl [no test files]
? github.com/timonwong/prometheus-webhook-dingtalk/webrouter [no test files]
#创建软连接
ln -s /home/gocode/src/github.com/timonwong/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk /usr/local/bin/prometheus-webhook-dingtalk
##查看
prometheus-webhook-dingtalk --help
usage: prometheus-webhook-dingtalk --ding.profile=DING.PROFILE [<flags>]
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--web.listen-address=":8060"
The address to listen on for web interface.
--ding.profile=DING.PROFILE ...
Custom DingTalk profile (can be given multiple times, <profile>=<dingtalk-url>).
--ding.timeout=5s Timeout for invoking DingTalk webhook.
--template.file="" Customized template file (see template/default.tmpl for example)
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--version Show application version.
启动钉钉插件
根据已申请的钉钉接口启动钉钉插件
prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=OOOOOOXXXXXXOXOXOX9b46d54e780d43b98a1951489e3a0a5b1c6b48e891e86bd"
#注意:可以配置多个webhook名字,这个名字和后续的报警url相关联
#关于这里的 -ding.profile 参数:为了支持同时往多个钉钉自定义机器人发送报警消息,因此 -ding.profile 可以在命令行中指定多次,比如:
prometheus-webhook-dingtalk
--ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx"
--ding.profile="webhook2=https://oapi.dingtalk.com/robot/send?access_token=yyyyyyyyyyy"
这里就定义了两个 WebHook,一个 webhook1,一个 webhook2,用来往不同的钉钉组发送报警消息。
然后在 AlertManager 的配置里面,加入相应的 receiver(注意下面的 url):
receivers:
- name: send_to_dingding_webhook1
webhook_configs:
- send_resolved: false
url: http://localhost:8060/dingtalk/webhook1/send
- name: send_to_dingding_webhook2
webhook_configs:
- send_resolved: false
url: http://localhost:8060/dingtalk/webhook2/send
##配置钉钉插件为系统服务
cat > dingtalk.service <<EFO
[Unit]
Description=alertmanager
[Service]
Restart=on-failure
ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXOOOOOOO0d43b98a1951489e3a0a5b1c6b48e891e86bd"
[Install]
WantedBy=multi-user.target
EFO
systemctl daemon-reload
systemctl status dingtalk 会报错,请忽略,直接start dingtalk
##看端口监听
[root@mini-install system]# ss -tanlp | grep 80
LISTEN 0 128 :::8060 :::* users:(("prometheus-webh",pid=18541,fd=3))
##简单测试
curl -H "Content-Type: application/json" -d '{ "version": "4", "status": "firing", "description":"description_content"}' http://localhost:8060/dingtalk/webhook/send
##prometheus webhook 传递数据格式
The webhook receiver allows configuring a generic receiver:
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The endpoint to send HTTP POST requests to.
url: <string>
# The HTTP client's configuration.
[ http_config: <http_config> | default = global.http_config ]
The Alertmanager will send HTTP POST requests in the following JSON format to the configured endpoint:
{
"version": "4",
"groupKey": <string>, // key identifying the group of alerts (e.g. to deduplicate)
"status": "<resolved|firing>",
"receiver": <string>,
"groupLabels": <object>,
"commonLabels": <object>,
"commonAnnotations": <object>,
"externalURL": <string>, // backlink to the Alertmanager.
"alerts": [
{
"status": "<resolved|firing>",
"labels": <object>,
"annotations": <object>,
"startsAt": "<rfc3339>",
"endsAt": "<rfc3339>",
"generatorURL": <string> // identifies the entity that caused the alert
},
...
]
}
alertmanager
配置
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
tar zxvf alertmanager-0.19.0.linux-amd64.tar.gz
ln -sv `pwd`/alertmanager-0.19.0.linux-amd64 /usr/local/alertmanager
#配置为系统服务
cat >> /usr/lib/systemd/system/alertmanager.service <<EFO
[Unit]
Description=alertmanager
[Service]
Restart=on-failure
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
EFO
systemctl daemon-reload 后启动
#编辑配置文件
cd /usr/local/alertmanager
vim alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/webhook/send'
和prometheus 结合
pwd
/usr/local/prometheus
mkdir rules && cd !$
cat example.yml
groups:
- name: exports.rules ##定义这组告警的组名,同性质的,都是监控实例exports是否开启的模板
rules:
- alert: 采集器黄了 ## 告警名称
expr: up == 0 ## 告警表达式,监控up指标,如果等于0就进行下面的操作
for: 1m ## 持续一分钟为0进行告警
labels: ## 定义告警级别
severity: ERROR
annotations: ## 定义了告警通知怎么写,默认调用了{$labels.instance&$labels.job}的值
summary: "实例 {{ $labels.instance }} 采集器 黄!!"
description: "实例 {{ $labels.instance }} job 名为 {{ $labels.job }} 的采集器 黄了有一分钟!!"
cat prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "first_rules.yml"
# - "second_rules.yml"
##启动服务各个服务
节点监控正常后关闭一个节点。效果
参考:
https://blog.rj-bai.com/post/158.html
钉钉插件作者:
https://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/
https://github.com/timonwong/prometheus-webhook-dingtalk
钉钉插件编译: https://blog.51cto.com/9406836/2419876
http://ylzheng.com/2018/03/01/alertmanager-webhook-dingtalk/
钉钉报警python版
钉钉的报警数据格式比较严格(别人讲的)为了使用钉钉报警的markdown格式,自己写一个api 将alertmanager 发送的数据优化后发送到钉钉机器人
钉钉报警 python 版本
import os
import json
import requests
import arrow
from flask import Flask
from flask import request
app = Flask(__name__)
@app.route('/', methods=['POST', 'GET'])
def send():
if request.method == 'POST':
post_data = request.get_data()
send_alert(bytes2json(post_data))
return 'success'
else:
return 'weclome to use prometheus alertmanager dingtalk webhook server!'
def bytes2json(data_bytes):
data = data_bytes.decode('utf8').replace("'", '"')
return json.loads(data)
def send_alert(data):
token = os.getenv('ROBOT_TOKEN')
if not token:
print('you must set ROBOT_TOKEN env')
return
url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
for output in data['alerts'][:]:
# annotations
send_data = {
"msgtype": "markdown",
"markdown": {
"title": "prometheus_alert",
"text": "## 告警程序: prometheus_alertmanager
" +
"**告警级别**: %s
" % output['labels']['status'] +
"**告警类型**: %s
" % output['labels']['alertname'] +
"**告警实例**: %s
" % output['labels']['instance'] +
"**告警详情**: %s
" % output['annotations']['summary'] +
"**触发时间**: %s
" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') +
"**触发结束时间**: %s
" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ')
}
}
req = requests.post(url, json=send_data)
result = req.json()
if result['errcode'] != 0:
print('notify dingtalk error: %s' % result['errcode'])
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8060)
将该程序打包成容器
#工作目录
# tree
.
├── Dockerfile
└── main.py
main.py 为flask代码
#cat Dockerfile
FROM tiangolo/uwsgi-nginx-flask:python3.7
#设置环境变量 钉钉的令牌
ENV ROBOT_TOKEN 47f07271e8a24b6a63486bBSJDFKj346556jhjk9892fk545jjf234jFJ89489JFKSDLF2KgfhsJK234
RUN pip install requests flask arrow -i https://pypi.tuna.tsinghua.edu.cn/simple some-package --no-cache-dir
COPY main.py /app
EXPOSE 80
##打成镜像
##启动容器
docker run -d --restart=always -p 8060:80 dingding
##测试成功
curl localhost:8060
weclome to use prometheus alertmanager dingtalk webhook server!
测试数据:
[root@t1 ~]# cat data.json
{
"version": "3",
"status": "firing",
"receiver": "jdhf",
"alerts": [
{
"labels": {'instance':"192.168.1.145:9100",'alertname':"home目录可用量", 'status':"严重告警"},
"annotations": {'summary': "实例在root挂载点磁盘可用量小于4G!, 当前可用: 2G"}
}
]
}
curl 127.0.0.1:8060 -X POST -d @data.json --header "Content-Type: application/json"
#测试有问题
是因为Alertmanager 发送给钉钉报警器的数据里有额外的数据,我们的测试数据不足,如果希望成功需要修改main.py,去除触发时间和触发结束时间
##错误
#钉钉发群通知报{"errcode":310000,"errmsg":"keywords not in content" 解决办法
钉钉安全设置的的自定义关键字未配置或公网ip未添加
#######################
##此时alertmanager配置文件为
/usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- send_resolved: true
url: 'http://localhost:8060'
#url: 'http://localhost:8060/dingtalk/webhook/send'
python 报警版修改的这个:
import os
import json
import requests
import arrow
from flask import Flask
from flask import request
app = Flask(__name__)
@app.route('/', methods=['POST', 'GET'])
def send():
if request.method == 'POST':
post_data = request.get_data()
send_alert(bytes2json(post_data))
return 'success'
else:
return 'weclome to use prometheus alertmanager dingtalk webhook server!'
def bytes2json(data_bytes):
data = data_bytes.decode('utf8').replace("'", '"')
return json.loads(data)
def send_alert(data):
token = os.getenv('ROBOT_TOKEN')
if not token:
print('you must set ROBOT_TOKEN env')
return
url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
for output in data['alerts'][:]:
try:
pod_name = output['labels']['pod']
except KeyError:
try:
pod_name = output['labels']['pod_name']
except KeyError:
pod_name = 'null'
try:
namespace = output['labels']['namespace']
except KeyError:
namespace = 'null'
try:
message = output['annotations']['message']
except KeyError:
try:
message = output['annotations']['description']
except KeyError:
message = 'null'
send_data = {
"msgtype": "markdown",
"markdown": {
"title": "prometheus_alert",
"text": "## 告警程序: prometheus_alert
" +
"**告警级别**: %s
" % output['labels']['severity'] +
"**告警类型**: %s
" % output['labels']['alertname'] +
"**故障pod**: %s
" % pod_name +
"**故障namespace**: %s
" % namespace +
"**告警详情**: %s
" % message +
"**告警状态**: %s
" % output['status'] +
"**触发时间**: %s
" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') +
"**触发结束时间**: %s
" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ')
}
}
req = requests.post(url, json=send_data)
result = req.json()
if result['errcode'] != 0:
print('notify dingtalk error: %s' % result['errcode'])
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)