上两篇我们讲了Prometheus+Grafana+Eureka实现微服务监控,通过Grafana可以查看监控大屏。但是还有一个问题,就是如果出现预警,不可能靠人一直盯着,一是可能漏掉重要的预警信息,二是人工预警不及时,更重的是上千台服务器得需要多少人。这时自动,准确的预警尤为重要。
Spring Cloud 之 Prometheus+Grafana实现监控微服务(二十一)
Spring Cloud 之 Prometheus+Grafana+Eureka实现动态微服务监控(二十二)
在官方给的架构图中,如下图右上角标记为蓝色框内,则为普罗米修斯的预警模块。本篇主要实现机器宕机后,Prometheus把宕机的预警信息发给Alertmanager,再通过AlertManager把预警信息转发给我们自己的预警应用。预警应用可以通过邮件,短信,企业微信预警,通知相关业务及开发人员。
Prometheus集成Alertmanager预警架构图
1、Prometheus集成alertmanager配置
prometheus.yml文件中配置好alertmanager地址,9093是alertmanager默认启动端口。
# Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"]
2、Prometheus配置预警规则
2.1 在prometheus.yml同级目录下新建一个规则配置文件,名称为first_rules.yml。
first_rules.yml内容如下:
groups: - name: example rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: Instance has been down for more than 5 minutes
上面配置的意思是当有实例下线时发送告警信息。
2.2 prometheus.yml配置中配置规则文件first_rules.yml,默认是注释掉的,打开即可。
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "first_rules.yml" # - "second_rules.yml"
3、alertmanager安装及配置
3.1 alertmanager下载
地址:https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.windows-amd64.tar.gz
3.2 修改alertmanager.yml文件
下载完成后解压,修改receivers.webhook_configs.url,指向我们自己的预警应用地址(spring-cloud-alertmanager地址)
global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/alertMessage/receive' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
4、预警模块spring-cloud-alertmanager开发
4.1 创建接收预警消息的Controller,注意Post提交,参数为数据流。
/** * @author Leo */ @RestController @RequestMapping("alertMessage") @Slf4j public class ReceiveAlertMessageController { @PostMapping("receive") public String receiveMsg(@RequestBody byte[] data) { String msg = new String(data, 0, data.length, Charset.forName("UTF-8")); log.info("接收AlertManager预警消息:" + msg); return "success"; } }
4.2 创建启动类
/** * @author Leo */ @SpringBootApplication @EnableEurekaClient public class AlertManagerApplication { public static void main(String[] args) { SpringApplication.run(AlertManagerApplication.class, args); } }
5、预警流程验证
5.1 启动
启动eureka
启动prometheus:D:softspringcloudprometheus-2.25.1prometheus.exe
启动alertmanager:D:softspringcloudalertmanager-0.21.0alertmanager.exe
启动pring-cloud-alertmanager
5.2 查看预警规则
浏览器中输入:http://localhost:9090/classic/rules,可以查看到我们之前在first_rules.yml文件中配置的规则
点击Alert菜单,可以看到现在有3个实例处于下线状态(其实这里不是真正的下线,只是我们没有在应用里配置Prometheus,而Prometheus又可以从eureka拉取应用列表,但是不能从应用侧拉取采集信息)
5.3 查看Alertmanager管理平台
浏览器输入:http://localhost:9093/,点击Alert菜单,可以看到现在有3条预警,证明Prometheus已经把告警信息推送到Alertmanager端了。
5.4 查看spring-cloud-alertmanager后台日志
2021-03-17 10:28:05.621 INFO 49924 --- [nio-5001-exec-5] c.x.a.c.ReceiveAlertMessageController : 接收AlertManager预警消息:{"receiver":"web\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"InstanceDown","instance":"172.16.43.41:5001","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0u0026g0.tab=1","fingerprint":"69addef300b8a5b1"},{"status":"firing","labels":{"alertname":"InstanceDown","instance":"windows10.microdone.cn:apollo-adminservice:8090","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0u0026g0.tab=1","fingerprint":"bfde9dc4159405b2"},{"status":"firing","labels":{"alertname":"InstanceDown","instance":"windows10.microdone.cn:apollo-configservice:8080","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0u0026g0.tab=1","fingerprint":"80adda6540e0cfba"}],"groupLabels":{"alertname":"InstanceDown"},"commonLabels":{"alertname":"InstanceDown","job":"eureka","severity":"critical"},"commonAnnotations":{"summary":"Instance has been down for more than 5 minutes"},"externalURL":"http://DESKTOP-TK67BLR:9093","version":"4","groupKey":"{}:{alertname="InstanceDown"}","truncatedAlerts":0}
可以看到我们通过http://127.0.0.1:5001/alertMessage/receive接口接收到了Alertmanager推送过来的消息,用JSON工具格式化接收到的报文:
{ "receiver":"web.hook", "status":"firing", "alerts":[ { "status":"firing", "labels":{ "alertname":"InstanceDown", "instance":"127.0.0.1:5001", "job":"eureka", "severity":"critical" }, "annotations":{ "summary":"Instance has been down for more than 5 minutes" }, "startsAt":"2021-03-17T00:27:55.050285364Z", "endsAt":"0001-01-01T00:00:00Z", "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1", "fingerprint":"69addef300b8a5b1" }, { "status":"firing", "labels":{ "alertname":"InstanceDown", "instance":"windows10.microdone.cn:apollo-adminservice:8090", "job":"eureka", "severity":"critical" }, "annotations":{ "summary":"Instance has been down for more than 5 minutes" }, "startsAt":"2021-03-17T00:27:55.050285364Z", "endsAt":"0001-01-01T00:00:00Z", "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1", "fingerprint":"bfde9dc4159405b2" }, { "status":"firing", "labels":{ "alertname":"InstanceDown", "instance":"windows10.microdone.cn:apollo-configservice:8080", "job":"eureka", "severity":"critical" }, "annotations":{ "summary":"Instance has been down for more than 5 minutes" }, "startsAt":"2021-03-17T00:27:55.050285364Z", "endsAt":"0001-01-01T00:00:00Z", "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1", "fingerprint":"80adda6540e0cfba" } ], "groupLabels":{ "alertname":"InstanceDown" }, "commonLabels":{ "alertname":"InstanceDown", "job":"eureka", "severity":"critical" }, "commonAnnotations":{ "summary":"Instance has been down for more than 5 minutes" }, "externalURL":"http://DESKTOP-TK67BLR:9093", "version":"4", "groupKey":"{}:{alertname="InstanceDown"}", "truncatedAlerts":0 }
到此Prometheus集成Alertmanager集成就完成了。
补充:不通过Alertmanager直接调邮件预警是因为生产上预警信息量很大,我们可以通过在spring-cloud-alertmanager中将接收到的预警信息存入MQ或数据库,然后再调邮件,短信服务预警。而且预警的方式也更灵活。