前言
第一种:JMX
https://help.aliyun.com/document_detail/141108.html?spm=a2c4g.11186623.6.621.12bb4dea7EyM9F
第二种:kafka_exporter
本文就是采用第二种方式实现,相比JMX,优势在于不需要消耗 JVM资源,指标收集时间从分钟级别降到秒级别,便于大规模集群的监控。
技术架构
图片引用:https://zhuanlan.zhihu.com/p/57704357
安装kafka_exporter
注:1个kafka集群只需要1个exporter,在集群上的任意1台服务器部署。
-
上传解压
从 https://github.com/danielqsj/kafka_exporter 下载并传kafka_exporter-1.2.0.linux-amd64.tar安装包并解压到/usr/local目录
wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.2.0/kafka_exporter-1.2.0.linux-amd64.tar.gz
tar -xvf kafka_exporter-1.2.0.linux-amd64.tar
cd kafka_exporter-1.2.0.linux-amd64/
-
配置
使用默认配置
-
启动
进入根目录下,输入以下命令:
cd /usr/local/kafka_exporter-1.2.0.linux-amd64 nohup ./kafka_exporter --kafka.server=172.16.10.93:9092 &
启动成功后,可以访问 http://172.16.10.93:9308/metrics/ ,(IP和端口要改成相应环境的)
看抓取的信息如下:
Prometheus配置
-
配置
修改prometheus组件的prometheus.yml加入kafka监控:
vi /usr/local/prometheus-2.15.1/prometheus.yml
-
启动验证
先kill掉Prometheus进程,用以下命令重启它,然后查看targets:
cd /usr/local/prometheus-2.15.1 nohup ./prometheus --config.file=prometheus.yml &
注:State=UP,说明成功
Grafana配置
-
导入仪表盘模板
通过浏览器访问:http://grafana服务器IP:3000
添加数据源,选择prometheus,填入prometheus服务器IP端口,点击保存
导入监控图表
输入7589,光标往下移,如下图
图表数据就出来了
以上仪表盘导入后再结合自身业务修改过的最终仪表盘:
-
预警指标
序号 |
预警名称 |
预警规则 |
描述 |
1 |
Broker数量预警 |
当Broker数量达到阈值【<3】时进行预警 |
|
2 |
消费延迟预警 |
当积压的消息数量达到阈值【>1000】时进行预警 |
|
3 |
失效副本分区预警 |
当失效副本分区数量达到阈值【>0】时进行预警 |
-
Grafana仪表盘参考:
- https://grafana.com/grafana/dashboards/7589 (推介)
- https://grafana.com/grafana/dashboards/9018 (参考-新的)
- https://grafana.com/grafana/dashboards/9947(参考-新的)
- https://grafana.com/grafana/dashboards/10973(JMX-阿里云)
- https://www.menina.cn/article/88
- https://cloud.tencent.com/developer/news/377416
其它
- 注册系统服务开机自动启动
## 准备配置文件 cat <<\EOF >/etc/systemd/system/kafka_exporter.service [Unit] Description=Elasticsearch stats exporter for Prometheus Documentation=Prometheus exporter for various metrics about ElasticSearch, written in Go. [Service] ExecStart=/usr/local/kafka_exporter/kafka_exporter --kafka.server=192.168.50.16:9092 [Install] WantedBy=multi-user.target EOF ## 启动并设置为开机自动启动 systemctl daemon-reload systemctl enable kafka_exporter.service systemctl stop kafka_exporter.service systemctl start kafka_exporter.service systemctl status kafka_exporter.service
报警规则:
cat kafka_prometheusRule.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: kafka-prometheus-rules namespace: monitoring spec: groups: - name: kafka.rules rules: - alert: KafkaTopicsReplicas expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 1 for: 1m labels: severity: critical annotations: title: 'Kafka topics replicas less than 3' description: "Topic: {{ $labels.topic }} partition less than 3, Current Value: {{ $value }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: KafkaConsumersGroupLag expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50 for: 1m labels: severity: critical annotations: title: 'Kafka consumers group 消费滞后' description: "Kafka consumers group 消费滞后 (Lag > 50), Lag值: {{ $value }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: KafkaConsumersTopicLag expr: sum(kafka_consumergroup_lag) by (topic) > 50 for: 1m labels: severity: critical annotations: title: 'Kafka Topic 消费滞后' description: "Kafka Topic 消费滞后 (Lag > 50), Lag值: {{ $value }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"