为k8s集群配置自定义告警

为k8s集群配置自定义告警
前提：k8s集群中已正确部署了prometheus和alertmanager。

在prometheus中已经预先定义了很多告警项。当然，我们也可以自定义告警内容。本文以自定义容器的内存和CPU用量告警为例。

一、创建Prometheusrule

Prometheusrule是Prometheus在部署时创建的一个crd。它定义了Prometheus中的一些数据指标（即以record结尾的）和告警项（即以alert结尾的）。通过kubectl get prometheusrule可以看到所有Prometheus部署时创建的Prometheusrule。

我们自定义告警项时，只需要创建一个新的名字以alert结尾的Prometheusrule即可。

编写自定义Prometheusrule的yaml文件如下：
```
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: prometheus-rules
    cato: node
    release: prometheus-operator
  name: test.alert
  namespace: kube-system
spec:
  groups:
  - name: test.alert.rules
    rules:
    - alert: CPUUsageHigh
      annotations:
        description: 'node: {{ $labels.node }}, namespace: {{ $labels.namespace }},
          pod: {{ $labels.pod_name }}, container: {{ $labels.container_name }}, value:
          {{ $value }}'
        summary: Container CPU Usage > 0.8 for 2m
      expr: ke:container:cpu_util{namespace="test"} > 0.8
      for: 2m
      labels:
        severity: warning
    - alert: MemUsageHigh
      annotations:
        description: 'node: {{ $labels.node }}, namespace: {{ $labels.namespace }},
          pod: {{ $labels.pod_name }}, container: {{ $labels.container_name }}, value:
          {{ $value }}'
        summary: Container Mem Usage > 0.8 for 2m
      expr: ke:container:mem_util{namespace="test"} > 0.8
      for: 2m
      labels:
        severity: warning
```
这里有几点需要注意：

1.这个rule的label，需要与你的Prometheus的ruleSelector字段相一致。

Prometheus在部署时，会创建一个名为Prometheus的crd。通过kubectl get prometheus -n xxx查出集群中的Prometheus的名字，再通过kubectl get prometheus -n xxx [集群中Prometheus的名字] -o yaml查找其ruleSelector字段。将这一字段的值原封不动地填写至yaml文件中的labels字段中。

2.rules.expr字段定义了指标的表达式。后面的大括号里面规定了告警的监测范围仅限与test这个namespace下。如果需要添加其他的条件，可用逗号隔开。

编写完后，通过kubectl create -f prometheusrule.yaml创建。

二、测试验证

验证时，为了快速检验出成果，可以将ke:container:mem_util{namespace="test"} > 0.8改为 > 0.0001。

创建后，进入Prometheus的pod，查看/etc/prometheus/rules/prometheus-rulefiles-0下，过一段时间就可以看到test.alert.yaml文件出现了。

去到Prometheus的portal页面，点击Alert，查看自定义的两个告警指标是否出现，且是否有active状态的告警出现。
相关阅读:
Python CI中iOS项目自动打包运行
 Jquery 插件开发公开属性顺序的影响.
MVC4使用SignalR出现$.connection is undefined错误备忘
 C语言运算符的优先级与结合性
 CF478C Table Decorations (贪心)
LightOJ1370 Bishoe and Phishoe (欧拉函数+二分)
经典排序：冒泡排序法与选择排序法
 STL初学
 博客园使用Markdown和公式
 为知笔记（Wiz）发布博客到博客园（cnblog）
原文地址：https://www.cnblogs.com/00986014w/p/12836585.html