• alertmanager 报警规则详解


    这篇文章介绍prometheus和alertmanager的报警和通知规则,prometheus的配置文件名为prometheus.yml,alertmanager的配置文件名为alertmanager.yml

    报警:指prometheus将监测到的异常事件发送给alertmanager,而不是指发送邮件通知
    通知:指alertmanager发送异常事件的通知(邮件、webhook等)

    报警规则

    在prometheus.yml中指定匹配报警规则的间隔

    # How frequently to evaluate rules.
    [ evaluation_interval: <duration> | default = 1m ]
    

    在prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)

    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
     - "/etc/prometheus/alert.rules"
    

    并基于以下模板:

    ALERT <alert name>
      IF <expression>
      [ FOR <duration> ]
      [ LABELS <label set> ]
      [ ANNOTATIONS <label set> ]
    

    其中:

    Alert name是警报标识符。它不需要是唯一的。

    Expression是为了触发警报而被评估的条件。它通常使用现有指标作为/metrics端点返回的指标。

    Duration是规则必须有效的时间段。例如,5s表示5秒。

    Label set是将在消息模板中使用的一组标签。

    在prometheus-k8s-statefulset.yaml 文件创建ruleSelector,标记报警规则角色。在prometheus-k8s-rules.yaml 报警规则文件中引用

      ruleSelector:
        matchLabels:
          role: prometheus-rulefiles
          prometheus: k8s
    

    在prometheus-k8s-rules.yaml 使用configmap 方式引用prometheus-rulefiles

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-k8s-rules
      namespace: monitoring
      labels:
        role: prometheus-rulefiles
        prometheus: k8s
    data:
      pod.rules.yaml: |+
        groups:
        - name: noah_pod.rules
          rules:
          - alert: Pod_all_cpu_usage
            expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
            for: 5m
            labels:
              severity: critical
              service: pods
            annotations:
              description: 容器 {{ $labels.name }} CPU 资源利用率大于 75% , (current value is {{ $value }})
              summary: Dev CPU 负载告警
          - alert: Pod_all_memory_usage
            expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
            for: 10m
            labels:
              severity: critical
            annotations:
              description: 容器 {{ $labels.name }} Memory 资源利用率大于 2G , (current value is {{ $value }})
              summary: Dev Memory 负载告警
          - alert: Pod_all_network_receive_usage
            expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
            for: 10m
            labels:
              severity: critical
            annotations:
              description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
              summary: network_receive 负载告警
    

    配置文件设置好后,prometheus-opeartor自动重新读取配置。
    如果二次修改comfigmap 内容只需要apply

    kubectl apply -f prometheus-k8s-rules.yaml
    

    将邮件通知与rules对比一下(还需要配置alertmanager.yml才能收到邮件)

    报警聚合功能

    通知规则

    设置alertmanager.yml的的route与receivers

    global:
      # ResolveTimeout is the time after which an alert is declared resolved
      # if it has not been updated.
      resolve_timeout: 5m
    
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'xxxxx'
      smtp_from: 'xxxxxxx'
      smtp_auth_username: 'xxxxx'
      smtp_auth_password: 'xxxxxx'
      # The API URL to use for Slack notifications.
      slack_api_url: 'https://hooks.slack.com/services/some/api/token'
    
    # # The directory from which notification templates are read.
    templates:
    - '*.tmpl'
    
    # The root route on which each incoming alert enters.
    route:
    
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
    
      group_by: ['alertname', 'cluster', 'service']
    
      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
    
      group_wait: 30s
    
      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
    
      group_interval: 5m
    
      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
    
      #repeat_interval: 1m
      repeat_interval: 15m
    
      # A default receiver
    
      # If an alert isn't caught by a route, send it to default.
      receiver: default
    
      # All the above attributes are inherited by all child routes and can
      # overwritten on each.
    
      # The child route trees.
      routes:
      - match:
          severity: critical
        receiver: email_alert
    
    receivers:
    - name: 'default'
      email_configs:
      - to : 'yi.hu@dianrong.com'
        send_resolved: true
    
    - name: 'email_alert'
      email_configs:
      - to : 'yi.hu@dianrong.com'
        send_resolved: true
    

    名词解释

    Route

    route属性用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。

    // Match does a depth-first left-to-right search through the route tree
    // and returns the matching routing nodes.
    func (r *Route) Match(lset model.LabelSet) []*Route {
    

    Alert

    Alert是alertmanager接收到的报警,类型如下。

    // Alert is a generic representation of an alert in the Prometheus eco-system.
    type Alert struct {
        // Label value pairs for purpose of aggregation, matching, and disposition
        // dispatching. This must minimally include an "alertname" label.
        Labels LabelSet `json:"labels"`
    
        // Extra key/value information which does not define alert identity.
        Annotations LabelSet `json:"annotations"`
    
        // The known time range for this alert. Both ends are optional.
        StartsAt     time.Time `json:"startsAt,omitempty"`
        EndsAt       time.Time `json:"endsAt,omitempty"`
        GeneratorURL string    `json:"generatorURL"`
    }
    

    具有相同Lables的Alert(key和value都相同)才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警

    Group

    alertmanager会根据group_by配置将Alert分组。如下规则,当go_goroutines等于4时会收到三条报警,alertmanager会将这三条报警分成两组向receivers发出通知。

    ALERT test1
      IF go_goroutines > 1
      LABELS {label1="l1", label2="l2", status="test"}
    ALERT test2
      IF go_goroutines > 2
      LABELS {label1="l2", label2="l2", status="test"}
    ALERT test3
      IF go_goroutines > 3
      LABELS {label1="l2", label2="l1", status="test"}
    

    主要处理流程

    1. 接收到Alert,根据labels判断属于哪些Route(可存在多个Route,一个Route有多个Group,一个Group有多个Alert)

    2. 将Alert分配到Group中,没有则新建Group

    3. 新的Group等待group_wait指定的时间(等待时可能收到同一Group的Alert),根据resolve_timeout判断Alert是否解决,然后发送通知

    4. 已有的Group等待group_interval指定的时间,判断Alert是否解决,当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知

    Alertmanager

    Alertmanager是警报的缓冲区,它具有以下特征:

    可以通过特定端点(不是特定于Prometheus)接收警报。

    可以将警报重定向到接收者,如hipchat、邮件或其他人。

    足够智能,可以确定已经发送了类似的通知。所以,如果出现问题,你不会被成千上万的电子邮件淹没。

    Alertmanager客户端(在这种情况下是Prometheus)首先发送POST消息,并将所有要处理的警报发送到/ api / v1 / alerts。例如:

    [
     {
      "labels": {
         "alertname": "low_connected_users",
         "severity": "warning"
       },
       "annotations": {
          "description": "Instance play-app:9000 under lower load",
          "summary": "play-app:9000 of job playframework-app is under lower load"
        }
     }]
    

    alert工作流程

    一旦这些警报存储在Alertmanager,它们可能处于以下任何状态:

    alert 报警流程

    • Inactive:这里什么都没有发生。

    • Pending:客户端告诉我们这个警报必须被触发。然而,警报可以被分组、压抑/抑制或者静默/静音。一旦所有的验证都通过了,我们就转到Firing。

    • Firing:警报发送到Notification Pipeline,它将联系警报的所有接收者。然后客户端告诉我们警报解除,所以转换到状Inactive状态。

    Prometheus有一个专门的端点,允许我们列出所有的警报,并遵循状态转换。Prometheus所示的每个状态以及导致过渡的条件如下所示:

    规则不符合。警报没有激活。

    规则符合。警报现在处于活动状态。 执行一些验证是为了避免淹没接收器的消息。

    警报发送到接收者

     

    接收器 receiver

    顾名思义,警报接收的配置。
    通用配置格式

    # The unique name of the receiver.
    name: <string>

    # Configurations for several notification integrations.
    email_configs:
      [ - <email_config>, ... ]
    pagerduty_configs:
      [ - <pagerduty_config>, ... ]
    slack_config:
      [ - <slack_config>, ... ]
    opsgenie_configs:
      [ - <opsgenie_config>, ... ]
    webhook_configs:
      [ - <webhook_config>, ... ]

    邮件接收器 email_config

    # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = false ]

    # The email address to send notifications to.
    to: <tmpl_string>
    # The sender address.
    [ from: <tmpl_string> | default = global.smtp_from ]
    # The SMTP host through which emails are sent.
    [ smarthost: <string> | default = global.smtp_smarthost ]

    # The HTML body of the email notification.
    [ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]

    # Further headers email header key/value pairs. Overrides any headers
    # previously set by the notification implementation.
    [ headers: { <string>: <tmpl_string>, ... } ]


    Slack接收器 slack_config

    # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = true ]

    # The Slack webhook URL.
    [ api_url: <string> | default = global.slack_api_url ]

    # The channel or user to send notifications to.
    channel: <tmpl_string>

    # API request data as defined by the Slack webhook API.
    [ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
    [ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
    [ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
    [ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
    [ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
    [ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
    [ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]

    Webhook接收器 webhook_config

    # Whether or not to notify about resolved alerts.
    [ send_resolved: <boolean> | default = true ]

    # The endpoint to send HTTP POST requests to.
    url: <string>

    Alertmanager会使用以下的格式向配置端点发送HTTP POST请求:

    {
      "version": "2",
      "status": "<resolved|firing>",
      "alerts": [
        {
          "labels": <object>,
          "annotations": <object>,
          "startsAt": "<rfc3339>",
          "endsAt": "<rfc3339>"
        },
        ...
      ]
    }

    Inhibition

    抑制是指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。

    例如,当警报被触发,通知整个集群不可达,可以配置Alertmanager忽略由该警报触发而产生的所有其他警报,这可以防止通知数百或数千与此问题不相关的其他警报。
    抑制机制可以通过Alertmanager的配置文件来配置。

    Inhibition允许在其他警报处于触发状态时,抑制一些警报的通知。例如,如果同一警报(基于警报名称)已经非常紧急,那么我们可以配置一个抑制来使任何警告级别的通知静音。 alertmanager.yml文件的相关部分如下所示:

    inhibit_rules:- source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['low_connected_users']
    

    配置抑制规则,是存在另一组匹配器匹配的情况下,静音其他被引发警报的规则。这两个警报,必须有一组相同的标签。

    # Matchers that have to be fulfilled in the alerts to be muted.
    target_match:
      [ <labelname>: <labelvalue>, ... ]
    target_match_re:
      [ <labelname>: <regex>, ... ]
    
    # Matchers for which one or more alerts have to exist for the
    # inhibition to take effect.
    source_match:
      [ <labelname>: <labelvalue>, ... ]
    source_match_re:
      [ <labelname>: <regex>, ... ]
    
    # Labels that must have an equal value in the source and target
    # alert for the inhibition to take effect.
    [ equal: '[' <labelname>, ... ']' ]
    

    Silences

    Silences是快速地使警报暂时静音的一种方法。 我们直接通过Alertmanager管理控制台中的专用页面来配置它们。在尝试解决严重的生产问题时,这对避免收到垃圾邮件很有用。


    alertmanager 参考资料
    抑制规则 inhibit_rule参考资料


    https://www.kancloud.cn/huyipow/prometheus/527563

  • 相关阅读:
    Starlink星链计划能与5G抗衡?看一下马斯克吹过的牛逼
    代码安全性和健壮性:如何在if和assert中做选择?
    都说软件架构要分层、分模块,具体应该怎么做(一)
    物联网网关开发:基于MQTT消息总线的设计过程(上)
    Linq 集成化查询(1)
    给自己定位的技术总监
    lucene.net初接触
    人生就像一系统软件
    用Microsoft.Practices.Unity实现简单的依赖注入
    flv播放器参数
  • 原文地址:https://www.cnblogs.com/yx88/p/11555431.html
Copyright © 2020-2023  润新知