• AlertManager


    Alertmanager接收到的告警的数据结构:
    type Alert struct {
       Status       string    `json:"status"`
       Labels       KV        `json:"labels"`
       Annotations  KV        `json:"annotations"`
       StartsAt     time.Time `json:"startsAt"`
       EndsAt       time.Time `json:"endsAt"`
       GeneratorURL string    `json:"generatorURL"`
       Fingerprint  string    `json:"fingerprint"`
    }
    具有相同Lable的Alert才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警
     
    Alertmanager启动时,使用--config.file参数指定一份配置文件。
    • 全局配置
    global:
      # The default SMTP From header field.
      [ smtp_from: <tmpl_string> ]
      # The default SMTP smarthost used for sending emails, including port number.
      # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
      # Example: smtp.example.org:587
      [ smtp_smarthost: <string> ]
      # The default hostname to identify to the SMTP server.
      [ smtp_hello: <string> | default = "localhost" ]
      # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
      [ smtp_auth_username: <string> ]
      # SMTP Auth using LOGIN and PLAIN.
      [ smtp_auth_password: <secret> ]
      # SMTP Auth using PLAIN.
      [ smtp_auth_identity: <string> ]
      # SMTP Auth using CRAM-MD5.
      [ smtp_auth_secret: <secret> ]
      # The default SMTP TLS requirement.
      # Note that Go does not support unencrypted connections to remote SMTP endpoints.
      [ smtp_require_tls: <bool> | default = true ]
     
      # The API URL to use for Slack notifications.
      [ slack_api_url: <secret> ]
      [ slack_api_url_file: <filepath> ]
      [ victorops_api_key: <secret> ]
      [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
      [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
      [ opsgenie_api_key: <secret> ]
      [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
      [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
      [ wechat_api_secret: <secret> ]
      [ wechat_api_corp_id: <string> ]
     
      # The default HTTP client configuration
      [ http_config: <http_config> ]
     
      # 告警的解决时间,超时还未解决会重发告警
      [ resolve_timeout: <duration> | default = 5m ]
    • templates
    • route
    route被组织成routing tree。
    告警首先走到根节点,其必须match所有告警
    此后match到某个节点之后依次走到所有子节点。
     
    receiver:根据receiver的name把告警送到receiver
    group_by:此处填写标签的key,根据key将Alert分组,同一组的组合到一起发给receiver
    continue:告警与子route匹配之后是否应该往下走
    match和matchers:key-value的匹配规则
    group_wait:一个新group的告警被构建出来后,等待若干时间再发送。期间有新的告警的话都组合到一起。
    group_interval:已有group的告警,等待若干时间再发送。
    repeat_interval:等待若干时间后重新发送
    mute_time_intervals:覆盖全局的mute_time_intervals配置
    routes:指定若干个子route
    • receivers
    配置文件中,指定receiver,将相应的报警信息发送到webhook、邮件、钉钉等
    receivers:
    - name: <name>
      xxx_configs:
    xxx可以是email、pagerduty、pushover、slack、opsgenie、webhook、victorops、wechat
    • inhibit_rules
    告警抑制规则
    示例:
    "inhibit_rules":
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "critical"
      "target_match_re":
        "severity": "warning|info"
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "warning"
      "target_match_re":
        "severity": "info"
    • mute_time_intervals
    告警静默规则:
    mute_time_interval:
    - name: <string>
      time_intervals:
        [ - <time_interval> ... ]
     
    处理流程:
    (1)接收到Alert,根据labels判断属于哪些Route(可存在多个Route,一个Route有多个Group,一个Group有多个Alert)
    (2)将Alert分配到Group中,没有则新建Group
    组合后的告警数据结构为:
    type Data struct {
       Receiver string `json:"receiver"`
       Status   string `json:"status"`
       Alerts   Alerts `json:"alerts"`
       GroupLabels       KV `json:"groupLabels"`
       CommonLabels      KV `json:"commonLabels"`
       CommonAnnotations KV `json:"commonAnnotations"`
       ExternalURL string `json:"externalURL"`
    }
    (3)新的Group等待group_wait指定的时间(等待时可能收到同一Group的Alert),根据resolve_timeout判断Alert是否解决,然后发送通知
    (4)已有的Group等待group_interval指定的时间,判断Alert是否解决,当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知
  • 相关阅读:
    Hello World基于.net framework中CLR的执行
    MVN常用命令
    Git常用命令
    Markdown常用语法
    计算机专用英语词汇
    Windows DiskPart
    字符集过滤器
    SSHkey
    书名
    redis
  • 原文地址:https://www.cnblogs.com/yangyuliufeng/p/15979371.html
Copyright © 2020-2023  润新知