• prometheus + alertmanager 实现报警


    1. 部署alertmanager

    • 镜像 prom/alertmanager
    • 配置文件在 /etc/alertmanager/alertmanager.yml
    alertmanager.yml: |
        global:
          resolve_timeout: 5m
        route:
          group_by: ['alertname']
          group_wait: 10s
          group_interval: 10s
          repeat_interval: 1h
          receiver: 'default'
        receivers:
          - name: "default"
            webhook_configs:
            - url: "http://192.168.0.103:5002/api/hooks"
              send_resolved: false
    
    • 由于没有企业微信、也懒得配置邮件,就先用webhook测试。
    • group_by 表示分组,alertname 是 prometheus 中配置的 rule 的名称,如下:memory_too_high、cpu_too_high、disk_will_full

    2. 配置prometheus

    prometheus.yml: |
        global:
          scrape_interval: 30s
          scrape_timeout: 30s
        alerting:
          alertmanagers:
          - static_configs:
            - targets: ["prometheus-alertmanager-svc:9093"]
        rule_files:
          - "alertmanager_rules.yml"
        scrape_configs:
        - job_name: 'prometheus'
          static_configs:
            - targets: ['localhost:9090']
        - job_name: 'k8s-node'
          static_configs:
            - targets: 
              - 192.168.0.200:9100
              - 192.168.0.201:9100
    
    • alerting 节点配置 alertmanager的地址
    • rule_files 节点配置 告警的 rule 文件,比如监控cpu、内存、硬盘的监控
    alertmanager_rules.yml: |
        groups:
        - name: node-alert
          rules:
          - alert: node-up
            expr: up{job="node-exporter"} == 0
            for: 15s
            labels:
                severity: 1
                team: node
            annotations:
                summary: "{{ $labels.instance }} 已停止运行超过 15s!"
          - alert: memory_too_high
            expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
            for: 10s  # 告警持续时间,超过这个时间才会发送给alertmanager
            labels:
              severity: warning
            annotations:
              summary: "Instance {{ $labels.instance }} 内存使用率过高"
              description: "{{ $labels.instance }} of job {{$labels.job}}内存使用率超过90%,当前使用率[{{ $value }}]."
          - alert: cpu_too_high
            expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 80
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "Instance {{ $labels.instance }} cpu使用率过高"
              description: "{{ $labels.instance }} of job {{$labels.job}}cpu使用率超过80%,当前使用率[{{ $value }}]."
          - alert: disk_will_full
            expr: 100 - node_filesystem_avail_bytes{fstype=~"ext4|xfs",mountpoint="/"}  * 100 / node_filesystem_size_bytes{fstype=~"ext4|xfs",mountpoint="/"} > 80
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "Instance {{ $labels.instance }} disk使用率过高"
              description: "{{ $labels.instance }} of job {{$labels.job}}disk使用率超过80%,当前使用率[{{ $value }}]."
    
    • $labels 可获得数据的标签
    • $value 可获得[expr]计算的值
  • 相关阅读:
    Codeforces Round #603 (Div. 2) E. Editor(线段树)
    Codeforces Round #603 (Div. 2) D. Secret Passwords(并查集)
    Java的DAO设计模式
    js实现本地时间同步
    循环播放
    正则表达式(2)
    正则表达式(1)
    第十八个知识点:画一个描述ECB,CBC,CTR模式的操作
    第十七个知识点:描述和比较DES和AES的轮结构
    第四十一个知识点 所有的侧信道分析都是能量分析吗
  • 原文地址:https://www.cnblogs.com/wh-blog/p/12256351.html
Copyright © 2020-2023  润新知