• Prometheus—告警altermanger




    相关内容原文地址链接:

    51CTO:wfwf1990:使用prometheus的mysql exporter监控mysql
    简书:fish_man:node_exporter 配置



    1.告警altermanger装配

    altermanager作用: 接收prometheus发送的告警信息, 通过相关方式例如邮件和微信等方式发送给接收者。

    创建目录:

    test -d /etc/alertmanager || mkdir -pv /etc/alertmanager
    

    配置文件:

    vi /etc/alertmanager/alertmanager.yml 
    
    global:
        resolve_timeout: 5m
    
    templates:
    - '/etc/alertmanager/wechat.tmpl'
    
    route:
        group_by: ['alertname']
        group_wait: 10s
        group_interval: 10s
        repeat_interval: 1h
        receiver: 'wechat'
    receivers:
    # 微信方式告警
    - name: 'wechat'
        wechat_configs:
        - corp_id: 'wwc08fcb42fc6fe93c'
            to_party: '2'
            agent_id: '1000002'
            api_secret: 'cLG91Xgcd3o3zPJp6NbOJV9m7SBIlhtCScxov3Hp-XQ'
            send_resolved: true
    

    模板文件:

    vi /etc/alertmanager/wechat.tmpl 
    
    {{ define "wechat.default.message" }}
    {{ if gt (len .Alerts.Firing) 0 -}}
    Alerts Firing:
    {{ range .Alerts }}
    告警级别:{{ .Labels.severity }}
    告警类型:{{ .Labels.alertname }}
    故障主机: {{ .Labels.instance }}
    告警主题: {{ .Annotations.summary }}
    告警详情: {{ .Annotations.description }}
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
    {{- end }}
    {{- end }}
    {{ if gt (len .Alerts.Resolved) 0 -}}
    Alerts Resolved:
    {{ range .Alerts }}
    告警级别:{{ .Labels.severity }}
    告警类型:{{ .Labels.alertname }}
    故障主机: {{ .Labels.instance }}
    告警主题: {{ .Annotations.summary }}
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
    恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
    {{- end }}
    {{- end }}
    告警链接:
    {{ template "__alertmanagerURL" . }}
    {{- end }}
    

    启动容器:

    docker run --restart=always   -d -p 9093:9093 -v /etc/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /etc/alertmanager/wechat.tmpl:/etc/alertmanager/wechat.tmpl --name alertmanager prom/alertmanager
    

    查看容器日志信息,检查是否报错:

    docker logs -f alertmanager
    

    2.告警Mysql

    准备mysql告警规则文件 , 注意该文件不能有tag键, 同时key和value之间必须要有空格。

    vi /etc/prometheus/prometheus.rules
    
    groups:
    - name: MySQLStatsAlert
        rules:
        - alert: MySQL is down
            expr: mysql_up == 0
            for: 1m
            labels:
                severity: critical
            annotations:
                summary: "Instance {{ $labels.instance }} MySQL is down"
                description: "MySQL database is down. This requires immediate action!"
    
        - alert: Mysql_High_QPS
            expr: rate(mysql_global_status_questions[5m]) > 500 
            for: 2m
            labels:
                severity: warning
            annotations:
                summary: "{{$labels.instance}}: Mysql_High_QPS detected"
                description: "{{$labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {{ $value }})"  
        - alert: Mysql_Too_Many_Connections
            expr: rate(mysql_global_status_threads_connected[5m]) > 200
            for: 2m
            labels:
                severity: warning
            annotations:
                summary: "{{$labels.instance}}: Mysql Too Many Connections detected"
                description: "{{$labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {{ $value }})"  
    
        - alert: Mysql_Too_Many_slow_queries
            expr: rate(mysql_global_status_slow_queries[5m]) > 3
            for: 2m
            labels:
                severity: warning
            annotations:
                summary: "{{$labels.instance}}: Mysql_Too_Many_slow_queries detected"
                description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"  
    
        - alert: SQL thread stopped
            expr: mysql_slave_status_slave_sql_running != 1
            for: 1m
            labels:
                severity: critical
            annotations:
                summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
                description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
        - alert: Slave lagging behind Master
            expr: rate(mysql_slave_status_seconds_behind_master[5m]) >30 
            for: 1m
            labels:
                severity: warning 
            annotations:
                summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
                description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
    

    验证报警: 把从库的mysql实例服务停止。
    在prometheus的alert界面可以看到有个告警, 处于pending状态, 当处于firing状态, 持续时间为for指定的时间, 向altermanager发送告警;
    在这里插入图片描述
    在这里插入图片描述

    进入altermanager界面, 发现altermanager接收到prometheus发送过来的报警。
    在这里插入图片描述

    3.Prometheus针对nodes告警规则配置

    groups:
    - name: example
      rules:
     
      - alert: 实例丢失
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "服务器实例 {{ $labels.instance }} 丢失"
          description: "{{ $labels.instance }} 上的任务 {{ $labels.job }} 已经停止了 1 分钟已上了"
     
      - alert: 磁盘容量小于 5%
        expr: 100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"} * 100) / node_filesystem_size_bytes {job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"}) > 95
        for: 30s
        annotations:
          summary: "服务器实例 {{ $labels.instance }} 磁盘不足 告警通知"
          description: "{{ $labels.instance }}磁盘 {{ $labels.device }} 资源 已不足 5%, 当前值: {{ $value }}"
     
      - alert: "内存容量小于 20%"
        expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "服务器实例 {{ $labels.instance }} 内存不足 告警通知"
          description: "{{ $labels.instance }}内存资源已不足 20%,当前值: {{ $value }}"
     
      - alert: "CPU 平均负载大于 4 个"
        expr: node_load5 > 4
        for: 30s
        annotations:
          sumary: "服务器实例 {{ $labels.instance }} CPU 负载 告警通知"
          description: "{{ $labels.instance }}CPU 平均负载(5 分钟) 已超过 4 ,当前值: {{ $value }}"
     
      - alert: "磁盘读 I/O 超过 30MB/s"
        expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000
        for: 30s
        annotations:
          sumary: "服务器实例 {{ $labels.instance }} I/O 读负载 告警通知"
          description: "{{ $labels.instance }}I/O 每分钟读已超过 30MB/s,当前值: {{ $value }}"
     
      - alert: "磁盘写 I/O 超过 30MB/s"
        expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000
        for: 30s
        annotations:
          sumary: "服务器实例 {{ $labels.instance }} I/O 写负载 告警通知"
          description: "{{ $labels.instance }}I/O 每分钟写已超过 30MB/s,当前值: {{ $value }}"
     
      - alert: "网卡流出速率大于 10MB/s"
        expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 1000000
        for: 30s
        annotations:
          sumary: "服务器实例 {{ $labels.instance }} 网卡流量负载 告警通知"
          description: "{{ $labels.instance }}网卡 {{ $labels.device }} 流量已经超过 10MB/s, 当前值: {{ $value }}"
     
      - alert: "CPU 使用率大于 90%"
        expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
        for: 30s
        annotations:
          sumary: "服务器实例 {{ $labels.instance }} CPU 使用率 告警通知"
          description: "{{ $labels.instance }}CPU 使用率已超过 90%, 当前值: {{ $value }}"
    
  • 相关阅读:
    ReflectionException: There is no getter for property named
    iframe发送post请求
    wget已安装但命令没找到
    linux性能观察命令
    ELK搭建
    python之中特性(attribute)与属性(property)有什么区别?
    Django中的日志详解
    创建fastdfs_nginx容器及nginx配置
    2. 顺序表 数据结构与算法(python)
    Ubuntu安装和卸载搜狗输入法
  • 原文地址:https://www.cnblogs.com/aixing/p/13327212.html
Copyright © 2020-2023  润新知