• 一套完整的中小级别的企业级监控prometheus


    一   相信有很多博客都已经详细的说明了prometheus的作用以及相关的作用以及原理,这里不在赘述,仅仅从部署和配置2个方面来记录一下,为公司产品组搭建的prometheus告警平台的过程以及踩过的坑,废话不多说,直接开始搭建部署,需要在一台服务器上面搭建prometheus+grafana+alertmanager+pushgateway,其余被监控的节点部署node_exporter,也可以在prometheus服务端部署node_exporter

      1.1 部署prometheus,并且使用systemctl进行管控

           安装版本:prometheus-2.6.1

                   百度云下载:https://pan.baidu.com/s/1w16lQZKw8PCHqlRuSK2i7A

                   提取码:lw1q

         之后将包解压到: /usr/local/prometheus目录下面,建议使用ansible脚本进行部署

         这里附上安装管理的管理文件以及目录地址/usr/lib/systemd/system/prometheus.service

    [Unit]
      Description=https://prometheus.io
      
      [Service]
      Restart=on-failure
      ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml
    
      [Install]                      
      WantedBy=multi-user.target

       1.2  整理后的prometheus配置文件,添加新的监控节点job_name和机器的节点,并且节点需要安装相应的node_exporter

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - 172.16.5.3:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
       - "rules/first_rules.yml"
       - "rules/second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
        static_configs:
        - targets: ['localhost:9090','172.16.5.3:9100']- job_name: 'pushgateway'
        scrape_interval: 5s
        static_configs:
        - targets: ['172.16.5.3:9091']
          labels:
            instance: pushgateway

      1.3 对服务器的基础监控项如如下所示

    #cat second_rules.yml
    groups:
    - name: 实例存活告警规则 rules: - alert: 实例存活告警 expr: up{job="prometheus"} == 0 or up{job="Linux-host"} == 0 for: 1m labels: user: prometheus severity: emergency team: HTY annotations: summary: "Instance {{ $labels.instance }} is down" description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes." value: "{{ $value }}" - name: 内存告警规则 rules: - alert: "内存使用率告警" expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 30 for: 1m labels: team: C3 user: prometheus severity: warning annotations: summary: "服务器: {{$labels.alertname}} 内存报警" description: "{{ $labels.alertname }} 内存资源利用率大于30%!(当前值: {{ $value }}%)" value: "{{ $value }}" - name: 内存告警规则2 rules: - alert: "内存使用率告警2" expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 50 for: 1m labels: team: C3 user: prometheus severity: critical annotations: summary: "服务器: {{$labels.alertname}} 内存报警" description: "{{ $labels.alertname }} 内存资源利用率大于50%!(当前值: {{ $value }}%)" value: "{{ $value }}" - name: CPU报警规则 rules: - alert: CPU使用率告警 expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 70 for: 1m labels: user: prometheus severity: warning annotations: summary: "服务器: {{$labels.alertname}} CPU报警" description: "服务器: CPU使用超过70%!(当前值: {{ $value }}%)" value: "{{ $value }}" - name: 磁盘报警规则 rules: - alert: 磁盘使用率告警 expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80 for: 1m labels: user: prometheus severity: warning annotations: summary: "服务器: {{$labels.alertname}} 磁盘报警" description: "服务器:{{$labels.alertname}},磁盘设备: 使用超过80%!(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)" value: "{{ $value }}"

      2 安装以及配置alertmanager

    global:
      # 企业微信告警配置
      resolve_timeout: 5m
      wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
      wechat_api_corp_id: 'ww41a2b13ef47aac58'
      wechat_api_secret: 'xxxxx'
      # qq邮箱告警配置
      smtp_from: xxx@qq.com
      smtp_auth_username: xx@qq.com
      smtp_auth_password: xxxx #需要从qq邮箱上面获取
      smtp_require_tls: false
      smtp_smarthost: 'smtp.qq.com:465'
    templates:
      - "/usr/local/alertmanager/template/*.tmpl"
    route:
      receiver: 'default-receiver'
      group_wait: 10s
      group_interval: 30s
      repeat_interval: 1m
      group_by: ['team']
      routes:
      - group_by: ['test']
        group_wait: 10s
        group_interval: 30s
        repeat_interval: 1m
        receiver: 'wechat'
        match:
          team: test1
    receivers:
    - name: 'wechat'
      wechat_configs:
      - send_resolved: true
        message: '{{ template "wechat.default.message" .}}'
        to_party: 'xxxx'
        agent_id: "xxx"需要从企业微信上面获取
        api_secret: 'xxxxxxxx'
    - name: 'default-receiver'
      email_configs:
      - to: 'xxxxxx@qq.com'
        send_resolved: true
        # html: '{{ template "wechat.default.message" .}}'
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['env','team','instance','type','group','job','alertname']

      获取企业微信的方式参考这个链接:https://www.cnblogs.com/miaocbin/p/13706164.html

      获取qq邮箱参考这个链接:https://blog.csdn.net/knight_zhou/article/details/105137581 

        3 附上模版信息

    {{ define "wechat.default.message" }}
    {{- if gt (len .Alerts.Firing) 0 -}}
    {{- range $index, $alert := .Alerts -}}
    {{- if eq $index 0 }}
    ========= 监控报警 =========
    告警状态:{{   .Status }}
    告警级别:{{ .Labels.severity }}
    告警类型:{{ $alert.Labels.alertname }}
    故障主机: {{ $alert.Labels.instance }}
    告警主题: {{ $alert.Annotations.summary }}
    告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
    触发阀值:{{ .Annotations.value }}
    故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    ========= = end =  =========
    {{- end }}
    {{- end }}
    {{- end }}
    {{- if gt (len .Alerts.Resolved) 0 -}}
    {{- range $index, $alert := .Alerts -}}
    {{- if eq $index 0 }}
    ========= 异常恢复 =========
    告警类型:{{ .Labels.alertname }}
    告警状态:{{   .Status }}
    告警主题: {{ $alert.Annotations.summary }}
    告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
    故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    {{- if gt (len $alert.Labels.instance) 0 }}
    实例信息: {{ $alert.Labels.instance }}
    {{- end }}
    ========= = end =  =========
    {{- end }}
    {{- end }}
    {{- end }}
    {{- end }}

      4. 安装以及部署grafana,推荐安装最新版的prometheus,然后使用插件,附上一个比较简洁的grafana看板

       直接倒入模板,倒入步骤参考这便博客:https://www.cnblogs.com/wukc/p/14231042.html

  • 相关阅读:
    在energia中添加新的库
    KEIL3中出现的字符不对齐的情况解决办法
    VHDL硬件描述语言实现数字钟
    51单片机软件I2C驱动中的CY
    自问自答:在VB中如何实现像C++一样printf的功能
    [转][译] 分分钟学会一门语言之 Python 篇
    杂谈PID控制算法——最终篇:C语言实现51单片机中的PID算法
    Eclipse 安装与配置
    win10 环境安装 jdk 11.0.2
    解决网络问题神奇工具
  • 原文地址:https://www.cnblogs.com/wxm-pythoncoder/p/14543808.html
Copyright © 2020-2023  润新知