promethus - 润新知

promethus

promethus
Prometheus是一个开源的系统监控和报警的工具包。
主要特点是：
多维数据模型（有metric名称和键值对确定的时间序列）
灵活的查询语言
不依赖分布式存储
通过pull方式采集时间序列，通过http协议传输
支持通过中介网关的push时间序列的方式
监控数据通过服务或者静态配置来发现
支持图表和dashboard等多种方式
Components
Prometheus包含多个组件，其中有许多是可选的：
Prometheus主服务器，用来收集和存储时间序列数据
应用程序client代码库
短时jobs的push gateway
基于Rails/SQL的GUI dashboard
特殊用途的exporter（包括HAProxy、StatsD、Ganglia等）
用于报警的alertmanager
命令行工具查询
Architecture
Prometheus和它的组件的整体架构：
Prometheus通过直接或者短时jobs中介网关收集监控数据，在本地存储所有收集到的数据，并且通过定义好的rules产生新的时间序列数据，或者发送警报。Promdash或者其他使用API的clients可以将采集到的数据可视化。
1、Prometheus Server：
主要负责数据采集和存储，提供PromQL查询语言的支持。
2、客户端SDK：
官方提供的客户端类库有go、java、scala、python、ruby，其他还有很多第三方开发的类库，支持nodejs、php、erlang等。
3、Push Gateway：
支持临时性Job主动推送指标的中间网关。
4、PromDash：
使用Rails开发可视化的Dashboard，用于可视化指标数据。
5、Exporter：
Exporter是Prometheus的一类数据采集组件的总称。它负责从目标处搜集数据，并将其转化为Prometheus支持的格式。与传统的数据采集组件不同的是，它并不向中央服务器发送数据，而是等待中央服务器主动前来抓取。
Prometheus提供多种类型的Exporter用于采集各种不同服务的运行状态。目前支持的有数据库、硬件、消息中间件、存储系统、HTTP服务器、JMX等。
6、alertmanager：
警告管理器，用来进行报警。
7、prometheus_cli：
命令行工具。
8、其他辅助性工具：
多种导出工具，可以支持Prometheus存储数据转化为HAProxy、StatsD、Graphite等工具所需要的数据存储格式。
Prometheus服务过程
1、Prometheus Daemon负责定时去目标上抓取metrics(指标)数据，每个抓取目标需要暴露一个http服务的接口给它定时抓取。Prometheus支持通过配置文件、文本文件、Zookeeper、Consul、DNS SRV Lookup等方式指定抓取目标。Prometheus采用PULL的方式进行监控，即服务器可以直接通过目标PULL数据或者间接地通过中间网关来Push数据。
2、Prometheus在本地存储抓取的所有数据，并通过一定规则进行清理和整理数据，并把得到的结果存储到新的时间序列中。
3、Prometheus通过PromQL和其他API可视化地展示收集的数据。Prometheus支持很多方式的图表可视化，例如Grafana、自带的Promdash以及自身提供的模版引擎等等。Prometheus还提供HTTP API的查询方式，自定义所需要的输出。
4、PushGateway支持Client主动推送metrics到PushGateway，而Prometheus只是定时去Gateway上抓取数据。
5、Alertmanager是独立于Prometheus的一个组件，可以支持Prometheus的查询语句，提供十分灵活的报警方式。
--------------------------------
基本环境配置：
1、prometheus安装配置：
tar -xf prometheus-2.0.0.linux-amd64.tar.gz
cd prometheus-2.0.0.linux-amd64
配置文件：cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.--默认抓取间隔
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. --数据计算的间隔
# scrape_timeout is set to the global default (10s). --默认抓取超时10秒
# Alertmanager configuration --管理报警配置
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093 --管理报警包需要单独下载，默认启动端口是9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml" --要发送报警，就得写规则，定义规则文件
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs: --抓取配置，就是要抓取哪些主机
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus' --任务名称
#- job_name: 'node_exporter'
# metrics_path defaults to '/metrics' --默认抓取监控机的url后缀地址是/metrics
# scheme defaults to 'http'. --模式是http
static_configs:
- targets: ['localhost:9090']
- targets:
- 192.168.31.113:9100
- 192.168.31.117:9100
- 192.168.31.117:8000
- job_name: mysql
static_configs:
- targets: ["localhost:9104"]
labels:
instance: db1
- job_name: linux
static_configs:
- targets: ["localhost:9100"]
labels:
instance: db1
- job_name: 'kubernetes-node'
scheme: https // 默认scheme http,声明为 https
tls_config:
insecure_skip_verify: true // 跳过不安全的认证提示
kubernetes_sd_configs:
- api_servers:
- 'http://10.3.1.141:8080'
role: node
relabel_configs:   // 复写meta label
- action: labelmap
regex: __meta_kubernetes_node_label_(.+) // 复写后指标为 kubernetes_io_hostname="xxxx", 用于 grafana 作图
-----------------------------
启动prometheus：
./prometheus
./prometheus --config.file=prometheus.yml
nohup ./prometheus --config.file=prometheus.yml &
---------------------------
默认配置文件包括三个分区：global、rule_files、scrape_configs。
global控制 Prometheus 服务器的全局配置。
scrape_interval 决定数据抓取的间隔。
evaluation_interval 决定数据计算的间隔，Prometheus会根据rule_file来产生新的时间序列值。
rule_files决定规则文件的保存路径。
scrape_config决定Prometheus监控的资源。Prometheus通过HTTP暴露自己的数据，因此也可以监控自己的健康状况。
------------------------------------------
2、node_exporter 安装
node_exporter 用来收集服务器的监控信息。
node_exporter 默认使用 9100 端口监听，Prometheus 会从 node_exporter 中获取信息。
tar -xf node_exporter-0.15.1.linux-amd64.tar.gz
./ node_exporter
自定义一个客户端
只要返回的数据库类型是这样就可以.这里用的django..只要格式正确就可以
def metrics(req):
    ss = "feiji 32" + " " + "caidian 31"
    return HttpResponse(ss)
3、编写 rules/mengyuan.rules 规则，规则是发送报警的前提
vi mengyuan.rules
groups:
- name: zus
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown   #报警名字随便写
    expr: up == 0   #这是一个表达式，如果主机up状态为0,表示关机了，条件为真就会触发报警可以通过$value得到值
    for: 5s      #5s内，还是0，就发送报警信息，当然是发送给报警管理器
    labels:
      severity: page #这个类型的报警定了个标签
    annotations:
      summary: "Instance {{ $labels.instance }} down current {{ $value }}"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
4、安装报警管理器
下载安装alertmanager-0.15.0-rc.1.linux-amd64
修改创建：
cat alertmanager.yml
route:
  receiver: mengyuan2 #接收的名字，默认必须有一个，对应receivers的- name
  group_wait: 1s #等待1s
  group_interval: 1s #发送间隔1s
  repeat_interval: 1m #重复发送等待1m分钟再发
  group_by: ["zus"]
  routes:      #路由了，匹配规则标签的severity:page走receiver: mengyuan ,如果routes不写，就会走默认的mengyuan2
  - receiver: mengyuan
    match:
      severity: page

receivers:
- name: 'mengyuan'
  webhook_configs: #这我用的webhook_configs 钩子方法,默认会把规则的报警信息发送到127.0.0.1:8000
  - url: http://127.0.0.1:8000
    send_resolved: true
- name: 'mengyuan2'
  webhook_configs:
  - url: http://127.0.0.1:8000/2
    send_resolved: true
- job_name: '***'
scrape interval:120s
scrape timeout:30s
file_sd_configs:
- files:
- /prometheus/*.json
relabe_configs:
相关阅读:
P4781 【模板】拉格朗日插值
 P1306 斐波那契公约数
 P1154 奶牛分厩
 P1028 数的计算
 P1445 [Violet]樱花
 2020 Multi-University Training Contest 4
Codeforces Round #658 (Div. 2) D
2020牛客暑期多校训练营（第八场） K
Codeforces Round #659 (Div. 2)
#10106. 「一本通 3.7 例 2」单词游戏
原文地址：https://www.cnblogs.com/skyzy/p/9226849.html