Prometheus PormQL语法及告警规则写法

Prometheus PormQL语法及告警规则写法
这个是我一直不想写的，嫌麻烦。还有就是这么多大神，我还差得远，不过是为了通过输出更好的理解这些知识。

介绍

Prometheus 是一个时序数据库，可以存储它通过 exporters 抓取回来的监控数据。那这些数据怎么去查询呢？比如MySQL有SQL语句，那Prometheus有啥呢？ PromQL (Prometheus Query Language) ，这是Prometheus开发的数据查询DSL语言，日常的可视化以及告警规则都要用到它。这个很重要，一定得好好学一下。

举个例子

浏览器打开Prometheus的web界面，http://localhost:9090/graph。可以看到有个输入框，就是输 PromQL语句的地方，下方有个 Execute 按钮。

先拿Nginx的指标举个例子，Nginx如何监控，请查看我之前的文章Prometheus监控nginx

查看一台Nginx的活跃连接数，即active指标，如何看呢？

在服务器上查看完整的指标语句是：
```
[wonders@node1 ~]$ curl http://172.18.11.192:9145/metrics
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="active"} 1349
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="waiting"} 1341
nginx_http_connections{state="writing"} 5
......
```
那我只想看 active 怎么办呢？在输 PromQL 语句的地方输入
```
nginx_http_connections{state="active"}
```
会输出所有Nginx机器的 active
```
nginx_http_connections{instance="172.18.11.192:9145",job="Nginx",state="active"}	1459
nginx_http_connections{instance="172.18.11.193:9145",job="Nginx",state="active"}	1456
```
当我只想看其中一台比如192这台的时候呢？
```
nginx_http_connections{instance="172.18.11.192:9145",state="active"}
```
会输出192的指标
```
nginx_http_connections{instance="172.18.11.192:9145",job="Nginx",state="active"}	1358
```
是不是有点明白了？

当然线上不可能只有一台Nginx，那我想算总和呢？就是所有Nginx的active相加，这个基本需求PromQL早已实现。
```
sum(nginx_http_connections{state="active"})
```
输出结果如下
```
{}	2900
```
同样的 PromQL 还有sum (求和)；min (最小值)；max (最大值)；avg (平均值)；stddev (标准差)；stdvar (标准差异)；count (计数)；count_values (对 value 进行计数)；bottomk (样本值最小的 k 个元素)；topk (样本值最大的k个元素)；quantile (分布统计)，等等各种计算函数。这种在Prometheus叫做聚合操作。

这时有人说我不想看某一台机器的指标怎么办呢？使用 !=
```
nginx_http_connections{instance!="172.18.11.192:9145",state="active"}
```
这种操作符还有算数二次元运算符（加减乘除）、布尔运算符（= ，!= ，< , > ,<= ,>= ）、集合运算符（and,or,unless）、匹配模式等等

看了上面的例子，应该有同学已经开窍了，下面的都是理论知识了。

理论知识

基本上是翻译了官方的https://prometheus.io/docs/prometheus/latest/querying/basics/

查询结果类型

PromQL 查询结果有下面4种类型：
- 即时数据 (Instant vector): 一组时间序列，每个时间序列包含一个样本，所有样本共享相同的时间戳，例如：http_requests_total
- 区间数据 (Range vector): 组时间序列，其中包含每个时间序列随时间的一系列数据点，例如：http_requests_total[5m]
- 纯量数据 (Scalar): 纯量只有一个数字，没有时序，例如：count(http_requests_total)
- String-一个简单的字符串值；目前未使用
查询条件

Prometheus 存储的是时序数据，而它的时序是由名字和一组标签构成的，其实名字也可以写出标签的形式，例如 http_requests_total 等价于 {name="http_requests_total"}。

一个简单的查询相当于是对各种标签的筛选，例如：
```
http_requests_total{code="200"} // 表示查询名字为 http_requests_total，code 为 "200" 的数据
```
查询条件支持正则匹配，例如：
```
http_requests_total{code!="200"}  // 表示查询 code 不为 "200" 的数据
http_requests_total{code=～"2.."} // 表示查询 code 为 "2xx" 的数据
http_requests_total{code!～"2.."} // 表示查询 code 不为 "2xx" 的数据
```
操作符

Prometheus 查询语句中，支持常见的各种表达式操作符，例如

算术运算符:

支持的算术运算符有 +，-，*，/，%，^, 例如 http_requests_total * 2 表示将 http_requests_total 所有数据 double 一倍。

比较运算符:

支持的比较运算符有 ==，!=，>，<，>=，<=, 例如 http_requests_total > 100 表示 http_requests_total 结果中大于 100 的数据。

逻辑运算符:

支持的逻辑运算符有 and，or，unless, 例如 http_requests_total == 5 or http_requests_total == 2 表示 http_requests_total 结果中等于 5 或者 2 的数据。

聚合运算符:

支持的聚合运算符有 sum，min，max，avg，stddev，stdvar，count，count_values，bottomk，topk，quantile，, 例如 max(http_requests_total) 表示 http_requests_total 结果中最大的数据。

注意，和四则运算类型，Prometheus 的运算符也有优先级，它们遵从（^）> (*, /, %) > (+, -) > (==, !=, <=, <, >=, >) > (and, unless) > (or) 的原则。

内置函数

Prometheus 内置不少函数，方便查询以及数据格式化，例如将结果由浮点数转为整数的 floor 和 ceil，
```
floor(avg(http_requests_total{code="200"}))
ceil(avg(http_requests_total{code="200"}))
```
查看 http_requests_total 5分钟内，平均每秒数据
```
rate(http_requests_total[5m])
```
告警规则

看了前面的知识，现在知道如何取指标了，那告警怎么做呢？还是先举个例子

在 Zabbix上告警怎么做的？比如单台Nginx active指标超过1w就要发出告警，触发器那里选的是Nginx active项，然后选大于10000，触发告警。

Prometheus 也是一样啊，你用如下语句获取当前值
```
nginx_http_connections{instance="172.18.11.192:9145",state="active"}
```
前面说了PromQL支持比较运算符，那告警规则就这么写
```
nginx_http_connections{instance="172.18.11.192:9145",state="active"} > 10000
```
简单吧。。。如下为完整的告警规则
```
 groups:
 - name: Nginx
   rules:
   - alert: HighErrorRate
     expr: nginx_http_connections{instance="172.18.11.192:9145",state="active"} > 10000
     for: 5m
     labels:
       severity: page
     annotations:
       summary: "啊啊啊啊啊,(instance {{ $labels.instance }}) 连接数超1w了"
       description: "Nginx 连接数现在 VALUE = {{ $value }}
  LABELS: {{ $labels }}"
  #group:定义一组相关规则
  #alert：告警规则名称
  #expr：基于PromQL的触发条件
  #for 等待评估时间
  #label 自定义标签
  #annotation： 指定一组附加信息Alertmanger特性
```
下一篇写 Prometheus 如何做告警。。。
相关阅读:
判断回文字符串
 汉诺塔递归问提
 课程作业02.2
编写一个程序，此程序从命令行接收多个数字，求和之后输出结果。
《大道至简》第一章java伪代码分析
 《大道至简》读后感
 用户体验
 大二上学期软件工程概论学习进度表（第十五周）
12-24个人博客
 大二上学期软件工程概论学习进度表（第十四周）
原文地址：https://www.cnblogs.com/fsckzy/p/13335173.html

Prometheus PormQL语法及告警规则写法

介绍

举个例子

理论知识

查询结果类型

查询条件

操作符

内置函数

告警规则