接续上篇,本篇介绍elasticsearch聚合查询,使用python库elasticsearch-dsl进行聚合查询操作。
条形图
聚合有一个令人激动的特性就是能够十分容易地将数据转换成图表和图形。
-
- 创建直方图需要指定一个区间,如果我们要为售价创建一个直方图,可以将间隔设为 20,000。这样做将会在每个 $20,000 档创建一个新桶,然后文档会被分到对应的桶中。
1 GET cars/transactions/_search 2 { 3 "size": 0, 4 "aggs": { 5 "price": { 6 "histogram": { 7 "field": "price", 8 "interval": 20000 9 }, 10 "aggs": { 11 "revenue": { 12 "sum": { 13 "field": "price" 14 } 15 } 16 } 17 } 18 } 19 }
1 s = Search(index='cars') 2 s.aggs.bucket("price", "histogram", field="price", interval=20000).metric("revenue", "sum", field="price") 3 response = s.execute()
图形化表示
- 更强大的统计
1 GET /cars/transactions/_search 2 { 3 "size" : 0, 4 "aggs": { 5 "makes": { 6 "terms": { 7 "field": "make", 8 "size": 10 9 }, 10 "aggs": { 11 "stats": { 12 "extended_stats": { 13 "field": "price" 14 } 15 } 16 } 17 } 18 } 19 }
1 s = Search(index='cars') 2 s.aggs.bucket("makes", "terms", field="make", size=10).metric("stats", "extended_stats", field="price") 3 response = s.execute()
- 按时间统计(date_histogram),每月销售了多少台汽车?
1 GET cars/transactions/_search 2 { 3 "size": 0, 4 "aggs": { 5 "sales": { 6 "date_histogram": { 7 "field": "sold", 8 "interval": "month", 9 "format": "yyyy-MM-dd", 10 "extended_bounds": { 11 "min": "2014-01-01", 12 "max": "2014-12-31" 13 } 14 } 15 } 16 } 17 }
1 s = Search(index='cars') 2 s.aggs.bucket("sales", "date_histogram", field="sold", interval="month", 3 format="yyyy-MM-dd", extended_bounds={"min": "2014-01-01", "max": "2014-12-31"}) 4 response = s.execute()
- 计算每个季度所有汽车品牌的销售总额以及每种汽车品牌的销售总额
1 GET cars/transactions/_search 2 { 3 "size": 0, 4 "aggs": { 5 "sales": { 6 "date_histogram": { 7 "field": "sold", 8 "interval": "quarter", 9 "format": "yyyy-MM-dd", 10 "extended_bounds": { 11 "min": "2014-01-01", 12 "max": "2014-12-31" 13 } 14 }, 15 "aggs": { 16 "per_make_sum": { 17 "terms": { 18 "field": "make" 19 }, 20 "aggs": { 21 "sum_price": { 22 "sum": { 23 "field": "price" 24 } 25 } 26 } 27 }, 28 "total_sum": { 29 "sum": { 30 "field": "price" 31 } 32 } 33 } 34 } 35 } 36 }
1 s = Search(index='cars') 2 a1 = A("date_histogram", field="sold", interval="quarter", format="yyyy-MM-dd", 3 extended_bounds={"min": "2014-01-01", "max": "2014-12-31"}) 4 a2 = A("terms", field="make") 5 s.aggs.bucket("sales", a1).bucket("per_make_sum", a2).metric("sum_price", "sum", field="price") 6 s.aggs["sales"].metric("total_sum", "sum", field="price") 7 response = s.execute()
- 限定范围的聚合,福特在售车有多少种颜色?
1 GET cars/transactions/_search 2 { 3 "query": { 4 "match": { 5 "make": "ford" 6 } 7 }, 8 "aggs": { 9 "colors": { 10 "terms": { 11 "field": "make" 12 } 13 } 14 } 15 }
1 s = Search(index="cars").query("match", make="ford") 2 s.aggs.bucket("colors", "terms", field="make") 3 response = s.execute()
- 全局桶(全局桶包含所有的文档,它无视查询的范围),比方说我们想知道福特汽车与所有汽车平均售价的比较
1 GET cars/transactions/_search 2 { 3 "query": { 4 "match": { 5 "make": "ford" 6 } 7 }, 8 "aggs": { 9 "single_avg_price": { 10 "avg": { 11 "field": "price" 12 } 13 }, 14 "all": { 15 "global": {}, --global忽略过滤条件 16 "aggs": { 17 "avg_price": { 18 "avg": { 19 "field": "price" 20 } 21 } 22 } 23 } 24 } 25 }
1 s = Search(index="cars").query("match", make="ford") 2 s.aggs.metric("single_avg_price", "avg", field="price") 3 s.aggs.bucket("all", "global").metric("avg_price", "avg", field="price") 4 response = s.execute()
- 过滤,找到售价在 $10,000 美元之上的所有汽车同时也为这些车计算平均售价
1 GET cars/transactions/_search 2 { 3 "query": { 4 "constant_score": { 5 "filter": { 6 "range": { 7 "price": { 8 "gte": 10000 9 } 10 } 11 } 12 } 13 }, 14 "aggs": { 15 "single_avg_price": { 16 "avg": { 17 "field": "price" 18 } 19 } 20 } 21 }
1 s = Search(index="cars").query("range", price={"gte": 10000}) 2 s.aggs.metric("single_avg_price", "avg", field="price") 3 response = s.execute()
- 过滤桶(一种特殊桶),搜索福特汽车在2014年上半年销售汽车的均价
1 GET /cars/transactions/_search 2 { 3 "size" : 0, 4 "query":{ 5 "match": { 6 "make": "ford" 7 } 8 }, 9 "aggs":{ 10 "recent_sales": { 11 "filter": { 12 "range": { 13 "sold": { 14 "from": "2014-01-01", 15 "to": "2014-06-30" 16 } 17 } 18 }, 19 "aggs": { 20 "average_price":{ 21 "avg": { 22 "field": "price" 23 } 24 } 25 } 26 } 27 } 28 }
1 s = Search(index="cars").query("match", make="ford") 2 q = Q("range", sold={"from": "2014-01-01", "to": "2014-06-30"}) 3 s.aggs.bucket("recent_sales", "filter", q).metric("average_price", "avg", field="price") 4 response = s.execute()
- 后过滤器(post_filter),只过滤搜索结果,不过滤聚合结果,对聚合没有影响
1 GET cars/transactions/_search 2 { 3 4 "query": { 5 "match": { 6 "make": "ford" 7 } 8 }, 9 "post_filter": { 10 "term": { 11 "color": "green" 12 } 13 }, 14 "aggs": { 15 "all_colors": { 16 "terms": { 17 "field": "color" 18 } 19 } 20 } 21 }
1 s = Search(index="cars").query("match", make="ford").post_filter("term", color="green") 2 s.aggs.bucket("all_colors", "terms", field="color") 3 response = s.execute()
- 创建直方图需要指定一个区间,如果我们要为售价创建一个直方图,可以将间隔设为 20,000。这样做将会在每个 $20,000 档创建一个新桶,然后文档会被分到对应的桶中。
内置排序
- _count:按文档数排序。对 terms 、 histogram 、 date_histogram 有效
- _term:按词项的字符串值的字母顺序排序。只在 terms 内使用
- _key:按每个桶的键值数值排序(理论上与 _term 类似)。 只在 histogram 和 date_histogram 内使用
-
- 让我们做一个 terms 聚合但是按 doc_count 值的升序排序
1 GET cars/transactions/_search 2 { 3 "size": 0, 4 "aggs": { 5 "colors": { 6 "terms": { 7 "field": "color", 8 "order": { 9 "_count": "asc" 10 } 11 } 12 } 13 } 14 }
1 s = Search(index="cars") 2 s.aggs.bucket("colors", "terms", field="color", order={"_count": "asc"}) 3 response = s.execute()
- 按度量排序,按照汽车颜色分类,再按照汽车平均售价升序排列
1 GET cars/transactions/_search 2 { 3 "size": 0, 4 "aggs": { 5 "colors": { 6 "terms": { 7 "field": "color", 8 "order": { 9 "avg_price": "asc" 10 } 11 }, 12 "aggs": { 13 "avg_price": { 14 "avg": { 15 "field": "price" 16 } 17 } 18 } 19 } 20 } 21 }
1 s = Search(index="cars") 2 s.aggs.bucket("colors", "terms", field="color", order={"avg_price": "asc"}).metric("avg_price", "avg", field="price") 3 response = s.execute()
- 基于“深度”度量排序
- 让我们做一个 terms 聚合但是按 doc_count 值的升序排序
我们可以定义更深的路径,将度量用尖括号( > )嵌套起来,像这样: my_bucket>another_bucket>metric 。
需要提醒的是嵌套路径上的每个桶都必须是 单值 的。 filter 桶生成 一个单值桶:所有与过滤条件匹配的文档都在桶中。 多值桶(如:terms )动态生成许多桶,无法通过指定一个确定路径来识别。
目前,只有三个单值桶: filter 、 global 和 reverse_nested 。
-
- 让我们快速用示例说明,创建一个汽车售价的直方图,但是按照红色和绿色(不包括蓝色)车各自的方差来排序
1 GET /cars/transactions/_search 2 { 3 "size" : 0, 4 "aggs" : { 5 "colors" : { 6 "histogram" : { 7 "field" : "price", 8 "interval": 20000, 9 "order": { 10 "red_green_cars>stats.variance" : "asc" 11 } 12 }, 13 "aggs": { 14 "red_green_cars": { 15 "filter": { "terms": {"color": ["red", "green"]}}, 16 "aggs": { 17 "stats": {"extended_stats": {"field" : "price"}} 18 } 19 } 20 } 21 } 22 } 23 }
1 s = Search(index="cars") 2 a = A("histogram", field="price", interval=20000, order={"red_green_cars>stats.variance": "asc"}) 3 q = A("filter", filter={"terms": {"color": ["red", "green"]}}) 4 s.aggs.bucket("colors", a).bucket("red_green_cars", q).metric("stats", "extended_stats", field="price") 5 response = s.execute()
- 让我们快速用示例说明,创建一个汽车售价的直方图,但是按照红色和绿色(不包括蓝色)车各自的方差来排序