• Elastic Stack 笔记(七)Elasticsearch5.6 聚合分析


    博客地址:http://www.moonxy.com

    一、前言

    Elasticsearch 是一个分布式的全文搜索引擎,索引和搜索是 Elasticsarch 的基本功能。同时,Elasticsearch 的聚合(Aggregations)功能也时分强大,允许在数据上做复杂的分析统计。ES 提供的聚合分析功能主要有指标聚合、桶聚合、管道聚合和矩阵聚合。需要主要掌握的是前两个,即指标聚合和桶聚合。

    聚合分析的官方文档:Aggregations

    二、聚合分析

    2.1 指标聚合

    指标聚合官网文档:Metric

    指标聚合中主要包括 min、max、sum、avg、stats、extended_stats、value_count 等聚合,相当于 SQL 中的聚合函数。

    指标聚合中包括如下聚合:

    Aggregations that keep track and compute metrics over a set of documents.

    在一组文档中跟踪和计算度量的聚合。如下以 max 聚合为例:

    Max Aggregation

    max 聚合官网文档:Max Aggregation

    max 聚合用于最大值统计,与 SQL 中的聚合函数 max() 的作用类似,其中 "max_price" 为自定义的聚合名称。

    ##Max Aggregation
    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "max_price": {
          "max":  {
            "field": "price"
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 6,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "max_price": {
          "value": 81.4
        }
      }
    }

    Cardinality Aggregation

    基数统计聚合官网文档:Cardinality Aggregation

    Cardinality Aggregation 用于基数查询,其作用是先执行类似 SQL 中的 distinct 操作,去掉集合中的重复项,然后统计排重后的集合长度。

    ##Cardinality Aggregation
    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "all_language": {
          "cardinality":  {
            "field": "language"
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 41,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "all_language": {
          "value": 3
        }
      }
    }

    Stats Aggregation

    基本统计聚合官网文档:Stats Aggregation

    Stats Aggregation 用于基本统计,会一次返回 count、max、min、avg 和 sum 这 5 个指标。如下:

    ##Stats Aggregation
    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "stats_pirce": {
          "stats":  {
            "field": "price"
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 5,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "stats_pirce": {
          "count": 5,
          "min": 46.5,
          "max": 81.4,
          "avg": 63.8,
          "sum": 319
        }
      }
    }

    Extended Stats Aggregation

    高级统计聚合官网文档:Extended Stats Aggregation

    用于高级统计,和基本统计功能类似,但是会比基本统计多4个统计结果:平方和、方差、标准差、平均值加/减两个标准差的区间。

    ##Extended Stats Aggregation
    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "extend_stats_pirce": {
          "extended_stats":  {
            "field": "price"
          }
        }
      }
    }

    返回响应结果:

    {
      "took": 14,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "extend_stats_pirce": {
          "count": 5,
          "min": 46.5,
          "max": 81.4,
          "avg": 63.8,
          "sum": 319,
          "sum_of_squares": 21095.46,
          "variance": 148.65199999999967,
          "std_deviation": 12.19229264740638,
          "std_deviation_bounds": {
            "upper": 88.18458529481276,
            "lower": 39.41541470518724
          }
        }
      }
    }

    Value Count Aggregation

    文档数量聚合官网文档:Value Count Aggregation

    Value Count Aggregation 可按字段统计文档数量。

    ##Value Count Aggregation
    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "doc_count": {
          "value_count":  {
            "field": "author"
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 6,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "doc_count": {
          "value": 5
        }
      }
    }

    注意:

    text 类型的字段不能做排序和聚合(terms Aggregation 除外),如下对 title 字段做聚合,title 定义为 text:

    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "doc_count": {
          "value_count":  {
            "field": "title"
          }
        }
      }
    }

    返回结果如下:

    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
          }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
          {
            "shard": 0,
            "index": "books",
            "node": "6n3douACShiPmlA9j2soBw",
            "reason": {
              "type": "illegal_argument_exception",
              "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
            }
          }
        ]
      },
      "status": 400
    }

    2.2 桶聚合

    桶聚合官网文档:Bucket Aggregations

    Bucket 可以理解为一个桶,它会遍历文档中的内容,凡是符合某一要求的就放入一个桶中,分桶相当与 SQL 中 SQL 中的 group by。

    桶聚合包括如下聚合:

    terms Aggregation 用于分组聚合,统计属于各编程语言的书籍数量,如下:

    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "terms_count": {
          "terms":  {
            "field": "language"
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 31,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "terms_count": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "java",
              "doc_count": 2
            },
            {
              "key": "python",
              "doc_count": 2
            },
            {
              "key": "javascript",
              "doc_count": 1
            }
          ]
        }
      }
    }

    在 terms 分桶的基础上,还可以对每个桶进行指标聚合。例如,想统计每一类图书的平局价格,可以先按照 language 字段进行 Terms Aggregation,再进行 Avg Aggregattion,查询语句如下:

    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "terms_count": {
          "terms":  {
            "field": "language"
          },
          "aggs": {
            "avg_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 8,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "terms_count": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "java",
              "doc_count": 2,
              "avg_price": {
                "value": 58.35
              }
            },
            {
              "key": "python",
              "doc_count": 2,
              "avg_price": {
                "value": 67.95
              }
            },
            {
              "key": "javascript",
              "doc_count": 1,
              "avg_price": {
                "value": 66.4
              }
            }
          ]
        }
      }
    }

    Range Aggregation

    Range Aggregation 是范围聚合,用于反映数据的分布情况。比如,对 books 索引中的图书按照价格区间在 0~50、50~80、80 以上进行范围聚合,如下:

    GET books/_search
    {
      "size": 0, 
      "aggs": {
        "price_range": {
          "range": {
            "field": "price",
            "ranges": [
              {"to": 50},
              {"from": 50, "to": 80},
              {"from": 80}
            ]
          }
        }
      }
    }

    返回结果如下:

    {
      "took": 16,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 5,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "price_range": {
          "buckets": [
            {
              "key": "*-50.0",
              "to": 50,
              "doc_count": 1
            },
            {
              "key": "50.0-80.0",
              "from": 50,
              "to": 80,
              "doc_count": 3
            },
            {
              "key": "80.0-*",
              "from": 80,
              "doc_count": 1
            }
          ]
        }
      }
    }

    Range Aggregation 不仅可以对数值型字段进行范围统计,也可以作用在日期类型上。Date Range Aggregation 专门用于日期类型的范围聚合,和 Range Aggregation 的区别在于日期的起止值可以使用数学表达式。

    2.3 管道聚合

    管道聚合官网文档:Pipeline Aggregations

    Pipeline Aggregations 处理的对象是其他聚合的输出(而不是文档)。

    2.4 矩阵聚合

    矩阵聚合官网文档:Matrix Aggregations

    Matrix Stats 聚合是一种面向数值型的聚合,用于计算一组文档字段中的以下统计信息:

    计数:计算过程中每种字段的样本数量;

    平均值:每个字段数据的平均值;

    方差:每个字段样本数据偏离平均值的程度;

    偏度:量化每个字段样本数据在平均值附近的非对称分布情况;

    峰度:量化每个字段样本数据分布的形状;

    协方差:一种量化描述一个字段数据随另一个字段数据变化程度的矩阵;

    相关性:描述两个字段数据之间的分布关系,其协方差矩阵取值在[-1,1]之间。

    主要用于计算两个数值型字段之间的关系。如对日志记录长度和 HTTP 状态码之间关系的计算。

    GET /_search
    {
        "aggs": {
            "statistics": {
                "matrix_stats": {
                    "fields": ["log_size", "status_code"]
                }
            }
        }
    }
  • 相关阅读:
    【LeetCode】13. Roman to Integer (2 solutions)
    【LeetCode】16. 3Sum Closest
    【LeetCode】18. 4Sum (2 solutions)
    【LeetCode】168. Excel Sheet Column Title
    如何应用性能测试常用计算公式
    系统吞吐量(TPS)、用户并发量、性能测试概念和公式
    Monkey测试3——Monkey测试结果分析
    Monkey测试2——Monkey测试策略
    Monkey测试1——Monkey的使用
    TestNG 三 测试方法
  • 原文地址:https://www.cnblogs.com/cnjavahome/p/9164078.html
Copyright © 2020-2023  润新知