• Elasticsearch学习系列四(聚合搜索与智能建议)


    聚合分析

    聚合分析是数据库中重要的功能特性,完成对一个查询的集中数据的聚合计算。如:最大值、最小值、求和、平均值等等。对一个数据集求和,算最大最小值等等,在ES中称为指标聚合,而对数据做类似关系型数据库那样的分组(group by),在ES中称为分桶。

    语法:

    aggregations" : {
      "<aggregation_name>" : { <!--聚合的名字 -->
        "<aggregation_type>" : { <!--聚合的类型 -->
           <aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
        }
        [,"meta" : { [<meta_data_body>] } ]? <!--元 -->
        [,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 -->
    
     }
     [,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 -->
    }
    

    aggregations可以简写为aggs。

    指标聚合

    示例1:查询所有商品里最贵的价格

    size就填0就行。

    POST /item/_search
    {
      "size":0,
      "aggs": {
        "max_price": {
          "max": {
            "field": "price"
          }
        }
      }
    }
    

    示例2:文档计数

    POST /item/_count
    {
      "query": {
        "range": {
          "price": {
            "gte": 10,
            "lte": 5000
          }
        }
      }
    }
    

    示例3:统计某字段有值的文档数

    POST /item/_search?size=0
    {
      "aggs": {
        "price_count": {
          "value_count": {
            "field": "price"
          }
        }
      }
    }
    
    

    示例4:用cardinality值去重计数

    如果有price重复的,就只会统计去重后的数量

    POST /item/_search?size=0
    {
      "aggs":{
        "price_count":{
          "cardinality": {
            "field": "price"
          }
        }
      }
    }
    

    示例5:stats统计count、max、min、avg、sum5个值

    POST /item/_search?size=0
    {
      "aggs":{
        "price_stats":{
          "stats": {
            "field": "price"
          }
        }
      }
    }
    

    结果如下:

    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 5,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "price_stats" : {
          "count" : 5,
          "min" : 2333.0,
          "max" : 6888.0,
          "avg" : 4059.2,
          "sum" : 20296.0
        }
      }
    }
    

    示例6:extended stats,stats的增强版,增加了平方和、方差、标准差、平均值加/减两个标准差的区间。

    POST /item/_search?size=0
    {
      "aggs":{
        "price_stats":{
          "extended_stats": {
            "field": "price"
          }
        }
      }
    }
    

    查询结果:

    {
      "took" : 4,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 5,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "price_stats" : {
          "count" : 5,
          "min" : 2333.0,
          "max" : 6888.0,
          "avg" : 4059.2,
          "sum" : 20296.0,
          "sum_of_squares" : 9.9816722E7,
          "variance" : 3486239.7599999993,
          "std_deviation" : 1867.1474928349928,
          "std_deviation_bounds" : {
            "upper" : 7793.494985669986,
            "lower" : 324.9050143300142
          }
        }
      }
    }
    

    示例7:Percentiles 占比百分位对应的值统计

    
    POST /item/_search?size=0
    {
      "aggs":{
        "price_percents":{
          "percentiles": {
            "field": "price"
            
          }
        }
      }
    }
    
    #指定分位值
    POST /item/_search?size=0
    {
      "aggs":{
        "price_percents":{
          "percentiles": {
            "field": "price",
            "percents": [
              1,
              5,
              25,
              50,
              75,
              95,
              99
            ]
          }
        }
      }
    }
    

    查询结果:

    ......
      "aggregations" : {
        "price_percents" : {
          "values" : {
            "1.0" : 2333.0000000000005,
            "5.0" : 2333.0,
            "25.0" : 2599.25,
            "50.0" : 2688.0,
            "75.0" : 5996.25,
            "95.0" : 6888.0,
            "99.0" : 6888.0
          }
        }
      }
    }
    
    

    Percentiles rank 统计值小于等于指定值的文档占比

    price小于3000和5000的占比

    POST /item/_search?size=0
    {
      "aggs":{
        "price_percents":{
          "percentile_ranks": {
            "field": "price"
            , "values": [3000,5000]
          }
        }
      }
    }
    

    桶聚合

    他执行的是对文档分组的操作,把满足相关特性的文档分到一个桶里,即桶分。输出结果往往是一个个包含多个文档的桶。

    示例1:分组求平均值

    POST /item/_search
    {
      "size": 0,
      "aggs": {
        "group_by_price": {
          "range": {
            "field": "price",
            "ranges": [
              {
                "from": 50,
                "to": 100
              },
              {
                "from": 2000,
                "to": 3000
              },
              {
                "from": 3000,
                "to": 5000
              }
            ]
          },
          "aggs": {
            "average_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }
    
    

    查询结果:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 5,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "group_by_price" : {
          "buckets" : [
            {
              "key" : "50.0-100.0",
              "from" : 50.0,
              "to" : 100.0,
              "doc_count" : 0,
              "average_price" : {
                "value" : null
              }
            },
            {
              "key" : "2000.0-3000.0",
              "from" : 2000.0,
              "to" : 3000.0,
              "doc_count" : 3,
              "average_price" : {
                "value" : 2569.6666666666665
              }
            },
            {
              "key" : "3000.0-7000.0",
              "from" : 3000.0,
              "to" : 7000.0,
              "doc_count" : 2,
              "average_price" : {
                "value" : 6293.5
              }
            }
          ]
        }
      }
    }
    
    

    示例2:分组的文档个数统计

    POST /item/_search
    {
      "size": 0,
      "aggs": {
        "group_by_price": {
          "range": {
            "field": "price",
            "ranges": [
              {
                "from": 50,
                "to": 100
              },
              {
                "from": 2000,
                "to": 3000
              },
              {
                "from": 3000,
                "to": 7000
              }
            ]
          },
          "aggs": {
            "average_price": {
              "value_count": {
                "field": "price"
              }
            }
          }
        }
      }
    }
    

    示例3:使用having语法

    POST /item/_search
    {
      "size": 0,
      "aggs": {
        "group_by_price": {
          "range": {
            "field": "price",
            "ranges": [
              {
                "from": 50,
                "to": 100
              },
              {
                "from": 2000,
                "to": 3000
              },
              {
                "from": 3000,
                "to": 7000
              }
            ]
          },
          "aggs": {
            "average_price": {
              "avg": {
                "field": "price"
              }
            },
            "having":{
              "bucket_selector": {
                "buckets_path": {
                  "avg_price":"average_price"
                },
                "script": {
                  "source": "params.avg_price >=2600"
                }
              }
            }
          }
      
        }
      }
    }
    

    智能搜索建议

    先构造一些测试数据:

    PUT /blogs/
    {
      "mappings": {
        "properties": {
          "body":{
            "type": "text"
          }
        }
      }
    }
    
    POST _bulk/?refresh=true
    { "index" : { "_index" : "blogs" } }
    { "body": "Lucene is cool"}
    { "index" : { "_index" : "blogs" } }
    { "body": "Elasticsearch builds on top of lucene"}
    { "index" : { "_index" : "blogs" } }
    { "body": "Elasticsearch rocks"}
    { "index" : { "_index" : "blogs" } }
    { "body": "Elastic is the company behind ELK stack"}
    { "index" : { "_index" : "blogs" } }
    { "body": "elk rocks"}
    { "index" : { "_index" : "blogs"} }
    { "body": "elasticsearch is rock solid"}
    
    
    Term Suggester

    搜索

    POST /blogs/_search
    {
      "suggest": {
        "my_suggest": {
          "text": "rock",
          "term": {
            "field": "body",
            "suggest_mode":"missing"
          }
        }
      }
    }
    

    suggest_mode有3个值:

    • missing:如果rock这个词已经存在了,就不会再建议
    • popular:虽然rock在词典里有了,但是有词频更高的相似项就会建议
    • always:不管词典里是否有,也要给出相似项
    Phrase suggester

    在Term suggester的基础上,会考量多个term之间的关系,如是否同时出现在索引的原文里、相邻程度等等。

    POST /blogs/_search
    {
      "suggest": {
        "my_suggest": {
          "text": "lucne and elasticsear rock",
          "phrase": {
            "field": "body",
            "highlight":{
              "pre_tag":"<em>",
              "post_tag":"</em>"
            }
          }
        }
      }
    }
    

    搜索结果:

    {
      "took" : 41,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 0,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "suggest" : {
        "my_suggest" : [
          {
            "text" : "lucne and elasticsear rock",
            "offset" : 0,
            "length" : 26,
            "options" : [
              {
                "text" : "lucene and elasticsearch rock",
                "highlighted" : "<em>lucene</em> and <em>elasticsearch</em> rock",
                "score" : 0.004993905
              },
              {
                "text" : "lucne and elasticsearch rock",
                "highlighted" : "lucne and <em>elasticsearch</em> rock",
                "score" : 0.0033391973
              },
              {
                "text" : "lucene and elasticsear rock",
                "highlighted" : "<em>lucene</em> and elasticsear rock",
                "score" : 0.0029183894
              }
            ]
          }
        ]
      }
    }
    
    

    options直接返回一个phrase列表,因为lucene和elasticsearch曾经在同一条原文里出现过,同时替换2个term的可信度更高,所以打分较高。

    Completion Suggester

    它主要针对的应用场景是"Auto Completion",此场景下用户每输入一个字符的时候,就需要发送一次请求到后端查找匹配项。因此数据结构实现上与上面的两个Suggester不一样,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里,进行前缀查找时速度极快。但是FST也只能用于前缀查询。为了能使用Completion Suggester,字段类型需定义为completion。

    PUT /blogs_complation
    {
      "mappings": {
        "properties": {
          "body":{
            "type": "completion"
          }
        }
      }
    }
    

    插入些测试数据:

    POST _bulk/?refresh=true
    { "index" : { "_index" : "blogs_completion" } }
    { "body": "Lucene is cool"}
    { "index" : { "_index" : "blogs_completion" } }
    { "body": "Elasticsearch builds on top of lucene"}
    { "index" : { "_index" : "blogs_completion"} }
    { "body": "Elasticsearch rocks"}
    { "index" : { "_index" : "blogs_completion" } }
    { "body": "Elastic is the company behind ELK stack"}
    { "index" : { "_index" : "blogs_completion" } }
    { "body": "the elk stack rocks"}
    { "index" : { "_index" : "blogs_completion"} }
    { "body": "elasticsearch is rock solid"}
    

    查找示例:

    POST /blogs_completion/_search?pretty
    {
      "size": 0,
      "suggest": {
        "blog-suggest": {
          "prefix": "elastic i",
          "completion": {
            "field": "body"
          }
        }
      }
    }
    

    推荐结果如下:

    # 省略部分
    "options" : [
              {
                "text" : "Elastic is the company behind ELK stack",
                "_index" : "blogs_completion",
                "_type" : "_doc",
                "_id" : "SG16oIEB1fsyWKAeKha5",
                "_score" : 1.0,
                "_source" : {
                  "body" : "Elastic is the company behind ELK stack"
                }
              }
            ]
    

    需要注意analyzer会影响建议,如果是english analyzer,is这个单词会被过滤到,所以无法匹配到建议词。还有preserve_separators和preserve_position_increments也会影响查询。

    • preserve_separators 这个设置为false,将忽略空格之类的分隔符
    • preserve_position_increments: 如果建议词第一个词是停用词,并且我们使用了过滤停用
      词的分析器,需要将此设置为false。

    如果Completion Suggester已经到了零匹配,那么可以猜测是否用户有输入错误,这时候可以尝试一下Phrase Suggester。如果Phrase Suggester没有找到任何option,开始尝试term Suggester。

    Context Suggester

    Completion Suggester的扩展,可以在搜索中加入更多的上下文信息,然后根据不同的上下文信息,对相同的输入,比如"star"提供不同的建议值,比如:

    • 咖啡相关:star bucks
    • 电影相关: star wars
  • 相关阅读:
    【JAVA基础&Python】静态方法与单例模式,以及应用场景
    【JAVA基础】static的定义
    【JAVA基础&Python】类与对象的继承
    MD的编辑器汇总
    Oracle 学习笔记(二)
    安装Jieba 库出现错误解总结
    (十五)-前端 -项目总结
    (八)-前端-DOM基础
    (十四)-前端-面试-项目相关
    (十三)- 前端-面试-REACT
  • 原文地址:https://www.cnblogs.com/javammc/p/16411055.html
Copyright © 2020-2023  润新知