• Elasticsearch(二)


    插入数据:

    POST /forum/article/_bulk
    { "index": { "_id": 1 }}
    { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
    { "index": { "_id": 2 }}
    { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
    { "index": { "_id": 3 }}
    { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
    { "index": { "_id": 4 }}
    { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
    

      (2)根据用户ID搜索帖子

    GET /forum/article/_search
    {
        "query" : {
            "constant_score" : { 
                "filter" : {
                    "term" : { 
                        "userID" : 1
                    }
                }
            }
        }
    }
    

     term filter/query:对搜索文本不分词,直接拿去倒排索引中匹配,你输入的是什么,就去匹配什么

    (3)搜索没有隐藏的帖子

    GET /forum/article/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "term": {
              "hidden": "false"
            }
          }
        }
      }
    }
    

      (4)根据发帖日期搜索帖子

    GET /forum/article/_search
    {
        "query" : {
            "constant_score" : { 
                "filter" : {
                    "term" : { 
                        "postDate" : "2017-01-01"
                    }
                }
            }
        }
    }
    

    (5)根据帖子ID搜索帖子

    GET /forum/article/_search
    {
        "query" : {
            "constant_score" : { 
                "filter" : {
                    "term" : { 
                        "articleID" : "XHDK-A-1293-#fJ3"
                    }
                }
            }
        }
    }
    

      结果:

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
      }
    }
    

      正确写法:

    GET /forum/article/_search
    {
        "query" : {
            "constant_score" : { 
                "filter" : {
                    "term" : { 
                        "articleID.keyword" : "XHDK-A-1293-#fJ3"
                    }
                }
            }
        }
    }
    

      (6)查看分词:

    GET /forum/_analyze
    {
      "field": "articleID",
      "text": "XHDK-A-1293-#fJ3"
    }
    

      articleID.keyword,是es最新版本内置建立的field,就是不分词的。所以一个articleID过来的时候,会建立两次索引,

    一次是自己本身,是要分词的,分词后放入倒排索引;另外一次是基于articleID.keyword,不分词,
    保留256个字符最多,直接一个字符串放入倒排索引中。

    所以term filter,对text过滤,可以考虑使用内置的field.keyword来进行匹配。
    但是有个问题,默认就保留256个字符。所以尽可能还是自己去手动建立索引,指定not_analyzed吧。
    在最新版本的es中,不需要指定not_analyzed也可以,将type=keyword即可。

    (7)重建索引

    DELETE /forum
    
    PUT /forum
    {
      "mappings": {
        "article": {
          "properties": {
            "articleID": {
              "type": "keyword"
            }
          }
        }
      }
    }
    
    POST /forum/article/_bulk
    { "index": { "_id": 1 }}
    { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
    { "index": { "_id": 2 }}
    { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
    { "index": { "_id": 3 }}
    { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
    { "index": { "_id": 4 }}
    { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
    

      (8)重新根据帖子ID和发帖日期进行搜索

    GET /forum/article/_search
    {
        "query" : {
            "constant_score" : { 
                "filter" : {
                    "term" : { 
                        "articleID" : "XHDK-A-1293-#fJ3"
                    }
                }
            }
        }
    }
    

      

    (1)term filter:根据exact value进行搜索,数字、boolean、date天然支持
    (2)text需要建索引时指定为not_analyzed,才能用term query
    (3)相当于SQL中的单个where条件

    filter比query的好处就在于会caching,但是之前不知道caching的是什么东西,
    实际上并不是一个filter返回的完整的doc list数据结果。而是filter bitset缓存起来。

    filter大部分情况下来说,在query之前执行,先尽量过滤掉尽可能多的数据

    query:是会计算doc对搜索条件的relevance score,还会根据这个score去排序
    filter:只是简单过滤出想要的数据,不计算relevance score,也不排序

     基于bool组合多个filter条件来搜索数据

    1、搜索发帖日期为2017-01-01,或者帖子ID为XHDK-A-1293-#fJ3的帖子,同时要求帖子的发帖日期绝对不为2017-01-02

    select *
    from forum.article
    where (post_date='2017-01-01' or article_id='XHDK-A-1293-#fJ3')
    and post_date!='2017-01-02'
    

      

    GET /forum/article/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "bool": {
              "should": [
                {
                  "term": {
                    "postDate": "2017-01-01"
                  }
                },
                {
                  "term": {
                    "articleID": "XHDK-A-1293-#fJ3"
                  }
                }
              ],
              "must_not": {
                "term": {
                  "postDate": "2017-01-02"
                }
              }
            }
          }
        }
      }
    }
    

      must,should,must_not,filter:必须匹配,可以匹配其中任意一个即可,必须不匹配

    2、搜索帖子ID为XHDK-A-1293-#fJ3,或者是帖子ID为JODL-X-1937-#pV7而且发帖日期为2017-01-01的帖子

    select *
    from forum.article
    where article_id='XHDK-A-1293-#fJ3'
    or (article_id='JODL-X-1937-#pV7' and post_date='2017-01-01')
    

      

    GET /forum/article/_search 
    {
      "query": {
        "constant_score": {
          "filter": {
            "bool": {
              "should": [
                {
                  "term": {
                    "articleID.keyword": "XHDK-A-1293-#fJ3"
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "term":{
                          "articleID.keyword": "JODL-X-1937-#pV7"
                        }
                      },
                      {
                        "term": {
                          "postDate": "2017-01-01"
                        }
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    }
    

      terms搜索多个值以及多值搜索结果

     1、为帖子数据增加tag字段

    POST /forum/article/_bulk
    { "update": { "_id": "1"} }
    { "doc" : {"tag" : ["java", "hadoop"]} }
    { "update": { "_id": "2"} }
    { "doc" : {"tag" : ["java"]} }
    { "update": { "_id": "3"} }
    { "doc" : {"tag" : ["hadoop"]} }
    { "update": { "_id": "4"} }
    { "doc" : {"tag" : ["java", "elasticsearch"]} }
    

      2、搜索articleID为KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子,搜索tag中包含java的帖子

    GET /forum/article/_search 
    {
      "query": {
        "constant_score": {
          "filter": {
            "terms": {
              "articleID.keyword": [
                "KDKE-B-9947-#kL5",
                "QQPX-R-3956-#aD8"
              ]
            }
          }
        }
      }
    }
    

      

    GET /forum/article/_search
    {
        "query" : {
            "constant_score" : {
                "filter" : {
                    "terms" : { 
                        "tag" : ["java"]
                    }
                }
            }
        }
    }
    

      terms搜索多个值以及多值搜索结果优化

    添加字段:

    POST /forum/article/_bulk
    { "update": { "_id": "1"} }
    { "doc" : {"tag_cnt" : 2} }
    { "update": { "_id": "2"} }
    { "doc" : {"tag_cnt" : 1} }
    { "update": { "_id": "3"} }
    { "doc" : {"tag_cnt" : 1} }
    { "update": { "_id": "4"} }
    { "doc" : {"tag_cnt" : 2} }
    

      

    GET /forum/article/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "tag_cnt": 1
                  }
                },
                {
                  "terms": {
                    "tag": ["java"]
                  }
                }
              ]
            }
          }
        }
      }
    }
    

      range filter来进行范围过滤

    1、为帖子数据增加浏览量的字段

    POST /forum/article/_bulk
    { "update": { "_id": "1"} }
    { "doc" : {"view_cnt" : 30} }
    { "update": { "_id": "2"} }
    { "doc" : {"view_cnt" : 50} }
    { "update": { "_id": "3"} }
    { "doc" : {"view_cnt" : 100} }
    { "update": { "_id": "4"} }
    { "doc" : {"view_cnt" : 80} }
    

      2、搜索浏览量在30~60之间的帖子

    GET /forum/article/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "range": {
              "view_cnt": {
                "gt": 30,
                "lt": 60
              }
            }
          }
        }
      }
    }
    

      3、搜索发帖日期在最近1个月的帖子

    POST /forum/article/_bulk
    {"index":{"_id":5}}
    {"articleID":"DHJK-B-1395-#Ky5","userID":3,"hidden":false,"postDate":"2017-03-01","tag":["elasticsearch"],"tag_cnt":1,"view_cnt":10}
    
    
    GET /forum/article/_search 
    {
      "query": {
        "constant_score": {
          "filter": {
            "range": {
              "postDate": {
                "gt": "2017-03-10||-30d"
              }
            }
          }
        }
      }
    }
    GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "postDate": { "gt": "now-30d" } } } } } }

      手动控制全文检索结果的精准度

    1、为帖子数据增加标题字段

    POST /forum/article/_bulk
    { "update": { "_id": "1"} }
    { "doc" : {"title" : "this is java and elasticsearch blog"} }
    { "update": { "_id": "2"} }
    { "doc" : {"title" : "this is java blog"} }
    { "update": { "_id": "3"} }
    { "doc" : {"title" : "this is elasticsearch blog"} }
    { "update": { "_id": "4"} }
    { "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
    { "update": { "_id": "5"} }
    { "doc" : {"title" : "this is spark blog"} }
    

      2、搜索标题中包含java或elasticsearch的blog

    GET /forum/article/_search
    {
        "query": {
            "match": {
                "title": "java elasticsearch"
            }
        }
    }
    

      3、搜索标题中包含java和elasticsearch的blog

    搜索结果精准控制的第一步:灵活使用and关键字,希望所有的搜索关键字都要匹配的,那么就用and,
    可以实现单纯match query无法实现的效果

    GET /forum/article/_search
    {
      "query": {
        "match": {
          "title": {
            "query": "java elasticsearch",
            "operator": "and"
          }
        }
      }
    }
    

      4、搜索包含java,elasticsearch,spark,hadoop,4个关键字中,至少3个的blog

    控制搜索结果的精准度的第二步:指定一些关键字中,必须至少匹配其中的多少个关键字,才能作为结果返回

    GET /forum/article/_search
    {
      "query": {
        "match": {
          "title": {
            "query": "java elasticsearch spark hadoop",
            "minimum_should_match": "75%"
          }
        }
      }
    }
    

      5、用bool组合多个搜索条件,来搜索title

    GET /forum/article/_search
    {
      "query": {
        "bool": {
          "must": {
            "match": {
              "title": "java"
            }
          },
          "must_not": {
            "match": {
              "title": "spark"
            }
          },
          "should": [
            {
              "match": {
                "title": "hadoop"
              }
            },
            {
              "match": {
                "title": "elasticsearch"
              }
            }
          ]
        }
      }
    }
    

      6、bool组合多个搜索条件,如何计算relevance score

    must和should搜索对应的分数,加起来,除以must和should的总数

    排名第一:java,同时包含should中所有的关键字,hadoop,elasticsearch
    排名第二:java,同时包含should中的elasticsearch
    排名第三:java,不包含should中的任何关键字
    
    should是可以影响相关度分数的
    
    must是确保说,谁必须有这个关键字,同时会根据这个must的条件去计算出document对这个搜索条件的relevance score
    在满足must的基础之上,should中的条件,不匹配也可以,但是如果匹配的更多,那么document的relevance score就会更高
    

      7、搜索java,hadoop,spark,elasticsearch,至少包含其中3个关键字

    默认情况下,should是可以不匹配任何一个的,比如上面的搜索中,this is java blog,就不匹配任何一个should条件
    但是有个例外的情况,如果没有must的话,那么should中必须至少匹配一个才可以
    比如下面的搜索,should中有4个条件,默认情况下,只要满足其中一个条件,就可以匹配作为结果返回

    但是可以精准控制,should的4个条件中,至少匹配几个才能作为结果返回

    GET /forum/article/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "title": "java"
              }
            },
            {
              "match": {
                "title": "elasticsearch"
              }
            },
            {
              "match": {
                "title": "hadoop"
              }
            },
            {
              "match": {
                "title": "spark"
              }
            }
          ],
          "minimum_should_match": 3
        }
      }
    }
    

      

    1、全文检索的时候,进行多个值的检索,有两种做法,match query;should
    2、控制搜索结果精准度:and operator,minimum_should_match

    boost的细粒度搜索条件权重控制

    GET /forum/article/_search 
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "title": "blog"
              }
            }
          ],
          "should": [
            {
              "match": {
                "title": {
                  "query": "java"
                }
              }
            },
            {
              "match": {
                "title": {
                  "query": "hadoop"
                }
              }
            },
            {
              "match": {
                "title": {
                  "query": "elasticsearch"
                }
              }
            },
            {
              "match": {
                "title": {
                  "query": "spark",
                  "boost": 5
                }
              }
            }
          ]
        }
      }
    }
    

      dis_max实现best fields策略进行多字段搜索

    1、为帖子数据增加content字段

    POST /forum/article/_bulk
    { "update": { "_id": "1"} }
    { "doc" : {"content" : "i like to write best elasticsearch article"} }
    { "update": { "_id": "2"} }
    { "doc" : {"content" : "i think java is the best programming language"} }
    { "update": { "_id": "3"} }
    { "doc" : {"content" : "i am only an elasticsearch beginner"} }
    { "update": { "_id": "4"} }
    { "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
    { "update": { "_id": "5"} }
    { "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }
    

      2、搜索title或content中包含java或solution的帖子

    这个就是multi-field搜索,多字段搜索

    GET /forum/article/_search
    {
        "query": {
            "bool": {
                "should": [
                    { "match": { "title": "java solution" }},
                    { "match": { "content":  "java solution" }}
                ]
            }
        }
    }
    

      

    best fields策略,就是说,搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;
    而不是尽可能多的field匹配到了少数的关键词,排在了前面

    dis_max语法,直接取多个query中,分数最高的那一个query的分数即可

    GET /forum/article/_search
    {
        "query": {
            "dis_max": {
                "queries": [
                    { "match": { "title": "java solution" }},
                    { "match": { "content":  "java solution" }}
                ]
            }
        }
    }
    

      tie_breaker参数优化dis_max搜索效果

    1、搜索title或content中包含java beginner的帖子

    GET /forum/article/_search
    {
        "query": {
            "dis_max": {
                "queries": [
                    { "match": { "title": "java beginner" }},
                    { "match": { "body":  "java beginner" }}
                ]
            }
        }
    }
    

      dis_max,只是取分数最高的那个query的分数而已

    2、dis_max只取某一个query最大的分数,完全不考虑其他query的分数

    3、使用tie_breaker将其他query的分数也考虑进去

    GET /forum/article/_search
    {
        "query": {
            "dis_max": {
                "queries": [
                    { "match": { "title": "java beginner" }},
                    { "match": { "body":  "java beginner" }}
                ],
                "tie_breaker": 0.3
            }
        }
    }
    

      基于multi_match语法实现dis_max+tie_breaker

    GET /forum/article/_search
    {
      "query": {
        "multi_match": {
            "query":                "java solution",
            "type":                 "best_fields", 
            "fields":               [ "title^2", "content" ],
            "tie_breaker":          0.3,
            "minimum_should_match": "50%" 
        }
      } 
    }
    

      

    GET /forum/article/_search
    {
      "query": {
        "dis_max": {
          "queries":  [
            {
              "match": {
                "title": {
                  "query": "java beginner",
                  "minimum_should_match": "50%",
    	      "boost": 2
                }
              }
            },
            {
              "match": {
                "body": {
                  "query": "java beginner",
                  "minimum_should_match": "30%"
                }
              }
            }
          ],
          "tie_breaker": 0.3
        }
      } 
    }
    

      

    minimum_should_match,的作用
    去长尾,long tail
    长尾,比如你搜索5个关键词,但是很多结果是只匹配1个关键词的,其实跟你想要的结果相差甚远,这些结果就是长尾
    minimum_should_match,控制搜索结果的精准度,只有匹配一定数量的关键词的数据,才能返回

  • 相关阅读:
    Codechef之2014FebChallenge
    Codechef之CodeCraft: IIIT Hyderabad
    原创水题
    用图论模型解决dp问题
    [某模拟赛]一道好题
    萌新java入门笔记
    CodeForces 761C 【DP】
    POJ3268【最短路】
    POJ3191【(-2)进制本质】
    POJ3264 【RMQ基础题—ST-线段树】
  • 原文地址:https://www.cnblogs.com/sunliyuan/p/14490918.html
Copyright © 2020-2023  润新知