• es原理


    一:  一个请求到达es集群,选中一个coordinate节点以后,会通过请求路由到指定primary shard中,如果分发策略选择为round-robin,如果来4个请求,则2个打到primary shard中2个打到replic shard中。

    二: es在多个shard进行分片但数据倾斜严重的时候有可能会发生搜索score不准的情况,因为IDF分值的计算方法实在shard本地完成的;如shard1中数据较多,在计算某一词搜索时的分值时会导致分值整体下降,而这时shard2中出现的词频较少会整体分值偏高,这样容易导致原本不太相关的内容却变得分值高了起来,从而使排序不准;解决方法就是让多个shard在生产环境中尽量做到数据均衡分布,这样就不会因为score的本地计算而整体受影响。

    三: es计算分值时有两种策略:

    1)most-field->默认策略是全文检索的所有关键词,在document的每一个field中可匹配的次数越多则分值越高;规则:(每个match中field匹配分值的和) *(实际document匹配到了字段个数)/(query中match的个数) ,如下代码:

    GET /index3/type3/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "title":"spark"//title中可匹配成功
              }
            },
            {
              "match": {
                "content":"java"//content中也可匹配成功
              }
            }
          ]
        }
      }
    }
    View Code

    2)beast-field->如果使用dis_max,document的分值则会根据match中field匹配分值最高的决定,也就是说和其他属性无关

    GET /index3/type3/_search
    {
      "query": {
        "dis_max": {
          "queries": [
            {
              "match": {
                "title": "spark"
              }
            },
            {
              "match": {
                "content": "java"
              }
            }
          ]
        }
      }
    View Code

     3)es中除了most_fields和beast_fields以外,使用cross_fields的情况还是比较多的,使用es系统中默认的cross_fields策略实质是将 "fields": ["name","content"]两个字段的内容放到一起后建立索引,这样就能通过一个fullField字段进行fullText,使结果更加准确

    搜索参数:
    GET /index2/type2/_search
    {
      "query": {
        "multi_match": {
          "query": "happening like",
          //query中的搜索词条去content和name两个字段中来匹配,不过会由于两个字段mapping定义不同导致得分不同,排序结果可能有差异
          "fields": ["name","content"],
          //best_fields策略是每个document的得分等于得分最高的match field的值;而匹配出最佳以后,其它document得分未必准确;most_fields根据每个field的评分计算出ducoment的综合评分
          "type":"cross_fields",
          "operator":"and"
        }
      }
    }
    结果:
    {
      "took": 36,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 0.84968257,
        "hits": [
          {
            "_index": "index2",
            "_type": "type2",
            "_id": "2",
            "_score": 0.84968257,
            "_source": {
              "num": 10,
              "title": "他的名字",
              "name": "yes happening like write",
              "content": "happening like"
            }
          },
          {
            "_index": "index2",
            "_type": "type2",
            "_id": "4",
            "_score": 0.8164005,
            "_source": {
              "num": 1000,
              "title": "我的名字",
              "name": "happening like write",
              "content": "happening hello like yeas and he happening like had read a lot about happening hello like"
            }
          },
          {
            "_index": "index2",
            "_type": "type2",
            "_id": "3",
            "_score": 0.5063205,
            "_source": {
              "num": 105,
              "title": "这是谁的名字",
              "name": "happening like write",
              "content": " national  treasure because  of its rare number and cute appearance. Many foreign people are so crazy about  pandas and they can’t watching these  lovely creatures all the time. Though some action"
            }
          }
        ]
      }
    }
    View Code

     四:提升全文检索效果的两种方法

    1) 使用boost提升检索分值

    GET index3/type3/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "content": {
                  "query": "from",
                  "boost":5//使用boost将term检索评分提升5倍
                }
              }
            },{
              "match": {
                "content": {
                  "query": "foot"//如果不使用boost则搜索foot则会得分较高
                }
              }
            }
          ]
        }
      }
    }
    结果:
    {
      "took": 3,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 1.3150566,
        "hits": [
          {
            "_index": "index3",
            "_type": "type3",
            "_id": "1",
            "_score": 1.3150566,
            "_source": {
              "date": "2019-01-02",
              "name": "the little",
              "content": "Half the hello book ideas in his talk were plagiarized from an article I wrote last month.",
              "no": "123"
            }
          },
          {
            "_index": "index3",
            "_type": "type3",
            "_id": "5",
            "_score": 1.3114156,
            "_source": {
              "date": "2019-05-01",
              "name": "http litty",
              "content": "There are hello moments in life when you miss book someone so much that you just want to pick them from your dreams",
              "no": "564",
              "description": "描述"
            }
          },
          {
            "_index": "index3",
            "_type": "type3",
            "_id": "3",
            "_score": 0.28582606,
            "_source": {
              "date": "2019-07-01",
              "name": "very tag",
              "content": "Some of our hello  comrades love book to write long articles with no substance, very much like the foot bindings of a slattern, long as well as smelly",
              "no": "123"
            }
          }
        ]
      }
    }
    View Code

    2)使用boosting的positive和negative进行反向筛选,通过设置 (negative_boost:0.5) 降低分值

    GET index3/type3/_search
    {
      "query": {
        "boosting": {
          //正常匹配的
          "positive": {
            "match": {
              "content": "from"
            }
          },
          //降低分值去匹配的,以下字段的分值乘以negative_boost值
          "negative": {
            "match": {
                "content": {
                  "query": "Half"
                }
              }
          },
          "negative_boost": 0.1
        }
      }
    }
    结果:
    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 0.26228312,
        "hits": [
          {
            "_index": "index3",
            "_type": "type3",
            "_id": "5",
            "_score": 0.26228312,
            "_source": {
              "date": "2019-05-01",
              "name": "http litty",
              "content": "There are hello moments in life when you miss book someone so much that you just want to pick them from your dreams",
              "no": "564",
              "description": "描述"
            }
          },
          {
            "_index": "index3",
            "_type": "type3",
            "_id": "1",
            "_score": 0.026301134,
            "_source": {
              "date": "2019-01-02",
              "name": "the little",
              "content": "Half the hello book ideas in his talk were plagiarized from an article I wrote last month.",
              "no": "123"
            }
          }
        ]
      }
    }
    View Code
  • 相关阅读:
    第三十五课、文本编辑器中的数据存取------------------狄泰软件学院
    第三十四课、缓冲区操作与目录操作------------------狄泰软件学院
    第三十三课、文件流和数据流------------------狄泰软件学院
    Machine Learning in Action(6) AdaBoost算法
    Machine Learning in Action(5) SVM算法
    machine learning for hacker记录(4) 智能邮箱(排序学习&推荐系统)
    linux mysql乱码问题
    php7+新特性
    shell学习(6)- curl
    shell学习(5)- sort
  • 原文地址:https://www.cnblogs.com/zzq-include/p/11155410.html
Copyright © 2020-2023  润新知