• ES系列十三、Elasticsearch Suggester API(自动补全)


    1.概念

    1.补全api主要分为四类

    1. Term Suggester(纠错补全,输入错误的情况下补全正确的单词)
    2. Phrase Suggester(自动补全短语,输入一个单词补全整个短语)
    3. Completion Suggester(完成补全单词,输出如前半部分,补全整个单词)
    4. Context Suggester(上下文补全)

    整体效果类似百度搜索,如图:

      

    2.Term Suggester(纠错补全)

    2.1.api

    1.建立索引

    PUT /book4
    {
      "mappings": {
        "english": {
          "properties": {
            "passage": {
              "type": "text"
            }
          }
        }
      }
    }

    2.插入数据

    curl -H "Content-Type: application/json" -XPOST 'http:localhost:9200/_bulk' -d'
    { "index" : { "_index" : "book4", "_type" : "english" } }
    { "passage": "Lucene is cool"}
    { "index" : { "_index" : "book4", "_type" : "english" } }
    { "passage": "Elasticsearch builds on top of lucene"}
    { "index" : { "_index" : "book4", "_type" : "english" } }
    { "passage": "Elasticsearch rocks"}
    { "index" : { "_index" : "book4", "_type" : "english" } }
    { "passage": "Elastic is the company behind ELK stack"}
    { "index" : { "_index" : "book4", "_type" : "english" } }
    { "passage": "elk rocks"}
    { "index" : { "_index" : "book4", "_type" : "english" } }
    {  "passage": "elasticsearch is rock solid"}
    '

    3.看下储存的分词有哪些

    post /_analyze
    {
      "text": [
        "Lucene is cool",
        "Elasticsearch builds on top of lucene",
        "Elasticsearch rocks",
        "Elastic is the company behind ELK stack",
        "elk rocks",
        "elasticsearch is rock solid"
      ]
    }

    结果:

    {
        "tokens": [
            {
                "token": "lucene",
                "start_offset": 0,
                "end_offset": 6,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "is",
                "start_offset": 7,
                "end_offset": 9,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "cool",
                "start_offset": 10,
                "end_offset": 14,
                "type": "<ALPHANUM>",
                "position": 2
            },
            {
                "token": "elasticsearch",
                "start_offset": 15,
                "end_offset": 28,
                "type": "<ALPHANUM>",
                "position": 103
            },
            {
                "token": "builds",
                "start_offset": 29,
                "end_offset": 35,
                "type": "<ALPHANUM>",
                "position": 104
            },
            {
                "token": "on",
                "start_offset": 36,
                "end_offset": 38,
                "type": "<ALPHANUM>",
                "position": 105
            },
            {
                "token": "top",
                "start_offset": 39,
                "end_offset": 42,
                "type": "<ALPHANUM>",
                "position": 106
            },
            {
                "token": "of",
                "start_offset": 43,
                "end_offset": 45,
                "type": "<ALPHANUM>",
                "position": 107
            },
            {
                "token": "lucene",
                "start_offset": 46,
                "end_offset": 52,
                "type": "<ALPHANUM>",
                "position": 108
            },
            {
                "token": "elasticsearch",
                "start_offset": 53,
                "end_offset": 66,
                "type": "<ALPHANUM>",
                "position": 209
            },
            {
                "token": "rocks",
                "start_offset": 67,
                "end_offset": 72,
                "type": "<ALPHANUM>",
                "position": 210
            },
            {
                "token": "elastic",
                "start_offset": 73,
                "end_offset": 80,
                "type": "<ALPHANUM>",
                "position": 311
            },
            {
                "token": "is",
                "start_offset": 81,
                "end_offset": 83,
                "type": "<ALPHANUM>",
                "position": 312
            },
            {
                "token": "the",
                "start_offset": 84,
                "end_offset": 87,
                "type": "<ALPHANUM>",
                "position": 313
            },
            {
                "token": "company",
                "start_offset": 88,
                "end_offset": 95,
                "type": "<ALPHANUM>",
                "position": 314
            },
            {
                "token": "behind",
                "start_offset": 96,
                "end_offset": 102,
                "type": "<ALPHANUM>",
                "position": 315
            },
            {
                "token": "elk",
                "start_offset": 103,
                "end_offset": 106,
                "type": "<ALPHANUM>",
                "position": 316
            },
            {
                "token": "stack",
                "start_offset": 107,
                "end_offset": 112,
                "type": "<ALPHANUM>",
                "position": 317
            },
            {
                "token": "elk",
                "start_offset": 113,
                "end_offset": 116,
                "type": "<ALPHANUM>",
                "position": 418
            },
            {
                "token": "rocks",
                "start_offset": 117,
                "end_offset": 122,
                "type": "<ALPHANUM>",
                "position": 419
            },
            {
                "token": "elasticsearch",
                "start_offset": 123,
                "end_offset": 136,
                "type": "<ALPHANUM>",
                "position": 520
            },
            {
                "token": "is",
                "start_offset": 137,
                "end_offset": 139,
                "type": "<ALPHANUM>",
                "position": 521
            },
            {
                "token": "rock",
                "start_offset": 140,
                "end_offset": 144,
                "type": "<ALPHANUM>",
                "position": 522
            },
            {
                "token": "solid",
                "start_offset": 145,
                "end_offset": 150,
                "type": "<ALPHANUM>",
                "position": 523
            }
        ]
    }
    View Code

    4.term suggest api(搜索单个字段)

    搜索下试试,给出错误单词Elasticsearaach

    POST /book4/_search
    {
        "suggest" : {
        "my-suggestion" : {
          "text" : "Elasticsearaach",
          "term" : {
            "field" : "passage",
         "suggest_mode": "popular"
    } } } }

    response:

    {
        "took": 26,
        "timed_out": false,
        "_shards": {
            "total": 5,
            "successful": 5,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 0,
            "max_score": 0,
            "hits": []
        },
        "suggest": {
            "my-suggestion": [
                {
                    "text": "elasticsearaach",
                    "offset": 0,
                    "length": 15,
                    "options": [
                        {
                            "text": "elasticsearch",
                            "score": 0.84615386,
                            "freq": 3
                        }
                    ]
                }
            ]
        }
    }

    5.搜索多个字段分别给出提示:

    POST _search
    {
      "suggest": {
        "my-suggest-1" : {
          "text" : "tring out Elasticsearch",
          "term" : {
            "field" : "message"
          }
        },
        "my-suggest-2" : {
          "text" : "kmichy",
          "term" : {
            "field" : "user"
          }
        }
      }
    }

    term建议者提出基于编辑距离条款。在建议术语之前分析提供的建议文本。建议的术语是根据分析的建议文本标记提供的。term建议者不走查询到的是是的请求部分。

    常见建议选项:

    text

    建议文字。建议文本是必需的选项,需要全局或按建议设置。

    field

    从中获取候选建议的字段。这是一个必需的选项,需要全局设置或根据建议设置。

    analyzer

    用于分析建议文本的分析器。默认为建议字段的搜索分析器。

    size

    每个建议文本标记返回的最大更正。

    sort

    定义如何根据建议文本术语对建议进行排序。两个可能的值:

    • score:先按分数排序,然后按文档频率排序,再按术语本身排序。
    • frequency:首先按文档频率排序,然后按相似性分数排序,然后按术语本身排序。

    suggest_mode

    建议模式控制包含哪些建议或控制建议的文本术语,建议。可以指定三个可能的值:

    • missing:仅提供不在索引词典中,但是在原文档中的词。这是默认值。
    • popular:仅提供在索引词典中出现的词语。
    • always:索引词典中出没出现的词语都要给出建议。

    其他术语建议选项:

    lowercase_terms

    在文本分析之后,建议文本术语小写。

    max_edits

    最大编辑距离候选建议可以具有以便被视为建议。只能是介于1和2之间的值。任何其他值都会导致抛出错误的请求错误。默认为2。

    prefix_length

    必须匹配的最小前缀字符的数量才是候选建议。默认为1.增加此数字可提高拼写检查性能。通常拼写错误不会出现在术语的开头。(旧名“prefix_len”已弃用)

    min_word_length

    建议文本术语必须具有的最小长度才能包含在内。默认为4.(旧名称“min_word_len”已弃用)

    shard_size

    设置从每个单独分片中检索的最大建议数。在减少阶段,仅根据size选项返回前N个建议默认为该 size选项。将此值设置为高于该值的值size可能非常有用,以便以性能为代价获得更准确的拼写更正文档频率。由于术语在分片之间被划分,因此拼写校正频率的分片级文档可能不准确。增加这些将使这些文档频率更精确。

    max_inspections

    用于乘以的因子, shards_size以便在碎片级别上检查更多候选拼写更正。可以以性能为代价提高准确性。默认为5。

    min_doc_freq

    建议应出现的文档数量的最小阈值。可以指定为绝对数字或文档数量的相对百分比。这可以仅通过建议高频项来提高质量。默认为0f且未启用。如果指定的值大于1,则该数字不能是小数。分片级文档频率用于此选项。

    max_term_freq

    建议文本令牌可以存在的文档数量的最大阈值,以便包括在内。可以是表示文档频率的相对百分比数(例如0.4)或绝对数。如果指定的值大于1,则不能指定小数。默认为0.01f。这可用于排除高频术语的拼写检查。高频术语通常拼写正确,这也提高了拼写检查的性能。分片级文档频率用于此选项。

    string_distance

    用于比较类似建议术语的字符串距离实现。可以指定五个可能的值: internal- 默认值基于damerau_levenshtein,但高度优化用于比较索引中术语的字符串距离。damerau_levenshtein - 基于Damerau-Levenshtein算法的字符串距离算法。levenshtein - 基于Levenshtein编辑距离算法的字符串距离算法。 jaro_winkler - 基于Jaro-Winkler算法的字符串距离算法。 ngram - 基于字符n-gram的字符串距离算法。

     官方api

    2.phase sguesster:短语纠错

    phrase 短语建议,在term的基础上,会考量多个term之间的关系,比如是否同时出现在索引的原文里,相邻程度,以及词频等

     示例1:

    POST book4/_search
    {
      "suggest" : {
        "myss":{
          "text": "Elasticsearch rock",
          "phrase": {
            "field": "passage"
          }
        }
      }
    }
    
    
    {
        "took": 11,
        "timed_out": false,
        "_shards": {
            "total": 5,
            "successful": 5,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 0,
            "max_score": 0,
            "hits": []
        },
        "suggest": {
            "myss": [
                {
                    "text": "Elasticsearch rock",
                    "offset": 0,
                    "length": 18,
                    "options": [
                        {
                            "text": "elasticsearch rocks",
                            "score": 0.3467123
                        }
                    ]
                }
            ]
        }
    }

    3. Completion suggester   自动补全

    针对自动补全场景而设计的建议器。此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在

    1.建立索引

    POST /book5
    
    {
        "mappings": {
            "music" : {
                "properties" : {
                    "suggest" : { 
                        "type" : "completion"
                    },
                    "title" : {
                        "type": "keyword"
                    }
                }
            }
        }
    }

    插入数据:

    POST /book5/music
    
    {
        "suggest":"test my book"
    }

    Input 指定输入词 Weight 指定排序值(可选)

    PUT music/music/5nupmmUBYLvVFwGWH3cu?refresh
    {
        "suggest" : {
            "input": [ "test", "book" ],
            "weight" : 34
        }
    }

    指定不同的排序值:

    PUT music/_doc/6Hu2mmUBYLvVFwGWxXef?refresh
    {
        "suggest" : [
            {
                "input": "test",
                "weight" : 10
            },
            {
                "input": "good",
                "weight" : 3
            }
        ]}

    示例1:查询建议根据前缀查询

    POST book5/_search?pretty
    {
        "suggest": {
            "song-suggest" : {
                "prefix" : "te", 
                "completion" : { 
                    "field" : "suggest" 
                }
            }
        }
    }
    {
        "took": 8,
        "timed_out": false,
        "_shards": {
            "total": 5,
            "successful": 5,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 0,
            "max_score": 0,
            "hits": []
        },
        "suggest": {
            "song-suggest": [
                {
                    "text": "te",
                    "offset": 0,
                    "length": 2,
                    "options": [
                        {
                            "text": "test my book1",
                            "_index": "book5",
                            "_type": "music",
                            "_id": "6Xu6mmUBYLvVFwGWpXeL",
                            "_score": 1,
                            "_source": {
                                "suggest": "test my book1"
                            }
                        },
                        {
                            "text": "test my book1",
                            "_index": "book5",
                            "_type": "music",
                            "_id": "6nu8mmUBYLvVFwGWSndC",
                            "_score": 1,
                            "_source": {
                                "suggest": "test my book1"
                            }
                        },
                        {
                            "text": "test my book1 english",
                            "_index": "book5",
                            "_type": "music",
                            "_id": "63u8mmUBYLvVFwGWZHdC",
                            "_score": 1,
                            "_source": {
                                "suggest": "test my book1 english"
                            }
                        }
                    ]
                }
            ]
        }
    }

    示例2:对建议查询结果去重

    {
        "suggest": {
            "song-suggest" : {
                "prefix" : "te", 
                "completion" : { 
                    "field" : "suggest" ,
                     "skip_duplicates": true 
                }
            }
        }
    }

    示例3:查询建议文档存储短语

    POST /book5/music/63u8mmUBYLvVFwGWZHdC?refresh
    {
        "suggest" : {
            "input": [ "book1 english", "test english" ],
            "weight" : 20
        }
    }

     查询:

    POST book5/_search?pretty
    {
        "suggest": {
            "song-suggest" : {
                "prefix" : "test", 
                "completion" : { 
                    "field" : "suggest" ,
                    "skip_duplicates": true
                }
            }
        }
    }

    结果:

    {
        "took": 7,
        "timed_out": false,
        "_shards": {
            "total": 5,
            "successful": 5,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 0,
            "max_score": 0,
            "hits": []
        },
        "suggest": {
            "song-suggest": [
                {
                    "text": "test",
                    "offset": 0,
                    "length": 4,
                    "options": [
                        {
                            "text": "test english",
                            "_index": "book5",
                            "_type": "music",
                            "_id": "63u8mmUBYLvVFwGWZHdC",
                            "_score": 20,
                            "_source": {
                                "suggest": {
                                    "input": [
                                        "book1 english",
                                        "test english"
                                    ],
                                    "weight": 20
                                }
                            }
                        },
                        {
                            "text": "test my book1",
                            "_index": "book5",
                            "_type": "music",
                            "_id": "6Xu6mmUBYLvVFwGWpXeL",
                            "_score": 1,
                            "_source": {
                                "suggest": "test my book1"
                            }
                        }
                    ]
                }
            ]
        }
    }

    4. 总结和建议

    因此用好Completion Sugester并不是一件容易的事,实际应用开发过程中,需要根据数据特性和业务需要,灵活搭配analyzer和mapping参数,反复调试才可能获得理想的补全效果。

    回到篇首搜索框的补全/纠错功能,如果用ES怎么实现呢?我能想到的一个的实现方式:

    1. 在用户刚开始输入的过程中,使用Completion Suggester进行关键词前缀匹配,刚开始匹配项会比较多,随着用户输入字符增多,匹配项越来越少。如果用户输入比较精准,可能Completion Suggester的结果已经够好,用户已经可以看到理想的备选项了。 
    2. 如果Completion Suggester已经到了零匹配,那么可以猜测是否用户有输入错误,这时候可以尝试一下Phrase Suggester。
    3. 如果Phrase Suggester没有找到任何option,开始尝试term Suggester。

    精准程度上(Precision)看: Completion >  Phrase > term, 而召回率上(Recall)则反之。从性能上看,Completion Suggester是最快的,如果能满足业务需求,只用Completion Suggester做前缀匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比较而言性能应该要低不少,应尽量控制suggester用到的索引的数据量,最理想的状况是经过一定时间预热后,索引可以全量map到内存。

  • 相关阅读:
    Beta阶段事后诸葛亮分析
    Beta阶段项目复审
    展示博客
    Beta版本测试报告以及Beta版本发布说明
    团队作业8——第二次项目冲刺(Beta阶段)博客汇总
    团队作业8----第二次项目冲刺(beta阶段)5.25
    团队作业8----第二次项目冲刺(beta阶段)5.24
    团队作业8----第二次项目冲刺(beta阶段)5.23
    团队作业8----第二次项目冲刺(beta阶段)5.22
    团队作业8----第二次项目冲刺(beta阶段)5.21
  • 原文地址:https://www.cnblogs.com/wangzhuxing/p/9574630.html
Copyright © 2020-2023  润新知