• elasticsearch7.x-analyzer(分析器)


    配置分析器

    分析器的构成

    • 一个分词器(tokenizer)
    • 零个或者多个词元过滤器(token filters)
    • 零个或者多个字符过滤器(character filters)

    配置内置分析器

    内置分析器不要任何配置就可以直接使用,然而,有一些分析器支持可选参数去改变它们的行为。比如,标准分析器可以配置停用词列表。

    示例:

    PUT my-index-000001
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "std_english": { 
              "type":      "standard",
              "stopwords": "_english_"
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "my_text": {
            "type":     "text",
            "analyzer": "standard", 
            "fields": {
              "english": {
                "type":     "text",
                "analyzer": "std_english" 
              }
            }
          }
        }
      }
    }
    POST my-index-000001/_analyze
    {
      "field": "my_text", 
      "text": "The old brown cow"
    }
    POST my-index-000001/_analyze
    {
      "field": "my_text.english", 
      "text": "The old brown cow"
    }

    结果

    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "old",
          "start_offset" : 4,
          "end_offset" : 7,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 8,
          "end_offset" : 13,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "cow",
          "start_offset" : 14,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 3
        }
      ]
    }
    {
      "tokens" : [
        {
          "token" : "old",
          "start_offset" : 4,
          "end_offset" : 7,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 8,
          "end_offset" : 13,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "cow",
          "start_offset" : 14,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 3
        }
      ]
    }

    配置自定义分析器

    当内置分析器不能完全满足你的需求时,你可以使用下面合适的组合来创建自定义分析器。

    1. 零个或者多个字符过滤器(character filters)

    2. 一个分词器(tokenizer)

    3. 零个或者多个词元过滤器(token filters)

    参数配置说明:

    参数 说明
    tokenizer 一个内置的或者自定义的分词器 分词器
    char_filter 内置的或者自定义的字符过滤器数组 字符过滤器
    filter 内置的或者自定义的词元过滤器数组 词元过滤器
    position_increment_gap default 100 短语位置间隙

     

     

     

     

    示例:

    type:custom 是告诉Elasticsearch我们定义了一个自定义的分析器,如果使用的是内置的分析器,type就是内置分析器的名字。

    PUT my-index-000001
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_custom_analyzer": {
              "type": "custom", 
              "tokenizer": "standard",
              "char_filter": [
                "html_strip"
              ],
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }
      }
    }
    POST my-index-000001/_analyze
    {
      "analyzer": "my_custom_analyzer",
      "text": "Is this <b>déjà vu</b>?"
    }

    结果:

    {
      "tokens" : [
        {
          "token" : "is",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "this",
          "start_offset" : 3,
          "end_offset" : 7,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "deja",
          "start_offset" : 11,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "vu",
          "start_offset" : 16,
          "end_offset" : 22,
          "type" : "<ALPHANUM>",
          "position" : 3
        }
      ]
    }

    示例2:

    PUT my-index-000001
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_custom_analyzer": { 
              "type": "custom",
              "char_filter": [
                "emoticons"
              ],
              "tokenizer": "punctuation",
              "filter": [
                "lowercase",
                "english_stop"
              ]
            }
          },
          "tokenizer": {
            "punctuation": { 
              "type": "pattern",
              "pattern": "[ .,!?]"
            }
          },
          "char_filter": {
            "emoticons": { 
              "type": "mapping",
              "mappings": [
                ":) => _happy_",
                ":( => _sad_"
              ]
            }
          },
          "filter": {
            "english_stop": { 
              "type": "stop",
              "stopwords": "_english_"
            }
          }
        }
      }
    }

    测试分析器

    测试内置分析器

    POST _analyze

    { "analyzer": "whitespace", "text": "The quick brown fox." }

    结果:

    {
      "tokens" : [
        {
          "token" : "The",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "quick",
          "start_offset" : 4,
          "end_offset" : 9,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "fox.",
          "start_offset" : 16,
          "end_offset" : 20,
          "type" : "word",
          "position" : 3
        }
      ]
    }

    示例2:

    POST _analyze
    {
      "tokenizer": "standard",
      "filter":  [ "lowercase", "asciifolding" ],
      "text":      "Is this déja vu?"
    }

    测试自定义分析器

    PUT my-index-000001
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "std_folded": { 
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "my_text": {
            "type": "text",
            "analyzer": "std_folded" 
          }
        }
      }
    }
    GET my-index-000001/_analyze 
    {
      "analyzer": "std_folded", 
      "text":     "Is this déjà vu?"
    }
    GET my-index-000001/_analyze 
    {
      "field": "my_text", 
      "text":  "Is this déjà vu?"
    }
  • 相关阅读:
    [codevs]失恋28天题目系列
    [NOIP1998]最大数
    [codevs4247]奇特的生物
    [codevs1380]没有上司的舞会
    [codevs2152]滑雪
    [codevs2171]棋盘覆盖
    [codevs2170]悠闲的漫步
    [codevs1557]热浪
    [codevs1554]最佳课题选择
    nodejs建站+github page 建站问题总结
  • 原文地址:https://www.cnblogs.com/zxbdboke/p/14562637.html
Copyright © 2020-2023  润新知