• Elasticsearch的分析过程,内置字符过滤器、分析器、分词器、分词过滤器(真是变态多啊!美滋滋)


    分析过程

    当数据被发送到elasticsearch后并加入倒排序索引之前,elasticsearch会对文档进行处理:   

    • 字符过滤:使用字符过滤器转变字符。
    • 文本切分为分词:将文本(档)分为单个或多个分词。
    • 分词过滤:使用分词过滤器转变每个分词。
    • 分词索引:最终将分词存储在Lucene倒排索引中。

    整体流程:

    目的是达到人性化的分词

    内置字符过滤器

    HTML字符过滤器、映射字符过滤器、模式替换过滤器

    HTML字符过滤器 

    POST _analyze
    {
      "tokenizer":      "keyword", 
      "char_filter":  [ "html_strip" ],
      "text": "<p>I&apos;m so <b>happy</b>!</p>"
    }

     结果

    {
      "tokens" : [
        {
          "token" : """
    
    I'm so happy!
    
    """,
          "start_offset" : 0,
          "end_offset" : 32,
          "type" : "word",
          "position" : 0
        }
      ]
    }

    自定义HTML过滤器

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "keyword",
              "char_filter": ["my_char_filter"]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "html_strip",
              "escaped_tags": ["b"]
            }
          }
        }
      }
    }

    映射字符过滤

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer":{
              "tokenizer":"keyword",
              "char_filter":["my_char_filter"]
            }
          },
          "char_filter":{
              "my_char_filter":{
                "type":"mapping",
                "mappings":["苍井空 => 666","武藤兰 => 888"]
              }
            }
        }
      }
    }
    
    GET my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text":"苍井空热爱武藤兰,可惜苍井空后来结婚了"
    }

    结果

     1 {
     2   "tokens" : [
     3     {
     4       "token" : "666热爱888,可惜666后来结婚了",
     5       "start_offset" : 0,
     6       "end_offset" : 19,
     7       "type" : "word",
     8       "position" : 0
     9     }
    10   ]
    11 }
    1111111

     模式替换过滤器

    PUT my_index1
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "char_filter": [
                "my_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "pattern_replace",
              "pattern": "(\d+)-(?=\d)",
              "replacement": "$1_"
            }
          }
        }
      }
    }
    
    POST my_index1/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "My credit card is 123-456-789"
    }

    结果

    {
      "tokens" : [
        {
          "token" : "My",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "credit",
          "start_offset" : 3,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "card",
          "start_offset" : 10,
          "end_offset" : 14,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "is",
          "start_offset" : 15,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "123_456_789",
          "start_offset" : 18,
          "end_offset" : 29,
          "type" : "<NUM>",
          "position" : 4
        }
      ]
    }
    1111111

     内置分析器

    内置分词器

    UAX URL电子邮件分词

    1 作者:一线码农
    2 来源:未知原文:https://www.cnblogs.com/Mc_HotHog/articles/1111111.html
    3 邮箱:22222@qq.com
    4 版权声明:本文为博主原创文章,转载请附上博文链接!
    POST _analyze
    {
      "tokenizer": "uax_url_email",
      "text":"作者:一线码农来源:未知原文:https://www.cnblogs.com/Mc_HotHog/articles/1111111.html邮箱:22222@qq.com版权声明:本文为博主原创文章,转载请附上博文链接!"
    }

    结果

    {
      "tokens" : [
        {
          "token" : "",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<IDEOGRAPHIC>",
          "position" : 0
        },
        {
          "token" : "",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "<IDEOGRAPHIC>",
          "position" : 1
        },
        {
          "token" : "",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "<IDEOGRAPHIC>",
          "position" : 2
        },
        {
          "token" : "线",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "<IDEOGRAPHIC>",
          "position" : 3
        },
        {
          "token" : "",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "<IDEOGRAPHIC>",
          "position" : 4
        },
        {
          "token" : "",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "<IDEOGRAPHIC>",
          "position" : 5
        },
        {
          "token" : "",
          "start_offset" : 7,
          "end_offset" : 8,
          "type" : "<IDEOGRAPHIC>",
          "position" : 6
        },
        {
          "token" : "",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "<IDEOGRAPHIC>",
          "position" : 7
        },
        {
          "token" : "",
          "start_offset" : 10,
          "end_offset" : 11,
          "type" : "<IDEOGRAPHIC>",
          "position" : 8
        },
        {
          "token" : "",
          "start_offset" : 11,
          "end_offset" : 12,
          "type" : "<IDEOGRAPHIC>",
          "position" : 9
        },
        {
          "token" : "",
          "start_offset" : 12,
          "end_offset" : 13,
          "type" : "<IDEOGRAPHIC>",
          "position" : 10
        },
        {
          "token" : "",
          "start_offset" : 13,
          "end_offset" : 14,
          "type" : "<IDEOGRAPHIC>",
          "position" : 11
        },
        {
          "token" : "https://www.cnblogs.com/Mc_HotHog/articles/1111111.html",
          "start_offset" : 15,
          "end_offset" : 70,
          "type" : "<URL>",
          "position" : 12
        },
        {
          "token" : "",
          "start_offset" : 70,
          "end_offset" : 71,
          "type" : "<IDEOGRAPHIC>",
          "position" : 13
        },
        {
          "token" : "",
          "start_offset" : 71,
          "end_offset" : 72,
          "type" : "<IDEOGRAPHIC>",
          "position" : 14
        },
        {
          "token" : "22222@qq.com",
          "start_offset" : 73,
          "end_offset" : 85,
          "type" : "<EMAIL>",
          "position" : 15
        },
        {
          "token" : "",
          "start_offset" : 85,
          "end_offset" : 86,
          "type" : "<IDEOGRAPHIC>",
          "position" : 16
        },
        {
          "token" : "",
          "start_offset" : 86,
          "end_offset" : 87,
          "type" : "<IDEOGRAPHIC>",
          "position" : 17
        },
        {
          "token" : "",
          "start_offset" : 87,
          "end_offset" : 88,
          "type" : "<IDEOGRAPHIC>",
          "position" : 18
        },
        {
          "token" : "",
          "start_offset" : 88,
          "end_offset" : 89,
          "type" : "<IDEOGRAPHIC>",
          "position" : 19
        },
        {
          "token" : "",
          "start_offset" : 90,
          "end_offset" : 91,
          "type" : "<IDEOGRAPHIC>",
          "position" : 20
        },
        {
          "token" : "",
          "start_offset" : 91,
          "end_offset" : 92,
          "type" : "<IDEOGRAPHIC>",
          "position" : 21
        },
        {
          "token" : "",
          "start_offset" : 92,
          "end_offset" : 93,
          "type" : "<IDEOGRAPHIC>",
          "position" : 22
        },
        {
          "token" : "",
          "start_offset" : 93,
          "end_offset" : 94,
          "type" : "<IDEOGRAPHIC>",
          "position" : 23
        },
        {
          "token" : "",
          "start_offset" : 94,
          "end_offset" : 95,
          "type" : "<IDEOGRAPHIC>",
          "position" : 24
        },
        {
          "token" : "",
          "start_offset" : 95,
          "end_offset" : 96,
          "type" : "<IDEOGRAPHIC>",
          "position" : 25
        },
        {
          "token" : "",
          "start_offset" : 96,
          "end_offset" : 97,
          "type" : "<IDEOGRAPHIC>",
          "position" : 26
        },
        {
          "token" : "",
          "start_offset" : 97,
          "end_offset" : 98,
          "type" : "<IDEOGRAPHIC>",
          "position" : 27
        },
        {
          "token" : "",
          "start_offset" : 98,
          "end_offset" : 99,
          "type" : "<IDEOGRAPHIC>",
          "position" : 28
        },
        {
          "token" : "",
          "start_offset" : 100,
          "end_offset" : 101,
          "type" : "<IDEOGRAPHIC>",
          "position" : 29
        },
        {
          "token" : "",
          "start_offset" : 101,
          "end_offset" : 102,
          "type" : "<IDEOGRAPHIC>",
          "position" : 30
        },
        {
          "token" : "",
          "start_offset" : 102,
          "end_offset" : 103,
          "type" : "<IDEOGRAPHIC>",
          "position" : 31
        },
        {
          "token" : "",
          "start_offset" : 103,
          "end_offset" : 104,
          "type" : "<IDEOGRAPHIC>",
          "position" : 32
        },
        {
          "token" : "",
          "start_offset" : 104,
          "end_offset" : 105,
          "type" : "<IDEOGRAPHIC>",
          "position" : 33
        },
        {
          "token" : "",
          "start_offset" : 105,
          "end_offset" : 106,
          "type" : "<IDEOGRAPHIC>",
          "position" : 34
        },
        {
          "token" : "",
          "start_offset" : 106,
          "end_offset" : 107,
          "type" : "<IDEOGRAPHIC>",
          "position" : 35
        },
        {
          "token" : "",
          "start_offset" : 107,
          "end_offset" : 108,
          "type" : "<IDEOGRAPHIC>",
          "position" : 36
        },
        {
          "token" : "",
          "start_offset" : 108,
          "end_offset" : 109,
          "type" : "<IDEOGRAPHIC>",
          "position" : 37
        }
      ]
    }
    11111

     内置分词过滤器

    了解更多https://www.elastic.co/guide/en/elasticsearch/reference/6.5/index.html

  • 相关阅读:
    GAN对抗神经网络(原理解析)
    Wasserstein distance(EM距离)
    浅谈KL散度
    深度学习中 Batch Normalization是什么
    Batch Normalization的正确打开方式
    对于梯度消失和梯度爆炸的理解
    [转贴]loadrunner 场景设计-添加Unix、Linux Resources计数器
    Volley(四)—— ImageLoader & NetworkImageView
    SQL单表查询
    ifconfig命令详解
  • 原文地址:https://www.cnblogs.com/Alexephor/p/11396724.html
Copyright © 2020-2023  润新知