• ElasticSearch(三):通分词器(Analyzer)进行分词(Analysis)


    ElasticSearch(三):通过分词器(Analyzer)进行分词(Analysis)

    学习课程链接《Elasticsearch核心技术与实战》


    ## Analysis与Analyzer * Analysis文本分析就是把全文转换成一系列单词的过程,也叫做分词。 * Analysis是通过Analyzer来实现的,它是专门处理分词的组件。可以使用ElasticSearch内置的分词器,也可以按需定制化分词器。 * 除了在数据写入时用分词器转换词条,在匹配查询语句时,也需要用相同的分词器对查询语句进行分析。

    Analyzer的组成

    分词器是专门处理分词的组件,Analyzer由三个部分组成:
    file

    • Character Filters:主要作用是对原始文本进行处理,例如去除HTML标签。
    • Tokenizer:主要作用是按照规则来切分单词。
    • Token Filter:将切分好的单词进行加工,例如:小写转换、删除停用词、增加同义词。

    ElasticSearch的内置分词器

    • Standard Analyzer:默认分词器,按词切分,小写处理。
    #standard
    GET _analyze
    {
      "analyzer": "standard",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:Quick小写处理, brown-foxes被切分为 brown,foxes
    {
      "tokens" : [
        {
          "token" : "2",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<NUM>",
          "position" : 0
        },
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "quick",#小写处理
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "foxes",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "<ALPHANUM>",
          "position" : 6
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "<ALPHANUM>",
          "position" : 7
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "<ALPHANUM>",
          "position" : 8
        },
        {
          "token" : "in",
          "start_offset" : 48,
          "end_offset" : 50,
          "type" : "<ALPHANUM>",
          "position" : 9
        },
        {
          "token" : "the",
          "start_offset" : 51,
          "end_offset" : 54,
          "type" : "<ALPHANUM>",
          "position" : 10
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "<ALPHANUM>",
          "position" : 11
        },
        {
          "token" : "evening",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "<ALPHANUM>",
          "position" : 12
        }
      ]
    }
    
    • Simple Analyzer:按照非字母切分(符号被过滤),小写处理。
    #simpe
    GET _analyze
    {
      "analyzer": "simple",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:数字2被过滤,Quick小写处理, brown-foxes被切分为 brown,foxes
    {
      "tokens" : [
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "foxes",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "in",
          "start_offset" : 48,
          "end_offset" : 50,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "the",
          "start_offset" : 51,
          "end_offset" : 54,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "evening",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "word",
          "position" : 11
        }
      ]
    }
    
    • Stop Analyzer:停用词过滤(is/a/the),小写处理。
    #stop
    GET _analyze
    {
      "analyzer": "stop",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:2,in,the被过滤,Quick小写处理, brown-foxes被切分为 brown,foxes
    {
      "tokens" : [
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "foxes",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "evening",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "word",
          "position" : 11
        }
      ]
    }
    
    • WhiteSpace Analyzer:按照空格切分,不转小写。
    #whitespace
    GET _analyze
    {
      "analyzer": "whitespace",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:按空格切分
    {
      "tokens" : [
        {
          "token" : "2",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "Quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "brown-foxes",
          "start_offset" : 16,
          "end_offset" : 27,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "in",
          "start_offset" : 48,
          "end_offset" : 50,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "the",
          "start_offset" : 51,
          "end_offset" : 54,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "evening.",
          "start_offset" : 62,
          "end_offset" : 70,
          "type" : "word",
          "position" : 11
        }
      ]
    }
    
    • Keyword Analyzer:不分词,直接将输入当作输出。
    #keyword
    GET _analyze
    {
      "analyzer": "keyword",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:
    {
      "tokens" : [
        {
          "token" : "2 running Quick brown-foxes leap over lazy dogs in the summer evening.",
          "start_offset" : 0,
          "end_offset" : 70,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    
    • Pattern Analyzer:正则表达式分词,默认W+(非字符分隔)。
    #pattern
    GET _analyze
    {
      "analyzer": "pattern",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:
    {
      "tokens" : [
        {
          "token" : "2",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "foxes",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "in",
          "start_offset" : 48,
          "end_offset" : 50,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "the",
          "start_offset" : 51,
          "end_offset" : 54,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "word",
          "position" : 11
        },
        {
          "token" : "evening",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "word",
          "position" : 12
        }
      ]
    }
    
    • Language:提供了30多种常见语言的分词器。
    #english
    GET _analyze
    {
      "analyzer": "english",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #分词结果:running转为run,Quick转为quick,brown-foxes 转为brown、fox,in、the过滤等等
    {
      "tokens" : [
        {
          "token" : "2",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<NUM>",
          "position" : 0
        },
        {
          "token" : "run",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "fox",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "<ALPHANUM>",
          "position" : 6
        },
        {
          "token" : "lazi",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "<ALPHANUM>",
          "position" : 7
        },
        {
          "token" : "dog",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "<ALPHANUM>",
          "position" : 8
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "<ALPHANUM>",
          "position" : 11
        },
        {
          "token" : "even",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "<ALPHANUM>",
          "position" : 12
        }
      ]
    }
    
    • Custom Analyzer:自定义分词器。
    #需要安装analysis-icu插件
    POST _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "他说的确实在理”"
    }
    
    #返回结果
    {
      "tokens" : [
        {
          "token" : "他",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<IDEOGRAPHIC>",
          "position" : 0
        },
        {
          "token" : "说的",
          "start_offset" : 1,
          "end_offset" : 3,
          "type" : "<IDEOGRAPHIC>",
          "position" : 1
        },
        {
          "token" : "确实",
          "start_offset" : 3,
          "end_offset" : 5,
          "type" : "<IDEOGRAPHIC>",
          "position" : 2
        },
        {
          "token" : "在",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "<IDEOGRAPHIC>",
          "position" : 3
        },
        {
          "token" : "理",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "<IDEOGRAPHIC>",
          "position" : 4
        }
      ]
    }
    

    中文分词比较:

    POST _analyze
    {
      "analyzer": "standard",
      "text": "他说的确实在理”"
    }
    
    #返回结果
    {
      "tokens" : [
        {
          "token" : "他",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<IDEOGRAPHIC>",
          "position" : 0
        },
        {
          "token" : "说",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "<IDEOGRAPHIC>",
          "position" : 1
        },
        {
          "token" : "的",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "<IDEOGRAPHIC>",
          "position" : 2
        },
        {
          "token" : "确",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "<IDEOGRAPHIC>",
          "position" : 3
        },
        {
          "token" : "实",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "<IDEOGRAPHIC>",
          "position" : 4
        },
        {
          "token" : "在",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "<IDEOGRAPHIC>",
          "position" : 5
        },
        {
          "token" : "理",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "<IDEOGRAPHIC>",
          "position" : 6
        }
      ]
    }
    


    作者:牧汜
    出处:http://www.cnblogs.com/czbxdd/
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

  • 相关阅读:
    使用SilverLight构建插件式应用程序(九) —聊天插件客户端的实现
    .NET 访问JAVA的WebService使用SOAP头
    使用SilverLight构建插件式应用程序(七)
    管理类软件的界面模板。
    使用SilverLight构建插件式应用程序(六)
    使用SilverLight开发ARPG游戏(一)
    AS 学习笔记 元件和代码的绑定
    scaleform 学习笔记1
    cocos2dx 获取图片的某像素点的RGBA颜色
    接触的第二个引擎 scaleform
  • 原文地址:https://www.cnblogs.com/czbxdd/p/11652897.html
Copyright © 2020-2023  润新知