• es中的分词


      这个分词,明天晚上进行补充好。

    一:概述

    1.分词器

      将⽤户输⼊的⼀段⽂本,按照⼀定逻辑,分析成多个词语的⼀种⼯具

    2.内置的分词器

      standard analyzer

      simple analyzer

      whitespace analyzer

      stop analyzer

      language analyzer

      pattern analyzer

    二:分词器测试

    1.standard analyzer

      标准分析器是默认分词器,如果未指定,则使⽤该分词器。

    POST /_analyze
    {
      "analyzer": "standard",
      "text": "The best 3-points shooter is Curry!"
    }
    

      效果:

      可以看看token,start_offset,end_offset,type,position。其中,‘-’不在了。

    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "3",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "<NUM>",
          "position" : 2
        },
        {
          "token" : "points",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "is",
          "start_offset" : 26,
          "end_offset" : 28,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "curry",
          "start_offset" : 29,
          "end_offset" : 34,
          "type" : "<ALPHANUM>",
          "position" : 6
        }
      ]
    }
    

      

    2,simple analyze

      simple 分析器当它遇到只要不是字⺟的字符,就将⽂本解析成term,⽽且所有的term都是 ⼩写的。

    POST /_analyze
    {
      "analyzer": "simple",
      "text": "The best 3-points shooter is Curry!"
    }
    

      效果:

      其中,3与‘-’不在了

    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "points",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "is",
          "start_offset" : 26,
          "end_offset" : 28,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "curry",
          "start_offset" : 29,
          "end_offset" : 34,
          "type" : "word",
          "position" : 5
        }
      ]
    }
    

      说明:只要不是字母的。都会被去掉

    3.whitespace analyzer

      whitespace 分析器,当它遇到空⽩字符时,就将⽂本解析成terms

    POST /_analyze
    {
      "analyzer": "whitespace",
      "text": "The best 3-points shooter is Curry!"
    }
    

      效果:

    {
      "tokens" : [
        {
          "token" : "The",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "3-points",
          "start_offset" : 9,
          "end_offset" : 17,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "is",
          "start_offset" : 26,
          "end_offset" : 28,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "Curry!",
          "start_offset" : 29,
          "end_offset" : 35,
          "type" : "word",
          "position" : 5
        }
      ]
    }
    

      说明:只是使用空格进行分开,不会小写

    4.stop analyzer

      stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了对删除停⽌词的⽀ 持,默认使⽤了english停⽌词

      stopwords 预定义的停⽌词列表,⽐如 (the,a,an,this,of,at)等等

    POST /_analyze
    {
      "analyzer": "stop",
      "text": "The best 3-points shooter is Curry!"
    }
    

      效果:

    {
      "tokens" : [
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "points",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "curry",
          "start_offset" : 29,
          "end_offset" : 34,
          "type" : "word",
          "position" : 5
        }
      ]
    }
    

      

    5.language analyzer

      特定的语⾔的分词器,⽐如说,english,英语分词器),

      内置语⾔:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

    POST /_analyze
    {
      "analyzer": "english",
      "text": "The best 3-points shooter is Curry!"
    }
    

      效果:

    {
      "tokens" : [
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "3",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "<NUM>",
          "position" : 2
        },
        {
          "token" : "point",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "curri",
          "start_offset" : 29,
          "end_offset" : 34,
          "type" : "<ALPHANUM>",
          "position" : 6
        }
      ]
    }
    

      

    6.pattern analyzer

      ⽤正则表达式来将⽂本分割成terms,默认的正则表达式是W+(⾮单词字符)

    POST /_analyze
    {
      "analyzer": "pattern",
      "text": "The best 3-points shooter is Curry!"
    }

      效果:

    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "3",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "points",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "is",
          "start_offset" : 26,
          "end_offset" : 28,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "curry",
          "start_offset" : 29,
          "end_offset" : 34,
          "type" : "word",
          "position" : 6
        }
      ]
    }
    

      

    三:实际使用分词器

    1.新建索引

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "type": "whitespace"
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "name": {
            "type": "text"
          },
          "team_name": {
            "type": "text"
          },
          "position": {
            "type": "text"
          },
          "play_year": {
            "type": "long"
          },
          "jerse_no": {
            "type": "keyword"
          },
          "title": {
            "type": "text",
            "analyzer": "my_analyzer"
          }
        }
      }
    }
    

      

    2.进行测试

    GET /my_index/_search
    {
      "query": {
        "match": {
          "title": "Curry!"
        }
      }
    }
    

      效果:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.2876821,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.2876821,
            "_source" : {
              "name" : "库⾥",
              "team_name" : "勇⼠",
              "position" : "控球后卫",
              "play_year" : 10,
              "jerse_no" : "30",
              "title" : "The best 3-points shooter is Curry!"
            }
          }
        ]
      }
    }
    

      

  • 相关阅读:
    centos7 python3.5中引入sqlite3
    转载nginx+uwsgi+django
    浮点数计算精度丢失问题#W01
    五大JavaScript 自动化测试框架
    deepin 安装Samba并设置为开机启动
    搭建macaca android环境
    open-MAT 安装部署
    基于Jmeter BackEnd+InfluxDB+Grafana实现性能指标实时可视监控
    使用开源libimobiledevice查看iphone信息
    Java 开发者必备测试框架
  • 原文地址:https://www.cnblogs.com/juncaoit/p/12650535.html
Copyright © 2020-2023  润新知