• elasticsearch 口水篇(8)分词 中文分词 ik插件


    先来一个标准分词(standard),配置如下:

    curl -XPUT localhost:9200/local -d '{
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "stem" : {
                        "tokenizer" : "standard",
                        "filter" : ["standard", "lowercase", "stop", "porter_stem"]
                    }
                }
            }
        },
        "mappings" : {
            "article" : {
                "dynamic" : true,
                "properties" : {
                    "title" : {
                        "type" : "string",
                        "analyzer" : "stem"
                    }
                }
            }
        }
    }'
    

    index:local

    type:article

    default analyzer:stem (filter:小写、停用词等)

    field:title  

    测试:

    # Sample Analysis 
    curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Fight for your life}'
    curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Bruno fights Tyson tomorrow}'
     
    # Index Data
    curl -XPUT localhost:9200/local/article/1 -d'{"title": "Fight for your life"}'
    curl -XPUT localhost:9200/local/article/2 -d'{"title": "Fighting for your life"}'
    curl -XPUT localhost:9200/local/article/3 -d'{"title": "My dad fought a dog"}'
    curl -XPUT localhost:9200/local/article/4 -d'{"title": "Bruno fights Tyson tomorrow"}'
     
    # search on the title field, which is stemmed on index and search
    curl -XGET localhost:9200/local/_search?q=title:fight
     
    # searching on _all will not do anystemming, unless also configured on the mapping to be stemmed...
    curl -XGET localhost:9200/local/_search?q=fight
    

    例如:

    Fight for your life

    分词如下:

    {"tokens":[
    {"token":"fight","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1},
    {"token":"your","start_offset":11,"end_offset":15,"type":"<ALPHANUM>","position":3},
    {"token":"life","start_offset":16,"end_offset":20,"type":"<ALPHANUM>","position":4} ]}

      


    部署ik分词器:

    1)将ik分词器插件(es)拷贝到./plugins/analyzerIK/中

    2)在elasticsearch.yml中配置

    index.analysis.analyzer.ik.type : "ik"

    3)在config中添加./config/ik

    IKAnalyzer.cfg.xml

    main.dic

    quantifier.dic

    ext.dic

    stopword.dic

    delete之前创建的index,重新配置如下:

    curl -XPUT localhost:9200/local -d '{
        "settings" : {
            "analysis" : {
                "analyzer" : {
                    "ik" : {
                        "tokenizer" : "ik"
                    }
                }
            }
        },
        "mappings" : {
            "article" : {
                "dynamic" : true,
                "properties" : {
                    "title" : {
                        "type" : "string",
                        "analyzer" : "ik"
                    }
                }
            }
        }
    }'
    

      

    测试:

    curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'  
    {  
        "text":"中华人民共和国国歌"  
    }  
    '  
    {
      "tokens" : [ {
        "token" : "text",
        "start_offset" : 12,
        "end_offset" : 16,
        "type" : "ENGLISH",
        "position" : 1
      }, {
        "token" : "中华人民共和国",
        "start_offset" : 19,
        "end_offset" : 26,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "国歌",
        "start_offset" : 26,
        "end_offset" : 28,
        "type" : "CN_WORD",
        "position" : 3
      } ]
    }
    

      

     ---------------------------------------

    如果我们想返回最细粒度的分词结果,需要在elasticsearch.yml中配置如下:

    index:
      analysis:
        analyzer:
          ik:
              alias: [ik_analyzer]
              type: org.elasticsearch.index.analysis.IkAnalyzerProvider
          ik_smart:
              type: ik
              use_smart: true
          ik_max_word:
              type: ik
              use_smart: false
    

      

     测试:

    curl 'http://localhost:9200/index/_analyze?analyzer=ik_max_word&pretty=true' -d'  
    {  
        "text":"中华人民共和国国歌"  
    }  
    '  
    {
      "tokens" : [ {
        "token" : "text",
        "start_offset" : 12,
        "end_offset" : 16,
        "type" : "ENGLISH",
        "position" : 1
      }, {
        "token" : "中华人民共和国",
        "start_offset" : 19,
        "end_offset" : 26,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "中华人民",
        "start_offset" : 19,
        "end_offset" : 23,
        "type" : "CN_WORD",
        "position" : 3
      }, {
        "token" : "中华",
        "start_offset" : 19,
        "end_offset" : 21,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "华人",
        "start_offset" : 20,
        "end_offset" : 22,
        "type" : "CN_WORD",
        "position" : 5
      }, {
        "token" : "人民共和国",
        "start_offset" : 21,
        "end_offset" : 26,
        "type" : "CN_WORD",
        "position" : 6
      }, {
        "token" : "人民",
        "start_offset" : 21,
        "end_offset" : 23,
        "type" : "CN_WORD",
        "position" : 7
      }, {
        "token" : "共和国",
        "start_offset" : 23,
        "end_offset" : 26,
        "type" : "CN_WORD",
        "position" : 8
      }, {
        "token" : "共和",
        "start_offset" : 23,
        "end_offset" : 25,
        "type" : "CN_WORD",
        "position" : 9
      }, {
        "token" : "国",
        "start_offset" : 25,
        "end_offset" : 26,
        "type" : "CN_CHAR",
        "position" : 10
      }, {
        "token" : "国歌",
        "start_offset" : 26,
        "end_offset" : 28,
        "type" : "CN_WORD",
        "position" : 11
      } ]
    }
    

      

  • 相关阅读:
    微软企业库调用Oracle存储过程返回(1个或多个)数据集
    (转)Oracle表空间
    HTML5操作
    完美实现 ASP.NET 2.0 中的URL重写伪静态(映射) &gt;(转载)的简介与内容
    HTML5 audio 详解
    步步为营:Asp.Net使用HttpWebRequest通知,抓取,采集(转)
    js cookie操作
    多线线程
    js with用法
    asp.net AllowSorting="true"仍然不能排序的原因
  • 原文地址:https://www.cnblogs.com/huangfox/p/3629286.html
Copyright © 2020-2023  润新知