• ElasticSearch中文分词(IK)


    ElasticSearch常用的很受欢迎的是IK,这里稍微介绍下安装过程及测试过程。
     

    1、ElasticSearch官方分词


    自带的中文分词器很弱,可以体检下:
    [zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d '岁月如梭'
    {
        "tokens": [
            {
                "token": "岁",
                "start_offset": 0,
                "end_offset": 1,
                "type": "<IDEOGRAPHIC>",
                "position": 0
            },
            {
                "token": "月",
                "start_offset": 1,
                "end_offset": 2,
                "type": "<IDEOGRAPHIC>",
                "position": 1
            },
            {
                "token": "如",
                "start_offset": 2,
                "end_offset": 3,
                "type": "<IDEOGRAPHIC>",
                "position": 2
            },
            {
                "token": "梭",
                "start_offset": 3,
                "end_offset": 4,
                "type": "<IDEOGRAPHIC>",
                "position": 3
            }
        ]
    }
    [zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d 'i am an enginner'
    {
        "tokens": [
            {
                "token": "i",
                "start_offset": 0,
                "end_offset": 1,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "am",
                "start_offset": 2,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "an",
                "start_offset": 5,
                "end_offset": 7,
                "type": "<ALPHANUM>",
                "position": 2
            },
            {
                "token": "enginner",
                "start_offset": 8,
                "end_offset": 16,
                "type": "<ALPHANUM>",
                "position": 3
            }
        ]
    }
    由此看见,ES的官方中文分词能力较差。
     
    2、IK中文分词器
     
    2.1、如何你下载的ik是源码半,需要打包该分词器,linux安装maven
    tar zxvf apache-maven-3.0.5-bin.tar.gz
    mv apache-maven-3.0.5 /usr/local/apache-maven-3.0.5
    vi /etc/profile
    增加:
    export MAVEN_HOME=/usr/local/apache-maven-3.0.5
    export PATH=$PATH:$MAVEN_HOME/bin
     
    source /etc/profile 
    mvn -v
    2.2、对源码打包得到target/目录下的内容
     
    mvn clean package 
     
    将打包好的IK插件内容部署到ES中:
    [zsz@VS-zsz ~]$ cd /home/zsz/elasticsearch-analysis-ik-1.10.0/target/releases/
    [zsz@VS-zsz releases]$ mkdir /usr/local/elasticsearch-2.4.0/plugins/ik/
    [zsz@VS-zsz releases]$ cp elasticsearch-analysis-ik-1.10.0.zip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip
    [zsz@VS-zsz releases]$ unzip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip
    [zsz@VS-zsz releases]$ cd /usr/local/elasticsearch-2.4.0/plugins/ik/
    [zsz@VS-zsz ik]$ rm elasticsearch-analysis-ik-1.10.0.zip
    [zsz@VS-zsz ik]$ mkdir /usr/local/elasticsearch-2.4.0/config/ik
     
    将IK的配置copy到ElasticSearch的配置中:
    [zsz@VS-zsz ik]$ cp /home/zsz/elasticsearch-analysis-ik-1.10.0/config /usr/local/elasticsearch-2.4.0/config/ik
     
    更改ElasticSearch的配置:
    [zsz@VS-zsz ik]$ vi /usr/local/elasticsearch-2.4.0/config/elasticsearch.yml
    在最后加上分词解析器的配置:
    index.analysis.analyzer.ik.type : "ik"
     
    启动ElasticSearch:
    [zsz@VS-zsz ik]$ cd  /usr/local/elasticsearch-2.4.0/
    [zsz@VS-zsz elasticsearch-2.4.0]$ ./bin/elasticsearch -d
     
    测试IK分词器的效果:
    [zsz@VS-zsz elasticsearch-2.4.0]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d '岁月如梭'
    {
        "tokens": [
            {
                "token": "岁月如梭",
                "start_offset": 0,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "岁月",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "如梭",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "梭",
                "start_offset": 3,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 3
            }
        ]
    }
    [zsz@VS-zsz config]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d 'elasticsearch很受欢迎的的一款拥有活跃社区开源的搜索解决方案'
    {
        "tokens": [
            {
                "token": "elasticsearch",
                "start_offset": 0,
                "end_offset": 13,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "elastic",
                "start_offset": 0,
                "end_offset": 7,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "很受",
                "start_offset": 13,
                "end_offset": 15,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "受欢迎",
                "start_offset": 14,
                "end_offset": 17,
                "type": "CN_WORD",
                "position": 3
            },
            {
                "token": "欢迎",
                "start_offset": 15,
                "end_offset": 17,
                "type": "CN_WORD",
                "position": 4
            },
            {
                "token": "一款",
                "start_offset": 19,
                "end_offset": 21,
                "type": "CN_WORD",
                "position": 5
            },
            {
                "token": "一",
                "start_offset": 19,
                "end_offset": 20,
                "type": "TYPE_CNUM",
                "position": 6
            },
            {
                "token": "款",
                "start_offset": 20,
                "end_offset": 21,
                "type": "COUNT",
                "position": 7
            },
            {
                "token": "拥有",
                "start_offset": 21,
                "end_offset": 23,
                "type": "CN_WORD",
                "position": 8
            },
            {
                "token": "拥",
                "start_offset": 21,
                "end_offset": 22,
                "type": "CN_WORD",
                "position": 9
            },
            {
                "token": "有",
                "start_offset": 22,
                "end_offset": 23,
                "type": "CN_CHAR",
                "position": 10
            },
            {
                "token": "活跃",
                "start_offset": 23,
                "end_offset": 25,
                "type": "CN_WORD",
                "position": 11
            },
            {
                "token": "跃",
                "start_offset": 24,
                "end_offset": 25,
                "type": "CN_WORD",
                "position": 12
            },
            {
                "token": "社区",
                "start_offset": 25,
                "end_offset": 27,
                "type": "CN_WORD",
                "position": 13
            },
            {
                "token": "开源",
                "start_offset": 27,
                "end_offset": 29,
                "type": "CN_WORD",
                "position": 14
            },
            {
                "token": "搜索",
                "start_offset": 30,
                "end_offset": 32,
                "type": "CN_WORD",
                "position": 15
            },
            {
                "token": "索解",
                "start_offset": 31,
                "end_offset": 33,
                "type": "CN_WORD",
                "position": 16
            },
            {
                "token": "索",
                "start_offset": 31,
                "end_offset": 32,
                "type": "CN_WORD",
                "position": 17
            },
            {
                "token": "解决方案",
                "start_offset": 32,
                "end_offset": 36,
                "type": "CN_WORD",
                "position": 18
            },
            {
                "token": "解决",
                "start_offset": 32,
                "end_offset": 34,
                "type": "CN_WORD",
                "position": 19
            },
            {
                "token": "方案",
                "start_offset": 34,
                "end_offset": 36,
                "type": "CN_WORD",
                "position": 20
            }
        ]
    }
     
    可以看到,中文分词变得更加合理。
     本文地址:http://www.cnblogs.com/zhongshengzhen/p/elasticsearch_ik.html
     
  • 相关阅读:
    括号序列
    秘密信息
    大奖赛
    订单
    摆花
    利用spring自己实现观察者模式
    Spring操作mongo排序,限制查询记录数
    Hbse的读写过程
    使用aop记录数据库操作的执行时间
    分享一个关于jackson的Json工具类
  • 原文地址:https://www.cnblogs.com/zhongshengzhen/p/elasticsearch_ik.html
Copyright © 2020-2023  润新知