ElasticSearch中文分词（IK）

ElasticSearch常用的很受欢迎的是IK，这里稍微介绍下安装过程及测试过程。

1、ElasticSearch官方分词

自带的中文分词器很弱，可以体检下：

[zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d '岁月如梭'
{
    "tokens": [
        {
            "token": "岁",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "月",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "如",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "梭",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        }
    ]
}

[zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d 'i am an enginner'
{
    "tokens": [
        {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "am",
            "start_offset": 2,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "an",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "enginner",
            "start_offset": 8,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        }
    ]
}

由此看见，ES的官方中文分词能力较差。

2、IK中文分词器

2.1、如何你下载的ik是源码半，需要打包该分词器，linux安装maven

wget http://mirrors.cnnic.cn/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz

tar zxvf apache-maven-3.0.5-bin.tar.gz

mv apache-maven-3.0.5 /usr/local/apache-maven-3.0.5

vi /etc/profile

增加：

export MAVEN_HOME=/usr/local/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin

source /etc/profile

mvn -v

2.2、对源码打包得到target/目录下的内容

mvn clean package

将打包好的IK插件内容部署到ES中：

[zsz@VS-zsz ~]$ cd /home/zsz/elasticsearch-analysis-ik-1.10.0/target/releases/

[zsz@VS-zsz releases]$ mkdir /usr/local/elasticsearch-2.4.0/plugins/ik/

[zsz@VS-zsz releases]$ cp elasticsearch-analysis-ik-1.10.0.zip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip

[zsz@VS-zsz releases]$ unzip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip

[zsz@VS-zsz releases]$ cd /usr/local/elasticsearch-2.4.0/plugins/ik/

[zsz@VS-zsz ik]$ rm elasticsearch-analysis-ik-1.10.0.zip

[zsz@VS-zsz ik]$ mkdir /usr/local/elasticsearch-2.4.0/config/ik

将IK的配置copy到ElasticSearch的配置中：

[zsz@VS-zsz ik]$ cp /home/zsz/elasticsearch-analysis-ik-1.10.0/config /usr/local/elasticsearch-2.4.0/config/ik

更改ElasticSearch的配置：

[zsz@VS-zsz ik]$ vi /usr/local/elasticsearch-2.4.0/config/elasticsearch.yml

在最后加上分词解析器的配置：

index.analysis.analyzer.ik.type : "ik"

启动ElasticSearch：

[zsz@VS-zsz ik]$ cd /usr/local/elasticsearch-2.4.0/

[zsz@VS-zsz elasticsearch-2.4.0]$ ./bin/elasticsearch -d

测试IK分词器的效果：

[zsz@VS-zsz elasticsearch-2.4.0]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d '岁月如梭'

{
    "tokens": [
        {
            "token": "岁月如梭",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "岁月",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "如梭",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "梭",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

[zsz@VS-zsz config]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d 'elasticsearch很受欢迎的的一款拥有活跃社区开源的搜索解决方案'

{
    "tokens": [
        {
            "token": "elasticsearch",
            "start_offset": 0,
            "end_offset": 13,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "elastic",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "很受",
            "start_offset": 13,
            "end_offset": 15,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "受欢迎",
            "start_offset": 14,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "欢迎",
            "start_offset": 15,
            "end_offset": 17,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "一款",
            "start_offset": 19,
            "end_offset": 21,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "一",
            "start_offset": 19,
            "end_offset": 20,
            "type": "TYPE_CNUM",
            "position": 6
        },
        {
            "token": "款",
            "start_offset": 20,
            "end_offset": 21,
            "type": "COUNT",
            "position": 7
        },
        {
            "token": "拥有",
            "start_offset": 21,
            "end_offset": 23,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "拥",
            "start_offset": 21,
            "end_offset": 22,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "有",
            "start_offset": 22,
            "end_offset": 23,
            "type": "CN_CHAR",
            "position": 10
        },
        {
            "token": "活跃",
            "start_offset": 23,
            "end_offset": 25,
            "type": "CN_WORD",
            "position": 11
        },
        {
            "token": "跃",
            "start_offset": 24,
            "end_offset": 25,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "社区",
            "start_offset": 25,
            "end_offset": 27,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "开源",
            "start_offset": 27,
            "end_offset": 29,
            "type": "CN_WORD",
            "position": 14
        },
        {
            "token": "搜索",
            "start_offset": 30,
            "end_offset": 32,
            "type": "CN_WORD",
            "position": 15
        },
        {
            "token": "索解",
            "start_offset": 31,
            "end_offset": 33,
            "type": "CN_WORD",
            "position": 16
        },
        {
            "token": "索",
            "start_offset": 31,
            "end_offset": 32,
            "type": "CN_WORD",
            "position": 17
        },
        {
            "token": "解决方案",
            "start_offset": 32,
            "end_offset": 36,
            "type": "CN_WORD",
            "position": 18
        },
        {
            "token": "解决",
            "start_offset": 32,
            "end_offset": 34,
            "type": "CN_WORD",
            "position": 19
        },
        {
            "token": "方案",
            "start_offset": 34,
            "end_offset": 36,
            "type": "CN_WORD",
            "position": 20
        }
    ]
}

可以看到，中文分词变得更加合理。

本文地址：http://www.cnblogs.com/zhongshengzhen/p/elasticsearch_ik.html

相关阅读:
括号序列
秘密信息
大奖赛
订单
摆花
利用spring自己实现观察者模式
Spring操作mongo排序，限制查询记录数
Hbse的读写过程
使用aop记录数据库操作的执行时间
分享一个关于jackson的Json工具类

原文地址：https://www.cnblogs.com/zhongshengzhen/p/elasticsearch_ik.html