• elasticsearch(v2.4.6)添加中文分词器ik



    一、 参考

    ik github文档

    将maven源改为国内阿里云镜像

    二、 编译安装 analysis-ik

    2.1 下载源码

    git clone --depth 1 --branch v1.10.6 https://github.com/medcl/elasticsearch-analysis-ik.git

    因为ES2.4.6对应的ik v1.10.6,所以仅仅clonetag源码

    2.2 编译

    (1) 下载安装 maven

    # 源码下载
    wget https://mirror.olnevhost.net/pub/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
    
    # 解压目录
    mkdir /usr/local/maven
    
    tar -zxvf apache-maven-3.6.3-bin.tar.gz --directory /usr/local/maven
    
    # 环境变量设置
    export JAVA_HOME=/home/java/jdk1.8.0_131
    MAVEN_HOME=/usr/local/maven/apache-maven-3.6.3
    export MAVEN_HOME
    
    export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin
    
    source /etc/profile
    
    # 查看版本信息
    mvn --version
    
    

    (2) 编译ik

    # 编译
    cd elasticsearch-analysis-ik/
    
    mvn package
    
    # 将编译文件添加到plugins
    
    cd cd target/releases/
    
    cp elasticsearch-analysis-ik-1.10.6.zip /home/elastic/elasticsearch-2.4.6/plugins/ik/
    
    cd /home/elastic/elasticsearch-2.4.6/plugins/
    
    unzip elasticsearch-analysis-ik-1.10.6.zip
    
    

    2.3 重启es服务

    三、测试ik分词效果

    3.1 内置的中文分词

    # 请求
    
    GET http://127.0.0.1:9200/_analyze
    {
    	"text": "正是江南好风景"
    }
    
    # 返回
    {
      "tokens": [
        {
          "token": "正",
          "start_offset": 0,
          "end_offset": 1,
          "type": "<IDEOGRAPHIC>",
          "position": 0
        },
        {
          "token": "是",
          "start_offset": 1,
          "end_offset": 2,
          "type": "<IDEOGRAPHIC>",
          "position": 1
        },
        {
          "token": "江",
          "start_offset": 2,
          "end_offset": 3,
          "type": "<IDEOGRAPHIC>",
          "position": 2
        },
        {
          "token": "南",
          "start_offset": 3,
          "end_offset": 4,
          "type": "<IDEOGRAPHIC>",
          "position": 3
        },
        {
          "token": "好",
          "start_offset": 4,
          "end_offset": 5,
          "type": "<IDEOGRAPHIC>",
          "position": 4
        },
        {
          "token": "风",
          "start_offset": 5,
          "end_offset": 6,
          "type": "<IDEOGRAPHIC>",
          "position": 5
        },
        {
          "token": "景",
          "start_offset": 6,
          "end_offset": 7,
          "type": "<IDEOGRAPHIC>",
          "position": 6
        }
      ]
    }
    
    

    3.2 ik的ik_max_word分词器

    # 请求
    
    GET http://127.0.0.1:9200/_analyze
    {
    	"analyzer": "ik_max_word",
    	"text": "正是江南好风景"
    }
    
    
    # 返回
    {
      "tokens": [
        {
          "token": "正是",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "江南",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "江",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "南",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_CHAR",
          "position": 3
        },
        {
          "token": "好",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 4
        },
        {
          "token": "风景",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "景",
          "start_offset": 6,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 6
        }
      ]
    }
    
    

    3.3 ik的ik_smart分词器

    # 请求
    
    GET http://127.0.0.1:9200/_analyze
    {
    	"analyzer": "ik_smart",
    	"text": "正是江南好风景"
    }
    
    # 返回
    {
      "tokens": [
        {
          "token": "正是",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "江南",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "好",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "风景",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 3
        }
      ]
    }
    
    

    3.4 比较结果

    (1) 默认的分词器将中文按照一个个汉字来分词,肯定不符合大部分使用场景

    (2) ik_max_word会作最细粒度的分词,而ik_smart则正相反,会作最粗粒度的分词

  • 相关阅读:
    北京之行
    csharp进界
    医院OA系统新思考
    茗洋博客
    monkey主要参数详解
    使用python判断Android自动化的渠道包是否全部打完
    手机连接mac电脑无法使用adb命令解决方法
    Python正则表达式指南
    Mac基本命令大全
    Mac之vim普通命令使用
  • 原文地址:https://www.cnblogs.com/thewindyz/p/14052829.html
Copyright © 2020-2023  润新知