• Elasticsearch的分词


    什么是分词

    分词就是指将一个文本转化成一系列单词的过程,也叫文本分析,在Elasticsearch中称之为Analysis。
    举例:我是中国人 --> 我/是/中国人

     结果:

    {
        "tokens": [
            {
                "token": "hello",
                "start_offset": 0,
                "end_offset": 5,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "world",
                "start_offset": 6,
                "end_offset": 11,
                "type": "<ALPHANUM>",
                "position": 1
            }
        ]
    }

    在结果中不仅可以看出分词的结果,还返回了该词在文本中的位置。

    中文分词
    中文分词的难点在于,在汉语中没有明显的词汇分界点,如在英语中,空格可以作为分隔符,如果分隔不正确就会造成歧义。
    如:
    我/爱/炒肉丝
    我/爱/炒/肉丝
    常用中文分词器,IK、jieba、THULAC等,推荐使用IK分词器。

    K Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始,IKAnalyzer已经推出了3个大版本。最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现。

    采用了特有的“正向迭代最细粒度切分算法“,具有80万字/秒的高速处理能力 采用了多子处理器分析模式,支持:英文字母(IP地址、Email、URL)、数字(日期,常用中文数量词,罗马数字,科学计数法),中文词汇(姓名、地名处理)等分词处理。 优化的词典存储,更小的内存占用

    IK分词器 Elasticsearch插件地址:https://github.com/medcl/elasticsearch-analysis-ik

    [root@dalianpai ~]# docker run -p 9200:9200 -d -v /root/ik:/usr/share/elasticsearch/plugins/ik --name elasticsearch 3fd2f723b598
    8378a1865408d30a279f9e057115cf4e68cfc4360fa2fe3866072ea9b820a27f
    [root@dalianpai ~]# docker ps -l
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                              NAMES
    8378a1865408        3fd2f723b598        "/docker-entrypoin..."   3 seconds ago       Up 2 seconds        0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

     结果:

    {
        "tokens": [
            {
                "token": "",
                "start_offset": 0,
                "end_offset": 1,
                "type": "CN_CHAR",
                "position": 0
            },
            {
                "token": "",
                "start_offset": 1,
                "end_offset": 2,
                "type": "CN_CHAR",
                "position": 1
            },
            {
                "token": "中国人",
                "start_offset": 2,
                "end_offset": 5,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "中国",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 3
            },
            {
                "token": "国人",
                "start_offset": 3,
                "end_offset": 5,
                "type": "CN_WORD",
                "position": 4
            }
        ]
    }

    可以看到,已经对中文进行了分词。

    全文搜索
    全文搜索两个最重要的方面是:

    1. 相关性(Relevance) 它是评价查询与其结果间的相关程度,并根据这种相关程度对结果排名的一种能力,这种计算方式可以是 TF/IDF 方法、地理位置邻近、模糊相似,或其他的某些算法。
    2. 分词(Analysis) 它是将文本块转换为有区别的、规范化的 token 的一个过程,目的是为了创建倒排索引以及查询倒排索引。

    {
        "acknowledged": true,
        "shards_acknowledged": true,
        "index": "topcheer"
    }

    批量插入数据

     

     结果:

    {
        "took": 213,
        "errors": false,
        "items": [
            {
                "index": {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzovfLg3Eko2bZmO_B",
                    "_version": 1,
                    "result": "created",
                    "_shards": {
                        "total": 2,
                        "successful": 1,
                        "failed": 0
                    },
                    "created": true,
                    "status": 201
                }
            },
            {
                "index": {
                    "_index": "itcast",
                    "_type": "person",
                    "_id": "AXFzovfLg3Eko2bZmO_C",
                    "_version": 1,
                    "result": "created",
                    "_shards": {
                        "total": 2,
                        "successful": 1,
                        "failed": 0
                    },
                    "created": true,
                    "status": 201
                }
            },
            {
                "index": {
                    "_index": "itcast",
                    "_type": "person",
                    "_id": "AXFzovfLg3Eko2bZmO_D",
                    "_version": 1,
                    "result": "created",
                    "_shards": {
                        "total": 2,
                        "successful": 1,
                        "failed": 0
                    },
                    "created": true,
                    "status": 201
                }
            },
            {
                "index": {
                    "_index": "itcast",
                    "_type": "person",
                    "_id": "AXFzovfLg3Eko2bZmO_E",
                    "_version": 1,
                    "result": "created",
                    "_shards": {
                        "total": 2,
                        "successful": 1,
                        "failed": 0
                    },
                    "created": true,
                    "status": 201
                }
            }
        ]
    }

    单词搜索

     结果:

    {
        "took": 38,
        "timed_out": false,
        "_shards": {
            "total": 5,
            "successful": 5,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 1,
            "max_score": 1.3123269,
            "hits": [
                {
                    "_index": "itcast",
                    "_type": "person",
                    "_id": "AXFzovfLg3Eko2bZmO_D",
                    "_score": 1.3123269,
                    "_source": {
                        "name": "王五",
                        "age": 22,
                        "mail": "333@qq.com",
                        "hobby": "羽毛球、篮球、游泳、听音乐"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、篮球、游泳、听<em>音</em><em>乐</em>"
                        ]
                    }
                }
            ]
        }
    }

    过程说明:
    1. 检查字段类型
    爱好 hobby 字段是一个 text 类型( 指定了IK分词器),这意味着查询字符串本身也应该被分词。
    2. 分析查询字符串 。
    将查询的字符串 “音乐” 传入IK分词器中,输出的结果是单个项 音乐。因为只有一个单词项,所以 match 查询执行的是单个底层 term 查询。
    3. 查找匹配文档 。
    用 term 查询在倒排索引中查找 “音乐” 然后获取一组包含该项的文档

    4. 为每个文档评分 。
    用 term 查询计算每个文档相关度评分 _score ,这是种将 词频(term frequency,即词 “音乐” 在相关文档的hobby 字段中出现的频率)和 反向文档频率(inverse document frequency,即词 “音乐” 在所有文档的hobby 字段中出现的频率),以及字段的长度(即字段越短相关度越高)相结合的计算方式。

    多词搜索

     结果:

    {
        "took": 5,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 3,
            "max_score": 1.2632889,
            "hits": [
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_K",
                    "_score": 1.2632889,
                    "_source": {
                        "name": "王五",
                        "age": 22,
                        "mail": "333@qq.com",
                        "hobby": "羽毛球、篮球、游泳、听音乐"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_L",
                    "_score": 0.42327404,
                    "_source": {
                        "name": "赵六",
                        "age": 23,
                        "mail": "444@qq.com",
                        "hobby": "跑步、游泳、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "跑步、游泳、<em>篮球</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_J",
                    "_score": 0.2887157,
                    "_source": {
                        "name": "李四",
                        "age": 21,
                        "mail": "222@qq.com",
                        "hobby": "羽毛球、乒乓球、足球、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、乒乓球、足球、<em>篮球</em>"
                        ]
                    }
                }
            ]
        }
    }

    可以看到,包含了“音乐”、“篮球”的数据都已经被搜索到了。
    可是,搜索的结果并不符合我们的预期,因为我们想搜索的是既包含“音乐”又包含“篮球”的用户,显然结果返回的“或”的关系。
    在Elasticsearch中,可以指定词之间的逻辑关系,如下:

     结果:

    {
        "took": 7,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 1,
            "max_score": 1.2632889,
            "hits": [
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_K",
                    "_score": 1.2632889,
                    "_source": {
                        "name": "王五",
                        "age": 22,
                        "mail": "333@qq.com",
                        "hobby": "羽毛球、篮球、游泳、听音乐"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"
                        ]
                    }
                }
            ]
        }
    }

    可以看到结果符合预期。

    前面我们测试了“OR” 和 “AND”搜索,这是两个极端,其实在实际场景中,并不会选取这2个极端,更有可能是选取这种,或者说,只需要符合一定的相似度就可以查询到数据,在Elasticsearch中也支持这样的查询,通过minimum_should_match来指定匹配度,如:70%;
    示例:

     结果:

    {
        "took": 3,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 3,
            "max_score": 1.2632889,
            "hits": [
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_K",
                    "_score": 1.2632889,
                    "_source": {
                        "name": "王五",
                        "age": 22,
                        "mail": "333@qq.com",
                        "hobby": "羽毛球、篮球、游泳、听音乐"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_L",
                    "_score": 0.42327404,
                    "_source": {
                        "name": "赵六",
                        "age": 23,
                        "mail": "444@qq.com",
                        "hobby": "跑步、游泳、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "跑步、游泳、<em>篮球</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_J",
                    "_score": 0.2887157,
                    "_source": {
                        "name": "李四",
                        "age": 21,
                        "mail": "222@qq.com",
                        "hobby": "羽毛球、乒乓球、足球、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、乒乓球、足球、<em>篮球</em>"
                        ]
                    }
                }
            ]
        }
    }

    组合搜索

    在搜索时,也可以使用过滤器中讲过的bool组合查询,示例:

     上面搜索的意思是:

    搜索结果中必须包含篮球,不能包含音乐,如果包含了游泳,那么它的相似度更高。

    结果:

    {
        "took": 5,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 2,
            "max_score": 1.2458471,
            "hits": [
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_L",
                    "_score": 1.2458471,
                    "_source": {
                        "name": "赵六",
                        "age": 23,
                        "mail": "444@qq.com",
                        "hobby": "跑步、游泳、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "跑步、<em>游泳</em>、<em>篮球</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_J",
                    "_score": 0.2887157,
                    "_source": {
                        "name": "李四",
                        "age": 21,
                        "mail": "222@qq.com",
                        "hobby": "羽毛球、乒乓球、足球、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、乒乓球、足球、<em>篮球</em>"
                        ]
                    }
                }
            ]
        }
    }

    评分的计算规则

    bool 查询会为每个文档计算相关度评分 _score , 再将所有匹配的 must 和 should 语句的分数 _score 求和,最后除以 must 和 should 语句的总数。

    must_not 语句不会影响评分; 它的作用只是将不相关的文档排除。默认情况下,should中的内容不是必须匹配的,如果查询语句中没有must,那么就会至少匹配其中一个。当然了,也可以通过minimum_should_match参数进行控制,该值可以是数字也可以的百分比。

    示例:

     结果:

    {
        "took": 3,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 2,
            "max_score": 1.8243669,
            "hits": [
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_K",
                    "_score": 1.8243669,
                    "_source": {
                        "name": "王五",
                        "age": 22,
                        "mail": "333@qq.com",
                        "hobby": "羽毛球、篮球、游泳、听音乐"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_L",
                    "_score": 1.2458471,
                    "_source": {
                        "name": "赵六",
                        "age": 23,
                        "mail": "444@qq.com",
                        "hobby": "跑步、游泳、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "跑步、<em>游泳</em>、<em>篮球</em>"
                        ]
                    }
                }
            ]
        }
    }

    权重

    有些时候,我们可能需要对某些词增加权重来影响该条数据的得分。如下:

    搜索关键字为“游泳篮球”,如果结果中包含了“音乐”权重为10,包含了“跑步”权重为2。

     结果:

    {
        "took": 5,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": 2,
            "max_score": 10.595525,
            "hits": [
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_K",
                    "_score": 10.595525,
                    "_source": {
                        "name": "王五",
                        "age": 22,
                        "mail": "333@qq.com",
                        "hobby": "羽毛球、篮球、游泳、听音乐"
                    },
                    "highlight": {
                        "hobby": [
                            "羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"
                        ]
                    }
                },
                {
                    "_index": "topcheer",
                    "_type": "person",
                    "_id": "AXFzrtEJg3Eko2bZmO_L",
                    "_score": 4.1034093,
                    "_source": {
                        "name": "赵六",
                        "age": 23,
                        "mail": "444@qq.com",
                        "hobby": "跑步、游泳、篮球"
                    },
                    "highlight": {
                        "hobby": [
                            "<em>跑步</em>、<em>游泳</em>、<em>篮球</em>"
                        ]
                    }
                }
            ]
     
  • 相关阅读:
    公司的OA系统基础框架系统(光标办公平台)
    通用权限控制系统--系统设计
    聘.Net软件工程师(昆明)
    对AgileFramework的思考
    iTextSharp.text.Rectangle 使用方法说明
    Castle Aspect# 难倒只支持一个拦截器?
    聘云南昆明地区的.Net工程师
    招聘云南软件销售人员
    给vncviewer 添加调用函数 GIS
    分享一个c++ 加密算法 ,在百度贴吧找的,比较好玩 GIS
  • 原文地址:https://www.cnblogs.com/dalianpai/p/12694298.html
Copyright © 2020-2023  润新知