分析过程
当数据被发送到elasticsearch后并加入倒排序索引之前,elasticsearch会对文档进行处理:
- 字符过滤:使用字符过滤器转变字符。
- 文本切分为分词:将文本(档)分为单个或多个分词。
- 分词过滤:使用分词过滤器转变每个分词。
- 分词索引:最终将分词存储在Lucene倒排索引中。
整体流程:
目的是达到人性化的分词
内置字符过滤器
HTML字符过滤器、映射字符过滤器、模式替换过滤器
HTML字符过滤器
POST _analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I'm so <b>happy</b>!</p>" }
结果
{ "tokens" : [ { "token" : """ I'm so happy! """, "start_offset" : 0, "end_offset" : 32, "type" : "word", "position" : 0 } ] }
自定义HTML过滤器
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "html_strip", "escaped_tags": ["b"] } } } } }
映射字符过滤
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":["my_char_filter"] } }, "char_filter":{ "my_char_filter":{ "type":"mapping", "mappings":["苍井空 => 666","武藤兰 => 888"] } } } } } GET my_index/_analyze { "analyzer": "my_analyzer", "text":"苍井空热爱武藤兰,可惜苍井空后来结婚了" }
结果
1 { 2 "tokens" : [ 3 { 4 "token" : "666热爱888,可惜666后来结婚了", 5 "start_offset" : 0, 6 "end_offset" : 19, 7 "type" : "word", 8 "position" : 0 9 } 10 ] 11 }
模式替换过滤器
PUT my_index1 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\d+)-(?=\d)", "replacement": "$1_" } } } } } POST my_index1/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }
结果
{ "tokens" : [ { "token" : "My", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "credit", "start_offset" : 3, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "card", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "is", "start_offset" : 15, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "123_456_789", "start_offset" : 18, "end_offset" : 29, "type" : "<NUM>", "position" : 4 } ] }
内置分析器
内置分词器
UAX URL电子邮件分词
1 作者:一线码农 2 来源:未知原文:https://www.cnblogs.com/Mc_HotHog/articles/1111111.html 3 邮箱:22222@qq.com 4 版权声明:本文为博主原创文章,转载请附上博文链接!
POST _analyze { "tokenizer": "uax_url_email", "text":"作者:一线码农来源:未知原文:https://www.cnblogs.com/Mc_HotHog/articles/1111111.html邮箱:22222@qq.com版权声明:本文为博主原创文章,转载请附上博文链接!" }
结果
{ "tokens" : [ { "token" : "作", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "者", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "一", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "线", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "码", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "农", "start_offset" : 6, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "来", "start_offset" : 7, "end_offset" : 8, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "源", "start_offset" : 8, "end_offset" : 9, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "未", "start_offset" : 10, "end_offset" : 11, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "知", "start_offset" : 11, "end_offset" : 12, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "原", "start_offset" : 12, "end_offset" : 13, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "文", "start_offset" : 13, "end_offset" : 14, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "https://www.cnblogs.com/Mc_HotHog/articles/1111111.html", "start_offset" : 15, "end_offset" : 70, "type" : "<URL>", "position" : 12 }, { "token" : "邮", "start_offset" : 70, "end_offset" : 71, "type" : "<IDEOGRAPHIC>", "position" : 13 }, { "token" : "箱", "start_offset" : 71, "end_offset" : 72, "type" : "<IDEOGRAPHIC>", "position" : 14 }, { "token" : "22222@qq.com", "start_offset" : 73, "end_offset" : 85, "type" : "<EMAIL>", "position" : 15 }, { "token" : "版", "start_offset" : 85, "end_offset" : 86, "type" : "<IDEOGRAPHIC>", "position" : 16 }, { "token" : "权", "start_offset" : 86, "end_offset" : 87, "type" : "<IDEOGRAPHIC>", "position" : 17 }, { "token" : "声", "start_offset" : 87, "end_offset" : 88, "type" : "<IDEOGRAPHIC>", "position" : 18 }, { "token" : "明", "start_offset" : 88, "end_offset" : 89, "type" : "<IDEOGRAPHIC>", "position" : 19 }, { "token" : "本", "start_offset" : 90, "end_offset" : 91, "type" : "<IDEOGRAPHIC>", "position" : 20 }, { "token" : "文", "start_offset" : 91, "end_offset" : 92, "type" : "<IDEOGRAPHIC>", "position" : 21 }, { "token" : "为", "start_offset" : 92, "end_offset" : 93, "type" : "<IDEOGRAPHIC>", "position" : 22 }, { "token" : "博", "start_offset" : 93, "end_offset" : 94, "type" : "<IDEOGRAPHIC>", "position" : 23 }, { "token" : "主", "start_offset" : 94, "end_offset" : 95, "type" : "<IDEOGRAPHIC>", "position" : 24 }, { "token" : "原", "start_offset" : 95, "end_offset" : 96, "type" : "<IDEOGRAPHIC>", "position" : 25 }, { "token" : "创", "start_offset" : 96, "end_offset" : 97, "type" : "<IDEOGRAPHIC>", "position" : 26 }, { "token" : "文", "start_offset" : 97, "end_offset" : 98, "type" : "<IDEOGRAPHIC>", "position" : 27 }, { "token" : "章", "start_offset" : 98, "end_offset" : 99, "type" : "<IDEOGRAPHIC>", "position" : 28 }, { "token" : "转", "start_offset" : 100, "end_offset" : 101, "type" : "<IDEOGRAPHIC>", "position" : 29 }, { "token" : "载", "start_offset" : 101, "end_offset" : 102, "type" : "<IDEOGRAPHIC>", "position" : 30 }, { "token" : "请", "start_offset" : 102, "end_offset" : 103, "type" : "<IDEOGRAPHIC>", "position" : 31 }, { "token" : "附", "start_offset" : 103, "end_offset" : 104, "type" : "<IDEOGRAPHIC>", "position" : 32 }, { "token" : "上", "start_offset" : 104, "end_offset" : 105, "type" : "<IDEOGRAPHIC>", "position" : 33 }, { "token" : "博", "start_offset" : 105, "end_offset" : 106, "type" : "<IDEOGRAPHIC>", "position" : 34 }, { "token" : "文", "start_offset" : 106, "end_offset" : 107, "type" : "<IDEOGRAPHIC>", "position" : 35 }, { "token" : "链", "start_offset" : 107, "end_offset" : 108, "type" : "<IDEOGRAPHIC>", "position" : 36 }, { "token" : "接", "start_offset" : 108, "end_offset" : 109, "type" : "<IDEOGRAPHIC>", "position" : 37 } ] }
内置分词过滤器
了解更多https://www.elastic.co/guide/en/elasticsearch/reference/6.5/index.html