配置分析器
分析器的构成
- 一个分词器(tokenizer)
- 零个或者多个词元过滤器(token filters)
- 零个或者多个字符过滤器(character filters)
配置内置分析器
内置分析器不要任何配置就可以直接使用,然而,有一些分析器支持可选参数去改变它们的行为。比如,标准分析器可以配置停用词列表。
示例:
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "std_english": { "type": "standard", "stopwords": "_english_" } } } }, "mappings": { "properties": { "my_text": { "type": "text", "analyzer": "standard", "fields": { "english": { "type": "text", "analyzer": "std_english" } } } } } } POST my-index-000001/_analyze { "field": "my_text", "text": "The old brown cow" } POST my-index-000001/_analyze { "field": "my_text.english", "text": "The old brown cow" }
结果
{ "tokens" : [ { "token" : "the", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "old", "start_offset" : 4, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "brown", "start_offset" : 8, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "cow", "start_offset" : 14, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 3 } ] } { "tokens" : [ { "token" : "old", "start_offset" : 4, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "brown", "start_offset" : 8, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "cow", "start_offset" : 14, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 3 } ] }
配置自定义分析器
当内置分析器不能完全满足你的需求时,你可以使用下面合适的组合来创建自定义分析器。
-
-
一个分词器(tokenizer)
-
零个或者多个词元过滤器(token filters)
参数配置说明:
参数 | 值 | 说明 |
tokenizer | 一个内置的或者自定义的分词器 | 分词器 |
char_filter | 内置的或者自定义的字符过滤器数组 | 字符过滤器 |
filter | 内置的或者自定义的词元过滤器数组 | 词元过滤器 |
position_increment_gap | default 100 | 短语位置间隙 |
示例:
type:custom 是告诉Elasticsearch我们定义了一个自定义的分析器,如果使用的是内置的分析器,type就是内置分析器的名字。
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": [ "html_strip" ], "filter": [ "lowercase", "asciifolding" ] } } } } } POST my-index-000001/_analyze { "analyzer": "my_custom_analyzer", "text": "Is this <b>déjà vu</b>?" }
结果:
{ "tokens" : [ { "token" : "is", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "this", "start_offset" : 3, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "deja", "start_offset" : 11, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "vu", "start_offset" : 16, "end_offset" : 22, "type" : "<ALPHANUM>", "position" : 3 } ] }
示例2:
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "char_filter": [ "emoticons" ], "tokenizer": "punctuation", "filter": [ "lowercase", "english_stop" ] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?]" } }, "char_filter": { "emoticons": { "type": "mapping", "mappings": [ ":) => _happy_", ":( => _sad_" ] } }, "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" } } } } }
测试分析器
测试内置分析器
POST _analyze
{ "analyzer": "whitespace", "text": "The quick brown fox." }
结果:
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "quick", "start_offset" : 4, "end_offset" : 9, "type" : "word", "position" : 1 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "word", "position" : 2 }, { "token" : "fox.", "start_offset" : 16, "end_offset" : 20, "type" : "word", "position" : 3 } ] }
示例2:
POST _analyze { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ], "text": "Is this déja vu?" }
测试自定义分析器
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "std_folded": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] } } } }, "mappings": { "properties": { "my_text": { "type": "text", "analyzer": "std_folded" } } } } GET my-index-000001/_analyze { "analyzer": "std_folded", "text": "Is this déjà vu?" } GET my-index-000001/_analyze { "field": "my_text", "text": "Is this déjà vu?" }