• ElasticSearch基础入门学习笔记


    前言

    本笔记的内容主要是在从0开始学习ElasticSearch中,按照官方文档以及自己的一些测试的过程。

    安装

    由于是初学者,按照官方文档安装即可。前面ELK入门使用主要就是讲述了安装过程,这里不再赘述。

    学习教程

    找了很久,文档大多比较老。即使是官方文档也是基于2.x介绍的,官网最新已经演进到6了。不过基础入门还是可以的。接下来将参照官方文档来学习。

    安装好ElasticSearch和Kibana之后. 打开localhost:5601, 选择Dev Tools。

    索引(存储)雇员文档

    测试的数据源是公司雇员的信息列表。其中,每个雇员的信息叫做一个文档,添加一条信息叫做索引一个文档。

    在console里输入

    PUT /megacorp/employee/1
    {
        "first_name" : "John",
        "last_name" :  "Smith",
        "age" :        25,
        "about" :      "I love to go rock climbing",
        "interests": [ "sports", "music" ]
    }
    
    • megacorp 是索引名称
    • employee 是类型名称
    • 1 是id,同样是雇员的id

    光标定位到第一行,点击绿色按钮执行。

    这个是简化的存入快捷方式, 其本质还是通过ES提供的REST API来实现的。上述可以用postman或者curl来实现,域名为ES的地址,即localhost:9200。对于postman,get方法不允许传body,用post也可以。

    这样就将一个文档存入了ES。接下来,多存储几个

    PUT /megacorp/employee/2
    {
        "first_name" :  "Jane",
        "last_name" :   "Smith",
        "age" :         32,
        "about" :       "I like to collect rock albums",
        "interests":  [ "music" ]
    }
    
    PUT /megacorp/employee/3
    {
        "first_name" :  "Douglas",
        "last_name" :   "Fir",
        "age" :         35,
        "about":        "I like to build cabinets",
        "interests":  [ "forestry" ]
    }
    

    然后,我们可以去查看,点击Management,Index Patterns,Configure an index pattern, 输入megacorp,确定。

    点击Discover, 就可以看到我们存储的信息了。

    检索文档

    存入数据后,想要查询出来。查询id为1的员工。

    GET /megacorp/employee/1
    
    返回:
    {
      "_index": "megacorp",
      "_type": "employee",
      "_id": "1",
      "_version": 5,
      "found": true,
      "_source": {
        "first_name": "John",
        "last_name": "Smith",
        "age": 25,
        "about": "I love to go rock climbing",
        "interests": [
          "sports",
          "music"
        ]
      }
    }
    

    区别于保存一条记录,只是http method不同。

    • put 添加
    • get 获取
    • delete 删除
    • head 查询是否存在
    • 想要更新,再次put即可

    轻量搜索

    我们除了findById,最常见就是条件查询了。

    先来查看所有:

    GET /megacorp/employee/_search
    

    对了,可以查看记录个数count

    GET /megacorp/employee/_count
    

    想要查看last_name是Smith的

    GET /megacorp/employee/_search?q=last_name:Smith
    

    加一个参数q,字段名:Value的形式查询。

    查询表达式

    Query-string 搜索通过命令非常方便地进行临时性的即席搜索 ,但它有自身的局限性(参见 轻量 搜索 )。Elasticsearch 提供一个丰富灵活的查询语言叫做 查询表达式 , 它支持构建更加复杂和健壮的查询。

    领域特定语言 (DSL), 指定了使用一个 JSON 请求。我们可以像这样重写之前的查询所有 Smith 的搜索

    GET /megacorp/employee/_search
    {
        "query" : {
            "match" : {
                "last_name" : "Smith"
            }
        }
    }
    

    更复杂的查询

    继续修改上一步的查询

    GET /megacorp/employee/_search
    {
        "query" : {
            "bool": {
                "must": {
                    "match" : {
                        "last_name" : "smith" 
                    }
                },
                "filter": {
                    "range" : {
                        "age" : { "gt" : 30 } 
                    }
                }
            }
        }
    }
    

    多了一个range过滤,要求age大于30.

    结果

    {
      "took": 4,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.2876821,
        "hits": [
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "2",
            "_score": 0.2876821,
            "_source": {
              "first_name": "Jane",
              "last_name": "Smith",
              "age": 32,
              "about": "I like to collect rock albums",
              "interests": [
                "music"
              ]
            }
          }
        ]
      }
    }
    

    全文检索

    截止目前的搜索相对都很简单:单个姓名,通过年龄过滤。现在尝试下稍微高级点儿的全文搜索--一项传统数据库确实很难搞定的任务。

    GET /megacorp/employee/_search
    {
        "query" : {
            "match" : {
                "about" : "rock climbing"
            }
        }
    }
    

    结果

    {
      "took": 32,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 0.53484553,
        "hits": [
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "1",
            "_score": 0.53484553,
            "_source": {
              "first_name": "John",
              "last_name": "Smith",
              "age": 25,
              "about": "I love to go rock climbing",
              "interests": [
                "sports",
                "music"
              ]
            }
          },
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "2",
            "_score": 0.26742277,
            "_source": {
              "first_name": "Jane",
              "last_name": "Smith",
              "age": 32,
              "about": "I like to collect rock albums",
              "interests": [
                "music"
              ]
            }
          }
        ]
      }
    }
    

    有个排序,以及是分数_score。可以看到只有一个字母匹配到的也查出来了. 如果我们想完全匹配, 换一个种查询.

    match_phrase 会完全匹配短语.

    GET /megacorp/employee/_search
    {
        "query" : {
            "match_phrase" : {
                "about" : "rock climbing"
            }
        }
    }
    

    我们百度搜索的时候, 命中的关键字还会高亮, es也可以返回高亮的位置.

    GET /megacorp/employee/_search
    {
        "query" : {
            "match_phrase" : {
                "about" : "rock climbing"
            }
        },
        "highlight": {
            "fields" : {
                "about" : {}
            }
        }
    }
    

    返回

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.5753642,
        "hits": [
          {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "1",
            "_score": 0.5753642,
            "_source": {
              "first_name": "John",
              "last_name": "Smith",
              "age": 25,
              "about": "I love to go rock climbing",
              "interests": [
                "sports",
                "music"
              ]
            },
            "highlight": {
              "about": [
                "I love to go <em>rock</em> <em>climbing</em>"
              ]
            }
          }
        ]
      }
    }
    

    聚合计算Group by

    在sql里经常遇到统计的计算, 比如sum, count, avg. es可以这样:

    GET /megacorp/employee/_search
    {
      "aggs": {
        "all_interests": {
          "terms": { "field": "interests" }
        }
      }
    }
    

    aggs表示聚合, all_interests是返回的变量名称, terms 表示count计算. 这个语句的意思是, 对interests进行count统计. 然后, es可能会返回:

    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
          }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
          {
            "shard": 0,
            "index": "megacorp",
            "node": "iqHCjOUkSsWM2Hv6jT-xUQ",
            "reason": {
              "type": "illegal_argument_exception",
              "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
            }
          }
        ],
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
          }
        }
      },
      "status": 400
    }
    

    意思是,对字符的统计, 需要开启一个设置fielddata=true.

    这就需要修改index设置了, 相当于修改关系型数据库表结构.

    修改index mapping

    我们先来查看一个配置:

    GET /megacorp/employee/_mapping
    
    

    结果:

    {
      "megacorp": {
        "mappings": {
          "employee": {
            "properties": {
              "about": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "age": {
                "type": "long"
              },
              "first_name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "interests": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "last_name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
    

    简单可以看出是定义了各个字段类型. 上个问题是需要增加一个配置

    "fielddata": true
    

    更新方法如下:

    
    PUT /megacorp/employee/_mapping
    {
            "properties": {
              "about": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "age": {
                "type": "long"
              },
              "first_name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "interests": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                },
                "fielddata": true
              },
              "last_name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
    

    返回:

    {
      "acknowledged": true
    }
    

    表示更新成功了. 然后可以继续我们之前的聚合计算了.

    聚合计算 group by count

    对于sql类似于

    select interests, count(*) from index_xxx
    where last_name = 'smith'
    group by interests.
    

    在es里可以这样查询:

    GET /megacorp/employee/_search
    {
      "_source": false,
      "query": {
        "match": {
          "last_name": "smith"
        }
      },
        "size": 0,
      "aggs": {
        "all_interests": {
          "terms": {
            "field": "interests"
          }
        }
      }
    }
    
    

    _source=false 是为了不返回hit命中的item的属性, 默认true.

    "size": 0,表示不返回hits. 默认会返回所有的行, 我们不需要, 我们只要返回统计结果.

    aggs表示一个聚合操作.

    all_interests是自定义的一个变量名称, 可以随便写一个.

    terms 表示进行count操作, 对应的字段是interests.

    返回:

    {
      "took": 0,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "all_interests": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "music",
              "doc_count": 2
            },
            {
              "key": "sports",
              "doc_count": 1
            }
          ]
        }
      }
    }
    

    可以得到需要的字段的count. 同样可以计算sum, avg.

    
    
    GET /megacorp/employee/_search
    {
        "_source": false, 
        "size": 0, 
        "aggs" : {
            "avg_age" : {
                "avg" : { "field" : "age" }
            },
            "sum_age" : {
                "sum" : { "field" : "age" }
            }
        }
    }
    

    返回

    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "avg_age": {
          "value": 30.666666666666668
        },
        "sum_age": {
          "value": 92
        }
      }
    }
    

    总结

    上述是官方文档的第一节, 基础入门. 这里只是摘抄和实现了一遍. 没做更多的突破,但增加了个人理解. 可以知道es基本怎么用了. 更多更详细的语法后面慢慢来.

    参考

  • 相关阅读:
    设置css阴影和取消css阴影
    logback 中打印自定义参数 (ip 服务名)
    kafka
    用SQL命令查看Mysql数据库大小 统计数据库空间占用
    Redis删除特定前缀key的优雅实现
    SpringBoot外部配置以及优先级
    分析redis key大小的几种方法
    logback
    Redis 中 scan 命令踩坑
    如何在yaml文件中引用python函数? 上海
  • 原文地址:https://www.cnblogs.com/woshimrf/p/es-start.html
Copyright © 2020-2023  润新知