• ELK之elasticsearch6.5集群


    前面介绍并初试了es6.5系列的单节点的操作,现在搭建es6.5系列的集群:

    环境:三节点:master-172.16.23.128.node1-172.16.23.129.node2-172.16.23.130,首先查看es的服务状态:

    [root@master ~]# ansible all_nodes -m shell -a "systemctl status elasticsearch"|grep -i running
       Active: active (running) since 六 2018-12-29 12:06:55 CST; 3h 33min ago
       Active: active (running) since 六 2018-12-29 12:07:43 CST; 3h 32min ago
       Active: active (running) since 六 2018-12-29 15:38:47 CST; 1min 42s ago
    

     查看各节点上面的es的配置文件:

    [root@master ~]# ansible all_nodes -m shell -a 'cat /etc/elasticsearch/elasticsearch.yml|egrep -v "^$|^#"'
    172.16.23.128 | CHANGED | rc=0 >>
    cluster.name: estest
    node.name: esnode2
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 0.0.0.0
    http.port: 9200
    discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.131"]
    
    172.16.23.130 | CHANGED | rc=0 >>
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    
    172.16.23.129 | CHANGED | rc=0 >>
    cluster.name: es
    node.name: node1
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 0.0.0.0
    http.port: 9200
    

     现在基于discovery.zen做集群配置参考:https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-discovery-zen.html,具体配置如下:

    [root@master ~]# ansible all_nodes -m shell -a 'cat /etc/elasticsearch/elasticsearch.yml|egrep -v "^$|^#"'
    172.16.23.128 | CHANGED | rc=0 >>
    cluster.name: estest
    node.name: master
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 0.0.0.0
    http.port: 9200
    discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.129", "172.16.23.130"]
    
    172.16.23.130 | CHANGED | rc=0 >>
    cluster.name: estest
    node.name: node2
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 0.0.0.0
    http.port: 9200
    discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.129", "172.16.23.130"]
    
    172.16.23.129 | CHANGED | rc=0 >>
    cluster.name: estest
    node.name: node1
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 0.0.0.0
    http.port: 9200
    discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.129", "172.16.23.130"]
    

     重启elasticsearch服务:

    [root@master ~]# ansible all_nodes -m shell -a "systemctl restart elasticsearch"
    172.16.23.130 | CHANGED | rc=0 >>
    
    
    172.16.23.128 | CHANGED | rc=0 >>
    
    
    172.16.23.129 | CHANGED | rc=0 >>
    

     然后查看集群状态:

    [root@master ~]# curl -X GET "localhost:9200/_cluster/health" -s|python -m json.tool
    {
        "active_primary_shards": 0,
        "active_shards": 0,
        "active_shards_percent_as_number": 100.0,
        "cluster_name": "estest",
        "delayed_unassigned_shards": 0,
        "initializing_shards": 0,
        "number_of_data_nodes": 3,
        "number_of_in_flight_fetch": 0,
        "number_of_nodes": 3,
        "number_of_pending_tasks": 0,
        "relocating_shards": 0,
        "status": "green",
        "task_max_waiting_in_queue_millis": 0,
        "timed_out": false,
        "unassigned_shards": 0
    }
    

     查看节点个数:

    [root@master ~]# curl -X GET "localhost:9200/_cat/nodes?v"
    ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
    172.16.23.128           28          71   3    0.04    0.11     0.08 mdi       *      master
    172.16.23.130           29          67   4    0.04    0.11     0.10 mdi       -      node2
    172.16.23.129           28          58   4    0.12    0.20     0.13 mdi       -      node1
    

     单单只看master节点:

    [root@master ~]# curl -X GET "localhost:9200/_cat/master?v"
    id                     host          ip            node
    hVY-U_ocQueMtcryoGGbTg 172.16.23.128 172.16.23.128 master
    

     查看集群health:

    [root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
    epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
    1546070536 08:02:16  estest  green           3         3      0   0    0    0        0             0                  -                100.0%
    

     查看nodeattrs属性:

    [root@master ~]# curl -X GET "localhost:9200/_cat/nodeattrs?v"
    node   host          ip            attr              value
    master 172.16.23.128 172.16.23.128 ml.machine_memory 3956293632
    master 172.16.23.128 172.16.23.128 xpack.installed   true
    master 172.16.23.128 172.16.23.128 ml.max_open_jobs  20
    master 172.16.23.128 172.16.23.128 ml.enabled        true
    node2  172.16.23.130 172.16.23.130 ml.machine_memory 3956293632
    node2  172.16.23.130 172.16.23.130 ml.max_open_jobs  20
    node2  172.16.23.130 172.16.23.130 xpack.installed   true
    node2  172.16.23.130 172.16.23.130 ml.enabled        true
    node1  172.16.23.129 172.16.23.129 ml.machine_memory 3956293632
    node1  172.16.23.129 172.16.23.129 ml.max_open_jobs  20
    node1  172.16.23.129 172.16.23.129 xpack.installed   true
    node1  172.16.23.129 172.16.23.129 ml.enabled        true
    

     现在手动创建一个index为test:

    # curl -X PUT "localhost:9200/test"
    

     然后查看各节点index情况:

    [root@master ~]# ansible all_nodes -m shell -a 'curl -X GET "localhost:9200/_cat/indices?v" -s'
     [WARNING]: Consider using the get_url or uri module rather than running curl.  If you need to use command because get_url or uri is insufficient you can add
    warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.
    
    172.16.23.128 | CHANGED | rc=0 >>
    health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    yellow open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      1.7kb          1.1kb
    
    172.16.23.130 | CHANGED | rc=0 >>
    health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    yellow open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      1.7kb          1.1kb
    
    172.16.23.129 | CHANGED | rc=0 >>
    health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    yellow open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      1.7kb          1.1kb
    

     查看index的分片情况:

    [root@master ~]# curl -X GET "localhost:9200/_cat/shards?v"
    index shard prirep state      docs store ip            node
    test  3     p      STARTED       0  230b 172.16.23.128 master
    test  3     r      STARTED       0  230b 172.16.23.130 node2
    test  2     r      STARTED       0  230b 172.16.23.129 node1
    test  2     p      STARTED       0  230b 172.16.23.130 node2
    test  1     p      STARTED       0  230b 172.16.23.129 node1
    test  1     r      UNASSIGNED                          
    test  4     p      STARTED       0  230b 172.16.23.129 node1
    test  4     r      UNASSIGNED                          
    test  0     p      STARTED       0  230b 172.16.23.128 master
    test  0     r      STARTED       0  230b 172.16.23.130 node2
    

     由上面可以看出有两个分片是UNASSIGNED状态,查看集群health:

    [root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
    epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
    1546071645 08:20:45  estest  yellow          3         3      8   5    0    0        2             0                  -                 80.0%
    

     使用下面的命令定位有问题的分片以及原因:

    [root@master ~]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason -s| grep UNASSIGNED
    test 1 r UNASSIGNED INDEX_CREATED
    test 4 r UNASSIGNED INDEX_CREATED
    

     获取分片更多信息:

    [root@master ~]# curl -XGET localhost:9200/_cluster/allocation/explain?pretty
    {
      "index" : "test",
      "shard" : 1,
      "primary" : false,
      "current_state" : "unassigned",
      "unassigned_info" : {
        "reason" : "INDEX_CREATED",
        "at" : "2018-12-29T08:14:47.378Z",
        "last_allocation_status" : "no_attempt"
      },
      "can_allocate" : "no",
      "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
      "node_allocation_decisions" : [
        {
          "node_id" : "hVY-U_ocQueMtcryoGGbTg",
          "node_name" : "master",
          "transport_address" : "172.16.23.128:9300",
          "node_attributes" : {
            "ml.machine_memory" : "3956293632",
            "xpack.installed" : "true",
            "ml.max_open_jobs" : "20",
            "ml.enabled" : "true"
          },
          "node_decision" : "no",
          "weight_ranking" : 1,
          "deciders" : [
            {
              "decider" : "node_version",
              "decision" : "NO",
              "explanation" : "cannot allocate replica shard to a node with version [6.5.2] since this is older than the primary version [6.5.4]"
            }
          ]
        },
        {
          "node_id" : "q95yZ4W4Tj6PaXyzLZZYDQ",
          "node_name" : "node1",
          "transport_address" : "172.16.23.129:9300",
          "node_attributes" : {
            "ml.machine_memory" : "3956293632",
            "ml.max_open_jobs" : "20",
            "xpack.installed" : "true",
            "ml.enabled" : "true"
          },
          "node_decision" : "no",
          "weight_ranking" : 2,
          "deciders" : [
            {
              "decider" : "same_shard",
              "decision" : "NO",
              "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test][1], node[q95yZ4W4Tj6PaXyzLZZYDQ], [P], s[STARTED], a[id=j7V8PBUvQnOZzISPAxK9Uw]]"
            }
          ]
        },
        {
          "node_id" : "_ADSWG04TEqNfX_88ejtzQ",
          "node_name" : "node2",
          "transport_address" : "172.16.23.130:9300",
          "node_attributes" : {
            "ml.machine_memory" : "3956293632",
            "ml.max_open_jobs" : "20",
            "xpack.installed" : "true",
            "ml.enabled" : "true"
          },
          "node_decision" : "no",
          "weight_ranking" : 3,
          "deciders" : [
            {
              "decider" : "node_version",
              "decision" : "NO",
              "explanation" : "cannot allocate replica shard to a node with version [6.5.2] since this is older than the primary version [6.5.4]"
            }
          ]
        }
      ]
    }
    

     由上面结果可知node1,node2的es版本不同于master的es版本:

    [root@master ~]# ansible all_nodes -m shell -a 'rpm -qa|grep elasticsearch'
     [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because yum, dnf or zypper is insufficient you can
    add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.
    
    172.16.23.128 | CHANGED | rc=0 >>
    elasticsearch-6.5.2-1.noarch
    
    172.16.23.130 | CHANGED | rc=0 >>
    elasticsearch-6.5.2-1.noarch
    
    172.16.23.129 | CHANGED | rc=0 >>
    elasticsearch-6.5.4-1.noarch
    

     将其中上面版本不一致的替换掉后,开启es服务,然后观察集群以及shards情况:

    [root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
    epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
    1546073143 08:45:43  estest  red             1         1      2   2    0    0        8             0                  -                 20.0%
    [root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
    epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
    1546073274 08:47:54  estest  green           3         3     10   5    0    0        0             0                  -                100.0%
    [root@master ~]# curl -X GET "localhost:9200/_cat/shards?v"
    index shard prirep state   docs store ip            node
    test  3     p      STARTED    0  261b 172.16.23.128 master
    test  3     r      STARTED    0  261b 172.16.23.130 node2
    test  4     r      STARTED    0  261b 172.16.23.128 master
    test  4     p      STARTED    0  261b 172.16.23.129 node1
    test  2     r      STARTED    0  261b 172.16.23.129 node1
    test  2     p      STARTED    0  261b 172.16.23.130 node2
    test  1     p      STARTED    0  261b 172.16.23.129 node1
    test  1     r      STARTED    0  261b 172.16.23.130 node2
    test  0     p      STARTED    0  261b 172.16.23.128 master
    test  0     r      STARTED    0  261b 172.16.23.130 node2
    

     索引test由10个分片组成,五个主分片,5个replica shard,replica shard是primary shard的副本,负责容错,以及承担读请求负载,primary shard的数量在创建索引的时候就固定了,replica shard的数量可以随时修改,primary shard的默认数量是5,replica默认是1

    [root@master ~]# curl -XGET localhost:9200/test?pretty
    {
      "test" : {
        "aliases" : { },
        "mappings" : { },
        "settings" : {
          "index" : {
            "creation_date" : "1546071287243",
            "number_of_shards" : "5",
            "number_of_replicas" : "1",
            "uuid" : "l0Js1PJLTPSFEdXhanVSHA",
            "version" : {
              "created" : "6050299"
            },
            "provided_name" : "test"
          }
        }
      }
    }
    [root@master ~]# curl -XGET localhost:9200/_cat/indices?v
    health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
    green  open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      2.5kb          1.2kb
    

     primary shard不能和自己的replica shard放在同一个节点上(否则节点宕机,primary shard和副本都丢失,起不到容错的作用),但是可以和其他primary shard的replica shard放在    同一个节点上

    节点以及shards数分配参考:https://blog.csdn.net/qq_38486203/article/details/80077844

    然后这里梳理一下es中一些基础概念:

    1.cluster:

    集群,一个ES集群由一个或多个节点(Node)组成,每个集群都有一个cluster name作为标识

    2.node:

    节点,一个ES实例就是一个node,一个机器可以有多个实例,一个集群由多个节点构成,大多数情况下每个node运行在一个独立的环境或虚拟机上。

    3.index:

    索引,即一系列documents的集合

    3.shard:

    分片,ES是分布式搜索引擎,每个索引有一个或多个分片,索引的数据被分配到各个分片上,相当于一桶水用了N个杯子装

    分片有助于横向扩展,N个分片会被尽可能平均地(rebalance)分配在不同的节点上(例如你有2个节点,4个主分片(不考虑备份),那么每个节点会分到2个分片,后来你增加了2个节点,那么你这4个节点上都会有1个分片,这个过程叫relocation,ES感知后自动完成)

    分片是独立的,对于一个Search Request的行为,每个分片都会执行这个Request.另外每个分片都是一个Lucene Index,所以一个分片只能存放 Integer.MAX_VALUE - 128 = 2,147,483,519 个docs

    4.replica:

    复制,可以理解为备份分片,相应地有primary shard(主分片)

    主分片和备分片不会出现在同一个节点上(防止单点故障),默认情况下一个索引创建5个分片一个备份(即5primary+5replica=10个分片)

    如果你只有一个节点,那么5个replica都无法分配(unassigned),此时cluster status会变成Yellow。

    ES集群的三种状态:

    Green: 所有主分片和备份分片都准备就绪,分配成功, 即使有一台机器挂了(假设一台机器实例),数据都不会丢失,但是会变成yellow状态.

    Yellow: 所有主分片准备就绪,但至少一个主分片(假设是A)对应的备份分片没有就绪,此时集群处于告警状态,意味着高可用和容灾能力下降.如果刚好A所在的机器挂了,并且你只设置了一个备份(已处于未继续状态), 那么A的数据就会丢失(查询不完整),此时集群处于Red状态.

    Red:至少有一个主分片没有就绪(直接原因是找不到对应的备份分片成为新的主分片),此时查询的结果会出现数据丢失(不完整).

    容灾:primary分片丢失,replica分片就会被顶上去成为新的主分片,同时根据这个新的主分片创建新的replica,集群数据安然无恙

    提高查询性能:replica和primary分片的数据是相同的,所以对于一个query既可以查主分片也可以查备分片,在合适的范围内多个replica性能会更优(但要考虑资源占用也会提升[cpu/disk/heap]),另外index request只能发生在主分片上,replica不能执行index request。

    对于一个索引,除非重建索引否则不能调整分片的数目(主分片数, number_of_shards),但可以随时调整replica数(number_of_replicas)。

  • 相关阅读:
    IEEE 754 浮点数的表示方法
    .NET Core 3.0及以上的EFCore连接MySql
    一些常见错误/技巧/结论总结
    2-sat学习笔记
    动态DP学习笔记
    动态规划优化算法——wqs二分 and 折线优化
    扩展莫队小总结(二) (回滚莫队/二次离线莫队)
    CF1504X Codeforces Round #712
    CF1500D Tiles for Bathroom (递推+大讨论)
    CF1486X Codeforces Round #703
  • 原文地址:https://www.cnblogs.com/jsonhc/p/10197163.html
Copyright © 2020-2023  润新知