• 使用 Ansible 快速部署 HBase 集群


    背景

    出于数据安全的考虑,自研了一个低成本的时序数据存储系统,用于存储历史行情数据。

    系统借鉴了 InfluxDB 的列存与压缩策略,并基于 HBase 实现了海量存储能力。

    由于运维同事缺乏 Hadoop 全家桶的运维经验,只能由我这个研发临时兼职,亲自指挥亲自部署了。

    Hadoop 发行版选择

    目前可选的方案并不多,主要有:

    • CDH 目前中小企业选型首选的发行版
    • Amibari 最为灵活的且可定制的发行版
    • Apache 最原始的发行版

    CDH 的缺点:

    • Hadoop 组件的版本老旧,不支持新的 API
    • JDK 版本受限,无法受益于新版 JDK 带来的性能提升
    • 存在大量已知且未修复的 Bug,为后续运维埋下隐患
    • 新版本的 CDH 不再免费,无法免费升级

    Amibari 的缺点:

    • 文档较少,构建困难(前端组件版本较旧,构建直接报错)
    • 该项目已经退役,未来不再进行维护

    Apache 的缺陷:

    • 部署流程复杂,版本兼容可能会踩坑
    • 监控系统不完善,自己搭建需要一定的动手能力

    最终方案

    系统规划现状:

    • 合规严格要求,必须避免版权纠纷
    • 集群规模不大,节点数量小于 50
    • 没有 Hadoop 相关研发能力,无法自主修复 Bug
    • 需要保证查询性能,最好能用上 ZGC 或 ShenandoahGC

    最终敲定基于原始的 Apache 发行版搭建 HBase 集群。

    版本选择

    HBase 组件

    版本选择如下:

    • Adoptium JDK
    • HBase 2.4.11 (JDK 17)
    • Hadoop 3.2.3 (JDK 8)
    • Zookeeper 3.6.3 (JDK 17)

    Hadoop 版本

    Hadoop 3.3.x 之后不再使用 native 版本的 snappy 与 lz4(相关链接),而最新的 HBase 稳定版 2.4.x 版尚未适配该变更,因此选择 3.2.x 版本。

    而 Hadoop 3.2.x 依赖 Zookeeper 3.4.14 的客户端,无法运行在 JDK14 以上的环境(参考案例),因此使用 JDK 8 进行部署。

    Zookeeper 版本

    Zookeeper 3.6.x 是自带 Prometheus 监控版本中最低的,并且高版本 Zookeeper 保证了对低版本客户端的兼容性,因此选择该版本。并且该版本已经支持 JDK 11 部署,因此可以放心的将 JRE 升级为 JDK 17 进行部署。

    JDK 发行版

    JDK 17 是首个支持 ZGC 的 LTS 版本。因 Oracle JDK17 暂不支持 ShenandoahGC,最终选择 Adoptium JDK。网上有朋友分享过在 JDK 15 上部署 CDH 版 HBase 的经验,但需要打一个 Patch,具体步骤参考附录。

    运维工具

    为了弥补 Apache 发行版难以运维的缺点,需要借助两个高效的开源运维工具:

    Ansible

    一款简单易用的自动化部署工具

    • 支持幂等部署,减少部署过程中出错概率
    • 通过 ssh 实现通信,侵入性低,无需安装 agent
    • playbook 可以将运维操作文档化,方便他人接手

    Ansible 版本的分界线是 2.9.x,该版本是最后一个支持 Python 2.x 的版本。为了适应现有的运维环境,最终选择该版本。

    不过有条件还是建议升级到 Python 3.x 以上,并使用更新版本的 Ansible。毕竟有些 Bug 只在新版本修复,不会同步至低版本。

    Prometheus

    新一代监控告警平台

    • 独特的 PromQL 提供灵活高效的查询能力
    • 自带 TSDB 与 AlertManager,部署架构简单
    • 生态组件丰富
      • 通过 JMX Exporter 实现监控指标接入
      • 通过 Grafana 实现监控指标的可视化

    没有历史包袱,可以直接选择最新版。

    配置详解

    为了保证配置变更的可追溯性,使用 Git 新建了一个工程来维护部署脚本,整个工程的目录结构如下:

    .
    ├── hosts
    ├── ansible.cfg
    ├── book
    │   ├── config-hadoop.yml
    │   ├── config-hbase.yml
    │   ├── config-metrics.yml
    │   ├── config-zk.yml
    │   ├── install-hadoop.yml
    │   ├── sync-host.yml
    │   └── vars.yml
    ├── conf
    │   ├── hadoop
    │   │   ├── core-site.xml
    │   │   ├── hdfs-site.xml
    │   │   ├── mapred-site.xml
    │   │   ├── workers
    │   │   └── yarn-site.xml
    │   ├── hbase
    │   │   ├── backup-masters
    │   │   ├── hbase-site.xml
    │   │   └── regionservers
    │   ├── metrics
    │   │   ├── exports
    │   │   │   ├── hmaster.yml
    │   │   │   ├── jmx_exporter.yml
    │   │   │   └── regionserver.yml
    │   │   └── targets
    │   │       ├── hadoop-cluster.yml
    │   │       ├── hbase-cluster.yml
    │   │       └── zk-cluster.yml
    │   └── zk
    │       ├── myid
    │       └── zoo.cfg
    └── repo
        ├── hadoop
        │   ├── apache-zookeeper-3.6.3-bin.tar.gz
        │   ├── hadoop-3.2.3.tar.gz
        │   ├── hbase-2.4.11-bin.tar.gz
        │   ├── hbase-2.4.11-src.tar.gz
        │   ├── hbase-server-2.4.11.jar
        │   ├── OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz
        │   ├── OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz
        │   └── repo.md5
        └── metrics
            └── jmx_prometheus_javaagent-0.16.1.jar
    

    各个目录的作用

    • repo :存储用于部署的二进制的文件
    • book :存储 ansible-playbook 的自动化脚本
    • conf :存储 HBase 组件的配置模板

    hosts 文件

    对主机进行分类,便于规划集群部署:

    [newborn]
    
    [nodes]
    172.20.72.1 hostname='my.hadoop1 my.hbase1 my.zk1'
    172.20.72.2 hostname='my.hadoop2 my.hbase2 my.zk2'
    172.20.72.3 hostname='my.hadoop3 my.hbase3 my.zk3'
    172.20.72.4 hostname='my.hadoop4 my.hbase4'
    
    [zk_nodes]
    my.zk1 ansible_host=172.30.73.209 myid=1
    my.zk2 ansible_host=172.30.73.210 myid=2
    my.zk3 ansible_host=172.30.73.211 myid=3
    
    [hadoop_nodes]
    my.hadoop[1:4]
    
    [namenodes]
    my.hadoop1 id=nn1 rpc_port=8020 http_port=9870
    my.hadoop2 id=nn2 rpc_port=8020 http_port=9870
    
    [datanodes]
    my.hadoop[1:4]
    
    [journalnodes]
    my.hadoop1 journal_port=8485
    my.hadoop2 journal_port=8485
    my.hadoop3 journal_port=8485
    
    [resourcemanagers]
    my.hadoop3 id=rm1 peer_port=8032 tracker_port=8031 scheduler_port=8030 web_port=8088
    my.hadoop4 id=rm2 peer_port=8032 tracker_port=8031 scheduler_port=8030 web_port=8088
    
    [hbase_nodes]
    my.hbase[1:4]
    
    [hmasters]
    my.hbase[1:2]
    
    [regionservers]
    my.hbase[1:4]
    
    [all:vars]
    ansible_user=admin
    deploy_dir=/opt
    data_dir=/data
    

    ansible.cfg 文件

    ansile 的基础配置文件:

    [defaults]
    inventory      = ./hosts
    host_key_checking = False
    

    conf 目录

    conf/zk 目录

    zoo.cfg

    # ZK 与客户端间的心跳间隔,单位 mills
    tickTime=2000
    # Leader 与 Follower 间建立连接的超时时间,单位为 tick
    initLimit=30
    # Leader 与 Follower 间通信的超时时间,单位为 tick
    syncLimit=5
    # 快照目录
    dataDir={{ zk_data_dir }}
    # WAL目录,最好为其指定一个独立的空闲设备(建议使用 SSD)
    dataLogDir={{ zk_data_log_dir }}
    # 使用默认通信端口
    clientPort=2181
    # 增加最大连接数
    maxClientCnxns=2000
    # 开启 Prometheus 监控
    metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
    metricsProvider.httpHost={{ ansible_host | default(inventory_hostname) }}
    metricsProvider.httpPort=7000
    metricsProvider.exportJvmInfo=true
    # 配置集群信息
    # server.{myid}={server-address}:{rpc-port}:{election-port}
    {% for host in groups['zk_nodes'] %}
    server.{{ hostvars[host]['myid'] }}={{ hostvars[host]['ansible_host'] }}:2888:3888
    {% endfor %}
    

    myid

    {{ myid }}
    

    conf/hadoop 目录

    core-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <!-- 指定 NameNode 地址 (使用集群名称替代) -->
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://{{ hdfs_name }}</value>
      </property>
      <!-- 指定数据存储目录 -->
      <property>
        <name>hadoop.tmp.dir</name>
        <value>{{ hadoop_data_dir }}</value>
      </property>
      <!-- 指定 Web 用户权限(默认用户 dr.who 无法上传文件) -->
      <property>
         <name>hadoop.http.staticuser.user</name>
         <value>{{ ansible_user }}</value>
      </property>
      <!-- 指定 DFSZKFailoverController 所需的 ZK -->
      <property>
        <name>ha.zookeeper.quorum</name>
        <value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
      </property>
    </configuration>
    

    hdfs-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
     <!-- NameNode 数据存储目录 -->
     <property>
       <name>dfs.namenode.name.dir</name>
       <value>file://${hadoop.tmp.dir}/name</value>
     </property>
     <!-- DataNode 数据存储目录 -->
     <property>
       <name>dfs.datanode.data.dir</name>
       <value>file://${hadoop.tmp.dir}/data</value>
     </property>
     <!-- JournalNode 数据存储目录(绝对路径,不能带 file://) -->
     <property>
       <name>dfs.journalnode.edits.dir</name>
       <value>${hadoop.tmp.dir}/journal</value>
     </property>
     <!-- HDFS 集群名称 -->
     <property>
       <name>dfs.nameservices</name>
       <value>{{ hdfs_name }}</value>
     </property>
     <!-- 集群 NameNode 节点列表 -->
     <property>
       <name>dfs.ha.namenodes.{{hdfs_name}}</name>
       <value>{{ groups['namenodes'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
     </property>
     <!-- NameNode RPC 地址 -->
     {% for host in groups['namenodes'] %}
     <property>
       <name>dfs.namenode.rpc-address.{{hdfs_name}}.{{hostvars[host]['id']}}</name>
       <value>{{host}}:{{hostvars[host]['rpc_port']}}</value>
     </property>
     {% endfor %}
     <!-- NameNode HTTP 地址 -->
     {% for host in groups['namenodes'] %}
     <property>
       <name>dfs.namenode.http-address.{{hdfs_name}}.{{hostvars[host]['id']}}</name>
        <value>{{host}}:{{hostvars[host]['http_port']}}</value>
     </property>
     {% endfor %}
     <!-- NameNode 元数据在 JournalNode 上的存放位置 -->
     <property>
       <name>dfs.namenode.shared.edits.dir</name>
       <value>qjournal://{{groups['journalnodes'] | zip( groups['journalnodes']|map('extract', hostvars)|map(attribute='journal_port') )| map('join', ':') | join(';') }}/{{hdfs_name}}</value>
     </property>
     <!-- fail-over 代理类 (client 通过 proxy 来确定 Active NameNode) -->
     <property>
       <name>dfs.client.failover.proxy.provider.my-hdfs</name>
       <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
     </property>
     <!-- 隔离机制 (保证只存在唯一的 Active NameNode) -->
     <property>
       <name>dfs.ha.fencing.methods</name>
       <value>sshfence</value>
     </property>
     <!-- SSH 隔离机制依赖的登录秘钥 -->
     <property>
       <name>dfs.ha.fencing.ssh.private-key-files</name>
       <value>/home/{{ ansible_user }}/.ssh/id_rsa</value>
     </property>
     <!-- 启用自动故障转移 -->
     <property>
        <name>dfs.ha.automatic-failover.enabled</name>
       <value>true</value>
     </property>
     <!-- NameNode 工作线程数量 -->
     <property>
       <name>dfs.namenode.handler.count</name>
       <value>21</value>
     </property>
    </configuration>
    

    yarn-site.xml

    <?xml version="1.0"?>
    <configuration>
     <!-- 启用 ResourceManager HA -->
     <property>
       <name>yarn.resourcemanager.ha.enabled</name>
       <value>true</value>
     </property>  
     <!-- YARN 集群名称 -->
     <property>
       <name>yarn.resourcemanager.cluster-id</name>
       <value>{{yarn_name}}</value>
     </property>  
    <!-- ResourceManager 节点列表 -->
     <property>
       <name>yarn.resourcemanager.ha.rm-ids</name>
       <value>{{ groups['resourcemanagers'] | map('extract', hostvars) | map(attribute='id') | join(',') }}</value>
     </property>  
     <!-- ResourceManager 地址 -->
     {% for host in groups['resourcemanagers'] %}
     <property>
       <name>yarn.resourcemanager.hostname.{{hostvars[host]['id']}}</name>
       <value>{{host}}</value>
     </property>
     {% endfor %}
     <!-- ResourceManager 内部通信地址 -->
     {% for host in groups['resourcemanagers'] %}
     <property>
         <name>yarn.resourcemanager.address.{{hostvars[host]['id']}}</name>
         <value>{{host}}:{{hostvars[host]['peer_port']}}</value>
     </property>
     {% endfor %}
     <!-- NM 访问 ResourceManager 地址 -->
     {% for host in groups['resourcemanagers'] %}
     <property>
         <name>yarn.resourcemanager.resource-tracker.{{hostvars[host]['id']}}</name>
         <value>{{host}}:{{hostvars[host]['tracker_port']}}</value>
     </property>
     {% endfor %}
     <!-- AM 向 ResourceManager 申请资源地址 -->
     {% for host in groups['resourcemanagers'] %}
     <property>
         <name>yarn.resourcemanager.scheduler.address.{{hostvars[host]['id']}}</name>
         <value>{{host}}:{{hostvars[host]['scheduler_port']}}</value>
     </property>
     {% endfor %}
     <!-- ResourceManager Web 入口 -->
     {% for host in groups['resourcemanagers'] %}
     <property>
         <name>yarn.resourcemanager.webapp.address.{{hostvars[host]['id']}}</name>
         <value>{{host}}:{{hostvars[host]['web_port']}}</value>
     </property>
     {% endfor %}
     <!-- 启用自动故障转移 -->
     <property>
       <name>yarn.resourcemanager.recovery.enabled</name>
       <value>true</value>
     </property>
     <!-- 指定 Zookeeper 列表 -->
     <property>
       <name>yarn.resourcemanager.zk-address</name>
       <value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
     </property>
     <!-- 将状态信息存储在 Zookeeper 集群-->
     <property>
       <name>yarn.resourcemanager.store.class</name>
       <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
     </property>
     <!-- 减少 ResourceManager 处理 Client 请求的线程-->
     <property>
       <name>yarn.resourcemanager.scheduler.client.thread-count</name>
       <value>10</value>
     </property>  
     <!-- 禁止 NodeManager 自适应硬件配置(非独占节点)-->
     <property>
       <name>yarn.nodemanager.resource.detect-hardware-capbilities</name>
       <value>false</value>
     </property>
     <!-- NodeManager 给容器分配的 CPU 核数-->
     <property>
       <name>yarn.nodemanager.resource.cpu-vcores</name>
       <value>4</value>
     </property>
     <!-- NodeManager 使用物理核计算 CPU 数量(可选)-->
     <property>
       <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
       <value>false</value>
     </property>  
     <!-- 减少 NodeManager 使用内存-->
     <property>
       <name>yarn.nodemanager.resource.memory-mb</name>
       <value>4096</value>
     </property>  
     <!-- 容器内存下限 -->
     <property>
       <name>yarn.scheduler.minimum-allocation-mb</name>
       <value>1024</value>
     </property>  
     <!-- 容器内存上限 -->
     <property>
       <name>yarn.scheduler.maximum-allocation-mb</name>
       <value>2048</value>
     </property>  
     <!-- 容器CPU下限 -->
     <property>
       <name>yarn.scheduler.minimum-allocation-vcores</name>
       <value>1</value>
     </property>  
     <!-- 容器CPU上限 -->
     <property>
       <name>yarn.scheduler.maximum-allocation-vcores</name>
       <value>2</value>
     </property>  
     <!-- 容器CPU上限 -->
     <property>
       <name>yarn.scheduler.maximum-allocation-vcores</name>
       <value>2</value>
     </property>
     <!-- 关闭虚拟内存检查 -->
     <property>
       <name>yarn.nodemanager.vmem-check-enabled</name>
       <value>false</value>
     </property>
     <!-- 设置虚拟内存和物理内存的比例 -->
     <property>
       <name>yarn.nodemanager.vmem-pmem-ratio</name>
       <value>2.1</value>
     </property>
     <!-- NodeManager 在 MR 过程中使用 Shuffle(可选)-->
     <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
     </property>  
    </configuration>
    

    mapred-site.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
     <!-- MapReduce 运行在 YARN 上 -->
     <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
     </property>
     <!-- MapReduce Classpath -->
     <property>
       <name>yarn.app.mapreduce.am.env</name>
       <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
     </property>
     <property>
       <name>mapreduce.map.env</name>
       <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
     </property>
     <property>
       <name>mapreduce.reduce.env</name>
       <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
     </property>
     <!-- MapReduce JVM 参数(不允许换行) -->
     <property>
       <name>yarn.app.mapreduce.am.command-opts</name>
       <value>-Xmx1024m --add-opens java.base/java.lang=ALL-UNNAMED</value>
     </property>
     <property>
       <name>mapred.child.java.opts</name>
       <value>--add-opens java.base/java.lang=ALL-UNNAMED -verbose:gc -Xloggc:/tmp/@taskid@.gc</value>
     </property>
    </configuration>
    

    workers

    {% for host in groups['datanodes'] %}
    {{ host }}
    {% endfor %}
    

    conf/hbase 目录

    hbase-site.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>hbase.tmp.dir</name>
        <value>./tmp</value>
      </property>
      <property>
        <name>hbase.rootdir</name>
        <value>hdfs://{{ hdfs_name }}/hbase</value>
      </property>
      <property>
        <name>hbase.master.maxclockskew</name>
        <value>180000</value>
      </property>
      <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
      </property>
      <property>
        <name>hbase.zookeeper.quorum</name>
          <value>{{ groups['zk_nodes'] | map('regex_replace','^(.+)$','\\1:2181') | join(',') }}</value>
      </property>
    </configuration>
    

    regionservers

    {% for host in groups['regionservers'] %}
    {{ host }}
    {% endfor %}
    

    backup-masters

    {% for host in groups['hmasters'][1:] %}
    {{ host }}
    {% endfor %}
    

    conf/metrics/exports 目录

    jmx_exporter.yml

    ---
    # https://github.com/prometheus/jmx_exporter
    startDelaySeconds: 5
    ssl: false
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    rules: 
    # ignore service
    - pattern: Hadoop<service=(\w+), name=([\w-.]+), sub=(\w+)><>([\w._]+)
      name: $4
      labels:
        name: "$2"
        group: "$3"
      attrNameSnakeCase: true
    # ignore service
    - pattern: Hadoop<service=(\w+), name=(\w+)-([^<]+)><>([\w._]+)
      name: $4
      labels:
        name: "$2"
        entity: "$3"
      attrNameSnakeCase: true
    # ignore service
    - pattern: Hadoop<service=(\w+), name=([^<]+)><>([\w._]+)
      name: $3
      labels:
        name: "$2"
      attrNameSnakeCase: true
    - pattern: .+
    

    hmaster.yml

    ---
    startDelaySeconds: 5
    ssl: false
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    blacklistObjectNames:
    - "Hadoop:service=HBase,name=JvmMetrics*"
    - "Hadoop:service=HBase,name=RegionServer,*"
    rules:
    - pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
      name: $2
      labels:
        group: "$1"
        stat: "$3"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=Master, sub=(\w+)><>([\w._]+)
      name: $2
      labels:
        group: "$1"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=Master><>([\w._]+)
      name: $1
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
      name: $3
      labels:
        name: "$1"
        group: "$2"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
      name: $2
      labels:
        name: "$1"
      attrNameSnakeCase: true
    - pattern: .+
    

    regionserver.yml

    ---
    startDelaySeconds: 5
    ssl: false
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    blacklistObjectNames:
    - "Hadoop:service=HBase,name=JvmMetrics*"
    - "Hadoop:service=HBase,name=Master,*"
    rules:
    - pattern: Hadoop<service=HBase, name=RegionServer, sub=Regions><>namespace_([\w._]+)_table_([\w._]+)_region_(\w+)_metric_([\w._]+)
      name: $4
      labels:
        group: Regions
        namespace: "$1"
        table: "$2"
        region: "$3"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=RegionServer, sub=Tables><>namespace_([\w._]+)_table_([\w._]+)_columnfamily_([\w._]+)_metric_([\w._]+)
      name: $4
      labels:
        group: Tables
        namespace: "$1"
        table: "$2"
        column_family: "$3"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
      name: $4
      labels:
        group: "$1"
        namespace: "$2"
        table: "$3"
        stat: "$5"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>namespace_([\w._]+)_table_([\w._]+)_metric_([\w._]+)
      name: $4
      labels:
        group: "$1"
        namespace: "$2"
        table: "$3"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)_(num_ops|min|max|mean|median|25th_percentile|75th_percentile|90th_percentile|95th_percentile|98th_percentile|99th_percentile|99.9th_percentile)
      name: $2
      labels:
        group: "$1"
        stat: "$3"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=RegionServer, sub=(\w+)><>([\w._]+)
      name: $2
      labels:
        group: "$1"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=(\w+), sub=(\w+)><>([\w._]+)
      name: $3
      labels:
        name: "$1"
        group: "$2"
      attrNameSnakeCase: true
    - pattern: Hadoop<service=HBase, name=(\w+)><>([\w._]+)
      name: $2
      labels:
        name: "$1"
      attrNameSnakeCase: true
    - pattern: .+
    

    conf/metrics/targets 目录

    zk-cluster.yml

    - targets:
    {% for host in groups['zk_nodes'] %}
      - {{ host }}:7000
    {% endfor %}
      labels:
        service: zookeeper
    

    hadoop-cluster.yml

    - targets:
    {% for host in groups['namenodes'] %}
      - {{ host }}:{{ namenode_metrics_port }}
    {% endfor %}
      labels:
        role: namenode
        service: hdfs
    
    - targets:
    {% for host in groups['datanodes'] %}
      - {{ host }}:{{ datanode_metrics_port }}
    {% endfor %}
      labels:
        role: datanode
        service: hdfs
    
    - targets:
    {% for host in groups['journalnodes'] %}
      - {{ host }}:{{ journalnode_metrics_port }}
    {% endfor %}
      labels:
        role: journalnode
        service: hdfs
    
    - targets:
    {% for host in groups['resourcemanagers'] %}
      - {{ host }}:{{ resourcemanager_metrics_port }}
    {% endfor %}
      labels:
        role: resourcemanager
        service: yarn
    
    - targets:
    {% for host in groups['datanodes'] %}
      - {{ host }}:{{ nodemanager_metrics_port }}
    {% endfor %}
      labels:
        role: nodemanager
        service: yarn
    

    hbase-cluster.yml

    - targets:
    {% for host in groups['hmasters'] %}
      - {{ host }}:{{ hmaster_metrics_port }}
    {% endfor %}
      labels:
        role: hmaster
        service: hbase
    
    - targets:
    {% for host in groups['regionservers'] %}
      - {{ host }}:{{ regionserver_metrics_port }}
    {% endfor %}
      labels:
        role: regionserver
        service: hbase
    

    book 目录

    vars.yml

    hdfs_name: my-hdfs
    yarn_name: my-yarn
    

    sync-host.yml

    ---
    - name: Config Hostname & SSH Keys
      hosts: nodes  
      connection: local
      gather_facts: no
      any_errors_fatal: true
    
      vars:
        hostnames: |
          {% for h in groups['nodes'] if hostvars[h].hostname is defined %}{{h}} {{ hostvars[h].hostname }}
          {% endfor %}
    
      tasks:
    
        - name: test connectivity
          ping:
          connection: ssh
    
        - name: change local hostname 
          become: true
          blockinfile:  
            dest: '/etc/hosts'
            marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
            block: '{{ hostnames }}'
          run_once: true
    
        - name: sync remote hostname 
          become: true
          blockinfile:  
            dest: '/etc/hosts'
            marker: "# {mark} ANSIBLE MANAGED HOSTNAME"
            block: '{{ hostnames }}'
          connection: ssh
    
        - name: fetch exist status
          stat:
            path: '~/.ssh/id_rsa'
          register: ssh_key_path
          connection: ssh
    
        - name: generate ssh key
          openssh_keypair:
            path: '~/.ssh/id_rsa'
            comment: '{{ ansible_user }}@{{ inventory_hostname }}'
            type: rsa
            size: 2048
            state: present
            force: no
          connection: ssh
          when: not ssh_key_path.stat.exists
    
        - name: collect ssh key
          command: ssh {{ansible_user}}@{{ansible_host|default(inventory_hostname)}} 'cat ~/.ssh/id_rsa.pub'
          register: host_keys  # cache data in hostvars[hostname].host_keys
          changed_when: false
    
        - name: create temp file
          tempfile:
            state: file
            suffix: _keys
          register: temp_ssh_keys
          changed_when: false
          run_once: true
    
        - name: save ssh key ({{temp_ssh_keys.path}})
          blockinfile:  
            dest: "{{temp_ssh_keys.path}}"  
            block: |  
              {% for h in groups['nodes'] if hostvars[h].host_keys is defined %}  
              {{ hostvars[h].host_keys.stdout }}  
              {% endfor %}  
          changed_when: false
          run_once: true
    
        - name: deploy ssh key
          vars:
            ssh_keys: "{{ lookup('file', temp_ssh_keys.path).split('\n') | select('match', '^ssh') | join('\n') }}"
          authorized_key:
            user: "{{ ansible_user }}"
            key: "{{ ssh_keys }}"
            state: present
          connection: ssh
    
    

    install-hadoop.yml

    ---
    - name: Install Hadoop Package
      hosts: newborn
      gather_facts: no
      any_errors_fatal: true
    
      vars:
        local_repo: '../repo/hadoop'
        remote_repo: '~/repo/hadoop'
        package_info:
          - {src: 'OpenJDK17U-jdk_x64_linux_hotspot_17.0.2_8.tar.gz', dst: 'java/jdk-17.0.2+8', home: 'jdk17'}
          - {src: 'OpenJDK8U-jdk_x64_linux_hotspot_8u322b06.tar.gz', dst: 'java/jdk8u322-b06', home: 'jdk8'}
          - {src: 'apache-zookeeper-3.6.3-bin.tar.gz', dst: 'apache/zookeeper-3.6.3', home: 'zookeeper'}
          - {src: 'hbase-2.4.11-bin.tar.gz', dst: 'apache/hbase-2.4.11',home: 'hbase'}
          - {src: 'hadoop-3.2.3.tar.gz', dst: 'apache/hadoop-3.2.3', home: 'hadoop'}
    
      tasks:
    
        - name: test connectivity
          ping:
    
        - name: copy hadoop package
          copy:
              src: '{{ local_repo }}'
              dest: '~/repo'
    
        - name: prepare directory
          become: true # become root
          file:
            state: directory
            path: '{{ deploy_dir }}/{{ item.dst }}'
            owner: '{{ ansible_user }}'
            group: '{{ ansible_user }}'
            mode: 0775
            recurse: yes
          with_items: '{{ package_info }}'
    
        - name: create link
          become: true # become root
          file:
            state: link
            src: '{{ deploy_dir }}/{{ item.dst }}'
            dest: '{{ deploy_dir }}/{{ item.home }}'
            owner: '{{ ansible_user }}'
            group: '{{ ansible_user }}'
          with_items: '{{ package_info }}'
    
        - name: install package
          unarchive:
            src: '{{ remote_repo }}/{{ item.src }}'
            dest: '{{ deploy_dir }}/{{ item.dst }}'
            remote_src: yes
            extra_opts:
              - --strip-components=1
          with_items: '{{ package_info }}'
    
        - name: config /etc/profile
          become: true
          blockinfile:  
            dest: '/etc/profile'
            marker: "# {mark} ANSIBLE MANAGED PROFILE"
            block: |
              export JAVA_HOME={{ deploy_dir }}/jdk8
              export HADOOP_HOME={{ deploy_dir }}/hadoop
              export HBASE_HOME={{ deploy_dir }}/hbase
              export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PATH
    
        - name: config zkEnv.sh
          lineinfile:
            path: '{{ deploy_dir }}/zookeeper/bin/zkEnv.sh'
            line: 'JAVA_HOME={{ deploy_dir }}/jdk17'
            insertafter: '^#\!\/usr\/bin'
            firstmatch: yes
    
        - name: config hadoop-env.sh
          blockinfile:
            dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
            marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP ENV"
            block: |
              export JAVA_HOME={{ deploy_dir }}/jdk8
    
        - name: config hbase-env.sh
          blockinfile:
            dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
            marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE ENV"
            block: |
              export JAVA_HOME={{ deploy_dir }}/jdk17
              export HBASE_MANAGES_ZK=false
              export HBASE_LIBRARY_PATH={{ deploy_dir }}/hadoop/lib/native
              export HBASE_OPTS="$HBASE_OPTS --add-exports=java.base/jdk.internal.access=ALL-UNNAMED --add-exports=java.base/jdk.internal=ALL-UNNAMED --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.security.pkcs=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/jdk.internal=ALL-UNNAMED --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-opens java.base/jdk.internal.access=ALL-UNNAMED"
    
        - name: patch hbase
          copy:
            src: '{{ local_repo }}/hbase-server-2.4.11.jar'
            dest: '{{ deploy_dir }}/hbase/lib'
            backup: no
            force: yes
    
        - name: link hadoop config
          file:
            state: link
            src: '{{ deploy_dir }}/hadoop/etc/hadoop/{{ item }}'
            dest: '{{ deploy_dir }}/hbase/conf/{{ item }}'
          with_items: 
            - core-site.xml
            - hdfs-site.xml
    
        - name: add epel-release repo
          shell: 'sudo yum -y install epel-release && sudo yum makecache'
    
        - name: install native libary
          shell: 'sudo yum -y install snappy snappy-devel lz4 lz4-devel libzstd libzstd-devel'
    
        - name: check hadoop native
          shell: '{{ deploy_dir }}/hadoop/bin/hadoop checknative -a'
          register: hadoop_checknative
          failed_when: false
          changed_when: false
          ignore_errors: yes
          environment:
            JAVA_HOME: '{{ deploy_dir }}/jdk8'
    
        - name: hadoop native status
          debug:
            msg: "{{ hadoop_checknative.stdout_lines }}"
    
        - name: check hbase native
          shell: '{{ deploy_dir }}/hbase/bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker'
          register: hbase_checknative
          failed_when: false
          changed_when: false
          ignore_errors: yes
          environment:
            JAVA_HOME: '{{ deploy_dir }}/jdk17'
            HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
    
        - name: hbase native status
          debug:
            msg: "{{ hbase_checknative.stdout_lines|select('match', '^[^0-9]') | list }}"
    
        - name: test native compresssion
          shell: '{{ deploy_dir }}/hbase/bin/hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/test {{ item }}'
          register: 'compression'
          failed_when: false
          changed_when: false
          ignore_errors: yes
          environment:
            JAVA_HOME: '{{ deploy_dir }}/jdk17'
            HBASE_LIBRARY_PATH: '{{ deploy_dir }}/hadoop/lib/native'
          with_items:
            - snappy
            - lz4
    
        - name: native compresssion status
          vars:
            results: "{{ compression | json_query('results[*].{type:item, result:stdout}') }}"
          debug:
            msg: |
              {% for r in results %} {{ r.type }} => {{ r.result == 'SUCCESS' }} {% endfor %}
    

    config-zk.yml

    ---
    - name: Change Zk Config
      hosts: zk_nodes
      gather_facts: no
      any_errors_fatal: true
    
      vars:
        template_dir: ../conf/zk
        zk_home: '{{ deploy_dir }}/zookeeper'
        zk_data_dir: '{{ zk_home }}/status/data'
        zk_data_log_dir: '{{ zk_home }}/status/logs'
    
      tasks:
    
        - name: Create data directory
          file:
            state: directory
            path: '{{ item }}'
            recurse: yes
          with_items: 
            - '{{ zk_data_dir }}'
            - '{{ zk_data_log_dir }}'
    
        - name: Init zookeeper myid
          template:
            src: '{{ template_dir }}/myid'
            dest: '{{ zk_data_dir }}'
    
        - name: Update zookeeper env
          become: true
          blockinfile:
            dest: '{{ zk_home }}/bin/zkEnv.sh'
            marker: "# {mark} ANSIBLE MANAGED ZK ENV"
            block: |
              export SERVER_JVMFLAGS="-Xmx1G -XX:+UseShenandoahGC -XX:+AlwaysPreTouch -Djute.maxbuffer=8388608"
          notify:
            - Restart zookeeper service
    
        - name: Update zookeeper config
          template:
            src: '{{ template_dir }}/zoo.cfg'
            dest: '{{ zk_home }}/conf'
          notify:
            - Restart zookeeper service
    
      handlers:
        - name: Restart zookeeper service
          shell:
            cmd: '{{ zk_home }}/bin/zkServer.sh restart'
    

    config-hadoop.yml

    ---
    - name: Change Hadoop Config
      hosts: hadoop_nodes
      gather_facts: no
      any_errors_fatal: true
    
      vars:
        template_dir: ../conf/hadoop
        hadoop_home: '{{ deploy_dir }}/hadoop'
        hadoop_conf_dir: '{{ hadoop_home }}/etc/hadoop'
        hadoop_data_dir: '{{ data_dir }}/hadoop'
    
      tasks:
    
        - name: Include common vars
          include_vars: file=vars.yml
    
        - name: Create data directory
          become: true
          file:
            state: directory
            path: '{{ hadoop_data_dir }}'
            owner: '{{ ansible_user }}'
            group: '{{ ansible_user }}'
            mode: 0775
            recurse: yes
    
        - name: Sync hadoop config
          template:
            src: '{{ template_dir }}/{{ item }}'
            dest: '{{ hadoop_conf_dir }}/{{ item }}'
          with_items: 
            - core-site.xml
            - hdfs-site.xml
            - mapred-site.xml
            - yarn-site.xml
            - workers
    
        - name: Config hadoop env
          blockinfile:
            dest: '{{ hadoop_conf_dir }}/hadoop-env.sh'
            marker: "# {mark} ANSIBLE MANAGED HADOOP ENV"
            block: |
              export HADOOP_PID_DIR={{ hadoop_home }}/pid
              export HADOOP_LOG_DIR={{ hadoop_data_dir }}/logs
    
              JVM_OPTS="-XX:+AlwaysPreTouch"
              export HDFS_JOURNALNODE_OPTS="-Xmx1G $JVM_OPTS $HDFS_JOURNALNODE_OPTS"
              export HDFS_NAMENODE_OPTS="-Xmx4G $JVM_OPTS $HDFS_NAMENODE_OPTS"
              export HDFS_DATANODE_OPTS="-Xmx8G $JVM_OPTS $HDFS_DATANODE_OPTS"
    
        - name: Config yarn env
          blockinfile:
            dest: '{{ hadoop_conf_dir }}/yarn-env.sh'
            marker: "# {mark} ANSIBLE MANAGED YARN ENV"
            block: |
              JVM_OPTS=""
              export YARN_RESOURCEMANAGER_OPTS="$JVM_OPTS $YARN_RESOURCEMANAGER_OPTS"
              export YARN_NODEMANAGER_OPTS="$JVM_OPTS $YARN_NODEMANAGER_OPTS"
    

    config-hbase.yml

    ---
    - name: Change HBase Config
      hosts: hbase_nodes
      gather_facts: no
      any_errors_fatal: true
    
      vars:
        template_dir: ../conf/hbase
        hbase_home: '{{ deploy_dir }}/hbase'
        hbase_conf_dir: '{{ hbase_home }}/conf'
        hbase_data_dir: '{{ data_dir }}/hbase'
        hbase_log_dir: '{{ hbase_data_dir }}/logs'
        hbase_gc_log_dir: '{{ hbase_log_dir }}/gc'
    
      tasks:
    
        - name: Include common vars
          include_vars: file=vars.yml
    
        - name: Create data directory
          become: true
          file:
            state: directory
            path: '{{ item }}'
            owner: '{{ ansible_user }}'
            group: '{{ ansible_user }}'
            mode: 0775
            recurse: yes
          with_items:
            - '{{ hbase_data_dir }}'
            - '{{ hbase_log_dir }}'
            - '{{ hbase_gc_log_dir }}'
    
        - name: Sync hbase config
          template:
            src: '{{ template_dir }}/{{ item }}'
            dest: '{{ hbase_conf_dir }}/{{ item }}'
          with_items: 
            - hbase-site.xml
            - backup-masters
            - regionservers
    
        - name: Config hbase env
          blockinfile:
            dest: '{{ hbase_conf_dir }}/hbase-env.sh'
            marker: "# {mark} ANSIBLE MANAGED HBASE ENV"
            block: |
              export HBASE_LOG_DIR={{ hbase_log_dir }}
    
              export HBASE_OPTS="-Xss256k -XX:+UseShenandoahGC -XX:+AlwaysPreTouch $HBASE_OPTS"
              export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hmaster-%p-%t.log"
              export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xlog:gc:{{hbase_gc_log_dir}}/gc-hregion-%p-%t.log"
    
    

    config-metrics.yml

    ---
    - name: Install Metrics Package
      hosts: "{{ groups['hadoop_nodes'] + groups['hbase_nodes'] }}"
      gather_facts: no
      any_errors_fatal: true
    
      vars:
        local_repo: '../repo/metrics'
        remote_repo: '~/repo/metrics'
        template_dir: ../conf/metrics
        default_conf: jmx_exporter.yml
    
        export_tmpl: '{{template_dir}}/exports'
        target_tmpl: '{{template_dir}}/targets'
    
        metrics_dir: '{{ deploy_dir }}/prometheus'
        hadoop_home: '{{ deploy_dir }}/hadoop'
        hbase_home: '{{ deploy_dir }}/hbase'
    
        jmx_exporter: 'jmx_prometheus_javaagent-0.16.1.jar'
        agent_path: '{{ metrics_dir }}/{{ jmx_exporter }}'
    
        namenode_metrics_port: 7021
        datanode_metrics_port: 7022
        journalnode_metrics_port: 7023
        resourcemanager_metrics_port: 7024
        nodemanager_metrics_port: 7025
        historyserver_metrics_port: 7026
    
        hmaster_metrics_port: 7027
        regionserver_metrics_port: 7028
    
        host_to_ip: |
          { {% for h in groups['nodes'] %} {% for n in hostvars[h]['hostname'].split() %}
           "{{ n }}" : "{{ h }}" ,
          {% endfor %} {% endfor %} }
    
        hadoop_metrics:
          - { env: 'HDFS_NAMENODE_OPTS', conf: 'namenode.yml', port: '{{namenode_metrics_port}}',  }
          - { env: 'HDFS_DATANODE_OPTS', conf: 'datanode.yml', port: '{{datanode_metrics_port}}'}
          - { env: 'HDFS_JOURNALNODE_OPTS', conf: 'journalnode.yml', port: '{{journalnode_metrics_port}}' }
          - { env: 'YARN_RESOURCEMANAGER_OPTS', conf: 'resourcemanager.yml', port: '{{resourcemanager_metrics_port}}' }
          - { env: 'YARN_NODEMANAGER_OPTS', conf: 'nodemanager.yml', port: '{{nodemanager_metrics_port}}' }
          - { env: 'MAPRED_HISTORYSERVER_OPTS', conf: 'historyserver.yml', port: '{{historyserver_metrics_port}}' }
    
        hbase_metrics:
          - { env: 'HBASE_MASTER_OPTS', conf: 'hmaster.yml', port: '{{hmaster_metrics_port}}' }
          - { env: 'HBASE_REGIONSERVER_OPTS', conf: 'regionserver.yml', port: '{{regionserver_metrics_port}}'}
    
      tasks:
    
        - name: test connectivity
          ping:
    
        - name: copy metrics package
          copy:
              src: '{{ local_repo }}'
              dest: '~/repo'
    
        - name: ensure metrics dir
          become: true
          file: 
            path: '{{ metrics_dir }}'
            owner: '{{ ansible_user }}'
            group: '{{ ansible_user }}'
            state: directory
    
        - name: install jmx exporter
          copy:
            src: '{{ remote_repo }}/{{ jmx_exporter }}'
            dest: '{{ metrics_dir }}/{{ jmx_exporter }}'
            remote_src: yes
    
        - name: fetch exist exporter config
          stat:
            path: '{{ export_tmpl }}/{{ item }}'
          with_items: "{{ (hadoop_metrics + hbase_metrics) | map(attribute='conf') | list }}"
          register: metric_tmpl
          run_once: yes
          connection: local
    
        - name: update hadoop exporter config
          vars:
            metrics_ip: '{{host_to_ip[inventory_hostname]}}'
            metrics_port: '{{ item.port }}'
            custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
          template:
            src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
            dest: '{{ metrics_dir }}/{{ item.conf }}'
          with_items: '{{ hadoop_metrics }}'
          when: inventory_hostname in groups['hadoop_nodes']
    
        - name: update hbase exporter config
          vars:
            metrics_ip: '{{host_to_ip[inventory_hostname]}}'
            metrics_port: '{{ item.port }}'
            custom_tmpl: "{{ item.conf in (metric_tmpl | json_query('results[?stat.exists].item')) }}"
          template:
            src: '{{ export_tmpl }}/{{ item.conf if custom_tmpl else default_conf }}'
            dest: '{{ metrics_dir }}/{{ item.conf }}'
          with_items: '{{ hbase_metrics }}'
          when: inventory_hostname in groups['hbase_nodes']
    
    
        - name: config hadoop-env.sh
          blockinfile:
            dest: '{{ deploy_dir }}/hadoop/etc/hadoop/hadoop-env.sh'
            marker: "# {mark} ANSIBLE MANAGED DEFAULT HADOOP METRIC ENV"
            block: |
              {% for m in hadoop_metrics %}
              export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
              {% endfor %}
          when: inventory_hostname in groups['hadoop_nodes']
    
    
        - name: config hbase-env.sh
          blockinfile:
            dest: '{{ deploy_dir }}/hbase/conf/hbase-env.sh'
            marker: "# {mark} ANSIBLE MANAGED DEFAULT HBASE METRIC ENV"
            block: |
              {% for m in hbase_metrics %}
              export {{m.env}}="-javaagent:{{agent_path}}={{m.port}}:{{metrics_dir}}/{{m.conf}} ${{m.env}}"
              {% endfor %}
          when: inventory_hostname in groups['hbase_nodes']
    
        - name: ensure generated target dir
          file: 
            path: '/tmp/gen-prometheus-targets'
            state: directory
          run_once: yes
          connection: local
    
        - name: generate target config to /tmp/gen-prometheus-targets
          template:
            src: '{{ target_tmpl }}/{{ item }}'
            dest: '/tmp/gen-prometheus-targets/{{ item }}'
          with_items: 
            - hadoop-cluster.yml
            - hbase-cluster.yml
            - zk-cluster.yml
          run_once: yes
          connection: local
    

    操作步骤

    配置中控机

    必须禁用 SSH 登陆询问,否则后面的安装步骤可能卡住

    初始化机器

    1. 修改 hosts 配置(必须为 IP 格式)
    • [nodes] 列出集群中所有节点
    • [newborn] 列出集群中未部署安装包的节点
    1. 执行 ansible-playbook book/sync-host.yml
    2. 执行 ansible-playbook book/install-hadoop.yml
    3. 修改 hosts 配置
    • [newborn] 清空该组节点

    配置并启动 Zookeeper

    1. 修改 hosts 配置(必须配置 ansible_usermyid
    • [zk_nodes] 列出集群中所有 ZK 节点
    1. 修改 book/config-zk.yml 调整 JVM 参数
    2. 执行 ansible-playbook book/config-zk.yml

    配置 Hadoop

    1. 修改 hosts 配置
    • [hadoop_nodes] 列出集群中所有 Hadoop 节点
    • [namenodes] 集群中所有 NameNode(必须配置 idrpc_porthttp_port
    • [datanodes] 集群中所有 DataNode
    • [journalnodes] 集群中所有 JournalNode(必须配置 journal_port
    • [resourcemanagers] 集群中所有 ResourceManager(必须配置 idpeer_porttracker_portscheduler_portweb_port
    1. 修改 book/config-hadoop.yml 调整 JVM 参数
    2. 执行 ansible-playbook book/config-hadoop.yml

    启动 HDFS

    1. 在所有 JournalNode 节点上,启动 journalnode 服务
    ansible journalnodes -m shell -a 'source /etc/profile && nohup hdfs --daemon start journalnode'
    
    # 查看是否存在进程 JournalNode
    ansible journalnodes -m shell -a 'source /etc/profile && jps | grep JournalNode'
    
    1. 在 nn1 节点上,格式化 NameNode 并启动 namenode 服务
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs namenode -format'
    
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && nohup hdfs --daemon start namenode'
    
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && jps | grep NameNode'
    
    1. 其余 NameNode 节点同步 nn1 的元数据信息 并启动 namenode 服务
    ansible 'namenodes[1:]' -m shell -a 'source /etc/profile && hdfs namenode -bootstrapStandby'
    
    ansible 'namenodes[1:]' -m shell -a 'source /etc/profile && nohup hdfs --daemon start namenode'
    
    ansible 'namenodes[1:]' -m shell -a 'source /etc/profile && jps | grep NameNode'
    
    1. 在所有 DataNode 节点上启动 datanode 服务(提前检查 DataNode 配置是否正常)
    ansible datanodes -m shell -a 'source /etc/profile && nohup hdfs --daemon start datanode'
    
    ansible datanodes -m shell -a 'source /etc/profile && jps | grep DataNode'
    
    1. 检查 NameNode 是否处于 Standby 状态
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn1'
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn2'
    
    1. 初始化 DFSZKFailoverController 的状态
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs zkfc -formatZK'
    
    1. 重启 HDFS 集群
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && stop-dfs.sh'
    
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-dfs.sh'
    
    # 查看是否存在 DFSZKFailoverController 进程
    ansible 'namenodes' -m shell -a 'source /etc/profile && jps | grep FailoverController'
    
    1. 检查 NameNode 是否处于 Active 状态
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn1'
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && hdfs haadmin -getServiceState nn2'
    

    启动 YARN

    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-yarn.sh'
    
    # 查看是否存在进程 ResourceManager 与 NodeManager
    ansible 'hadoop_nodes' -m shell -a 'source /etc/profile && jps | grep Manager'
    

    查看各个 ResourceManager 的状态,找到 Active RM

    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && yarn rmadmin -getServiceState rm1'
    
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && yarn rmadmin -getServiceState rm2'
    

    配置 HBase

    1. 修改 hosts 配置
    • [hbase_nodes] 列出集群中所有 HBase 节点
    • [hmasters] 集群中所有 HMaster
    • [regionservers] 集群中所有 RegionServer
    1. 修改 book/config-hbase.yml 调整 JVM 参数
    2. 执行 ansible-playbook book/config-hbase.yml

    启动 HBase

    ansible 'hmasters[0]' -m shell -a 'source /etc/profile && nohup start-hbase.sh'
    
    # 查看是否存在进程 HMaster 与 RegionServer
    ansible 'hbase_nodes' -m shell -a 'source /etc/profile && jps | grep H'
    

    配置监控

    1. 修改 book/install-metrics.yml 调整 JVM 参数
    2. book/install-metrics.yml 定制不同节点的配置
    3. 执行 ansible-playbook book/install-metrics.yml
    4. 重启服务
    # 关闭 HBase
    ansible 'hmasters[0]' -m shell -a 'source /etc/profile && stop-hbase.sh'
    
    ansible 'hbase_nodes' -m shell -a 'source /etc/profile && jps | grep H'
    
    # 关闭 Hadoop
    
    ansible 'resourcemanagers[0]' -m shell -a 'source /etc/profile && stop-yarn.sh'
    
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && stop-dfs.sh'
    
    ansible 'hadoop_nodes' -m shell -a 'source /etc/profile && jps | grep -v "Jps\|QuorumPeerMain"'
    
    # 启动 HDFS
    
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-dfs.sh'
    
    # 检查 HDFS
    # curl my.hadoop1:7021/metrics
    # curl my.hadoop1:7022/metrics
    # curl my.hadoop1:7023/metrics
    
    # 启动 YARN
    ansible 'namenodes[0]' -m shell -a 'source /etc/profile && start-yarn.sh'
    
    # 检查 YARN
    # curl my.hadoop3:7024/metrics
    # curl my.hadoop3:7025/metrics
    
    # 启动 HBase
    ansible 'hmasters[0]' -m shell -a 'source /etc/profile && nohup start-hbase.sh'
    
    # 检查 HBase
    # curl my.hbase1:7027/metrics
    # curl my.hbase1:7028/metrics
    

    安装 Prometheus 与 Grafana

    • 安装 Prometheus(参考附录)
    • 安装 Grafana(参考附录)

    附录

    安装 Ansible

    安装依赖

    • 安装 pip(版本为 Python 2.7)
    curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py
    
    python get-pip.py --user
    
    pip -V
    
    • 安装依赖库
    sudo yum install -y gcc glibc-devel zlib-devel rpm-build openssl-devel
    sudo yum install -y python-devel python-yaml python-jinja2 python2-jmespath
    

    编译安装

    而 Python2 仅支持 2.9 系列,因此无法通过 yum 进行安装

    下载 ansible 2.9.27 源码,在本地编译安装

    wget https://releases.ansible.com/ansible/ansible-2.9.27.tar.gz
    
    tar -xf ansible-2.9.27.tar.gz
    
    pushd ansible-2.9.27/
    
    python setup.py build
    
    sudo python setup.py install
    
    popd
    
    ansible --version
    

    配置免密登陆

    • 在主控机生成密钥
    ssh-keygen -t rsa -b 3072
    cat ~/.ssh/id_rsa.pub
    
    • 受控机访问授权
    cat <<EOF >> ~/.ssh/authorized_keys
    ssh-rsa XXX
    EOF
    
    • 禁用受控机 SSH 登陆询问
    vim /etc/ssh/ssh_config
    # 在 Host * 后加上
    Host *
            StrictHostKeyChecking no
    
    

    安装 Prometheus

    创建 prometheus 用户

    sudo useradd --no-create-home --shell /bin/false prometheus
    
    # 授予sudo权限
    sudo visudo
    prometheus ALL=(ALL) NOPASSWD:ALL
    

    官网找到下载链接

    wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
    
    tar -xvf prometheus-2.35.0.linux-amd64.tar.gz && sudo mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus-2.35.0 
    
    sudo mkdir -p /data/prometheus/tsdb
    sudo mkdir -p /etc/prometheus
    
    sudo ln -s /usr/local/prometheus-2.35.0 /usr/local/prometheus
    
    sudo mv /usr/local/prometheus/prometheus.yml /etc/prometheus
    
    sudo chown -R prometheus:prometheus /usr/local/prometheus/
    sudo chown -R prometheus:prometheus /data/prometheus
    sudo chown -R prometheus:prometheus /etc/prometheus
    

    添加到系统服务 (配置格式)

    sudo vim /etc/systemd/system/prometheus.service
    
    # 新增以下内容
    [Unit]
    Description=Prometheus Server
    Documentation=https://prometheus.io/docs/introduction/overview/
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=prometheus
    Group=prometheus
    Type=simple
    ExecStart=/usr/local/prometheus/prometheus \
        --config.file=/etc/prometheus/prometheus.yml \
        --storage.tsdb.path=/data/prometheus/tsdb \
        --web.listen-address=:9090
    
    [Install]
    WantedBy=multi-user.target
    

    启动服务

    sudo systemctl start prometheus.service
    
    # 查看服务状态
    systemctl status prometheus.service
    
    # 查看日志
    sudo journalctl -u prometheus
    
    # 测试 curl 127.0.0.1:9090
    

    修改配置 prometheus.yml

    scrape_configs:
    
      - job_name: "prometheus"
        file_sd_configs:
          - files:
            - targets/prometheus-*.yml
            refresh_interval: 1m
    
      - job_name: "zookeeper"
        file_sd_configs:
          - files:
            - targets/zk-cluster.yml
            refresh_interval: 1m
        metric_relabel_configs:
        - action: replace
          source_labels: ["instance"]
          target_label: "instance"
          regex: "([^:]+):.*"
          replacement: "$1"
    
      - job_name: "hadoop"
        file_sd_configs:
          - files:
            - targets/hadoop-cluster.yml
            refresh_interval: 1m
        metric_relabel_configs:
        - action: replace
          source_labels: ["__name__"]
          target_label: "__name__"
          regex: "Hadoop_[^_]*_(.*)"
          replacement: "$1"
        - action: replace
          source_labels: ["instance"]
          target_label: "instance"
          regex: "([^:]+):.*"
          replacement: "$1"
    
      - job_name: "hbase"
        file_sd_configs:
          - files:
            - targets/hbase-cluster.yml
            refresh_interval: 1m
        metric_relabel_configs:
        - action: replace
          source_labels: ["instance"]
          target_label: "instance"
          regex: "([^:]+):.*"
          replacement: "$1"
        - action: replace
          source_labels: ["stat"]
          target_label: "stat"
          regex: "(.*)th_percentile"
          replacement: "p$1"
    
    

    增加 targets

    pushd /etc/prometheus/targets
    
    sudo cat <<EOF >> prometheus-servers.yml
    - targets:
      - localhost:9090
      labels:
        service: prometheus
    EOF
    
    sudo cat <<EOF >> zk-cluster.yml
    - targets:
      - my.zk1:7000
      - my.zk2:7000
      - my.zk3:7000
      labels:
        service: zookeeper
    EOF
    
    sudo cat <<EOF >> hadoop-cluster.yml
    - targets:
      - my.hadoop1:7021
      - my.hadoop2:7021
      labels:
        role: namenode
        service: hdfs
    - targets:
      - my.hadoop1:7022
      - my.hadoop2:7022
      - my.hadoop3:7022
      - my.hadoop4:7022
      labels:
        role: datanode
        service: hdfs
    - targets:
      - my.hadoop1:7023
      - my.hadoop2:7023
      - my.hadoop3:7023
      labels:
        role: journalnode
        service: hdfs
    - targets:
      - my.hadoop3:7024
      - my.hadoop4:7024
      labels:
        role: resourcemanager
        service: yarn
    - targets:
      - my.hadoop1:7025
      - my.hadoop2:7025
      - my.hadoop3:7025
      - my.hadoop4:7025
      labels:
        role: nodemanager
        service: yarn
    EOF
    
    sudo cat <<EOF >> hbase-cluster.yml
    - targets:
      - my.hbase1:7027
      - my.hbase2:7027
      labels:
        app: hmaster
        service: hbase
    - targets:
      - my.hbase1:7028
      - my.hbase2:7028
      - my.hbase3:7028
      - my.hbase4:7028
      labels:
        app: regionserver
        service: hbase
    EOF
    

    安装 Grafana

    安装服务

    官网找到下载链接(选择 OSS 版):

    wget https://dl.grafana.com/oss/release/grafana-8.5.0-1.x86_64.rpm
    sudo yum install grafana-8.5.0-1.x86_64.rpm
    
    # 查看安装后生成的配置文件
    rpm -ql grafana
    

    修改配置 grafana.ini

    sudo vim /etc/grafana/grafana.ini
    
    # 存储路径
    [paths]
    data = /data/grafana/data
    logs = /data/grafana/logs
    
    # 管理员账号
    [security]
    admin_user = admin
    admin_password = admin
    

    启动 grafana 服务

    sudo mkdir -p /data/grafana/{data,logs} && sudo chown -R grafana:grafana /data/grafana
    
    sudo systemctl start grafana-server
    
    systemctl status grafana-server
    
    # 测试 curl 127.0.0.1:3000
    

    配置 LDAP

    修改配置文件 grafana.ini

    sudo vim /etc/grafana/grafana.ini
    
    # 开启 LDAP
    [auth.ldap]
    enabled = true
    
    # 调整日志等级为 debug 方便调试(可选)
    [log]
    level = debug
    

    增加 ldap 配置 参考

    sudo vim /etc/grafana/ldap.toml
    
    [[servers]]
    # LDAP服务
    host = "ldap.service.com"
    port = 389
    
    # 访问授权
    bind_dn = "cn=ldap_sync,cn=Users,dc=staff,dc=my,dc=com"
    bind_password = """???"""
    
    # 查找范围
    search_filter = "(sAMAccountName=%s)"
    search_base_dns = ["ou=Employees,dc=staff,dc=my,dc=com"]
    
    # 用户信息映射
    [servers.attributes]
    name = "givenname"
    surname = "cn"
    username = "cn"
    email =  "mail"
    
    # 权限映射相关配置,此处忽略...
    

    启动 grafana 服务

    systemctl restart grafana-server
    
    # 在界面登录,并观察日志(需要 ctrl + G 定位到末尾)
    sudo journalctl -u grafana-server
    

    配置 Dashboard

    添加数据源

    使用 admin 账号登录,添加 Prometehues 作为数据源:

    Configuration (侧边栏)
      -> Data sources (进入子页面)
        -> Add data source (蓝色按钮)
          -> Prometheus (列表选项)
            -> 填写 http 地址并点击 Save & Test (蓝色按钮)
    

    添加 Dashboard

    zookeeper 官方提供的配置指南 与默认 Dashboard

    Create (侧边栏)
      -> Import (进入子页面)
        -> 填写 http 地址并点击 Load (蓝色按钮)
          -> 选择 Prometehues 数据源并点击 Import (蓝色按钮)
            -> 进入 Dashboard 后设置 Cluster (下拉框)
                -> 点击右上角按钮 Save Dashboard (文件图标)
    

    无现成的 Hbase Dashboard 模板,需要参考官网进行配置。

    无现成的 Hadoop Dashboard 模板,需要参考官网该文章进行配置。

    为了方便各位懒人,这里给出几个简单 Dashboard 模板:

    Patch HBase

    HBase 对 JDK 12+ 存在兼容性问题,并且问题是最新的 Fix 只存在 3.0,并没有合并到 2.x,因此需要手动打 Patch 并参考 解决方案 直接替换 hbase-server.jar 即可,为了与 release 包一致,编译时使用 JDK 8。

    但是实践中发现打包后的 jar 体积明显小于 release 包中的 hbase-server.jar
    为了保险起见,决定只替换 class 文件,步骤如下:

    
    # 将重新编译的 hbase-server-2.4.11.jar 重命名为 patch.jar
    # 将其与官方的 hbase-server-2.4.11.jar 拷贝到相同目录
    unzip patch.jar -d patch
    unzip hbase-server-2.4.11.jar -d hbase-server-2.4.11
    
    # 查找目标类所在路径
    jar -tf hbase-server-2.4.11.jar | grep HFileSystem.class
    
    # 查看 class 版本是否一致
    file patch/org/apache/hadoop/hbase/fs/HFileSystem.class
    file hbase-server-2.4.11/org/apache/hadoop/hbase/fs/HFileSystem.class
    
    # 下载 cfr 用于反编译 jar
    wget https://www.benf.org/other/cfr/cfr-0.152.jar
    
    # 反编译 class 
    java -jar cfr-0.152.jar hbase-server-2.4.11/org/apache/hadoop/hbase/fs/HFileSystem.class > A.java
    java -jar cfr-0.152.jar patch/org/apache/hadoop/hbase/fs/HFileSystem.class > B.java
    
    # 查看修改是否成功
    diff A.java B.java
    
    # 检查完毕后,将 patch 后的 class 文件打包进 hbase-server-2.4.11.jar 包
    cd patch
    jar -uf ../hbase-server-2.4.11.jar org/apache/hadoop/hbase/fs/HFileSystem.class
    
  • 相关阅读:
    java多线程练习题 类
    java练习题在一个文件里面输入内容在另一个文件里面可以查看
    java练习题输入流姓名学号信息
    java 异常处理2
    java 处理异常练习题
    java get银行练习题
    java 练习题 求梯形的面积和周长
    java get正确写类的练习题 猫
    GUID 全局唯一标识符
    oracle 建表 练习2
  • 原文地址:https://www.cnblogs.com/buttercup/p/15858660.html
Copyright © 2020-2023  润新知