• HBase预分区实战篇


                  HBase预分区实战篇

                                     作者:尹正杰

    版权声明:原创作品,谢绝转载!否则将追究法律责任。

    一.HBase预分区概述

      每一个region维护着startRow与endRowKey,如果加入的数据符合某个region维护的rowKey范围,则该数据交给这个region维护。那么依照这个原则,我们可以将数据所要投放的分区提前大致的规划好,以提高HBase性能。
    
      预分区是否可以基于时间戳作为ROWKEY?
        不推荐基于时间戳来作为ROWKEY,其原因很简单,很容易出现数据倾斜。随着数据量的增大,Region也会被切分,而基于时间戳切分的Region有一个明显特点,前面的Region的数据固定,后面的Region数量不断有数据写入导致数据倾斜现象,从而频繁发生Region的切分。
    
      如何规划预分区?
        这需要您对数据的总量有一个判断,举个例子,目前HBase有50亿条数据,每条数据平均大小是10k,那么数据总量大小为5000000000*10k/1024/1024=47683G(约47T).我们假设每个Region可以存储100G的数据,那么要存储到现有数据只需要477个Region足以。
        但是这个Region数量仅能存储现有数据,实际情况我们需要预估公司未来5-10年的数量量的增长,比如每年增长10亿条数据,按5年来做预估,那么5年后的数据会增加50亿条数据,因此预估5年后总数居大小为10000000000*10k/1024/1024=95367(约95T)。
        综上所述,按照每个Region存储100G的数据,最少需要954个Region,接下来我们就得考虑如何将数据尽量平均的存储在这954个Region中,而不是将数据放在某一个Region中导致数据倾斜的情况发生。

    二.手动设置预分区

    1>.手动设置预分区

    hbase(main):001:0> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']
    Created table staff1
    Took 3.0235 seconds                                                                                                                                                                                                                                                           
    => Hbase::Table - staff1
    hbase(main):002:0> 
    hbase(main):001:0> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']        #指定的分区键为'1000','2000','3000','4000'
    hbase(main):006:0> describe 'staff1'
    Table staff1 is ENABLED                                                                                                                                                                                                                                                       
    staff1                                                                                                                                                                                                                                                                        
    COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
    {NAME => 'info', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTE
    R => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                                   
    
    {NAME => 'partition1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
    MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                             
    
    2 row(s)
    
    QUOTAS                                                                                                                                                                                                                                                                        
    0 row(s)
    Took 0.0648 seconds                                                                                                                                                                                                                                                           
    hbase(main):007:0> 
    hbase(main):006:0> describe 'staff1'

    2>.访问Hmaster的WebUI

      浏览器访问HMaster的地址:
        http://hadoop101.yinzhengjie.org.cn:16010/

    3>.查看staff1表的Region信息

    三.生成16进制序列预分区

    1>.生成16进制序列预分区

    hbase(main):007:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
    Created table staff2
    Took 2.4445 seconds                                                                                                                                                                                                                                                           
    => Hbase::Table - staff2
    hbase(main):008:0> 
    hbase(main):007:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
    hbase(main):008:0> describe 'staff2'
    Table staff2 is ENABLED                                                                                                                                                                                                                                                       
    staff2                                                                                                                                                                                                                                                                        
    COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
    {NAME => 'info', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTE
    R => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                                   
    
    {NAME => 'partition2', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
    MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                             
    
    2 row(s)
    
    QUOTAS                                                                                                                                                                                                                                                                        
    0 row(s)
    Took 0.0880 seconds                                                                                                                                                                                                                                                           
    hbase(main):009:0>
    hbase(main):008:0> describe 'staff2'

    2>.访问Hmaster的WebUI

      浏览器访问HMaster的地址:
        http://hadoop101.yinzhengjie.org.cn:16010/

    3>.查看staff2表的Region信息

    四.按照文件中设置的规则预分区

    1>.按照文件中设置的规则预分区

    [root@hadoop101.yinzhengjie.org.cn ~]# vim splits.txt
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# cat splits.txt
    AAAAA
    BBBBB
    CCCCC
    DDDDD
    EEEEE
    [root@hadoop101.yinzhengjie.org.cn ~]# 
    [root@hadoop101.yinzhengjie.org.cn ~]# vim splits.txt
    hbase(main):009:0> create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
    Created table staff3
    Took 2.3140 seconds                                                                                                                                                                                                                                                           
    => Hbase::Table - staff3
    hbase(main):010:0> 
    hbase(main):009:0> create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
    hbase(main):010:0> describe 'staff3'
    Table staff3 is ENABLED                                                                                                                                                                                                                                                       
    staff3                                                                                                                                                                                                                                                                        
    COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
    {NAME => 'partition3', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
    MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                             
    
    1 row(s)
    
    QUOTAS                                                                                                                                                                                                                                                                        
    0 row(s)
    Took 0.0955 seconds                                                                                                                                                                                                                                                           
    hbase(main):011:0> 
    hbase(main):010:0> describe 'staff3'

    2>.访问Hmaster的WebUI

      浏览器访问HMaster的地址:
        http://hadoop101.yinzhengjie.org.cn:16010/

    3>.查看staff3表的Region信息

    五.基于JavaAPI创建预分区

    1>.自定义生成3个Region的预分区表

    package cn.org.yinzhengjie.bigdata.hbase.util;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.Connection;
    import org.apache.hadoop.hbase.client.ConnectionFactory;
    import org.apache.hadoop.hbase.client.Put;
    import org.apache.hadoop.hbase.client.Table;
    import org.apache.hadoop.hbase.util.Bytes;
    
    import java.io.IOException;
    
    /**
     * HBase操作工具类
     */
    public class HBaseUtil {
    
        //创建ThreadLocal对象(会在各个线程里单独开辟一块共享内存空间),目的是为了同一个线程内实现数据共享,它的作用并不是解决线程安全问题哟~
        private static ThreadLocal<Connection> connHolder = new ThreadLocal<Connection>();
    
        /**
         *  将构造方法私有化,禁止该工具类被实例化
         */
        private HBaseUtil(){}
    
    
        /**
         *  获取HBase连接对象
         */
        public static void makeHbaseConnection() throws IOException {
            //获取连接
            Connection conn = connHolder.get();
    
            //第一次获取连接为空,因此需要手动创建Connection对象
            if (conn == null){
                //使用HBaseConfiguration的单例方法实例化,该方法会自动帮咱们加载"hbase-default.xml"和"hbase-site.xml"文件.
                Configuration conf = HBaseConfiguration.create();
                conn = ConnectionFactory.createConnection(conf);
                connHolder.set(conn);
            }
        }
    
        /**
         *  增加数据
         */
        public static void insertData (String tableName, String rowKey, String columnFamily, String column, String value)
                throws IOException{
            //获取连接
            Connection conn = connHolder.get();
            //获取表
            Table table = conn.getTable(TableName.valueOf(tableName));
            //创建Put对象,需要指定往哪个"RowKey"插入数据
            Put put = new Put(Bytes.toBytes(rowKey));
            //记得添加列族信息
            put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value));
            //执行完这行代码数据才会被真正写入到HBase哟~
            table.put(put);
            //记得关闭表
            table.close();
        }
    
        /**
         *  关闭连接
         */
        public static void close() throws IOException {
            //获取连接
            Connection conn = connHolder.get();
            if (conn != null){
                conn.close();
                //关闭连接后记得将其从ThreadLocal的内存中移除以释放空间。
                connHolder.remove();
            }
        }
    
        /**
         *  生成分区键
         */
        public static byte[][] genRegionKeys(int regionCount){
            //设置3个Region就会生成2分区键
            byte[][] bs = new byte[regionCount -1][];
            for (int i = 0;i<regionCount -1;i++){
                bs[i] = Bytes.toBytes(i + "|");
            }
            return bs;
        }
    
    
        /**
         *  生成分区号
         */
        public static String genRegionNumber(String rowkey,int regionCount){
    
            //计算分区号
            int regionNumber;
            //获取一个散列值
            int hash = rowkey.hashCode();
    
            /**
             *  (regionCount & (regionCount - 1)) == 0 :
             *      用于判断regionCount这个数字是否是2的N次方:
             *          如果是则将该数字减去1和hash值做与运算(&)目的是可以随机匹配到不同的分区号;
             *          如果不是则将该数字和hash值做取余运算(%)目的也是可以随机匹配到不同的分区号;
             */
            if (regionCount > 0 && (regionCount & (regionCount - 1)) == 0){
                // (regionCount -1)算的是分区键,举个例子:分区数为3,则分区键为2.
                regionNumber = hash & (regionCount -1);
            }else {
                regionNumber = hash % regionCount;
            }
    
            return regionNumber + "_" + rowkey;
        }
    }
    HBaseUtil.java
    package cn.org.yinzhengjie.bigdata.hbase;
    
    import cn.org.yinzhengjie.bigdata.hbase.util.HBaseUtil;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.HColumnDescriptor;
    import org.apache.hadoop.hbase.HTableDescriptor;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.Admin;
    import org.apache.hadoop.hbase.client.Connection;
    import org.apache.hadoop.hbase.client.ConnectionFactory;
    import org.apache.hadoop.hbase.util.Bytes;
    
    import java.io.IOException;
    
    public class PrePartition {
        public static void main(String[] args) throws IOException {
            //创建配置对象
            Configuration conf = HBaseConfiguration.create();
    
            //获取HBase的连接对象
            Connection conn = ConnectionFactory.createConnection(conf);
    
            //获取管理Hbase的对象
            Admin admin = conn.getAdmin();
    
            //获取表对象
            HTableDescriptor td = new HTableDescriptor(TableName.valueOf("yinzhengjie2020:course1"));
    
            //为表设置列族
            HColumnDescriptor cd = new HColumnDescriptor("info");
            td.addFamily(cd);
    
            //设置3个Region就会生成2分区键
            byte[][] spiltKeys = HBaseUtil.genRegionKeys(3);
            for (byte[] spiltKey : spiltKeys) {
                System.out.println(Bytes.toString(spiltKey));
            }
    
            //创建表并指定分区键(二维数组)可用增加预分区
            admin.createTable(td,spiltKeys);
            System.out.println("预分区表创建成功....");
    
            //释放资源
            admin.close();
            conn.close();
        }
    }
    案例代码

    2>.查看WebUI

      浏览器访问:
        http://hadoop105.yinzhengjie.org.cn:16010/table.jsp?name=yinzhengjie2020:course

    3>.往自定义的分区插入数据

    package cn.org.yinzhengjie.bigdata.hbase.util;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.Connection;
    import org.apache.hadoop.hbase.client.ConnectionFactory;
    import org.apache.hadoop.hbase.client.Put;
    import org.apache.hadoop.hbase.client.Table;
    import org.apache.hadoop.hbase.util.Bytes;
    
    import java.io.IOException;
    
    /**
     * HBase操作工具类
     */
    public class HBaseUtil {
    
        //创建ThreadLocal对象(会在各个线程里单独开辟一块共享内存空间),目的是为了同一个线程内实现数据共享,它的作用并不是解决线程安全问题哟~
        private static ThreadLocal<Connection> connHolder = new ThreadLocal<Connection>();
    
        /**
         *  将构造方法私有化,禁止该工具类被实例化
         */
        private HBaseUtil(){}
    
    
        /**
         *  获取HBase连接对象
         */
        public static void makeHbaseConnection() throws IOException {
            //获取连接
            Connection conn = connHolder.get();
    
            //第一次获取连接为空,因此需要手动创建Connection对象
            if (conn == null){
                //使用HBaseConfiguration的单例方法实例化,该方法会自动帮咱们加载"hbase-default.xml"和"hbase-site.xml"文件.
                Configuration conf = HBaseConfiguration.create();
                conn = ConnectionFactory.createConnection(conf);
                connHolder.set(conn);
            }
        }
    
        /**
         *  增加数据
         */
        public static void insertData (String tableName, String rowKey, String columnFamily, String column, String value)
                throws IOException{
            //获取连接
            Connection conn = connHolder.get();
            //获取表
            Table table = conn.getTable(TableName.valueOf(tableName));
            //创建Put对象,需要指定往哪个"RowKey"插入数据
            Put put = new Put(Bytes.toBytes(rowKey));
            //记得添加列族信息
            put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value));
            //执行完这行代码数据才会被真正写入到HBase哟~
            table.put(put);
            //记得关闭表
            table.close();
        }
    
        /**
         *  关闭连接
         */
        public static void close() throws IOException {
            //获取连接
            Connection conn = connHolder.get();
            if (conn != null){
                conn.close();
                //关闭连接后记得将其从ThreadLocal的内存中移除以释放空间。
                connHolder.remove();
            }
        }
    
        /**
         *  生成分区键
         */
        public static byte[][] genRegionKeys(int regionCount){
            //设置3个Region就会生成2分区键
            byte[][] bs = new byte[regionCount -1][];
            for (int i = 0;i<regionCount -1;i++){
                bs[i] = Bytes.toBytes(i + "|");
            }
            return bs;
        }
    
    
        /**
         *  生成分区号
         */
        public static String genRegionNumber(String rowkey,int regionCount){
    
            //计算分区号
            int regionNumber;
            //获取一个散列值
            int hash = rowkey.hashCode();
    
            /**
             *  (regionCount & (regionCount - 1)) == 0 :
             *      用于判断regionCount这个数字是否是2的N次方:
             *          如果是则将该数字减去1和hash值做与运算(&)目的是可以随机匹配到不同的分区号;
             *          如果不是则将该数字和hash值做取余运算(%)目的也是可以随机匹配到不同的分区号;
             */
            if (regionCount > 0 && (regionCount & (regionCount - 1)) == 0){
                // (regionCount -1)算的是分区键,举个例子:分区数为3,则分区键为2.
                regionNumber = hash & (regionCount -1);
            }else {
                regionNumber = hash % regionCount;
            }
    
            return regionNumber + "_" + rowkey;
        }
    }
    HBaseUtil.java
    package cn.org.yinzhengjie.bigdata.hbase;
    
    import cn.org.yinzhengjie.bigdata.hbase.util.HBaseUtil;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.*;
    import org.apache.hadoop.hbase.util.Bytes;
    
    import java.io.IOException;
    
    public class InsertPrePartition {
        public static void main(String[] args) throws IOException {
            //创建配置对象
            Configuration conf = HBaseConfiguration.create();
            //获取HBase的连接对象
            Connection conn = ConnectionFactory.createConnection(conf);
            //获取管理Hbase的对象
            Admin admin = conn.getAdmin();
            //获取要操作的表对象
            Table courseTable = conn.getTable(TableName.valueOf("yinzhengjie2020:course"));
            //编辑待插入数据,将rowkey均匀分配到不同的分区中,效果可以模拟HashMap数据存储的规则。
            String rowkey = "yinzhengjie8";
            rowkey = HBaseUtil.genRegionNumber(rowkey, 3);
            Put put = new Put(Bytes.toBytes(rowkey));
            put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(20));
            //往表中插入数据
            courseTable.put(put);
            System.out.println("数据插入成功......");
        }
    }
    案例代码
  • 相关阅读:
    2015 Multi-University Training Contest 2 1004 Delicious Apples(DP)
    开门人和关门人
    数据降维 实例
    Leetcode题解(5):L58/Length of Last Word
    JavaWeb开发环境搭建
    Linux配置hugepage
    lua的函数初识
    有人离职时经理的反应是?
    svn如何回滚到之前版本
    python用httplib模块发送get和post请求
  • 原文地址:https://www.cnblogs.com/yinzhengjie2020/p/12914075.html
Copyright © 2020-2023  润新知