• 对于HBase的MapReduce性能提升方案之BulkLoad


    我们知道,在第一次海量数据批量入库时,我们会选择使用BulkLoad的方式。

    简单介绍一下BulkLoad原理方式:(1)通过MapReduce的方式,在Map或者Reduce端将输出格式化为HBase的底层存储文件HFile。(2)调用BulkLoad将第一个Job生成的HFile导入到相应的HBase表中。

    ps:请注意(1)HFile方式是全部的载入方案里面是最快的,前提是:数据必须第一个导入,表示空的!假设表中已经有数据,HFile再次导入的时候,HBase的表会触发split切割操作。(2)终于输出结果,不管是Map还是Reduce,输出建议仅仅使用<ImmutableBytesWritable, KeyValue>。


    如今我们開始正题:BulkLoad固然是写入HBase最快的方式,可是,假设我们在做业务分析的时候,而数据又已经在HBase的时候,我们採用普通的针对HBase的方式,例如以下demo所看到的:

    import com.yeepay.bigdata.bulkload.TableCreator;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.client.Put;
    import org.apache.hadoop.hbase.client.Result;
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
    import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
    import org.apache.hadoop.hbase.mapreduce.TableMapper;
    import org.apache.hadoop.hbase.mapreduce.TableReducer;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.log4j.Logger;
    
    import java.io.IOException;
    
    public class HBaseMapReduceDemo {
    
        static Logger LOG = Logger.getLogger(HBaseMapReduceDemo.class);
    
        static class Mapper1 extends TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {
    
            @Override
            public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
    
                try {
                //  context.write(key, value);
                } catch (Exception e) {
                    LOG.error(e);
                }
            }
        }
    
        public static class Reducer1 extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
    
            public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
                try {
    
                    Put put = new Put(key.get());
                    // put.add();
                    context.write(key, put);
    
                } catch (Exception e) {
                    LOG.error(e);
                    return ;
                }  // catch
            }  // reduce function
        }  // reduce class
    
        public static void main(String[] args) throws Exception {
    
            HBaseConfiguration conf = new HBaseConfiguration();
            conf.set("hbase.zookeeper.quorum", "yp-name02,yp-name01,yp-data01");
            conf.set("hbase.zookeeper.property.clientPort", "2181");
            // conf.set(TableInputFormat.INPUT_TABLE,"access_logs");
            Job job = new Job(conf, "HBaseMapReduceDemo");
            job.setJarByClass(HBaseMapReduceDemo.class);
    //        job.setNumReduceTasks(2);
            Scan scan = new Scan();
            scan.setCaching(2500);
            scan.setCacheBlocks(false);
    
            TableMapReduceUtil.initTableMapperJob("srcHBaseTableName", scan, Mapper1.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
    //        TableCreator.createTable(20, true, "OP_SUM");
            TableMapReduceUtil.initTableReducerJob("destHBasetableName", Reducer1.class, job);
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    
    
    }
    
    这个时候在对海量数据的插入过程中,会放生Spliter,写入速度很的,及其的慢。可是此种情况适合,对已有的HBase表进行改动时候的使用。

    针对例如以下情况HBase -> MapReduce 分析 -> 新表,我们採用 (HBase -> MapReduce 分析 -> bulkload -> 新表)方式。

    demo例如以下:

    Mapper例如以下:

    public class MyReducer extends Reducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable, KeyValue> {
    
        static Logger LOG = Logger.getLogger(MyReducer.class);
    
        public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
            try {
                context.write(key, kv);
            } catch (Exception e) {
                LOG.error(e);
                return;
            }  // catch
        }  // reduce function
    
    
    }
    

    Reducer例如以下:

    public class MyReducer extends Reducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable, KeyValue> {
    
        static Logger LOG = Logger.getLogger(MyReducer.class);
    
        public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
            try {
                context.write(key, kv);
            } catch (Exception e) {
                LOG.error(e);
                return;
            }  // catch
        }  // reduce function
    
    
    }
    

    Job and BulkLoad:

    public abstract class JobBulkLoad {
    
        public void run(String[] args) throws Exception {
            try {
                if (args.length < 1) {
                    System.err.println("please set input dir");
                    System.exit(-1);
                    return;
                }
    
                String srcTableName = args[0];
                String destTableName = args[1];
                TableCreator.createTable(20, true, destTableName);
    
                // 设置 HBase 參数
                HBaseConfiguration conf = new HBaseConfiguration();
                conf.set("hbase.zookeeper.quorum", "yp-name02,yp-name01,yp-data01");
    //          conf.set("hbase.zookeeper.quorum", "nn01, nn02, dn01");
                conf.set("hbase.zookeeper.property.clientPort", "2181");
    
                // 设置 Job 參数
                Job job = new Job(conf, "hbase2hbase-bulkload");
                job.setJarByClass(JobBulkLoad.class);
                HTable htable = new HTable(conf, destTableName);  // 依据region的数量来决定reduce的数量以及每一个reduce覆盖的rowkey范围
    
                // ----------------------------------------------------------------------------------------
                Scan scan = new Scan();
                scan.setCaching(2500);
                scan.setCacheBlocks(false);
                TableMapReduceUtil.initTableMapperJob(srcTableName, scan, MyMapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
    //          TableMapReduceUtil.initTableReducerJob(destTableName, Common_Reducer.class, job);
    
                job.setReducerClass(MyReducer.class);
                Date now = new Date();
                Path output = new Path("/output/" + destTableName + "/" + now.getTime());
                System.out.println("/output/" + destTableName + "/" + now.getTime());
    
                HFileOutputFormat.configureIncrementalLoad(job, htable);
                FileOutputFormat.setOutputPath(job, output);
                HFileOutputFormat.configureIncrementalLoad(job, htable);
                job.waitForCompletion(true);
    
    //-----  运行BulkLoad  -------------------------------------------------------------------------------
                HdfsUtil.chmod(conf, output.toString());
                HdfsUtil.chmod(conf, output + "/" + YeepayConstant.COMMON_FAMILY);
                htable = new HTable(conf, destTableName);
                new LoadIncrementalHFiles(conf).doBulkLoad(output, htable);
                System.out.println("HFile data load success!");
            } catch (Throwable t) {
                throw new RuntimeException(t);
            }
        }
        
    }
    


  • 相关阅读:
    [算法] 带权图
    哥德巴赫猜想 ——— 极限算法(你要是能写出比我用时还短的代码算我输)
    详解 位运算
    内存对齐模式 —— 原理讲解
    C语言 文件操作
    指针与数组
    队列的实现
    堆栈的实现
    线性表 详讲及使用
    树莓派
  • 原文地址:https://www.cnblogs.com/gcczhongduan/p/4056640.html
Copyright © 2020-2023  润新知