对于HBase的MapReduce性能提升方案之BulkLoad

我们知道，在第一次海量数据批量入库时，我们会选择使用BulkLoad的方式。

简单介绍一下BulkLoad原理方式:(1)通过MapReduce的方式，在Map或者Reduce端将输出格式化为HBase的底层存储文件HFile。(2)调用BulkLoad将第一个Job生成的HFile导入到相应的HBase表中。

ps：请注意（1）HFile方式是全部的载入方案里面是最快的，前提是：数据必须第一个导入，表示空的！假设表中已经有数据，HFile再次导入的时候，HBase的表会触发split切割操作。（2）终于输出结果，不管是Map还是Reduce，输出建议仅仅使用<ImmutableBytesWritable, KeyValue>。

如今我们開始正题：BulkLoad固然是写入HBase最快的方式，可是，假设我们在做业务分析的时候，而数据又已经在HBase的时候，我们採用普通的针对HBase的方式，例如以下demo所看到的：

import com.yeepay.bigdata.bulkload.TableCreator;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.log4j.Logger;

import java.io.IOException;

public class HBaseMapReduceDemo {

    static Logger LOG = Logger.getLogger(HBaseMapReduceDemo.class);

    static class Mapper1 extends TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {

        @Override
        public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {

            try {
            //  context.write(key, value);
            } catch (Exception e) {
                LOG.error(e);
            }
        }
    }

    public static class Reducer1 extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {

        public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
            try {

                Put put = new Put(key.get());
                // put.add();
                context.write(key, put);

            } catch (Exception e) {
                LOG.error(e);
                return ;
            }  // catch
        }  // reduce function
    }  // reduce class

    public static void main(String[] args) throws Exception {

        HBaseConfiguration conf = new HBaseConfiguration();
        conf.set("hbase.zookeeper.quorum", "yp-name02,yp-name01,yp-data01");
        conf.set("hbase.zookeeper.property.clientPort", "2181");
        // conf.set(TableInputFormat.INPUT_TABLE,"access_logs");
        Job job = new Job(conf, "HBaseMapReduceDemo");
        job.setJarByClass(HBaseMapReduceDemo.class);
//        job.setNumReduceTasks(2);
        Scan scan = new Scan();
        scan.setCaching(2500);
        scan.setCacheBlocks(false);

        TableMapReduceUtil.initTableMapperJob("srcHBaseTableName", scan, Mapper1.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
//        TableCreator.createTable(20, true, "OP_SUM");
        TableMapReduceUtil.initTableReducerJob("destHBasetableName", Reducer1.class, job);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }


}

这个时候在对海量数据的插入过程中，会放生Spliter，写入速度很的，及其的慢。可是此种情况适合，对已有的HBase表进行改动时候的使用。

针对例如以下情况HBase -> MapReduce 分析 -> 新表，我们採用（HBase -> MapReduce 分析 -> bulkload -> 新表）方式。

demo例如以下：

Mapper例如以下：

public class MyReducer extends Reducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable, KeyValue> {

    static Logger LOG = Logger.getLogger(MyReducer.class);

    public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
        try {
            context.write(key, kv);
        } catch (Exception e) {
            LOG.error(e);
            return;
        }  // catch
    }  // reduce function


}

Reducer例如以下：

public class MyReducer extends Reducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable, KeyValue> {

    static Logger LOG = Logger.getLogger(MyReducer.class);

    public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
        try {
            context.write(key, kv);
        } catch (Exception e) {
            LOG.error(e);
            return;
        }  // catch
    }  // reduce function


}

Job and BulkLoad：

public abstract class JobBulkLoad {

    public void run(String[] args) throws Exception {
        try {
            if (args.length < 1) {
                System.err.println("please set input dir");
                System.exit(-1);
                return;
            }

            String srcTableName = args[0];
            String destTableName = args[1];
            TableCreator.createTable(20, true, destTableName);

            // 设置 HBase 參数
            HBaseConfiguration conf = new HBaseConfiguration();
            conf.set("hbase.zookeeper.quorum", "yp-name02,yp-name01,yp-data01");
//          conf.set("hbase.zookeeper.quorum", "nn01, nn02, dn01");
            conf.set("hbase.zookeeper.property.clientPort", "2181");

            // 设置 Job 參数
            Job job = new Job(conf, "hbase2hbase-bulkload");
            job.setJarByClass(JobBulkLoad.class);
            HTable htable = new HTable(conf, destTableName);  // 依据region的数量来决定reduce的数量以及每一个reduce覆盖的rowkey范围

            // ----------------------------------------------------------------------------------------
            Scan scan = new Scan();
            scan.setCaching(2500);
            scan.setCacheBlocks(false);
            TableMapReduceUtil.initTableMapperJob(srcTableName, scan, MyMapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
//          TableMapReduceUtil.initTableReducerJob(destTableName, Common_Reducer.class, job);

            job.setReducerClass(MyReducer.class);
            Date now = new Date();
            Path output = new Path("/output/" + destTableName + "/" + now.getTime());
            System.out.println("/output/" + destTableName + "/" + now.getTime());

            HFileOutputFormat.configureIncrementalLoad(job, htable);
            FileOutputFormat.setOutputPath(job, output);
            HFileOutputFormat.configureIncrementalLoad(job, htable);
            job.waitForCompletion(true);

//-----  运行BulkLoad  -------------------------------------------------------------------------------
            HdfsUtil.chmod(conf, output.toString());
            HdfsUtil.chmod(conf, output + "/" + YeepayConstant.COMMON_FAMILY);
            htable = new HTable(conf, destTableName);
            new LoadIncrementalHFiles(conf).doBulkLoad(output, htable);
            System.out.println("HFile data load success!");
        } catch (Throwable t) {
            throw new RuntimeException(t);
        }
    }
    
}

相关阅读:
学习进度
 作业8：单元测试练习
 用户体验设计案例分析
 团队协作一
 需求分析
 结对编程——词频统计 2
结对编程——词频统计
 个人项目-词频统计
 数组求和
 个人介绍和Github使用流程
原文地址：https://www.cnblogs.com/gcczhongduan/p/4056640.html