• HFileOutputFormat与TotalOrderPartitioner


    最近需要为一些数据增加随机读的功能,于是采用生成HFile再bulk load进HBase的方式。

    运行的时候map很快完成,reduce在sort阶段花费时间很长,reducer用的是KeyValueSortReducer而且只有一个,这就形成了单reducer全排序的瓶颈。于是就想着采用TotalOrderPartitioner使得MR Job可以有多个reducer,来提高并行度解决这个瓶颈。

    于是动手写代码,不仅用了TotalOrderPartitioner,还使用InputSampler.RandomSampler生成分区文件。但执行时碰到问题,查资料时无意发现HFileOutputFormat内部是使用TotalOrderPartitioner来进行全排序的,

     public static void configureIncrementalLoad(Job job, HTable table)
      throws IOException {
        Configuration conf = job.getConfiguration();
        Class<? extends Partitioner> topClass;
        try {
          topClass = getTotalOrderPartitionerClass();
        } catch (ClassNotFoundException e) {
          throw new IOException("Failed getting TotalOrderPartitioner", e);
        }
        job.setPartitionerClass(topClass);
      ......
    

    分区文件的内容就是各region的startKey(去掉最小的),

    private static void writePartitions(Configuration conf, Path partitionsPath,
          List<ImmutableBytesWritable> startKeys) throws IOException {
        if (startKeys.isEmpty()) {
          throw new IllegalArgumentException("No regions passed");
        }
    
        // We're generating a list of split points, and we don't ever
        // have keys < the first region (which has an empty start key)
        // so we need to remove it. Otherwise we would end up with an
        // empty reducer with index 0
        //没有哪个rowkey会排在最小的startKey之前,所以去掉最小的startKey
       TreeSet<ImmutableBytesWritable> sorted =
          new TreeSet<ImmutableBytesWritable>(startKeys);
    
        ImmutableBytesWritable first = sorted.first();
        //如果最小的region startKey不是“法定”的最小rowkey,那就报异常
        if (!first.equals(HConstants.EMPTY_BYTE_ARRAY)) {
          throw new IllegalArgumentException(
              "First region of table should have empty start key. Instead has: "
              + Bytes.toStringBinary(first.get()));
        }
        sorted.remove(first);
    
        // Write the actual file
        FileSystem fs = partitionsPath.getFileSystem(conf);
        SequenceFile.Writer writer = SequenceFile.createWriter(fs,
            conf, partitionsPath, ImmutableBytesWritable.class, NullWritable.class);
    
        try {
        //写入分区文件中 
        for (ImmutableBytesWritable startKey : sorted) {
            writer.append(startKey, NullWritable.get());
          }
        } finally {
          writer.close();
        }
      }
    

    因为我的表都是新表,只有一个region, 所以肯定是只有一个reducer了。

    既然如此,使用HFileOutputFormat时reducer的数量就是HTable的region数量,如果使用bluk load HFile的方式导入巨量数据,最好的办法是在定义htable是就预先定义好各region。这种方式其实叫Pre-Creating Regions,PCR还能带来些别的优化,比如减少split region的操作:淘宝有些优化就是应用PCR并且关闭自动split,等到系统空闲时再手动split,这样可以保证系统繁忙时不会再被split雪上加霜。

    关于Pre-Creating Regions: http://hbase.apache.org/book.html#precreate.regions

     11.7.2. Table Creation: Pre-Creating Regions Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. There are two different approaches to pre-creating splits. The first approach is to rely on the default HBaseAdmin strategy (which is implemented in Bytes.split)...

    byte[] startKey = ...;       // your lowest keuy
    byte[] endKey = ...;           // your highest key
    int numberOfRegions = ...;    // # of regions to create
    admin.createTable(table, startKey, endKey, numberOfRegions);

    And the other approach is to define the splits yourself...

    byte[][] splits = ...;   // create your own splits
    admin.createTable(table, splits);
    
  • 相关阅读:
    临时记事本
    D版??班得瑞英文天籁CD13集(下载!!!)
    一个程序员的早上
    使用C#实现Morse码的输出
    面向对象编程的乐趣(TextBox.Text="")
    如何从MS Word的表格中提取指定单元格的数据
    使用Turbo C创建自己的软件包(即创建接口)
    使用C#读取Word表格数据
    一种光栅绘制直线的方法
    关于数据库设计的一个思索这样做是不好的
  • 原文地址:https://www.cnblogs.com/aprilrain/p/2985064.html
Copyright © 2020-2023  润新知