• HBase中加盐(Salting)之后的表如何读取:协处理器文章


    我们介绍了避免数据斑点的三种比较常见方法:

    • 加盐-盐腌
    • 哈希-散列
    • 反转-反转

    其中在加盐(Salting)的方法里面是这么描述的:给Rowkey分配一个随机指针以使其和之前排序不同。但是在Rowkey前面加了随机重叠,那么我们怎么将这些数据替换来呢?我将分三篇文章来介绍如何读取加盐之后的表,其中每篇文章提供一种方法,主要包括:

    • 使用协处理器读取加盐的表
    • 使用Spark读取加盐的表
    • 使用MapReduce读取加盐的表

    关于协处理器的入门及实战,参见请这里。本文使用的各组件版本:Hadoop的2.7.7,HBase的-2.0.4,jdk1.8.0_201。

    测试数据生成

    在介绍如何查询数据之前,我们先创建一张名为iteblog的HBase表,进行测试。为了数据均匀和介绍的方便,这里使用了预分区,并设置了27个分区,如下:

    hbase(main):002:0> create 'iteblog', 'f', SPLITS => ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
    0 row(s) in 2.4880 seconds

    然后我们使用以下方法生成了1000000条测试数据。RowKey的形式为UID +当前数据生成版本;由于UID的长度为4,所以1000000条数据会存在大量的UID一样的数据,所以我们使用加盐方法将这些数据均匀分散到上述27个Region里面(注意,其实第一个Region其实没数据)。具体代码如下:

    package com.iteblog.data;
     
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.HConstants;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.*;
    import org.apache.hadoop.hbase.util.Bytes;
     
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.Random;
    import java.util.UUID;
     
    public class HBaseDataGenerator {
        private static byte[] FAMILY = "f".getBytes();
        private static byte[] QUALIFIER_UUID = "uuid".getBytes();
        private static byte[] QUALIFIER_AGE = "age".getBytes();
     
        private static char generateLetter() {
            return (char) (Math.random() * 26 + 'A');
        }
     
        private static long generateUid(int n) {
            return (long) (Math.random() * 9 * Math.pow(10, n - 1)) + (long) Math.pow(10, n - 1);
        }
     
        public static void main(String[] args) throws IOException {
            BufferedMutatorParams bmp = new BufferedMutatorParams(TableName.valueOf("iteblog"));
            bmp.writeBufferSize(1024 * 1024 * 24);
     
            Configuration conf = HBaseConfiguration.create();
            conf.set(HConstants.ZOOKEEPER_QUORUM, "https://www.iteblog.com:2181");
            Connection connection = ConnectionFactory.createConnection(conf);
     
            BufferedMutator bufferedMutator = connection.getBufferedMutator(bmp);
     
            int BATCH_SIZE = 1000;
            int COUNTS = 1000000;
            int count = 0;
            List<Put> putList = new ArrayList<>();
     
            for (int i = 0; i < COUNTS; i++) {
                String rowKey = generateLetter() + "-"
                        + generateUid(4) + "-"
                        + System.currentTimeMillis();
     
                Put put = new Put(Bytes.toBytes(rowKey));
                byte[] uuidBytes = UUID.randomUUID().toString().substring(0, 23).getBytes();
                put.addColumn(FAMILY, QUALIFIER_UUID, uuidBytes);
                put.addColumn(FAMILY, QUALIFIER_AGE, Bytes.toBytes("" + new Random().nextInt(100)));
                putList.add(put);
                count++;
     
                if (count % BATCH_SIZE == 0) {
                    bufferedMutator.mutate(putList);
                    bufferedMutator.flush();
                    putList.clear();
                    System.out.println(count);
                }
            }
     
            if (putList.size() > 0) {
                bufferedMutator.mutate(putList);
                bufferedMutator.flush();
                putList.clear();
            }
     
        }
    }

    运行完上面的代码之后,会生成1000000条数据(注意,这里其实不严谨,因为Rowkey设计问题,可能会导致重复的Rowkey生成,所以实际情况下可能没有1000000条数据。)。我们limit 10条数据看下长成什么样:

    hbase(main):001:0> scan 'iteblog', {'LIMIT'=>10}
    ROW                        COLUMN+CELL
     A-1000-1550572395399      column=f:age, timestamp=1549091990253, value=54
     A-1000-1550572395399      column=f:uuid, timestamp=1549091990253, value=e9b10a9f-1218-43fd-bd01
     A-1000-1550572413799      column=f:age, timestamp=1549092008575, value=4
     A-1000-1550572413799      column=f:uuid, timestamp=1549092008575, value=181aa91e-5f1d-454c-959c
     A-1000-1550572414761      column=f:age, timestamp=1549092009531, value=33
     A-1000-1550572414761      column=f:uuid, timestamp=1549092009531, value=19aad8d3-621a-473c-8f9f
     A-1001-1550572394570      column=f:age, timestamp=1549091989341, value=64
     A-1001-1550572394570      column=f:uuid, timestamp=1549091989341, value=c6712a0d-3793-46d5-865b
     A-1001-1550572405337      column=f:age, timestamp=1549092000108, value=96
     A-1001-1550572405337      column=f:uuid, timestamp=1549092000108, value=4bf05d10-bb4d-43e3-9957
     A-1001-1550572419688      column=f:age, timestamp=1549092014458, value=8
     A-1001-1550572419688      column=f:uuid, timestamp=1549092014458, value=f04ba835-d8ac-49a3-8f96
     A-1002-1550572424041      column=f:age, timestamp=1549092018816, value=84
     A-1002-1550572424041      column=f:uuid, timestamp=1549092018816, value=99d6c989-afb5-4101-9d95
     A-1003-1550572431830      column=f:age, timestamp=1549092026605, value=21
     A-1003-1550572431830      column=f:uuid, timestamp=1549092026605, value=8c1ff1b6-b97c-4059-9b68
     A-1004-1550572395399      column=f:age, timestamp=1549091990253, value=2
     A-1004-1550572395399      column=f:uuid, timestamp=1549091990253, value=e240aa0f-c044-452f-89c0
     A-1004-1550572403783      column=f:age, timestamp=1549091998555, value=6
     A-1004-1550572403783      column=f:uuid, timestamp=1549091998555, value=e8df15c9-02fa-458e-bd0c
    10 row(s)
    Took 0.1104 seconds

    使用协处理器查询加盐之后的表

    现在有数据了,我们需要查询所有UID = 1000的用户所有历史数据,那么如何查呢?我们知道UID = 1000的用户数据是均匀放到上述的27个地区里面的,因为经过加盐了,所以这些数据垂直都是垂直A-,B-,C-开头的。其次我们需要知道,每个区域其实是有Start Key和End Key的,这些Start Key和End Key其实就是我们创建iteblog表指定的。如果你看了《 HBase协处理器入门及实战》这篇文章,你就知道协处理器的代码其实是在每个区域里面执行的;而这些代码在区域里面执行的时候是可以拿到当前Region的信息,包括了键和结束键,所以实际上我们可以将拿到的开始键信息和查询的UID进行拆分,这样就可以查询我们要的数据。协处理器处理文章就是基于这样的思想来查询加盐之后的数据的。

    定义proto文件

    为什么需要定义这个请参见《 HBase协处理器入门及实战》这篇文章。因为我们查询的时候需要引用查询的参数,表名,StartKey,EndKey以及是否加盐等标记;同时当查询到结果的当时,我们还需要将数据返回,所以我们定义的proto文件如下:

    option java_package = "com.iteblog.data.coprocessor.generated";
    option java_outer_classname = "DataQueryProtos";
    option java_generic_services = true;
    option java_generate_equals_and_hash = true;
    option optimize_for = SPEED;
     
    message DataQueryRequest {
      optional string tableName = 1;
      optional string startRow = 2;
      optional string endRow = 3;
      optional bool  incluedEnd = 4;
      optional bool  isSalting = 5;
    }
     
    message DataQueryResponse {
      message Cell{
        required bytes value = 1;
        required bytes family = 2;
        required bytes qualifier = 3;
        required bytes row = 4;
        required int64 timestamp = 5;
      }
     
      message Row{
        optional bytes rowKey = 1;
        repeated Cell cellList = 2;
      }
     
      repeated Row rowList = 1;
    }
     
    service QueryDataService{
      rpc queryByStartRowAndEndRow(DataQueryRequest)
        returns (DataQueryResponse);
    }

    我们然后使用protobuf-maven-plugin插件将上面的原生成的Java类,具体如何操作参见“在IDEA中使用Maven的编译原文件”。将我们的生成DataQueryProtos.java类拷贝产品到com.iteblog.data.coprocessor.generated包里面。

    编写协处理器代码

    有了请求和返回的类,现在我们需要编写协处理器的处理代码了,结合上面的分析,协处理器的代码实现如下:

    package com.iteblog.data.coprocessor;
     
    import com.google.protobuf.ByteString;
    import com.google.protobuf.RpcCallback;
    import com.google.protobuf.RpcController;
    import com.google.protobuf.Service;
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.QueryDataService;
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.DataQueryRequest;
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.DataQueryResponse;
    import org.apache.hadoop.hbase.Cell;
    import org.apache.hadoop.hbase.CoprocessorEnvironment;
    import org.apache.hadoop.hbase.client.Get;
    import org.apache.hadoop.hbase.client.Result;
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.coprocessor.CoprocessorException;
    import org.apache.hadoop.hbase.coprocessor.RegionCoprocessor;
    import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
    import org.apache.hadoop.hbase.regionserver.InternalScanner;
    import org.apache.hadoop.hbase.shaded.protobuf.ResponseConverter;
    import org.apache.hadoop.hbase.util.Bytes;
     
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.List;
     
    public class SlatTableDataSearch extends QueryDataService implements RegionCoprocessor {
        private RegionCoprocessorEnvironment env;
     
        public Iterable<Service> getServices() {
            return Collections.singleton(this);
        }
     
        @Override
        public void queryByStartRowAndEndRow(RpcController controller,
                                             DataQueryRequest request,
                                             RpcCallback<DataQueryResponse> done) {
            DataQueryResponse response = null;
     
            String startRow = request.getStartRow();
            String endRow = request.getEndRow();
            String regionStartKey = Bytes.toString(this.env.getRegion().getRegionInfo().getStartKey());
     
            if (request.getIsSalting()) {
                String startSalt = null;
                if (null != regionStartKey && regionStartKey.length() != 0) {
                    startSalt = regionStartKey;
                }
                if (null != startSalt && null != startRow) {
                    startRow = startSalt + "-" + startRow;
                    endRow = startSalt + "-" + endRow;
                }
            }
     
            Scan scan = new Scan();
            if (null != startRow) {
                scan.withStartRow(Bytes.toBytes(startRow));
            }
     
            if (null != endRow) {
                scan.withStopRow(Bytes.toBytes(endRow), request.getIncluedEnd());
            }
     
            try (InternalScanner scanner = this.env.getRegion().getScanner(scan)) {
                List<Cell> results = new ArrayList<>();
     
                boolean hasMore;
                DataQueryResponse.Builder responseBuilder = DataQueryResponse.newBuilder();
                do {
                    hasMore = scanner.next(results);
                    DataQueryResponse.Row.Builder rowBuilder = DataQueryResponse.Row.newBuilder();
                    if (results.size() > 0) {
                        Cell cell = results.get(0);
                        rowBuilder.setRowKey(ByteString.copyFrom(cell.getRowArray(), cell.getRowOffset(), cell.getRowLength()));
                        for (Cell kv : results) {
                            buildCell(rowBuilder, kv);
                        }
                    }
     
                    responseBuilder.addRowList(rowBuilder);
                    results.clear();
                } while (hasMore);
     
                response = responseBuilder.build();
     
            } catch (IOException e) {
                ResponseConverter.setControllerException(controller, e);
            }
            done.run(response);
        }
     
        private void buildCell(DataQueryResponse.Row.Builder rowBuilder, Cell kv) {
            DataQueryResponse.Cell.Builder cellBuilder = DataQueryResponse.Cell.newBuilder();
            cellBuilder.setFamily(ByteString.copyFrom(kv.getFamilyArray(), kv.getFamilyOffset(), kv.getFamilyLength()));
            cellBuilder.setQualifier(ByteString.copyFrom(kv.getQualifierArray(), kv.getQualifierOffset(), kv.getQualifierLength()));
            cellBuilder.setRow(ByteString.copyFrom(kv.getRowArray(), kv.getRowOffset(), kv.getRowLength()));
            cellBuilder.setValue(ByteString.copyFrom(kv.getValueArray(), kv.getValueOffset(), kv.getValueLength()));
            cellBuilder.setTimestamp(kv.getTimestamp());
            rowBuilder.addCellList(cellBuilder);
        }
     
        /**
         * Stores a reference to the coprocessor environment provided by the
         * {@link org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost} from the region where this
         * coprocessor is loaded.  Since this is a coprocessor endpoint, it always expects to be loaded
         * on a table region, so always expects this to be an instance of
         * {@link RegionCoprocessorEnvironment}.
         *
         * @param env the environment provided by the coprocessor host
         * @throws IOException if the provided environment is not an instance of
         *                     {@code RegionCoprocessorEnvironment}
         */
        @Override
        public void start(CoprocessorEnvironment env) throws IOException {
            if (env instanceof RegionCoprocessorEnvironment) {
                this.env = (RegionCoprocessorEnvironment) env;
            } else {
                throw new CoprocessorException("Must be loaded on a table region!");
            }
        }
     
        @Override
        public void stop(CoprocessorEnvironment env) {
            // nothing to do
        }
    }

    大家可以看到,这里面的代码框架和《 HBase协处理器入门及实战》里面介绍的HBase提供的RowCountEndpoint示例代码很类似。主要逻辑在queryByStartRowAndEndRow函数实现里面。我们通过DataQueryRequest拿到客户端查询的表,StartKey和EndKey等数据。通过this.env.getRegion().getRegionInfo().getStartKey()可以拿到当前区域的StartKey,然后再和和客户端传进来的StartKey和EndKey进行拼接就可以拿到完整的Rowkey插入。剩下的查询就是正常的HBase扫描代码了。

    现在我们将SlatTableDataSearch类进行编译打包,并部署到HBase表里面去,具体如何部署参见《 HBase协处理器入门及实战》

    协处理器客户端代码编写

    到这里,我们的协处理器服务器端的代码和部署已经完成了,现在我们需要编写协处理器客户端代码。其实也很简单,如下:

    package com.iteblog.data;
     
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.QueryDataService;
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.DataQueryRequest;
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.DataQueryResponse;
    import com.iteblog.data.coprocessor.generated.DataQueryProtos.DataQueryResponse.*;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.Connection;
    import org.apache.hadoop.hbase.client.ConnectionFactory;
    import org.apache.hadoop.hbase.client.HTable;
    import org.apache.hadoop.hbase.ipc.CoprocessorRpcUtils.BlockingRpcCallback;
    import org.apache.hadoop.hbase.ipc.ServerRpcController;
     
    import java.util.LinkedList;
    import java.util.List;
    import java.util.Map;
     
    public class DataQuery {
        private static Configuration conf = null;
     
        static {
            conf = HBaseConfiguration.create();
            conf.set("hbase.zookeeper.quorum", "https://www.iteblog.com:2181");
        }
     
        static List<Row> queryByStartRowAndStopRow(String tableName,
                                                   String startRow, String stopRow,
                                                   boolean isIncludeEnd, boolean isSalting) {
     
            final DataQueryRequest.Builder requestBuilder = DataQueryRequest.newBuilder();
            requestBuilder.setTableName(tableName);
            requestBuilder.setStartRow(startRow);
            requestBuilder.setEndRow(stopRow);
            requestBuilder.setIncluedEnd(isIncludeEnd);
            requestBuilder.setIsSalting(isSalting);
     
            try {
                Connection connection = ConnectionFactory.createConnection(conf);
                HTable table = (HTable) connection.getTable(TableName.valueOf(tableName));
                Map<byte[], List<Row>> result = table.coprocessorService(QueryDataService.class,
                        null, null, counter -> {
                            ServerRpcController controller = new ServerRpcController();
                            BlockingRpcCallback<DataQueryResponse> call = new BlockingRpcCallback<>();
                            counter.queryByStartRowAndEndRow(controller, requestBuilder.build(), call);
                            DataQueryResponse response = call.get();
     
                            if (controller.failedOnException()) {
                                throw controller.getFailedOn();
                            }
     
                            return response.getRowListList();
                        });
     
                List<Row> list = new LinkedList<>();
                for (Map.Entry<byte[], List<Row>> entry : result.entrySet()) {
                    if (null != entry.getKey()) {
                        list.addAll(entry.getValue());
                    }
                }
                return list;
            } catch (Throwable e) {
                e.printStackTrace();
            }
            return null;
     
        }
     
        public static void main(String[] args) {
            List<Row> rows = queryByStartRowAndStopRow("iteblog", "1000", "1001", false, true);
            if (null != rows) {
                System.out.println(rows.size());
                for (DataQueryResponse.Row row : rows) {
                    List<DataQueryResponse.Cell> cellListList = row.getCellListList();
                    for (DataQueryResponse.Cell cell : cellListList) {
                        System.out.println(row.getRowKey().toStringUtf8() + " " +
                                "column=" + cell.getFamily().toStringUtf8() +
                                ":" + cell.getQualifier().toStringUtf8() + ", " +
                                "timestamp=" + cell.getTimestamp() + ", " +
                                "value=" + cell.getValue().toStringUtf8());
                    }
                }
            }
        }
    }

    我们运行上面的代码,可以得到如下的输出:

    A-1000-1550572395399     column=f:age, timestamp=1549091990253, value=54
    A-1000-1550572395399     column=f:uuid, timestamp=1549091990253, value=e9b10a9f-1218-43fd-bd01
    A-1000-1550572413799     column=f:age, timestamp=1549092008575, value=4
    A-1000-1550572413799     column=f:uuid, timestamp=1549092008575, value=181aa91e-5f1d-454c-959c
    A-1000-1550572414761     column=f:age, timestamp=1549092009531, value=33
    A-1000-1550572414761     column=f:uuid, timestamp=1549092009531, value=19aad8d3-621a-473c-8f9f
    B-1000-1550572388491     column=f:age, timestamp=1549091983276, value=1
    B-1000-1550572388491     column=f:uuid, timestamp=1549091983276, value=cf720efe-2ad2-48d6-81b8
    B-1000-1550572392922     column=f:age, timestamp=1549091987701, value=7
    B-1000-1550572392922     column=f:uuid, timestamp=1549091987701, value=8a047118-e130-48cb-adfe
     
    hbase(main):020:0> scan 'iteblog', {STARTROW => 'A-1000', ENDROW => 'A-1001'}
    ROW                         COLUMN+CELL
     A-1000-1550572395399       column=f:age, timestamp=1549091990253, value=54
     A-1000-1550572395399       column=f:uuid, timestamp=1549091990253, value=e9b10a9f-1218-43fd-bd01
     A-1000-1550572413799       column=f:age, timestamp=1549092008575, value=4
     A-1000-1550572413799       column=f:uuid, timestamp=1549092008575, value=181aa91e-5f1d-454c-959c
     A-1000-1550572414761       column=f:age, timestamp=1549092009531, value=33
     A-1000-1550572414761       column=f:uuid, timestamp=1549092009531, value=19aad8d3-621a-473c-8f9f
    3 row(s)
    Took 0.0569 seconds

    可以看到,和我们使用HBase Shell输出的一致,而且我们还把所有的UID = 1000的数据拿到了。好了,到这里,使用协处理器查询HBase加盐之后的表已经算完成了,明天我将介绍使用Spark如何查询加盐之后的表。

  • 相关阅读:
    Heavy Transportation POJ
    Frogger POJ
    CODEFORCES 25E Test
    POJ
    POJ-2777
    [ZJOI2008]骑士
    POJ
    POJ
    [USACO12FEB]Nearby Cows
    [HAOI2009]毛毛虫
  • 原文地址:https://www.cnblogs.com/huanghanyu/p/13042025.html
Copyright © 2020-2023  润新知