Hbase数据备份&&容灾方案
标签(空格分隔): Hbase
一、Distcp
在使用distcp命令copy hdfs文件的方式实现备份时,需要禁用备份表确保copy时该表没有数据写入,对于在线服务的hbase集群,该方式不可用,将静态此目录distcp 到其他HDFS文件系统时候,可以通过在其他集群直接启动新Hbase 集群将所有数据恢复。
二、CopyTable
执行命令前,需在对端集群先创建表
支持时间区间、row区间,改变表名称,改变列簇名称,指定是否copy删除数据等功能,例如:
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr= dstClusterZK:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
1、同一个集群不同表名称
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy srcTable
2、跨集群copy表
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase srcTable
跨集群copytable 必须注意是用推的方式,即从原集群运行此命令。
copytable eg
$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
Options:
rs.class hbase.regionserver.class of the peer cluster,
specify if different from current cluster
rs.impl hbase.regionserver.impl of the peer cluster,
startrow the start row
stoprow the stop row
starttime beginning of the time range (unixtime in millis)
without endtime means from starttime to forever
endtime end of the time range. Ignored if no starttime specified.
versions number of cell versions to copy
new.name new table's name
peer.adr Address of the peer cluster given in the format
hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
families comma-separated list of families to copy
To copy from cf1 to cf2, give sourceCfName:destCfName.
To keep the same name, just give "cfName"
all.cells also copy delete markers and deleted cells
Args:
tablename Name of the table to copy
Examples:
To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
For performance consider the following general options:
It is recommended that you set the following to >=100. A higher value uses more memory but
decreases the round trip time to the server and may increase performance.
-Dhbase.client.scanner.caching=100
The following should always be set to false, to prevent writing data twice, which may produce
inaccurate results.
-Dmapred.map.tasks.speculative.execution=false
一些示例
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –peer.adr=VECS00001,VECS00002,VECS00003:2181:/hbase –families=txjl –new.name=hy_membercontacts_bk hy_membercontacts
#根据时间范围备份
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –new.name=hy_membercontacts_bk hy_membercontacts
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1477929600000 –endtime=1478591994506 –new.name=hy_linkman_tmp hy_linkman
#备份全表
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del hy_mobileblacklist
#拓展根据时间范围查询
scan ‘hy_linkman’, {COLUMNS => ‘lxr:sguid’, TIMERANGE => [1478966400000, 1479052799000]}
scan ‘hy_mobileblacklist’, {COLUMNS => ‘mobhmd:sguid’, TIMERANGE => [1468719824000, 1468809824000]}
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del_20161228 hy_mobileblacklist
三、Export/Import(使用mapreduce)
Export 执行导出命令
可使用-D命令自定义参数,此处限定表名、列族、开始结束RowKey、以及导出到HDFS的目录
hbase org.apache.hadoop.hbase.mapreduce.Export -D hbase.mapreduce.scan.column.family=cf -D hbase.mapreduce.scan.row.start=0000001 -D hbase.mapreduce.scan.row.stop=1000000 table_name /tmp/hbase_export
可选的-D参数配置项
Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]
Note: -D properties will be applied to the conf used.
For example:
-D mapred.output.compress=true
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-D mapred.output.compression.type=BLOCK
Additionally, the following SCAN properties can be specified
to control/limit what is exported..
-D hbase.mapreduce.scan.column.family=<familyName>
-D hbase.mapreduce.include.deleted.rows=true
For performance consider the following properties:
-Dhbase.client.scanner.caching=100
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
For tables with very wide rows consider setting the batch size as below:
-Dhbase.export.scanner.batch=10
Import 执行导入命令
必须在导入前存在表
create 'table_name','cf'
运行导入命令
hbase org.apache.hadoop.hbase.mapreduce.Import table_name hdfs://flashhadoop/tmp/hbase_export/
可选的-D参数配置项
Usage: Import [options] <tablename> <inputdir>
By default Import will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimport.bulk.output=/path/for/output
To apply a generic org.apache.hadoop.hbase.filter.Filter to the input, use
-Dimport.filter.class=<name of filter class>
-Dimport.filter.args=<comma separated list of args for filter
NOTE: The filter will be applied BEFORE doing key renames via the HBASE_IMPORTER_RENAME_CFS property. Futher, filters will only use the Filter#filterRowKey(byte[] buffer, int offset, int length) method to identify whether the current row needs to be ignored completely for processing and Filter#filterKeyValue(KeyValue) method to determine if the KeyValue should be added; Filter.ReturnCode#INCLUDE and #INCLUDE_AND_NEXT_COL will be considered as including the KeyValue.
For performance consider the following options:
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
-Dimport.wal.durability=<Used while writing data to hbase. Allowed values are the supported durability values like SKIP_WAL/ASYNC_WAL/SYNC_WAL/...>
四、Snapshot
即为Hbase 表的镜像。
需要提前开启Hbase 集群的snapshot 功能。
<property>
<name>hbase.snapshot.enabled</name>
<value>true</value>
</property>
在hbase shell中使用clone_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot命令可是是想创建快照,查看快照,通过快照恢复表,通过快照创建一个新的表等功能,
在创建snapshot后,可以通过ExportSnapshot工具把快照导出到另外一个集群,实现数据备份或者数据迁移,ExportSnapshot工具的用法如下:(必须为推送的方式,即从现集群到目的集群)
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot table_name_snapshot -copy-to hdfs://flashhadoop_2/hbase -mappers 2
执行该命令后,在flashhadoop_2的hdfs中会把table_name_snapshot文件夹copy到/hbase/.hbase-snapshot文件下,进入flashhadoop_2这个hbase集群,执行list_snapshots会看到有一个快照:table_name_snapshot,通过命令clone_snapshot可以把该快照copy成一个新的表,不用提前创建表,新表的region个数等信息完全与快照保持一致。也可以先创建一张与原表相同的表,然后通过restore snapshot的方式恢复表,但会多出一个region.这个region 将会失效。
在使用snapshot把一个集群的数据copy到新集群后,应用程序开启双写,然后可以使用Export工具把快照与双写之间的数据导入到新集群,从而实现数据迁移,为保障数据不丢失,Export导出时指定的时间范围可以适当放宽。
五、Replication
可以通过replication机制实现hbase集群的主从模式,或者可以说主主模式,也就是两边都做双向同步,具体步骤如下:
1、 如果主从hbase集群共用一个zk集群,则zookeeper.znode.parent不能都是默认的hbase,可以配置为hbase-master和hbase-slave,总之在zk 中的znode节点命名不能冲突。
2,在主,从hbase集群的hbase-site.xml中添加配置项:(其实做主从模式的话,只需要将从集群hbase.replication设置为true 即可,其他可以忽略。)
<property>
<name>hbase.replication</name>
<value>true</value>
</property>
<property>
<name>replication.source.nb.capacity</name>
<value>25000</value>
<description>主集群每次向从集群发送的entry最大的个数,默认值25000,可根据集群规模做出适当调整</description>
</property>
<property>
<name>replication.source.size.capacity</name>
<value>67108864</value>
<description>主集群每次向从集群发送的entry的包的最大值大小,默认为64M</description>
</property>
<property>
<name>replication.source.ratio</name>
<value>1</value>
<description>主集群使用的从集群的RS的数据百分比,默认为0.1,1.X.X版本默认0.15,需调整为1,充分利用从集群的RS</description>
</property>
<property>
<name>replication.sleep.before.failover</name>
<value>2000</value>
<description>主集群在RS宕机多长时间后进行failover,默认为2秒,具体的sleep时间是: sleepBeforeFailover + (long) (new Random().nextFloat() * sleepBeforeFailover) </description>
</property>
<property>
<name>replication.executor.workers</name>
<value>1</value>
<description>从事replication的线程数,默认为1,如果写入量大,可以适当调大</description>
</property>
3,重启主从集群,新集群搭建请忽略重启,直接启动即可。
4,分别在主从集群hbase shell中
add_peer 'ID' 'CLUSTER_KEY'
The ID must be a short integer. To compose the CLUSTER_KEY, use the following template:
hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent
This will show you the help to setup the replication stream between both clusters. If both clusters use the same Zookeeper cluster, you have to use a different zookeeper.znode.parent since they can't write in the same folder.
1,
增加主Hbase 到容灾 Hbase 数据表 同步
add_peer '1', "VECS00840,VECS00841,VECS00842,VECS00843,VECS00844:2181:/hbase"
2,
增加容灾Hbase 到主 Hbase 数据表 同步
add_peer '2', "VECS00994,VECS00995,VECS00996,VECS00997,VECS00998:2181:/hbase"
3,然后在主,备集群建表结构,属性完全相同的表。(注意,是完全相同)
主从集群都建立。
hbase shell>
create 't_warehouse_track', {NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
4,在主集群hbase shell
enable_table_replication 't_warehouse_track'
5,在容灾集群hbase shell
disable 'your_table'
alter 'your_table', {NAME => 'family_name', REPLICATION_SCOPE => '1'}
enable 'your_table
此处的REPLICATION_SCOPE => '1'中的1,与第3步中设置到“ID”无关系,这个值只有0或者1,标示开启复制或者关闭。