版本:0.94-cdh4.2.1
1. Split Region
[regionserver60020-splits-1397585864985] INFO org.apache.hadoop.hbase.regionserver.SplitRequest - Region split, META updated, and report to master. Parent=test_table,|u:21419463|1389789100286|9749216794103914,1391785640492.893d6a798767b8e819c1b205e291e89b., new regions: test_table,|u:21419463|1389789100286|9749216794103914,1397585864985.ad6da8441218fd62b2d4a2e01e261d00., test_table,|u:21515002|1391907324553|6451290551646972,1397585864985.49a3be1557ee19fdd9d251b5987b88ce.. Split took 1sec
分割Region的过程是非常快的,具体过程参见《HBase split策略》,日志(在父Region所在的regionserver上)中显示出父Region信息和两个子Region信息以及花费时间。
2. 手动major_compact
2014-04-15 14:02:44,858 414746642 [PRI IPC Server handler 4 on 60020] DEBUG - Small Compaction requested: regionName=push_debug_pro,,1386503868556.dd3d0e1d0c792d88694ef3f9b02000dc., storeName=body, fileCount=6, fileSize=1.3g (1.1g, 132.1m, 18.4m, 1.9m, 1.8m, 1.8m), priority=1, time=26005188608728440; Because: User-triggered major compaction; compaction_queue=(0:0), split_queue=0 2014-04-15 14:02:44,859 414746643 [PRI IPC Server handler 4 on 60020] DEBUG - Small Compaction requested: regionName=push_debug_pro,,1386503868556.dd3d0e1d0c792d88694ef3f9b02000dc., storeName=meta, fileCount=9, fileSize=1.7g (1.5g, 174.5m, 46.6m, 2.5m, 2.4m, 2.4m, 2.4m, 2.4m, 2.4m), priority=1, time=26005188609180116; Because: User-triggered major compaction; compaction_queue=(0:0), split_queue=0 2014-04-15 14:02:44,870 414746654 [PRI IPC Server handler 1 on 60020] DEBUG - Small Compaction requested: regionName=push_debug_pro,|1bf1830e-f40e-4179-9b97-52e10a3a01f2|1387531107521|4,1389537372468.c50b3b070f626c2edf83531df2a67e96., storeName=body, fileCount=8, fileSize=648.6m (572.6m, 64.9m, 1.9m, 1.8m, 1.8m, 1.8m, 1.9m, 1.8m), priority=1, time=26005188621011974; Because: User-triggered major compaction; compaction_queue=(0:0), split_queue=0 2014-04-15 14:02:44,871 414746655 [PRI IPC Server handler 1 on 60020] DEBUG - Small Compaction requested: regionName=push_debug_pro,|1bf1830e-f40e-4179-9b97-52e10a3a01f2|1387531107521|4,1389537372468.c50b3b070f626c2edf83531df2a67e96., storeName=meta, fileCount=8, fileSize=849.8m (748.8m, 64.4m, 24.5m, 2.4m, 2.4m, 2.4m, 2.4m, 2.4m), priority=1, time=26005188621309047; Because: User-triggered major compaction; compaction_queue=(0:0), split_queue=0 2014-04-15 14:02:44,870 [regionserver60020-smallCompactions-1397127058124] INFO org.apache.hadoop.hbase.regionserver.HRegion - Starting compaction o n body in region push_debug_pro,|1bf1830e-f40e-4179-9b97-52e10a3a01f2|1387531107521|4,1389537372468.c50b3b070f626c2edf83531df2a67e96. ……. 2014-04-15 14:04:32,737 [regionserver60020-smallCompactions-1397127044870] INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest - completed compaction: regionName=push_debug_pro,|1bf1830e-f40e-4179-9b97-52e10a3a01f2|1387531107521|4,1389537372468.c50b3b070f626c2edf83531df2a67e96 ., storeName=meta, fileCount=8, fileSize=849.8m, priority=1, time=26005188621309047; duration=1mins, 47sec
手动触发后会发出Compaction Request,每处理一个CR,均会有start日志和finish日志,可以看到做compaction所花得时间。
3. load balance
经检测没有达到balance的条件
2014-04-16 10:09:54,747 [inspur249.xxx.xxxx.org,60000,1390302283923-BalancerChore] INFO org.apache.hadoop.hbase.master.LoadBalancer - Skipping load balancing because balanced cluster; servers=10 regions=667 average=66.7 mostloaded=80 leastloaded=59 2014-04-16 10:10:05,221 [IPC Server handler 30 on 60000] INFO org.apache.hadoop.hbase.master.LoadBalancer - Skipping load balancing because balanced cluster; servers=10 regions=667 average=66.7 mostloaded=80 leastloaded=59
检测之后发现需要做balance
2014-04-16 10:23:36,549 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Balance parameter: numRegions=114, numServers=2, max=57, min=57 2014-04-16 10:23:36,551 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 2ms. Moving 57 regions off of 1 overloaded servers onto 1 less loaded servers 2014-04-16 10:23:36,551 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=hxh_urgent_json,,1397604802595.b1ccc2a8299c11bd2a1abe50750444d3., src=db-41.xxx.xxxx.org,60020,1397528681867, dest=db-40.xxx.xxxx.org,60020,1397614880721 2014-04-16 10:23:36,551 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region hxh_urgent_json,,1397604802595.b1ccc2a8299c11bd2a1abe50750444d3. (offlining) 2014-04-16 10:23:36,551 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x143b2d18db21a4e Creating unassigned node for b1ccc2a8299c11bd2a1abe50750444d3 in a CLOSING state 2014-04-16 10:23:36,563 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to db-41.xxx.xxxx.org,60020,1397528681867 2014-04-16 10:23:36,565 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to db-41.xxx.xxxx.org,60020,1397528681867 for region hxh_urgent_json,,1397604802595.b1ccc2a8299c11bd2a1abe50750444d3. …….
不管是手动触发还是自动检测,会触发上面两种日志(在Master)
4. Memstore满了,进行Flush
每一个CF下面都会有一个Memstore缓存,当满了之后就会flush成一个StoreFile到HDFS上,具体日志如下:
2014-04-16 14:40:06,987 [regionserver60020.cacheFlusher] INFO org.apache.hadoop.hbase.regionserver.HRegion - Finished memstore flush of ~64.1m/67181208, currentsize=44.4k/45416 for region test_table,|u:16237342|s:91|c:16|1389706594286|16357828508728627,1393714562268.ed3801126d30507d12210201aaee747a. in 537ms, sequenceid=8067477096, compaction requested=false
5、responseTooSlow
2014-03-30 15:49:15,234 [IPC Server handler 49 on 60020] WARN org.apache.hadoop.ipc.HBaseServer - (responseTooSlow): {"processingtimems":132079,"cal l":"multi(org.apache.hadoop.hbase.client.MultiAction@ed15e44), rpc version=1, client version=29, methodsFingerPrint=1891768260","client":"172.17.5.56 :59669","starttimems":1396165623154,"queuetimems":5633,"class":"HRegionServer","responsesize":0,"method":"multi”}
当某一次RPC请求的processingtimems>10s则会打印responseTooSlow日志,出现这种情况原因有很多:磁盘 io 繁忙集群之间节点通信时间过长、GC 导致的系统系统资源竞争或者交换机网络也会有影响。
6、client端写入暴增
当客户端写入暴增超过hbase.hregion.memstore.block.multiplier*hbase.hregion.memstore.flush.size则会block,打印如下日志
2014-03-30 15:49:15,227 [IPC Server handler 28 on 60020] INFO org.apache.hadoop.hbase.regionserver.HRegion - Blocking updates for 'IPC Server handle r 28 on 60020' on region clp_test_3,,1385643984667.99d41aba22a47532adee1fe7bff41465.: memstore size 680.7m is >= than blocking 512.0m size
处理办法是控制客户端的读写,slipt region均衡到其他节点上