原因分析:
线上hbase,在凌晨1点左右,发现某一台regionserver进行了重启(regionserver加了守护线程)
1、查看master日志:
2020-02-27 01:04:57,001 ERROR [RpcServer.FifoRWQ.default.read.handler=26,queue=10,port=16000] master.MasterRpcServices: Region server a3ster,16020,1582342923163 reported a fatal error: ABORTING region server a3ser,16020,1582342923163: Replay of WAL required. Forcing server shutdown Cause: org.apache.hadoop.hbase.DroppedSnapshotException: region: T_BL,x0Ax00x00x00x00x00x00x00x00x00x00x00x00,1572576275632.069e4d877a4ff46f9964ac8bcddb09ef. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2509) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2186) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2148) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2039) at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1965) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:505) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for ringBufferSequence=101793126, WAL system stuck? at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1406) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1400) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1512) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:126) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:75) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2486) ... 9 more 2020-02-27 01:04:57,032 ERROR [RpcServer.FifoRWQ.default.read.handler=29,queue=8,port=16000] master.MasterRpcServices: Region server a3ser,16020,1582342923163 reported a fatal error: ABORTING region server a3serz,16020,1582342923163: Replay of WAL required. Forcing server shutdown Cause:
2、查看regioserver 日志
2020-02-27 01:04:56,813 WARN [ResponseProcessor for block BP-1884348122-10.62.2.1-1545175191847:blk_1489206371_467735337] hdfs.DFSClient: Slow ReadProcessor read fields took 327586ms (threshold=30000ms); ack: seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 965211 4: "