DB2(WZQZ)故障(2019/7/26)分析报告
1. 故障设备/系统
DB2数据库:WZQZ
AIX OS Level:5.3
2. 故障现象
2019-07-26 22:30:00 左右客户反映DB2数据库异常Crash,随后自动恢复了。
3. 故障分析
2019-07-26 22:50 远程分析收集的DB2 support 数据,发现在2019-07-26 21.08.32 时,db2 reorg 操作失败,原因是文件系统异常。日志如下:
2019-07-26-21.08.31.798330+480 E33340748A531 LEVEL: Warning (OS)
PID : 6394024 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-977 APPID: *LOCAL.wzqzinst.190726130831
FUNCTION: DB2 UDB, oper system services, sqloopenp, probe:80
CALLED : OS, -, unspecified_system_function
OSERR : ECORRUPT (89) "Invalid file system control data detected."
DATA #1 : File name, 51 bytes
/wzqz/dbdir/wzqzinst/NODE0000/SQL00001/0007003a.ROR
2019-07-26-21.08.32.679580+480 I33341280A397 LEVEL: Warning
PID : 6394024 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-977 APPID: *LOCAL.wzqzinst.190726130831
MESSAGE : Reorg table failed.
DATA #1 : String, 70 bytes
Table(7:58)=WZQZ .T_TELE_BAT_DETAIL, Flags=x01114091, IID=0, Temp=0
由于reorg 失败导致了 DB Marked Bad,强行中断所有连接,随后数据库要进行crash recovery,日志如下:
2019-07-26-21.08.32.823670+480 E33386126A373 LEVEL: Severe
PID : 6394024 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-977 APPID: *LOCAL.wzqzinst.190726130831
FUNCTION: DB2 UDB, base sys utilities, sqleMarkDBad, probe:10
MESSAGE : ADM7518C "WZQZ " marked bad.
2019-07-26-21.08.32.823931+480 I33386500A386 LEVEL: Severe
PID : 6394024 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-977 APPID: *LOCAL.wzqzinst.190726130831
FUNCTION: DB2 UDB, base sys utilities, sqleMarkDBad, probe:210
MESSAGE : Database logging stopped due to mark db bad.
2019-07-26-21.10.36.544189+480 I34320860A364 LEVEL: Warning
PID : 1388710 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-773 APPID: *LOCAL.wzqzinst.190726131047
FUNCTION: DB2 UDB, base sys utilities, sqledint, probe:30
MESSAGE : Crash Recovery is needed.
2019-07-26-21.10.37.851203+480 I34321225A427 LEVEL: Warning
PID : 1388710 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-773 APPID: *LOCAL.wzqzinst.190726131047
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:410
MESSAGE : Crash recovery started. LowtranLSN 000004D1478DBA0B MinbuffLSN
000004D145171C11
2019-07-27,客户反映数据库再次Crash。通过Db2 db2daig.log 日志发现,同样是文件系统异常导致清理进程异常,最终导致数据库宕机:
2019-07-27-09.59.00.075130+480 E34356335A447 LEVEL: Warning (OS)
PID : 5738586 TID : 1 PROC : db2pclnr 0
INSTANCE: wzqzinst NODE : 000
FUNCTION: DB2 UDB, oper system services, sqloopenp, probe:80
CALLED : OS, -, unspecified_system_function
OSERR : ECORRUPT (89) "Invalid file system control data detected"
DATA #1 : File name, 62 bytes
/wzqz/dbdir/wzqzinst DE0000/SQL00001/SQLT0001.0/SQL00002.TDA
回看数据库日志发现 7月15日开始就有文件系统异常的报错:
2019-07-15-15.32.14.187073+480 E30848714A538 LEVEL: Warning (OS)
PID : 6697026 TID : 1 PROC : db2agent (WZQZ) 0
INSTANCE: wzqzinst NODE : 000 DB : WZQZ
APPHDL : 0-902 APPID: *LOCAL.wzqzinst.190715073214
FUNCTION: DB2 UDB, oper system services, sqlomkdirp, probe:100
CALLED : OS, -, unspecified_system_function
OSERR : ECORRUPT (89) "Invalid file system control data detected."
DATA #1 : File name, 56 bytes
/wzqz/dbdir/wzqzinst DE0000/SQL00001/load/DB200003.PID
OS errpt 日志也印证了在7月15日文件系统就有报错:
LABEL: J2_IMAP_CORRUPT
IDENTIFIER: 61277850
Date/Time: Mon Jul 15 15:32:14 BEIST 2019
Sequence Number: 2943
Machine Id: 00F8C8014C00
Node Id: wzqzb
Class: U
Type: UNKN
Resource Name: SYSJ2
Resource Class: NONE
Resource Type: NONE
Location:
Description
FILE SYSTEM CORRUPTION
Probable Causes
INVALID FILE SYSTEM CONTROL DATA
Recommended Actions
PERFORM FULL FILE SYSTEM RECOVERY USING FSCK UTILITY
OBTAIN DUMP
CHECK ERROR LOG FOR ADDITIONAL RELATED ENTRIES
IF PROBLEM PERSISTS, CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
FILE NAME
j2_imap.c
LINE NUMBER
2007
JFS2 MAJOR/MINOR DEVICE NUMBER
0024 0005
JFS2 ERROR LOG FLAG
0008 0010
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/lv_wzqz_dbdir, /wzqz/dbdir
根据以上信息确定由于系统文件系统异常引起数据库在使用该文件系统时异常,导致部分数据库操作失败,如果是主要进程和任务异常中断,就导致了数据库的Crash。
4. 故障处理
于2019/7/27 21:30左右停WZQZ数据库,完成数据库备份后,执行文件系统修复:
#fsck -y /wzqz/dbdir
成功修复所有文件系统的不一致性,重启数据库,原错误不再出现,数据库恢复正常运行。