同事在给客户做数据库巡检的过程中,发现其中一个数据库的alert日志中报了一个坏块的错误信息,具体如下:
Reading datafile '+DATA_DW/xtdw/datafile/sysaux.295.819217697' for corruption at rdba: 0x0081140e (file 2, block 70670) Read datafile mirror 'DATA_DW_CD_05_DWCEL02' (file 2, block 70670) found same corrupt data (no logical check) Read datafile mirror 'DATA_DW_CD_09_DWCEL01' (file 2, block 70670) found same corrupt data (no logical check) Errors in file /u01/app/oracle/diag/rdbms/xtdw/xtdw1/trace/xtdw1_smon_12209.trc (incident=128756): ORA-01578: ORACLE 数据块损坏 (文件号 2, 块号 70670) ORA-01110: 数据文件 2: '+DATA_DW/xtdw/datafile/sysaux.295.819217697' Errors in file /u01/app/oracle/diag/rdbms/xtdw/xtdw1/trace/xtdw1_smon_12209.trc: ORA-00604: 递归 SQL 级别 1 出现错误 ORA-01578: ORACLE 数据块损坏 (文件号 2, 块号 70670) ORA-01110: 数据文件 2: '+DATA_DW/xtdw/datafile/sysaux.295.819217697' Fri Jan 26 17:10:41 2018 Sweep [inc][128756]: completed Corrupt Block Found TSN = 1, TSNAME = SYSAUX RFN = 2, BLK = 70670, RDBA = 8459278 OBJN = 270, OBJD = 268, OBJECT = SMON_SCN_TO_TIME_AUX, SUBOBJECT = SEGMENT OWNER = SYS, SEGMENT TYPE = Cluster Segment |
从错误日志可以看出是SMON_SCN_TO_TIME_AUX表出现了坏块。为了谨慎起见,还是在测试环境中模拟了整个故障,并做了恢复测试。
1、模拟故障
[oracle@ggdb02 ~]$ bbed parfile=bbed.par Password:
BBED: Release 2.0.0.0.0 - Limited Production on Sat Feb 24 13:22:38 2018
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
************* !!! For Oracle Internal Use only !!! ***************
BBED> set dba 2,411 DBA 0x0080019b (8389019 2,411)
BBED> corrupt Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y Block marked media corrupt.
BBED> sum Check value for File 2, Block 411: current = 0xda6e, required = 0xda6e
BBED> exit |
使用BBED工具故意损坏(2,411)数据块。
SQL> startup force ORACLE instance started.
Total System Global Area 835104768 bytes Fixed Size 2257840 bytes Variable Size 322964560 bytes Database Buffers 507510784 bytes Redo Buffers 2371584 bytes Database mounted. Database opened.
SQL> select * from SMON_SCN_TIME; select * from SMON_SCN_TIME * ERROR at line 1: ORA-01578: ORACLE data block corrupted (file # 2, block # 411) ORA-01110: data file 2: '/u01/app/oracle/oradata/ggdb/sysaux01.dbf'
SQL>select tablespace_name, segment_type, owner, segment_name from dba_extents where file_id = 2 and 411 between block_id and block_id + blocks - 1;
TABLESPACE_NAME SEGMENT_TYPE OWNER SEGMENT_NAME --------------- --------------- ---------- ------------------------------ SYSAUX CLUSTER SYS SMON_SCN_TO_TIME_AUX
SQL> |
可见,故障已经重现,SMON_SCN_TO_TIME_AUX表已经无法访问。
2、故障处理方案
(1).重启数据库,并设置12500 event SHUTDOWN IMMEDIATE STARTUP RESTRICT ALTER SYSTEM SET EVENTS '12500 TRACE NAME CONTEXT FOREVER, LEVEL 10';
(2).获取SMON_SCN_TIME 和SMON_SCN_TO_TIME_AUX的表结构和相关索引结构 SET LONG 1000 SET LINESIZE 500 SELECT DBMS_METADATA.GET_DDL('CLUSTER', 'SMON_SCN_TO_TIME_AUX') FROM DUAL; SELECT DBMS_METADATA.GET_DDL('TABLE', 'SMON_SCN_TIME') FROM DUAL; SELECT DBMS_METADATA.GET_DDL('INDEX', A.INDEX_NAME) FROM DBA_INDEXES A WHERE TABLE_NAME IN('SMON_SCN_TO_TIME_AUX', 'SMON_SCN_TIME');
(3).删除相关的表 DROP TABLE SMON_SCN_TIME; DROP CLUSTER SMON_SCN_TO_TIME_AUX;
(4).重新执行步骤2中获取的表结构语句,重新创建SMON_SCN_TO_TIME_AUX和SMON_SCN_TIME表(注意:要先创建SMON_SCN_TO_TIME_AUX表)
(5).禁用12500 event并重启数据库 ALTER SYSTEM SET EVENTS '12500 TRACE NAME CONTEXT OFF'; SHUTDOWN IMMEDIATE STARTUP |
3、处理过程中遇到的问题
SQL> DROP TABLE SMON_SCN_TIME; DROP TABLE SMON_SCN_TIME * ERROR at line 1: ORA-01578: ORACLE data block corrupted (file # 2, block # 411) ORA-01110: data file 2: '/u01/app/oracle/oradata/ggdb/sysaux01.dbf'
SQL> DROP CLUSTER SMON_SCN_TO_TIME_AUX; DROP CLUSTER SMON_SCN_TO_TIME_AUX * ERROR at line 1: ORA-00951: cluster not empty |
在执行方案的第3步时,无法成功删除SMON_SCN_TIME表,此时,尝试使用truncate cluster smon_scn_to_time_aux的方式来修复该故障,但仍然报错:
SQL> truncate TABLE SMON_SCN_TIME; truncate TABLE SMON_SCN_TIME * ERROR at line 1: ORA-03292: Table to be truncated is part of a cluster
SQL> truncate CLUSTER SMON_SCN_TO_TIME_AUX; truncate CLUSTER SMON_SCN_TO_TIME_AUX * ERROR at line 1: ORA-01578: ORACLE data block corrupted (file # 2, block # 411) ORA-01110: data file 2: '/u01/app/oracle/oradata/ggdb/sysaux01.dbf' |
最终,只能再次借用了删除数据的非常规方法,直接修改了基表信息,成功删除了SMON_SCN_TIME表,删除了SMON_SCN_TIME表之外,可正常删除SMON_SCN_TO_TIME_AUX。
SQL> select object_id, data_object_id , object_name from dba_objects where object_name='SMON_SCN_TIME';
OBJECT_ID DATA_OBJECT_ID OBJECT_NAME ---------- -------------- ----------------------------------- 87932 87895 SMON_SCN_TIME
SQL> delete from tab$ where obj#=87932; 1 row deleted.
SQL> commit; Commit complete.
SQL> SQL> update obj$ set type#=10 where obj#=87932; 1 row updated.
SQL> commit; Commit complete.
SQL> alter system flush buffer_cache; System altered.
SQL> alter system flush shared_pool; System altered.
SQL> DROP CLUSTER SMON_SCN_TO_TIME_AUX; Cluster dropped.
SQL> select object_id, data_object_id , object_name from dba_objects where object_name like '%SMON%'; no rows selected
SQL> |
重建完相关表和索引后,坏块故障成功解决。