• 案例:归档自动清理脚本失效及连带影响


    前些天给同事准备一套模拟环境用于测试一个OGG问题:
    环境架构:Oracle 11.2.0.4 RAC + 单实例11.2.0.4 ADG(同时作为OGG源端,OGG版本19.1.0.0.4) + 单实例19.3多租户(其中1个PDB作为OGG目标端,OGG版本19.1.0.0.4)
    现象概述:发现OGG进程abended,原因是主库归档满,但是实际已配置归档自动清理脚本(归档空间使用大于90%时清理),进一步查看发现根源是归档清理失效,报错RMAN-08137,导致的影响有很多,首先主库无法进行测试数据写入,其次ADG备库产生延迟,然后OGG源端抽取进程因超时报错OGG-02149导致abended..

    1.故障现象:归档清理报错RMAN-08137

    自动归档清理失效,报错RMAN-08137,手工清理现象一样:

    RMAN> DELETE NOPROMPT ARCHIVELOG ALL COMPLETED BEFORE 'SYSDATE-1/24';
    
    RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
    archived log file name=+FRA/crmdb/archivelog/2020_07_08/thread_1_seq_422.426.1045198149 thread=1 sequence=422
    RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
    archived log file name=+FRA/crmdb/archivelog/2020_07_08/thread_1_seq_423.424.1045198157 thread=1 sequence=423
    ...
    

    使用oerr查看RMAN-08137的描述:

    [oracle@jystdrac1 logs]$ oerr rman 8137
    8137, 3, "WARNING: archived log not deleted, needed for standby or upstream capture process"
    // *Cause: An archived log that should have been deleted was not as it was
    //         required by upstream capture process or Data Guard.
    //         The next message identifies the archived log.
    // *Action: This is an informational message. The archived log can be
    //          deleted after it is no longer needed.  See the
    //          documentation for Data Guard to alter the set of active
    //          Data Guard destinations.  See the documentation for
    //          Streams to alter the set of active streams.
    

    查看rman的设置,未发现特殊设置:

    RMAN> show all;
    
    using target database control file instead of recovery catalog
    RMAN configuration parameters for database with db_unique_name CRMDB are:
    CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
    CONFIGURE BACKUP OPTIMIZATION OFF; # default
    CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
    CONFIGURE CONTROLFILE AUTOBACKUP OFF; # default
    CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default
    CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default
    CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
    CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
    CONFIGURE MAXSETSIZE TO UNLIMITED; # default
    CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
    CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
    CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
    CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
    CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+DATA/crmdb/snapcf_crmdb.f';
    

    2.解决方案:设置"_deferred_log_dest_is_valid"参数

    进一步搜索查询,匹配到MOS:
    • RMAN-08137 on Primary Database although Archive Destination to Standby is deferred (Doc ID 1380368.1)

    给出的原因和解决方案引用如下:

    CAUSE
    If we defer an Archive Destination to a Standby Database, the Primary Database will still consider the Standby Database as existing but temporary unavailable eg. for Maintenance. This can happen if you stop Log Transport Services from the Data Guard Broker or manually defer the State for the Archive Destination.

    SOLUTION
    As long as the Archive Destination (log_archive_dest_n) is still set, we consider the Standby Database as still existing and preserve the ArchiveLogs on the Primary Database to perform Gap Resolution when the Archive Destination is valid again.
    There are Situations when this is not wanted, eg. the Standby Database was activated or removed but you still keep the Archive Destination because you want to rebuild the Standby Database later again. In this Case you can set the hidden Parameter "_deferred_log_dest_is_valid" to FALSE (default TRUE) which will consider deferred Archive Destinations as completely unavailable and will not preserve ArchiveLogs for those Destinations any more. It is a dynamic Parameter and can be set this Way:

    SQL> alter system set "_deferred_log_dest_is_valid" = FALSE scope=both;

    NOTE: This Parameter has been introduced with Oracle Database 11.2.0.x. In earlier Versions you have to unset the log_archive_dest_n-Parameter pointing to the remote Standby Database to make the Primary Database considering it as completely unavailable. There also exists a Patch on Top of 11.1.0.7 for some Platforms to include this Parameter in 11.1.0.7, too. This is Patch Number 8468117.

    上面的描述很清楚了,实际结合当前环境,发现确实是有其他log_archive_dest_n设置为defer,而这些我们实际不再用了,要么彻底清除,要么按照MOS设置隐藏参数,我们先查下这个隐藏参数的当前默认设置,发现是Ture:

    NAME                                DESCRIPTION                                                        VALUE
    ----------------------------------- ------------------------------------------------------------------ ------------------------------
    _deferred_log_dest_is_valid         consider deferred log dest as valid for log deletion (TRUE/FALSE)  TRUE
    

    这是一个动态的隐藏参数,可以直接修改为FALSE:

    alter system set "_deferred_log_dest_is_valid" = FALSE;
    
    --再次查询,已经修改成功:
    NAME                                DESCRIPTION                                                        VALUE
    ----------------------------------- ------------------------------------------------------------------ ------------------------------
    _deferred_log_dest_is_valid         consider deferred log dest as valid for log deletion (TRUE/FALSE)  FALSE
    

    3.恢复故障:归档清理恢复正常,ADG同步正常,OGG进程启动正常

    3.1 归档清理恢复正常
    上面设置隐藏参数之后,就可以正常删除归档了:

    RMAN> DELETE NOPROMPT ARCHIVELOG ALL COMPLETED BEFORE 'SYSDATE-1/24';
    ...
    archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_399.462.1045343023 RECID=1655 STAMP=1045343047
    deleted archived log
    archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_400.396.1045343069 RECID=1657 STAMP=1045343099
    deleted archived log
    archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_401.404.1045343111 RECID=1661 STAMP=1045343151
    Deleted 87 objects
    

    3.2 ADG同步恢复正常
    主库切下日志,再看ADG同步状态,最终恢复正常:

    SQL> @dg
    
    NAME                           VALUE                          UNIT                           TIME_COMPUTED                  DATUM_TIME
    ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
    transport lag                  +06 03:04:01                   day(2) to second(0) interval   07/16/2020 00:10:08            07/16/2020 00:10:08
    apply lag                      +06 03:04:01                   day(2) to second(0) interval   07/16/2020 00:10:08            07/16/2020 00:10:08
    apply finish time              +00 00:00:04.358               day(2) to second(3) interval   07/16/2020 00:10:08
    estimated startup time         65                             second                         07/16/2020 00:10:08
    
    SQL> /
    
    NAME                           VALUE                          UNIT                           TIME_COMPUTED                  DATUM_TIME
    ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
    transport lag                  +00 00:00:00                   day(2) to second(0) interval   07/16/2020 00:10:16            07/16/2020 00:10:15
    apply lag                      +00 00:00:00                   day(2) to second(0) interval   07/16/2020 00:10:16            07/16/2020 00:10:15
    apply finish time              +00 00:00:00.000               day(2) to second(3) interval   07/16/2020 00:10:16
    estimated startup time         65                             second                         07/16/2020 00:10:16
    

    3.3 OGG进程启动正常
    再将OGG的进程手工启动,恢复正常:

    GGSCI (test03) 1> info all
    
    Program     Status      Group       Lag at Chkpt  Time Since Chkpt
    
    MANAGER     RUNNING                                           
    EXTRACT     RUNNING     DPAUDDG     00:00:00      00:00:10    
    EXTRACT     ABENDED     EXTAUDDG    00:00:00      146:59:54 
    GGSCI (test03) 2> view report EXTAUDDG
    ...
    2020-07-09 21:11:37  ERROR   OGG-02149  Standby database has made no progress for more than 30,000 seconds.
    
    GGSCI (test03) 3> info all
    
    Program     Status      Group       Lag at Chkpt  Time Since Chkpt
    
    MANAGER     RUNNING                                           
    EXTRACT     RUNNING     DPAUDDG     00:00:00      00:00:02    
    EXTRACT     ABENDED     EXTAUDDG    00:00:00      147:01:07   
    
    
    GGSCI (test03) 4> start EXTAUDDG
    
    Sending START request to MANAGER ...
    EXTRACT EXTAUDDG starting
    
    
    GGSCI (test03) 5> info all
    
    Program     Status      Group       Lag at Chkpt  Time Since Chkpt
    
    MANAGER     RUNNING                                           
    EXTRACT     RUNNING     DPAUDDG     00:00:00      00:00:00    
    EXTRACT     RUNNING     EXTAUDDG    00:00:00      00:00:00    
    
    --target ogg ok!
    GGSCI (db19) 1> info all
    
    Program     Status      Group       Lag at Chkpt  Time Since Chkpt
    
    MANAGER     RUNNING                                           
    REPLICAT    RUNNING     REPAUD1A    00:00:00      00:00:05 
    

    至此,整个架构涉及到的所有环境均恢复正常。

  • 相关阅读:
    十个男人看了,九个成了富人
    win7下编译安装osgearth
    gdal源码编译安装
    win7下编译boost库总结
    everything && executor
    cursor:hand与cursor:pointer的区别介绍
    web程序记录当前在线人数
    汉字转拼音
    40多个非常有用的Oracle 查询语句
    asp.net 使用IHttpModule 做权限检查 登录超时检查(转)
  • 原文地址:https://www.cnblogs.com/jyzhao/p/13320760.html
Copyright © 2020-2023  润新知