• NDMCDB数据库hang住故障分析


    问题描写叙述:

    上午刚刚到办公室,就有监控人员邮件反馈,昨晚NDMCDB407数据库被重新启动过,让我分析一下数据库重新启动的原因。因为昨晚业务有版本号上线,所以短信警告关闭了,所以没有短信下发到我手机上,并且故障时相关人员也没有通知到我。

    1     检查alert日志

    从alert日志中,能够看到,先是在03:29时有一个job执行失败了:
    Fri Aug 22 03:29:29 2014
    Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:
    ORA-12012: error on auto execute of job 31
    ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorized
    ORA-06512: at "NDMC.PROC_NDMC_CANCEL_OPEN",line 5
    ORA-06512: at line 1
    然后在03:49时,出现了连接超时失败,并且一直持续到05:00:08:
    Fri Aug 22 03:49:43 2014
    ***********************************************************************
     
    Fatal NI connect error 12170.
     
     VERSION INFORMATION:
           TNS for Linux: Version 11.1.0.7.0 - Production
           Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0 -Production
           TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0 - Production
     Time: 22-AUG-2014 03:49:43
     Tracing not turned on.
      Tnserror struct:
       ns main err code: 12535
       
    TNS-12535: TNS:operation timed out
       ns secondary err code: 12606
       nt main err code: 0
       nt secondary err code: 0
       nt OS err code: 0
     Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.130.87)(PORT=36628))
    WARNING: inbound connection timed out(ORA-3136)
    Fri Aug 22 03:49:44 2014
    ……
    并且出现了连接数耗尽了:
    Fri Aug 22 03:49:50 2014
    ORA-00020: maximum number of processes 0exceeded
       ns secondary err code: 12560
       ns secondary err code: 12560
       ns main err code: 12537
    Fri Aug 22 03:49:50 2014
    ……
    Fri Aug 22 03:51:48 2014
     
    ***********************************************************************
     
    Fatal NI connect error 12537, connectingto:
     (LOCAL=NO)
     
     VERSION INFORMATION:
           TNS for Linux: Version 11.1.0.7.0 - Production
           Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0 -Production
           TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0 - Production
     Time: 22-AUG-2014 03:51:48
     Tracing not turned on.
      Tnserror struct:
       ns main err code: 12537
       
    TNS-12537: TNS:connection closed
    ns secondaryerr code: 12560
       nt main err code: 0
       nt secondary err code: 0
       nt OS err code: 0
    ORA-609 : opiodr aborting process unknownospid (30476_47044991385184)
    Fri Aug 22 04:14:15 2014
    ORA-28 : opiodr aborting process unknownospid (24925_46986315964000)
    Fri Aug 22 04:16:27 2014
    ORA-28 : opiodr aborting process unknownospid (22475_47013891882592)
    Fri Aug 22 04:16:28 2014
    ORA-28 : opiodr aborting process unknownospid (21356_47116835528288)
    Fri Aug 22 04:16:29 2014
    ORA-28 : opiodr aborting process unknownospid (24947_47774766210656)
    ORA-28 : opiodr aborting process unknownospid (14958_47053435166304)
    ……
    Fri Aug 22 05:00:05 2014
    ORA-28 : opiodr aborting process unknownospid (25765_46941307182688)
    Fri Aug 22 05:00:08 2014
    ORA-28 : opiodr aborting process unknownospid (4949_47396524895840)
    于是在05:04数据库被关闭,从日志来看,这是正常关闭的,初步怀疑是人为关闭或是VCS双机自己主动将数据库关闭了:
    Fri Aug 22 05:04:10 2014
    Stopping background process SMCO
    Stopping background process FBDA
    Shutting down instance: further logonsdisabled
    Fri Aug 22 05:04:12 2014
    Stopping background process CJQ0
    Stopping background process QMNC
    Stopping background process MMNL
    Stopping background process MMON
    Shutting down instance (immediate)
    License high water mark = 1220
    Stopping Job queue slave processes, flags =7
    Fri Aug 22 05:04:20 2014
    Waiting for Job queue slaves to complete
    Job queue slave processes stopped
    Fri Aug 22 05:09:11 2014
    License high water mark = 1220
    USER (ospid: 25110): terminating theinstance
    Termination issued to instance processes.Waiting for the processes to exit
    Fri Aug 22 05:09:21 2014
    Instance termination failed to kill one ormore processes
    Instance terminated by USER, pid = 25110

    2     检查messages日志

    大概在05:03:51时,人为的想将双机切换到备机中:

    Aug 22 05:03:51 NDMCDB11 user_cmd:2014-08-22 05:03:51 hagrp -switch RCS_DB_SG -to system by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]
    Aug 22 05:04:01 NDMCDB11/usr/sbin/cron[15348]: (root) CMD (su - root -c'/opt/watchdog/watchdog_schedule -n OS,oracle' >/dev/null 2>&1)
    Aug 22 05:04:01 NDMCDB11 su: (to root) rooton none
    Aug 22 05:04:03 NDMCDB11 su: (to oracle)root on none
    Aug 22 05:04:09 NDMCDB11 user_cmd:2014-08-22 05:04:09 hagrp -switch RCS_DB_SG -to NDMCDB12 by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]
    Aug 22 05:04:09 NDMCDB11 su: (to oracle)root on none

    但双机切换失败,最后是直接将双机停止,重新启动VCS:

    Aug 22 05:06:18 NDMCDB11 user_cmd:2014-08-22 05:06:18 hastop -all by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]
    ……
    Aug 22 05:07:02 NDMCDB11 user_cmd:2014-08-22 05:07:02 hastat by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]

    所以,到这里就已经确定,数据库这所以重新启动了,全然是因为人为将VCS集群重新启动引起的。那么为什么要VCS群集重新启动呢?数据库究竟有没有问题呢?再来看看。

    最后,经向升级人员操作确认,在升级时,有一个存储过程须要跑,但执行后,数据库基本响应就很慢了,一直执行到3:29左右,人为cancel掉了,所以这也就是为什么会出现这种报错了:

    Fri Aug 22 03:29:29 2014
    Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:
    ORA-12012: error on auto execute of job 31
    ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorized
    ORA-06512: at"NDMC.PROC_NDMC_CANCEL_OPEN", line 5
    ORA-06512: at line 1

    3     查看系统负载

    CPU负载:


    内存负载:


    可见,系统在3:49左右,出现了CPU及内存均被耗尽的情况,这个时间段,刚好数据库出现了大量连接超时失败,甚至是出现了连接数超过阀值:

    Fri Aug 22 03:49:50 2014
    ORA-00020: maximum number of processes 0exceeded
       ns secondary err code: 12560
       ns secondary err code: 12560
       ns main err code: 12537
    Fri Aug 22 03:49:50 2014

    4     分析AWR


    从这里看,数据库在2点到3点时,已经非常的繁忙,但从之前有系统负载来看,2点到3点时,CPU及内存使用率都不算非常高的。接着看:


    指标都没有什么特别高的。


    从top 5 event中,看到了有大量的cursor: pin S wait on X等待,可见出现mutex争用,但通常这仅仅是表象而已,并不是根因。


    绝大部分时间都在做SQL的解析,并且解析还失败了,这就是数据库hang住的根因。正常来说,一个数据库的绝大部分时间应该是用于SQL的运行,所以这个是占用最多时间的:sql execute elapsedtime等。


    不存在较高的versioncount。

    那么数据库什么时候出现的不停解析SQL,而且解析失败了呢?

    查了DBA_HIST_ACTIVE_SESS_HISTORY,分析了下历史会话信息,发如今02:57:00至03:00:00出现的问题:


    经过确认,恰巧就是运行存储过程的时间点左右。

    至此,数据库从3:00開始,已经是不正常的,数据库不停的在解析SQL,SQL都还没有到运行这一步,数据库已经处于无响应的状态,连接会话都被堵塞住了,直到连接数达到了最大连接数,最后被升级操作人员重新启动了VCS集群。

    5     分析结论

    (1)数据库down机主要还是人为进行了VCS切换失败后,进行了VCS重新启动操作引起。

    (2)这套数据库故障的根因,还是为什么数据库在2:58左右时出现解析SQL失败上。从眼下的日志分析来看,看不出是什么原因。


    -- Bosco
    ---- END ----

    -------------------------------------------------------------------------------------------------------

    版权全部,文章同意转载,但必须以链接方式注明源地址,否则追究法律责任!


  • 相关阅读:
    Python之操作MySQL数据库
    Python之进程与线程
    Python之socket网络编程
    网络基础之网络协议
    Python之log的处理方式
    11 python 操作mysql数据库
    10 线程 协程 socketserver 基于udp的socketserver
    9 异常处理 操作系统 进程线程 队列+生产消费者模型 进程同步 回调函数
    6 常用模块 (time,random,os,sys,shutil, json&pickle, shelve,xml,configparser,hashlib,suprocess,logging)
    ubuntu16上安装安装Docker图形化管理界面-Shipyard中文版
  • 原文地址:https://www.cnblogs.com/mfrbuaa/p/4488822.html
Copyright © 2020-2023  润新知