• 客户数据库出现大量cache buffer chains latch


    http://yangtingkun.itpub.net/post/468/526943

    客户产品数据库上午出现了严重的性能问题,简单记录一下问题的诊断和解决过程。

    可以看到,数据库的DB TIME已经涨到了非常高的地步,这说明系统正经受着非常严重的性能问题:

     

    Snap Id

    Snap Time

    Sessions

    Cursors/Session

    Begin Snap:

    7170

    31-3 -12 08:00:50

    395

    10.5

    End Snap:

    7172

    31-3 -12 09:00:27

    689

    11.7

    Elapsed:

     

    59.61 (mins)

       

    DB Time:

     

    17,270.93 (mins)

       

    TOP 5看,等待事件中最明显的是latch: cache buffers chains,从当前系统的状态也可以看到这一点:

    SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC;
    
    EVENT COUNT(*)
    ---------------------------------------------------------------- ----------
    latch: cache buffers chains 1043
    SQL*Net message from client 500
    SQL*Net message to client 35
    rdbms ipc message 29
    gc cr request 25
    latch free 13
    gcs remote message 10
    gc buffer busy 7
    
    8 rows selected

    系统中部分SQL的执行时间已经长达几个小时:

     

    SQL ordered by Elapsed Time

     

    Elapsed Time (s)

    CPU Time (s)

    Executions

    Elap per Exec (s)

    % Total DB Time

    SQL Id

    SQL Module

    SQL Text

    341,821

    11,917

    112

    3051.97

    32.99

    8v9mhuxam279p

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    244,752

    3,380

    147

    1664.98

    23.62

    44zutxspd664c

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    52,419

    11,331

    3

    17472.91

    5.06

    a6aqkm30u7p90

    JDBC Thin Client

    select cf.cardid, (CASE WHEN c...

    38,767

    532

    11

    3524.24

    3.74

    8b8kbrqmg7qf9

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    37,146

    2,524

    2

    18573.15

    3.58

    8fpf0vtjzb583

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    30,796

    2,331

    5

    6159.20

    2.97

    9wxzf70vub4wg

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    29,991

    2,506

    1

    29991.41

    2.89

    5vdncfn06sxz8

    JDBC Thin Client

    SELECT count(*) total FROM ( s...

    24,762

    875

    3

    8254.06

    2.39

    8vyda1jxu2nsc

    JDBC Thin Client

    SELECT count(*) total FROM (SE...

    17,845

    699

    2

    8922.50

    1.72

    99g0x7u3jv28v

    JDBC Thin Client

    SELECT count(*) total FROM (SE...

    17,470

    1,295

    0

     

    1.69

    08qmyqnjkpgru

    JDBC Thin Client

    select * from (select aa.*, ro...

    其实根据这些现象,基本上可以判断问题了。肯定是SQL执行计划的改变导致了当前的性能问题。而类似绑定变量窥探之类的问题只会影响个别的SQL,而这么大面积的执行计划的改变,几乎可以断定是统计信息造成的。

     

    询问了一下客户最近的操作,原来昨天夜里通过数据库链的方式导入了一部分数据。而今天一早问题就出现了。

     

    其实问题已经很明显了,在通过数据库链加载数据后,并没有重新收集统计信息,且由于加载时间是在半夜,这批数据也没有赶上每天22:00的统计信息自动收集的工作。这就使得Oracle在生成执行计划时,会依赖数据加载前的统计信息,从而造成了错误的执行计划。

    首先解决问题的根源问题,对加载过数据的表重新收集统计:
    
    SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE OWNER = 'USER1' AND LAST_ANALYZED < TRUNC(SYSDATE);
    
    'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)'
    -----------------------------------------------------------------------------------------
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE)
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE)
    .
    .
    .
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE)
    EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE)
    
    12 rows selected.
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    
    .
    .
    .
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    
    SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE)
    
    PL/SQL procedure successfully completed.
    虽然问题的根源已经被解决,但是当前运行的SQL并不会改变执行计划,因此还需要找到问题的SQL,从后台将其结束:
    
    SQL> SELECT 'kill -9 ' || spid from v$session s, v$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains';
    
    'KILL-9'||SPID
    --------------------
    kill -9 28321
    kill -9 25384
    kill -9 23697
    kill -9 7239
    .
    .
    .
    kill -9 9331
    kill -9 13759
    
    740 rows selected.
    
    SQL> SELECT 'kill -9 ' || spid from gv$session s, gv$process p where s.username = 'USER2' and s.paddr = p.addr and event = 'latch: cache buffers chains' and s.inst_id = p.inst_id and s.inst_id = 2;
    
    'KILL-9'||SPID
    --------------------
    kill -9 23992
    kill -9 5289
    kill -9 21067
    kill -9 16816
    kill -9 16820
    kill -9 26767
    .
    .
    .
    kill -9 14981
    kill -9 26678
    kill -9 26682
    
    258 rows selected.
    
    分别在两个节点杀掉这些执行计划存在问题的会话,释放被大量占用的系统资源。
    由于Oracle的执行计划并非在收集统计信息后马上生效,因此还有个别的SQL仍然沿用错误的执行计划:
    
    SQL> select distinct inst_id, sql_id from gv$session where event = 'latch: cache buffers chains';
    
    INST_ID SQL_ID
    ---------- -------------
    1 39gvg7vbcm8jx
    1 a6aqkm30u7p90
    
    SQL> select address, hash_value from v$sqlarea where sql_id = 'a6aqkm30u7p90';
    
    ADDRESS HASH_VALUE
    ---------------- ----------
    C000000EB7ED3420 3248739616
    
    SQL> exec dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C')
    
    PL/SQL procedure successfully completed.
    
    SQL> select address, hash_value from v$sqlarea where sql_id ='39gvg7vbcm8jx';
    
    ADDRESS HASH_VALUE
    ---------------- ----------
    C000001037B8E308 3603538493
    
    SQL> exec dbms_shared_pool.purge('C000001037B8E308’,‘3603538493', 'C')
    
    PL/SQL procedure successfully completed.
    
    由于当前的数据库版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool将执行计划错误的SQL清除出共享池,强制其重新生成执行计划。
    
    SQL> select event, count(*) from gv$session where username like != user GROUP BY EVENT order by 2 desc;
    
    EVENT COUNT(*)
    ---------------------------------------------------------------- ----------
    SQL*Net message from client 370
    SQL*Net message to client 15
    gc cr request 10
    latch free 4
    Streams AQ: waiting for messages in the queue 1
    
    5 rows selected.

    数据库中SQL执行计划错误除了导致大量的latch: cache buffers chains等待以外,还存在gc cr requestlatch free等这些等待事件,经分析同样是由于错误的执行计划所致。将这些会话采用相同的方法清除后,系统负载恢复到正常范围:

     

     

    Snap Id

    Snap Time

    Sessions

    Cursors/Session

    Begin Snap:

    7188

    31-3 -12 14:55:17

    257

    20.3

    End Snap:

    7189

    31-3 -12 15:03:07

    256

    19.9

    Elapsed:

     

    7.84 (mins)

       

    DB Time:

     

    70.46 (mins)

       
  • 相关阅读:
    Javac配置环境变量01
    jmeter压测学习-发soap请求测试webservice
    Faker 安装与介绍
    docker使用nginx部署spring项目
    Linux查看端口占用情况
    Cesium 中两种添加 model 方法的区别
    Cesium 动态更换模型贴图方法
    Node.js API 学习笔记
    一步步建立 Vue + Cesium 初始化项目
    python3 解析 base64 数据
  • 原文地址:https://www.cnblogs.com/taowang2016/p/3030829.html
Copyright © 2020-2023  润新知