• reliable message


    reliable message

    1 现象

    同事反馈一个Sqlloader 进程加载数据特别慢,平常几分钟运行完的事情,这次跑了3个半小时还没跑完。 查询数据库会话,信息如下:

    SID   USER_NAME  EVENT
    ---   ---------- -------
    1962  STG        reliable message
    

    趁此机会,研究一下这个等待事件吧。

    2 reliable message问题分析

     

    2.1 事件说明

    MOS 上对于reliable message的解释如下:

    When a process sends a message using the 'KSR' intra-instance broadcast service,
     the message publisher waits on this wait-event until all subscribers have
    consumed the 'reliable message' just sent. The publisher waits on this wait-event
     for up to one second and then re-tests if all subscribers have consumed the
    message, or until posted. If the message is not fully consumed the wait recurs,
    repeating until either the message is consumed or until the waiter is interrupted.
    

    说明此等待事件是发布消息方出现的等待。当消息队列中的消息没有被全部读取的时候,就会等待此事件。 经查阅文档得知,此等待事件,是针对各种channel的。不同的channel 针对不同的情况。也就有不同 的解决方法。而大部分是BUG,需要打补丁,或者升级至更高的版本.workaround,基本上是重启实例, 或者关闭相关的功能。

    2.2 查看渠道

    从gv$channel_waits 视图里查询问题最严重的 channel. 方法1,可以马上确定有问题的一个或者多个channel. 而方法2虽然也可以,但是略显麻烦。

    • 方法1

      SELECT CHANNEL,
        SUM(wait_count) sum_wait_count
      FROM GV$CHANNEL_WAITS
      GROUP BY CHANNEL
      ORDER BY SUM(wait_count) DESC;
      

      查询示例:

      CHANNEL                                                          SUM_WAIT_COUNT
      ---------------------------------------------------------------- --------------
      Result Cache: Channel                                                  15436686
      RBR channel                                                                9393
      kxfp control signal channel                                                7357
      MMON remote action broadcast channel                                       3070
      obj broadcast channel                                                      1731
      service operations - broadcast channel                                        2
      kill job broadcast - broadcast channel                                        2
      parameters to cluster db instances - broadcast channel                        2
      quiesce channel                                                               2
      

      从上面查询结果,可以看到 "Result Cache: Channel", 是最有问题的channel.

    • 方法2

      select to_char(p1, 'XXXXXXXXXXXXXXXX') event_param,
       count(*), sum(time_waited/1000000) time_waited
      from gv$active_session_history
      where event = 'reliable message'
      group by to_char(p1, 'XXXXXXXXXXXXXXXX')
      order by time_waited*count(*) desc;
      -- 取出影响最大的内存地址
      select name_ksrcdes
       from x$ksrcdes
       where indx in (select name_ksrcctx from x$ksrcctx where addr in (&1));
      Enter value for 1: '7ACD8AA60','7ACD8FA88'
      old   3:  where indx in (select name_ksrcctx from x$ksrcctx where addr in (&1))
      new   3:  where indx in (select name_ksrcctx from x$ksrcctx where addr in ('7ACD8AA60','7ACD8FA88'))
      
      NAME_KSRCDES
      ----------------------------------------------------------------
      Result Cache: Channel
      RBR channel
      

      从上面查询结果来看,已明确定位到有问题的 "Result Cache: Channel". 上面只是一个查询多个 channel的示例。这个例子中只需要查询第一个 addr='7ACD8AA60' 即可。

    3 解决办法

     

    3.1 Result Cache: Channel

    以下内容三选一:

    • 数据库更新到 12.2 或者12.1.0.2.0 Patchset
    • 应用补丁 18416368
    • workaround

      SQL> alter system set result_cache_max_size=0 scope=both sid='*';
      

      修改参数后,实例需要重启。

    3.2 RBR channel

    影响版本:11.2.0.3

    Bug 15826962 High "reliable message" wait due to "RBR channel"。

    最保险的办法是得出进程trace,或者system trace,然后与MOS 文档对照,或者开SR,由Oracle 服务人员帮忙确定。

    在以下版本、补丁中得到修复 :

    • 11.2.0.4 (Server Patch Set)
    • 11.2.0.3.12 (Oct 2014) Database Patch Set Update (DB PSU)
    • 11.2.0.3 Bundle Patch 19 for Exadata Database
    • 11.2.0.3 Patch 34 on Windows Platforms
    • 11.2.0.3 Patch 23 on Windows Platforms

    所以解决办法是升级或者打补丁。

    3.3 kxfp control signal channel

    影响版本
    12.1.0.2
    (no term)

    现象分析 其实这里并不只是这一个channel等待严重。示例如下:

    SQL> select CHANNEL,sum(wait_count) sum_wait_count
    
    from GV$CHANNEL_WAITS group by
    CHANNEL order by sum(wait_count)   2    3
     4  /
    
    CHANNEL                                                          SUM_WAIT_COUNT
    ---------------------------------------------------------------- --------------
    Flashback RVWR init channel                                                   2
    quiesce channel                                                               3
    PMON actions channel                                                          6
    Broker IQ Result Channel                                                     24
    kill job broadcast - broadcast channel                                       54
    parameters to cluster db instances - broadcast channel                      137
    GEN0 ksbxic channel                                                        1035
    Flashback Marker channel                                                   1546
    LCK0 ksbxic channel                                                        2669
    service operations - broadcast channel                                     7033
    MMON remote action broadcast channel                                      78046
    kxfp remote slave spawn channel                                          157850
    Result Cache: Channel                                                    242303
    RBR channel                                                             1595647
    obj broadcast channel                                                   4105387
    kxfp control signal channel                                             5582125
    

    可以看到除了,kxfp control signal channel 外还有 obj broadcast channel . 这两个是 其他的数倍甚至是数十倍。

    同时,建议做一个hang analyze 。 查看trace 文件中是否包含了以下内容:

    ervice name: SYS$BACKGROUND
         Current Wait Stack:
          1: waiting for 'CSS group membership query'
    

    如果有,说明CSS 组成员关系查询出现了阻塞,正常情况下应该是非常快的。

    以上两个现象,基本可以确定是Oracle BUG: 20470877.

    (no term)

    解决办法只有更新补丁

    Patch 20470877: LONG WAITS FOR "RELIABLE MESSAGE" AFTER A FEW DAYS OF UPTIME

    (no term)

    workaround

    重启实例

    Author: halberd.lee

    Created: 2019-12-26 Thu 13:39

    Validate

  • 相关阅读:
    Vim step by step
    Ubuntu解压命令全览
    这样才能使本地Mysql服务允许被外部主机连接(两步)
    [Python] logging.logger
    Python Selenium
    MySQL中char、varchar和text的区别
    Way to MongoDB
    Python误区之strip,lstrip,rstrip
    Android Studio Tips
    Way to tmux
  • 原文地址:https://www.cnblogs.com/halberd-lee/p/12101572.html
Copyright © 2020-2023  润新知