• What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)


    In this Document

      Purpose
      Scope
      Details
      1. Clusterware layer
      2. Real Application Cluster (database) layer
      Known Issues
      References

    APPLIES TO:

    Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
    Information in this document applies to any platform.

    PURPOSE

    This note is to explain what is split brain in an Oracle Real Application cluster and what errors/consequences are associated with it.

    SCOPE

    For DBA and Support engineer.

    DETAILS

    In generic term, split-brain indicates data inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and unifying their data to each other.

    There are two components in Oracle Real Application Cluster implementation could experience split brain.

    1. Clusterware layer

    Cluster nodes maintain their heartbeat via private network and voting disk. When there is a private network disruption, cluster nodes can not communicate to each other via private network for the time period of misscount setting, split brain will happen. In such case, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:

    a. The group with more cluster nodes survive
    b. The group with lower node member in case of same number of node(s) available in each group
    c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

    Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:

    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: 
    ###################################
    [ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
    ###################################

    Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.

    Solution: Please engage network administrator to check private network layer to eliminate any network fault.

    2. Real Application Cluster (database) layer

    To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

    Common messages in instance alert log are similar to:

    alert log of instance 1:
    ---------
    Mon Dec 07 19:43:05 2011
    IPC Send timeout detected.Sender: ospid 26318
    Receiver: inst 2 binc 554466600 ospid 29940
    IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
    Mon Dec 07 19:43:07 2011
    Communications reconfiguration: instance_number 2
    Mon Dec 07 19:43:07 2011
    Trace dumping is performing id=[cdmp_20091207194307]
    Waiting for clusterware split-brain resolution
    Mon Dec 07 19:53:07 2011
    Evicting instance 2 from cluster
    Waiting for instances to leave: 

    ...

    alert log of instance 2:
    ---------
    Mon Dec 07 19:42:18 2011
    IPC Send timeout detected. Receiver ospid 29940
    Mon Dec 07 19:42:18 2011
    Errors in file 
    /u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
    Trace dumping is performing id=[cdmp_20091207194307]
    Mon Dec 07 19:42:20 2011
    Waiting for clusterware split-brain resolution
    Mon Dec 07 19:44:45 2011
    ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
    Mon Dec 07 19:44:51 2011
    ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
    Mon Dec 07 19:45:38 2011
    ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
    Mon Dec 07 19:52:27 2011
    Errors in file 
    /u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc  
    (incident=90153):
    ORA-29740: evicted by member 0, group incarnation 10
    Incident details in: 
    /u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

    In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

    a. Network problem
    b. Process hang
    c. Bug etc

    Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.

    In case of instance eviction, alert log and all background traces need to be checked to determine the root cause.

    Known Issues

    1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8
        Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1
        Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows

    2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load
        Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1
        Fixed in: 11.2.0.1

    3. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8
        Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1

    4. Bug 7587008 - Hung RAC instance not evicted from cluster Document  7587008.8
        Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release

    5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8
        Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows

    6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2
        BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
        BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
        Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1
        Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3

  • 相关阅读:
    对比度受限的自适应直方图均衡化(CLAHE)
    双边滤波
    快速高斯滤波
    积分图像的应用(二):非局部均值去噪(NL-means)
    非局部均值去噪(NL-means)
    积分图像的应用(一):局部标准差 分类: 图像处理 Matlab 2015-06-06 13:31 137人阅读 评论(0) 收藏
    积分图像 分类: 图像处理 Matlab 2015-06-06 10:30 149人阅读 评论(0) 收藏
    双边滤波与引导滤波 分类: 图像处理 2015-04-29 14:52 48人阅读 评论(0) 收藏
    UE4射击小游戏原型
    UnrealEngine4 尝鲜
  • 原文地址:https://www.cnblogs.com/future2012lg/p/4317970.html
Copyright © 2020-2023  润新知