• Know about RAC Clusterware Process OPROCD


    OPROCD introduced in 10.2.0.4 Linux and other Unix platform.
    • Fencing
      • Cluster handling of nodes that should not have access to shared resources
      • STONITH – Power cycle the node
      • PCW – nodes fence themselves through the reboot(8) command
      • Fabric Fencing from Polyserve
        • Healthy nodes send SNMP msgs to Fabric switch to disable SAN access from unhealthy nodes [ fence them out ]
        • Server is left in up state to view logs etc.
    • Oracle's Cluster I/O Fencing solution
    • Only started on Unix platforms when vendor Clusterware is not running
    • Does not run on Windows and Linux!
    • Takes 2 parameters
      • Timeout value [ length of time between executions ]
      • Margin [ leeway for dispatches ]
      • Oproc.debug –t 1000 –m 500
    • In fatal mode node will get reboot’ed
    • In non-fatal mode error messages will be logged
    OPROCD - This process is spawned in any non-vendor clusterware environment, except on Windows where Oracle uses a kernel driver to perform the same actions and Linux prior to version 10.2.0.4. If oprocd detects problems, it will kill a node via C code. It is spawned in init.cssd and runs as root. This daemon is used to detect hardware and driver freezes on the machine. If a machine were frozen for long enough that the other nodes evicted it from the cluster, it needs to kill itself to prevent any IO from getting reissued to the disk after the rest of the cluster has remastered locks." *** Oprocd log locations: In /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.
    Note that oprocd only runs when no vendor clusterware is running or on Linux > 10.2.0.4

    COMMON CAUSES OF OPROCD REBOOTS

    - A problem detected by the OPROCD process. This can be caused by 4 things:1) An OS scheduler problem. 2) The OS is getting locked up in a driver or hardware. 3) Excessive amounts of load on the machine, thus preventing the scheduler from behaving reasonably. 4) An Oracle bug.OPROCD Bugs Known to Cause Reboots: Bug 5015469 - OPROCD may reboot the node whenever the system date is moved backwards. Fixed in 10.2.0.3+ Bug 4206159 - Oprocd is prone to time regression due to current API used (AIX only) Fixed in 10.1.0.3 + One off patch for Bug 4206159. Diagnostic Fixes (VERY NECESSARY IN MOST CASES): Bug 5137401 - Oprocd logfile is cleared after a reboot Fixed in 10.2.0.4+ Bug 5037858 - Increase the warning levels if a reboot is approaching Fixed in 10.2.0.3+

    FILES TO REVIEW AND GATHER FOR OPROCD REBOOTS

    If logging a service request, please provide ALL of the following files to Oracle Support if possible: - Oprocd logs in /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform. - All the files in the following directories from all nodes. For 10.2 and above, all files under: <CRS_HOME>/log Recommended method for gathering these for each node would be to run the diagcollection.pl script. For 10.1: <CRS_HOME>/crs/log <CRS_HOME>/crs/init <CRS_HOME>/css/log <CRS_HOME>/css/init <CRS_HOME>/evm/log <CRS_HOME>/evm/init <CRS_HOME>/srvm/log Recommended method for gathering these for each node: cd <CRS_HOME> tar cf crs.tar crs/init crs/log css/init css/log evm/init evm/log srvm/log - Messages or Syslog from all nodes from the time of the problem: Sun: /var/adm/messages HP-UX: /var/adm/syslog/syslog.log Tru64: /var/adm/messages Linux: /var/log/messages IBM: /bin/errpt -a > messages.out - 'opatch lsinventory -detail' output for the CRS home - It would also be useful to get the following from each node leading up to the time of the reboot: - netstat -is (or equivelant) - iostat -x (or equivelant) - vmstat (or equivelant) There is a tool called "OS Watcher" that helps gather this information. This tool will dump netstat, vmstat, iostat, and other output at an inverval and save x number of hours of archived data. For more information about this tool see Note 301137.1.  
    The OPROCD executable sets a signal handler for the SIGALRM handler and sets the interval timer based on the to-millisec parameter provided. The alarm handler gets the current time and checks it against the time that the alarm handler was last entered. If the difference exceeds (to-millisec + margin-millisec), it will fail; the production version will cause a node reboot. In fatal mode, OPROCD will reboot the node if it detects excessive wait. In Non Fatal mode, it will write an error message out to the file .oprocd.log in one of the following directories. Oracle clusterware has the following three daemons which may be responsible for panicing the node. It is possible that some other external entity may have rebooted the node. In the context of this discussion, we will assume that the reboot/panic was done by an Oracle clusterware daemon. * Oprocd – Cluster fencing module * Cssd – Cluster sychronization module which manages node membership * Oclsomon – Cssd monitor which will monitor for cssd hangs OPROCD This is a daemon that only gets activated when there is no vendor clusterware present on the OS.This daemon is also not activated to run on Windows/Linux. This daemon runs a tight loop and if it is not scheduled for 1.5 seconds, will reboot the node. CSSD This daemon pings the other members of the cluster over the private network and Voting disk. If this does not get a response for Misscount seconds and Disktimeout seconds respectively, it will reboot the node. Oclsomon This daemon monitors the CSSD to ensure that CSSD is scheduled by the OS, if it detects any problems it will reboot the node.
    A sample log looks like
    May 11 18:13:15.528 | INF | monitoring started with timeout(1000), margin(500)
    May 11 18:13:15.548 | INF | normal startup, setting process to fatal mode
    May 12 11:43:00.899 | INF | shutting down from client request
    May 12 11:43:00.899 | INF | exiting current process in NORMAL mode
    May 12 12:10:43.984 | INF | monitoring started with timeout(1000), margin(500)
    May 13 11:29:37.528 | INF | shutting down from client request
    May 13 11:29:37.528 | INF | exiting current process in NORMAL mode
    
    When fatal mode is disabled, OPROCD will write the following to the log file and exit:
    May 10 18:01:40.668 | INF | monitoring started with timeout(1000), margin(500)
    May 10 18:23:02.490 | ERR | AlarmHandler:? timeout(1739751316) exceeds interval(1000000000)+margin(500000000)
    
    [root@rh2 ~]# ps -ef|grep oprocd|grep -v grep
    root     19763     1  0 Jun27 ?        00:00:00 oprocd start
    
    [root@rh2 oprocd]# cd /etc/oracle/oprocd
    
    [root@rh2 oprocd]# ls -l
    total 20
    drwxrwx--- 2 root oinstall 4096 Jun 27 23:52 check
    drwxrwx--- 2 root oinstall 4096 Mar 29 22:37 fatal
    -rwxr--r-- 1 root root      512 Jun 27 23:52 rh2.oprocd.lgl
    -rw-r--r-- 1 root root      171 Jun 27 23:52 rh2.oprocd.log
    drwxrwx--- 2 root oinstall 4096 Jun 27 23:52 stop
    
    [root@rh2 oprocd]# cat rh2.oprocd.log
    Jun 27 23:52:47.861 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
    Jun 27 23:52:47.864 | INF | normal startup, setting process to fatal mode
    
    [root@rh2 oprocd]# oprocd
    
    usage:  oprocd [start | startInstall | stop | check | enableFatal| help | -?]
    
            run [ -t | -m | -g | -f  | -e]   foreground startup
                  -t           timeout in ms
                  -m            timout margin in ms
                  -e           clock skew epsilon in ms
                  -g         group name to enable fatal
                  -f                    fatal startup
    
            start  [-t | -m  | -e]           starts the daemon
                    -t         timeout in ms
                    -m          timout margin in ms
                     -e         clock skew epsilon in ms
    
            startInstall [ -t | -m | -g  | - e] start process in install mode
                           -t    timeout in ms
                           -m     timout margin in ms
                           -e    clock skew epsilon in ms
                           -g  group name to enable fatal
    
            enableFatal  [ -t ]             force install mode process to fatal
                           -t    timeout for response in ms
            stop         [ -t ]             stops running daemon
                           -t    timeout for response in ms
            check        [ -t ]           checks status of daemon
                           -t    timeout for response in ms
            help                          this help information
            -?                            same as help above
    
    [root@rh2 oprocd]# oprocd stop
    Jun 28 00:17:36.604 | INF | daemon shutting down
    Oracle Clusterware Process Monitor (OPROCD) From Julian Dyke
    Process Monitor Daemon
    Provides Cluster I/O Fencing
    
    Implemented on Unix systems
    Not required with third-party clusterware
    Implemented in Linux in 10.2.0.4 and above
    In 10.2.0.3 and below hangcheck timer module is used
    
    Provides hangcheck timer functionality to maintain cluster integrity
    Behaviour similar to hangcheck timer
    Runs as root
    Locked in memory
    Failure causes reboot of system
    See /etc/init.d/init.cssd for operating system reboot commands
    
    OPROCD takes two parameters
    -t  - Timeout value
    Length of time between executions (milliseconds)
    Normally defaults to 1000
    -m - Margin
    Acceptable margin before rebooting (milliseconds)
    Normally defaults to 500
    
    Parameters are specified in /etc/init.d/init.cssd
    OPROCD_DEFAULT_TIMEOUT=1000
    OPROCD_DEFAULT_MARGIN=500
    
    Contact Oracle Support before changing these values
    
    /etc/init.d/init.cssd can increase OPROCD_DEFAULT_MARGIN based on two CSS variables
    reboottime (mandatory)
    diagwait (optional)
    Values can for these be obtained using
    
    [root@server3]# crsctl get css reboottime
    [root@server3]# crsctl get css diagwait
    
    Both values are reported in seconds
    The algorithm is
    
    If diagwait > reboottime then
    OPROCD_DEFAULT_MARGIN := (diagwait - reboottime) * 1000
    
    Therefore increasing diagwait will reduce frequency of reboots e.g
    
    [root@server3]# crsctl set css diagwait 13
  • 相关阅读:
    PHP新的垃圾回收机制:Zend GC详解
    SSH隧道技术简介
    mysql主从延迟
    非root配置linux下vim
    PHP 中的 9 个魔术方法
    PHP内核介绍及扩展开发指南—Extensions 的编写(下)
    PHP内核介绍及扩展开发指南—Extensions 的编写
    php 扩展开发
    php opcode
    rsa 数学推论
  • 原文地址:https://www.cnblogs.com/macleanoracle/p/2967285.html
Copyright © 2020-2023  润新知