• 有些尴尬的一次集群启动故障排错


    因为工作性质改变,有许久没动手处理故障了,今天的排错也是非生产环境,为验证一些测试临时搭的一套11g RAC环境,为了省时间,直接拿之前备份的vbox的环境拷贝,结果启动机器发现集群无法启动:

    [root@jystdrac1 ~]# su - grid
    [grid@jystdrac1 ~]$ crsctl stat res -t
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4000: Command Status failed, or completed with errors.
    [grid@jystdrac1 ~]$ crsctl stat res -t -init
    CRS-4639: Could not contact Oracle High Availability Services
    CRS-4000: Command Status failed, or completed with errors.
    

    查看集群alert日志报错:

    [grid@jystdrac1 jystdrac1]$ pwd
    /opt/app/11.2.0/grid/log/jystdrac1
    [grid@jystdrac1 jystdrac1]$ tail -20f alertjystdrac1.log
    2021-07-01 00:26:27.379:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4526)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:26:31.384:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:28:32.889:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4568)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:28:36.895:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:28:38.424:
    [mdnsd(4644)]CRS-5602:mDNS service stopping by request.
    2021-07-01 00:30:38.407:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4633)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:30:42.412:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:32:43.923:
    [/opt/app/11.2.0/grid/bin/oraagent.bin(4676)]CRS-5818:Aborted command 'start' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/agent/ohasd/oraagent_grid/oraagent_grid.log.
    2021-07-01 00:32:47.928:
    [ohasd(4160)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.gpnpd'. Details at (:CRSPE00111:) {0:0:2} in /opt/app/11.2.0/grid/log/jystdrac1/ohasd/ohasd.log.
    2021-07-01 00:32:49.455:
    [mdnsd(4822)]CRS-5602:mDNS service stopping by request.
    

    进一步看mdns.log等最新报错信息(gpnp.log类似,为节省篇幅没有贴出):

    [grid@jystdrac1 mdnsd]$ pwd
    /opt/app/11.2.0/grid/log/jystdrac1/mdnsd
    [grid@jystdrac1 mdnsd]$ tail -20 mdnsd.log
    2021-06-30 22:50:59.275: [    MDNS][1534236416] mdnsd exit
    2021-06-30 22:53:03.989: [ default][1342412544]
    
    ================================================================================
    2021-06-30 22:53:03.989: [ default][1342412544]mdnsd START pid=2201
    [  clsdmt][1335961344]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
    2021-06-30 22:53:03.991: [  clsdmt][1335961344]PID for the Process [2201], connkey 9
    2021-06-30 22:53:03.991: [  clsdmt][1335961344]Creating PID [2201] file for home /opt/app/11.2.0/grid host jystdrac1 bin mdns to /opt/app/11.2.0/grid/mdns/init/
    2021-06-30 22:53:03.992: [  clsdmt][1335961344]Writing PID [2201] to the file [/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid]
    2021-06-30 22:53:03.992: [  clsdmt][1335961344]Failed to record pid for MDNSD
    2021-06-30 22:53:03.992: [  clsdmt][1335961344]Terminating process
    2021-06-30 22:53:03.992: [    MDNS][1335961344] clsdm requested mdnsd exit
    2021-06-30 22:53:03.992: [    MDNS][1335961344] mdnsd exit
    2021-06-30 22:57:14.236: [ default][747345664]
    
    ================================================================================
    2021-06-30 22:57:14.236: [ default][747345664]mdnsd START pid=2375
    [  clsdmt][740894464]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jystdrac1DBG_MDNSD))
    2021-06-30 22:57:14.239: [  clsdmt][740894464]PID for the Process [2375], connkey 9
    2021-06-30 22:57:14.239: [  clsdmt][740894464]Cr[grid@jystdrac1 mdnsd]$
    

    MOS 也有篇文章介绍了RAC起不来的五大问题:

    • Grid Infrastructure 启动的五大问题 (Doc ID 1526147.1)

    其中问题 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未运行,就和目前的现象很匹配。

    文档中描述了可能的原因和对应解决方案:

    可能的原因:
    
    1. orarootagent 缺少执行权限
    2. 缺少进程相关的 <node>.pid 文件或者这个文件的所有者/权限不对
    3. GRID_HOME 所有者/权限不对
    
    解决方案:
    
    1. 和一个好的GRID_HOME比较所有者/权限,并做相应的改正,或者以root用户执行:,
       # cd <GRID_HOME>/crs/install
       # ./rootcrs.pl -unlock
       # ./rootcrs.pl -patch
    这将停止集群软件,对需要的文件的所有者/权限设置为root用户,并且重启集群软件。
    2. 如果对应的 <node>.pid 不存在, 就用touch命令创建一个具有相应所有者/权限的文件, 否则就按要求改正文件<node>.pid的所有者/权限, 然后重启集群软件.
    这里是<GRID_HOME>下,所有者属于root:root 权限 644的<node>.pid 文件列表:
      ./ologgerd/init/<node>.pid
      ./osysmond/init/<node>.pid
      ./ctss/init/<node>.pid
      ./ohasd/init/<node>.pid
      ./crs/init/<node>.pid
    所有者属于<grid>:oinstall,权限644
      ./mdns/init/<node>.pid  
      ./evm/init/<node>.pid
      ./gipc/init/<node>.pid
      ./gpnp/init/<node>.pid
    
    3. 对第3种原因,请参考解决方案1
    

    可是依次排查下来发现均无问题,奇怪了,为啥权限都正确就是写不进去呢?

    手工vi试下看看呢?

    [grid@jystdrac1 jystdrac1]$ vi /opt/app/11.2.0/grid/mdns/init/jystdrac1.pid
    2201
    

    保存时发现报错:

    "/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid"
    "/opt/app/11.2.0/grid/mdns/init/jystdrac1.pid" E514: write error (file system full?)
    Press ENTER or type command to continue
    

    什么?文件系统空间满了???

    [grid@jystdrac1 jystdrac1]$ df -h
    Filesystem                        Size  Used Avail Use% Mounted on
    /dev/mapper/vg_linuxbase-lv_root   28G   27G     0 100% /
    tmpfs                             1.5G     0  1.5G   0% /dev/shm
    /dev/sda1                         485M   39M  421M   9% /boot
    

    额,果然.. 好尴尬,居然是最初级的空间容量问题。
    赶紧清理下空间后重启集群再试是否正常启动?
    It's Ok!

    AlfredZhao©版权所有「从Oracle起航,领略精彩的IT技术。」
  • 相关阅读:
    VMware下桥接设置
    Silverlight 样式的灵活使用
    Silverlight网页打开后马上崩溃,“白屏”,而且毫无提示
    Silverlight中字典的使用
    WEBGIS网页崩溃问题分析
    MDB数据类型注意事项
    使用浏览器开发着工具查看地图或影响的请求信息
    ArcGIS出图调整
    启动aspx文件错误
    hdu3555(数位DP dfs/递推)
  • 原文地址:https://www.cnblogs.com/jyzhao/p/14957091.html
Copyright © 2020-2023  润新知