现象:在启动crs的时候,启动到下面的进程OSYSMOND的时候,服务器cpu被该进程占用100%。在/var/log/message中看到有下列报错。
初步判断asmfd加载失败。导致asm没启动起来,导致后面的crsd.bin也没有启动成功。通过卸载acfs,使用udev,重启集群
查看/var/log/messages
Mar 12 09:04:29 lxtrac04 journal: Oracle Clusterware: 2018-03-12 09:04:29.909#012[(27517)]CRS-8500:Oracle Clusterware OSYSMOND process is starting with operating syste
m process ID 27517
Mar 12 09:04:31 lxtrac04 kernel: blk_update_request: I/O error, dev fd0, sector 0
Mar 12 09:04:31 lxtrac04 kernel: floppy: error -5 while reading block 0
Mar 12 09:04:31 lxtrac04 kernel: loop: module loaded
Mar 12 09:04:31 lxtrac04 kernel: Unknown ioctl -2146954638
Mar 12 09:04:31 lxtrac04 kernel: Unknown ioctl 4731
Mar 12 09:04:31 lxtrac04 kernel: Unknown ioctl 4712
Mar 12 09:04:31 lxtrac04 kernel: Unknown ioctl 4712
Mar 12 09:04:31 lxtrac04 kernel: PPP generic driver version 2.4.2
Mar 12 09:04:31 lxtrac04 kernel: Bluetooth: Core ver 2.20
Mar 12 09:04:31 lxtrac04 kernel: NET: Registered protocol family 31
Mar 12 09:04:31 lxtrac04 kernel: Bluetooth: HCI device and connection manager initialized
Mar 12 09:04:31 lxtrac04 kernel: Bluetooth: HCI socket layer initialized
Mar 12 09:04:31 lxtrac04 kernel: Bluetooth: L2CAP socket layer initialized
Mar 12 09:04:31 lxtrac04 kernel: Bluetooth: SCO socket layer initialized
Mar 12 09:04:31 lxtrac04 kernel: Bluetooth: Virtual HCI driver ver 1.5
Mar 12 09:04:31 lxtrac04 kernel: lp0: using parport0 (interrupt-driven).
Mar 12 09:04:31 lxtrac04 kernel: lp0: console ready
Mar 12 09:04:31 lxtrac04 systemd: Reached target Printer.
Mar 12 09:04:31 lxtrac04 systemd: Starting Printer.
Mar 12 09:04:35 lxtrac04 journal: Oracle Clusterware: 2018-03-12 09:04:35.070#012[(27757)]CRS-8500:Oracle Clusterware OLOGGERD process is starting with operating syste
m process ID 27757
Mar 12 09:04:45 lxtrac04 journal: Oracle Clusterware: 2018-03-12 09:04:45.059#012[(27794)]CRS-8500:Oracle Clusterware OLOGGERD process is starting with operating syste
m process ID 27794
Mar 12 09:04:58 lxtrac04 kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [osysmond.bin:27570]
Mar 12 09:04:58 lxtrac04 kernel: Modules linked in: cuse vhost_net vhost macvtap macvlan lp uinput hci_vhci bluetooth rfkill uhid ppp_generic slhc loop rds oracleacfs(
PO) oracleadvm(PO) oracleoks(PO) tcp_lp fuse oracleafd(PO) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack
_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock
_vmci_transport vsock coretemp crct10dif_pclmul crc32_pclmul aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon pcspkr sg shpchp vmw_vmci i2c_pi
ix4 parport_pc parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx
Mar 12 09:04:58 lxtrac04 kernel: drm_kms_helper crc32c_intel ttm serio_raw ata_piix mptspi drm scsi_transport_spi mptscsih libata e1000 mptbase i2c_core floppy dm_mirr
or dm_region_hash dm_log dm_mod
Mar 12 09:04:58 lxtrac04 kernel: CPU: 3 PID: 27570 Comm: osysmond.bin Tainted: P O 4.1.12-61.1.18.el7uek.x86_64 #2
Mar 12 09:04:58 lxtrac04 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
Mar 12 09:04:58 lxtrac04 kernel: task: ffff880eff728000 ti: ffff880eff770000 task.ti: ffff880eff770000
Mar 12 09:04:58 lxtrac04 kernel: RIP: 0010:[<ffffffff81720fd8>] [<ffffffff81720fd8>] _raw_spin_lock+0x38/0x60
Mar 12 09:04:58 lxtrac04 kernel: RSP: 0018:ffff880eff773ca0 EFLAGS: 00000202
Mar 12 09:04:58 lxtrac04 kernel: RAX: 000000000000711b RBX: ffff880ff7946cc0 RCX: 0000000000000002
Mar 12 09:04:58 lxtrac04 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880f657d4810
Mar 12 09:04:58 lxtrac04 kernel: RBP: ffff880eff773d38 R08: 0000000000000002 R09: 0000000000000000
Mar 12 09:04:58 lxtrac04 kernel: R10: ffff880fefdede58 R11: ffff880f0541d010 R12: ffffffff81213120
Mar 12 09:04:58 lxtrac04 kernel: R13: ffff880eff773c18 R14: ffffffff8120bcdb R15: ffff880eff773c48
Mar 12 09:04:58 lxtrac04 kernel: FS: 00007f5f899ab700(0000) GS:ffff88103fcc0000(0000) knlGS:0000000000000000
Mar 12 09:04:58 lxtrac04 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 12 09:04:58 lxtrac04 kernel: CR2: 00007fece1138240 CR3: 0000000eff5be000 CR4: 00000000000006e0
Mar 12 09:04:58 lxtrac04 kernel: Stack:
Mar 12 09:04:58 lxtrac04 kernel: ffffffffa059fef1 ffff88103e427000 0000000000000000 ffff880f657d4810
Mar 12 09:04:58 lxtrac04 kernel: 0000000000000024 ffff880ff2ba8000 ffff880f0541d000 ffff880eff773d8c
Mar 12 09:04:58 lxtrac04 kernel: ffff881000000000 ffff880ff2ba8000 0000000000000041 ffff880fefdede58
Mar 12 09:04:58 lxtrac04 kernel: Call Trace:
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa059fef1>] ? fuse_abort_conn+0x31/0x270 [fuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa0bf23c0>] ? cuse_read_iter+0x70/0x70 [cuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa0bf2414>] cuse_process_init_reply+0x54/0x490 [cuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa0bf23c0>] ? cuse_read_iter+0x70/0x70 [cuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa059dbbf>] request_end+0xbf/0x170 [fuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa059fd16>] end_queued_requests.isra.19+0x86/0x160 [fuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa059fe8f>] fuse_dev_release+0x9f/0xd0 [fuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffffa0bf211a>] cuse_channel_release+0x8a/0xa0 [cuse]
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffff81210224>] __fput+0xe4/0x220
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffff812103ae>] ____fput+0xe/0x10
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffff810a3ba7>] task_work_run+0xb7/0xf0
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffff81017c6d>] do_notify_resume+0x8d/0xa0
Mar 12 09:04:58 lxtrac04 kernel: [<ffffffff8172147c>] int_signal+0x12/0x17
Mar 12 09:04:58 lxtrac04 kernel: Code: 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 89 d1 0f b7 f2 b8 00 80 00 00 eb 0a 0f 1f 00 f3 90 83 e8 01 74 20 0f b7 17 41 89 d0 41 31 c8
<41> 81 e0 fe ff 00 00 75 e7 55 0f b7 f2 48 89 e5 e8 8b 39 ff ff
Mar 12 09:04:59 lxtrac04 sh: abrt-dump-oops: Found oopses: 1
Mar 12 09:04:59 lxtrac04 sh: abrt-dump-oops: Creating problem directories
Mar 12 09:04:59 lxtrac04 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on
Mar 12 09:04:59 lxtrac04 abrt-server: Duplicate: core backtrace
在mos上面搜索blk_update_request: I/O error, dev fd0, sector 0。确实有查到asmfd相关的文档。
1.使用独占模式nocrs启动集群
[root@lxtrac04 bin]# ./crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'lxtrac04'
CRS-2672: Attempting to start 'ora.evmd' on 'lxtrac04'
CRS-2672: Attempting to start 'ora.mdnsd' on 'lxtrac04'
CRS-2676: Start of 'ora.cssdmonitor' on 'lxtrac04' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'lxtrac04' succeeded
CRS-2676: Start of 'ora.evmd' on 'lxtrac04' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'lxtrac04'
CRS-2676: Start of 'ora.gpnpd' on 'lxtrac04' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'lxtrac04'
CRS-2676: Start of 'ora.gipcd' on 'lxtrac04' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'lxtrac04'
CRS-2672: Attempting to start 'ora.diskmon' on 'lxtrac04'
CRS-2676: Start of 'ora.diskmon' on 'lxtrac04' succeeded
CRS-2676: Start of 'ora.cssd' on 'lxtrac04' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'lxtrac04'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'lxtrac04'
CRS-2672: Attempting to start 'ora.ctssd' on 'lxtrac04'
CRS-2676: Start of 'ora.drivers.acfs' on 'lxtrac04' succeeded
CRS-2676: Start of 'ora.ctssd' on 'lxtrac04' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'lxtrac04' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'lxtrac04'
CRS-2676: Start of 'ora.asm' on 'lxtrac04' succeeded
2.查看集群的asm_diskstring
[root@lxtrac04 bin]# ./asmcmd dsget
parameter: AFD:*
profile:AFD:*
3.修改asm_diskstring
[root@lxtrac04 bin]# ./asmcmd dsset "/dev/sd*"
[root@lxtrac04 bin]# ./asmcmd dsget
parameter:/dev/sd*
profile:/dev/sd*
[root@lxtrac04 bin]#
4.使用udev进行绑定,并重启udev ([root@lxtrac04 ~]# systemctl restart systemd-udevd.service)
(服务器版本
[grid@lxtrac04 ~]$ uname -a
Linux lxtrac04 4.1.12-61.1.18.el7uek.x86_64 #2 SMP Fri Nov 4 15:48:30 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
[grid@lxtrac04 ~]$
)
[grid@lxtrac04 ~]$ cat /etc/udev/rules.d/99-oracle-asm.rules
KERNEL=="sdd[1-9]",ACTION=="add",OWNER="grid", GROUP="asmadmin", MODE="0660"
KERNEL=="sde[1-9]",ACTION=="add",OWNER="grid", GROUP="asmadmin", MODE="0660"
[grid@lxtrac04 ~]$
5.停止集群
[root@lxtrac04 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'lxtrac04'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'lxtrac04'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'lxtrac04'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'lxtrac04'
CRS-2673: Attempting to stop 'ora.ctssd' on 'lxtrac04'
CRS-2673: Attempting to stop 'ora.evmd' on 'lxtrac04'
CRS-2673: Attempting to stop 'ora.asm' on 'lxtrac04'
CRS-2677: Stop of 'ora.drivers.acfs' on 'lxtrac04' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'lxtrac04' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'lxtrac04' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'lxtrac04' succeeded
CRS-2677: Stop of 'ora.evmd' on 'lxtrac04' succeeded
CRS-2677: Stop of 'ora.asm' on 'lxtrac04' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'lxtrac04'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'lxtrac04' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'lxtrac04'
CRS-2677: Stop of 'ora.cssd' on 'lxtrac04' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'lxtrac04'
CRS-2677: Stop of 'ora.gipcd' on 'lxtrac04' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'lxtrac04' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@lxtrac04 bin]#
6.停止acfs、afd
# acfsload stop # stop the ACFS driver stack
# afdload stop # stop the ASMFD driver
7.清理label
[root@lxtrac04 bin]# ./asmcmd afd_unlabel /dev/sdd1 -f
[root@lxtrac04 bin]# ./asmcmd afd_unlabel /dev/sdd2 -f
[root@lxtrac04 bin]# ./asmcmd afd_unlabel /dev/sdd3 -f
…………
8.卸载 ASMFD
# ./asmcmd afd_deconfigure
AFD-632:Existing AFD installation detected.
AFD-634:Removing previous AFD installation.
AFD-635:Previous AFD components successfully removed.
Modifying resource dependencies-thismay take some time.
# ls -ltr /dev/oracleafd/disks/
ls:cannot access/dev/oracleafd/disks/:No such file ordirectory
9.重启crs,启动成功
[root@lxtrac04 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@lxtrac04 bin]#