• 解决 OCFS2 不能自动挂载 提示 o2net_connect_expired


    RAC 在启动的是要要先启动OCFS2, 在修改/etc/sysconfig/o2cb的配置后,发现两机器只有一台可以自动挂载ocfs2分区,而另外一台不能自动挂载。但启动完毕后,手动挂载正常。


    一、详细情况
    两机器分别是dbsrv-1和dbsrv-2,使用交叉线做网络心跳,并在cluster.conf中使用私有心跳IP,非公用IP地址。
    1、检查o2cb状态
    启动后,o2cb服务是启动正常的,ocfs2模块也加载正常的,但心跳是Not Active:

    引用
    Checking heartbeat: Not Active


    2、检查/etc/fstab文件

    引用
    #cat /etc/fstab|grep ocfs2
    /dev/sdc1    /oradata   ocfs2   _netdev,datavolume,nointr 0 0


    配置正确;
    3、检查两机器的/etc/ocfs2/cluster.conf内容

    引用
    # more /etc/ocfs2/cluster.conf
    node:
           ip_port = 7777
           ip_address = 172.20.3.2
           number = 0
           name = dbsrv-2
           cluster = ocfs2

    node:
           ip_port = 7777
           ip_address = 172.20.3.1
           number = 1
           name = dbsrv-1
           cluster = ocfs2

    cluster:
           node_count = 2
           name = ocfs2


    已经确认,两机器该文件是完全相同的。
    4、查看系统日志
    报错信息如下:

    引用
    Jul 20 19:33:18 dbsrv-2 kernel: OCFS2 1.2.3
    Jul 20 19:33:24 dbsrv-2 kernel: (4452,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
    Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_request_join:786 ERROR: status = -107
    Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_try_to_join_domain:934 ERROR: status = -107
    Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_join_domain:1186 ERROR: status = -107
    Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_register_domain:1379 ERROR: status = -107
    Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_dlm_init:2009 ERROR: status = -107
    Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_mount_volume:1062 ERROR: status = -107
    Jul 20 19:33:24 dbsrv-2 kernel: ocfs2: Unmounting device (8,33) on (node 0)
    Jul 20 19:33:26 dbsrv-2 mount: mount.ocfs2: Transport endpoint is not connected
    Jul 20 19:33:26 dbsrv-2 mount:
    Jul 20 19:33:26 dbsrv-2 netfs: Mounting other filesystems:  failed



    二、分析问题
    1、node节点的启动顺序
    从Google搜索到如此的信息:

    引用
    Mount triggers the heartbeat thread which triggers the o2net
    to make a connection to all heartbeating nodes. If this connection
    fails,the mount fails. (The larger node number initiates the connection
    to the lower node number.)


    说明o2cb启动的时候,是根据node节点的大小顺序启动的。
    而在cluster.conf中,node0是dbsrv-2,node1是dbsrv-1,所以,dbsrv-1在启动的时候马上可联通本机IP,然后挂载ocfs2分区;但dbsrv-2启动的时候,则不能即时发现对方IP地址,所以启动失败。
    2、尝试修改HEARTBEAT_THRESHOLD参数
    从Goolge搜索到另外一条信息:

    引用
    After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.

    Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes?

    Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?

    Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.


    并参考网上的资料,修改/etc/sysconfig/o2cb的HEARTBEAT_THRESHOLD参数为301,启动后报:

    引用
    Jul 23 13:59:50 dbsrv-2 kernel: (4477,0):o2hb_check_slot:883 ERROR: Node 1 on device sdc1 has a dead count of 14000 ms, but our count is 602000 ms.
    Jul 23 13:59:50 dbsrv-2 kernel: Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
    Jul 23 13:59:54 dbsrv-2 kernel: OCFS2 1.2.3
    Jul 23 14:00:00 dbsrv-2 kernel: (4449,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
    Jul 23 14:00:00 dbsrv-2 kernel: (4475,2):dlm_request_join:786 ERROR: status = -107


    问题依旧。
    ※注释

    引用
    [隔离时间(秒)] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2
    (301 - 1) * 2 = 600 秒



    综上所述,已经能清楚所有配置都是正确的。
    导致故障的原因是:
    在启动o2cb服务的前,由于某些原因,o2cb依赖的IP地址未能及时取得联系,操作了其限定的时间,而启动失败。而在机器完整启动后,网络已经正常,所以,手动挂载ocfs2分区正常。

    三、解决问题
    1、Oracle metalink给出的信息

    引用
    The problem here is that network layer not becoming fully functional even  after /etc/init.d/network script is done executing. The proposed patch is a  work around and is not fixing a problem in o2cb script.


    2、解决方法

    引用
    a)确保所有配置文件都正确,无差异;
    b)确保两服务器的机器时间不要相差太远;
    (可使用时间同步)
    c)o2cb使用的cluster.conf文件中,应使用心跳IP,而非公网IP
    d)修改/etc/init.d/o2cb脚本,在最前面加入一个sleep的延迟时间,以等待网络正常;
    e)实在还是不行,把启动脚本放到/etc/rc.local中
    mount -t ocfs2 -o datavolume,nointr /dev/sdc1 /oradata
    /etc/init.d/init.crs start



    四、已知可能的原因
    1、磁盘原因
    例如使用iSCSI、Firewire等做盘柜,可能因读取时间长,引发timeout导致问题;
    2、网络原因
    如果使用公网IP做o2cb的判断,则由于在加载网卡驱动后,交换机未能及时通讯(特别是Cisco的交换机),导致IP通讯失败;
    如果使用心跳IP做o2cb的判断,则有部分网卡在加载驱动后,未能马上激活,并与对方网卡联通而导致失败。
    总体来说,都是和硬件的关系比较多。

    道森Oracle,国内最早、最大的网络语音培训机构,我们提供专业、优质的Oracle技术培训和服务! 我们的官方网站:http://www.daosenoracle.com 官方淘宝店:http://daosenpx.taobao.com/
  • 相关阅读:
    机器学习到深度学习资料
    安装CentOS 6停在selinux-policy-targeted卡住的问题解决
    U盘安装Ubuntu 16.04出现:Failed to load ldlinux.c32
    Ubuntu 16.04下使用UNetbootin制作的ISO镜像为U盘启动出现:Missing Operating System (mbr.bin)
    为什么Linux的Fdisk分区时First Sector为2048?
    Windows下将ISO镜像制作成U盘启动的工具(U盘启动工具/UltraISO/Rufus/Universal-USB)
    CentOS 6.9安装类型选择(Basic Server/Web Server)
    Java中String与byte[]的转换
    IntelliJ IDEA插件-翻译插件
    Mycat查询时出现:Error Code: 1064. can't find any valid datanode
  • 原文地址:https://www.cnblogs.com/tianlesoftware/p/3610354.html
Copyright © 2020-2023  润新知