• DS4700控制器重启原因分析


    DS4700控制器重启故障原因分析

     

     

    版本历史:

    1.0

    初稿

    2017/5/10

     

     

     

     

    说明:本文档内容来自于IBM官方手册,可以作为建议使用。


    第 1 章           环境说明

    1.1 控制器微码

    当前控制器微码和硬盘微码均是官方推荐的最新版本。

    1.2 接口信息

    根据现场的同事反馈,A控和B控的host channel port分别和主机直连。

    1.3 主机信息

    host_alias

    host_type                                         

    host_group  

    Controller

    Logical_Drive_Name

    windown1_a

    Windows 2000/Server 2003/Server 2008 Non-Clustered

    windows_group

     B       

     4,5

    windown2_a

    Windows 2000/Server 2003/Server 2008 Non-Clustered

    windows_group

     B       

     4,5

    app1_hba 

    Solaris (with or without MPXIO)                   

    app_group   

     A       

     3

    app2_hba 

    Solaris (with or without MPXIO)                   

    app_group   

     A       

     1,2,6

    dbhba1   

    Solaris (with or without MPXIO)                   

    db_group    

     A       

     1,2,6

    db2_hba  

    Solaris (with or without MPXIO)                   

    db_group    

     A       

     1,2,6

     

    第 2 章          故障现象

    5/10 A控发生重启,导致对应的appdb业务中断,对应的业务名称如下:

    10.1.121.129    M01-HQ-SV013-DB1   数据库主
    10.1.121.130    M01-HQ-SV013-DB2   数据库备(DG数据库)
    10.1.121.132    M01-HQ-SV014-APP1  
    应用服务器(主机)
    10.1.121.133    M01-HQ-SV014-APP2  
    应用服务器(备机)

    第 3 章          故障分析原因

    由于是单链路,主机端也没有多路径,导致控制器重启后链路中断。

    根据存储日志分析A控的重启日志:

    Date/Time: 15-2-3 8:38:19

    Sequence number: 4698

    Event type: 400F

    Event category: Internal

    Priority: Informational

    Description: Controller reset by its alternate

    Event specific codes: 0/0/0

    Component type: Controller

    Component location: Enclosure 85, Slot 1

     

    Date/Time: 17-5-10 7:47:44

    Sequence number: 6300

    Event type: 400F

    Event category: Internal

    Priority: Informational

    Description: Controller reset by its alternate

    Event specific codes: 0/0/0

    Component type: Controller

    Component location: Enclosure 85, Slot 1

    2015-2-32017-5-10日,时间间隔827

    再看A控制器的最近2Start-of-day routine begun的日期:

    Date/Time: 15-2-10 17:25:12

    Date/Time: 17-5-10 7:47:04

    2015-2-102017-5-10,时间间隔820天。

    由此可以判断下来是存储820/825 日期问题导致的重启。

    存储每820/825天检测一次控制器的运行天数

    A控上次运行这个日期检测程序是2015/2/10日,到2017/5/10日刚好820

    A控上次重启的日期是2015/2/3日,到2017/5/10日刚好827天,所以A控重启了。

     

    另外通过历史日志检查,发现B控在6年中有重启过4次,而B控上主机端有2FC口没有接主机,如果有SFP模块的话建议插上堵头或者拨掉SFP模块。

     

    第 4 章          后续建议

    1.     修改存储的链路设计,实现高可用冗余连接

    2.     考虑到DS4K系列的820/825的设计,到期前进行预防性重启。

     

    第 5 章          附录

    关于DS4K 820/825的说明

    H193288: DS3000/DS4000/DS5000 controllerwill reboot every 820 or 825 days

    5.1 Technote(troubleshooting)



    5.2Problem(Abstract)

    RETAINtip: H193288

    5.3Symptom

    TheIBM System Storage DS3000, DS4000, and DS5000 families of storage subsystemcontrollers will reboot every 820 days for controller A or 825 days for controllerB, if the controller firmware is not upgraded or already rebooted within thattime period.

    Affectedconfigurations

    Thesystem may be any of the following IBM servers:

    ·       DS4100 (FAStT100) Dual-Controller Storage Server, type 1724, anymodel

    ·       DS4100 (FAStT100) Single-Controller Storage Server, type 1724,any model

    ·       DS4200 Storage Server, type 1814, any model

    ·       DS4300 (FAStT600) Dual Controller and Turbo Storage Server, type1722, any model

    ·       DS4300 (FAStT600) Single Controller Storage Server, type 1722,any model

    ·       DS4400 (FAStT700) Storage Server, type 1742, any model

    ·       DS4500 (FAStT900) Storage Server, type 1742, any model

    ·       DS4700 Storage Server, type 1814, any model

    ·       DS4700 Storage Server, type 1814 (DC power supplies), any model

    ·       DS4800 Storage Server, type 1815, any model

    ·       DS5020 Disk Controller (1814-20A), any model

    ·       DS5100 Storage Controller, type 1818, any model

    ·       DS5300 Storage Controller, type 1818, any model

    ·       FAStT 200 Storage Server, type 3542, any model

    ·       FAStT500 RAID Controller, type 3552, any model

    ·       FAStT500, type 3552, any model

    ·       IBM System Storage DS3200, type 1726, any model

    ·       IBM System Storage DS3300, type 1726, any model

    ·       IBM System Storage DS3400, type 1726, any model

    ·       IBM System Storage DS3512, type 1746, any model

    ·       IBM System Storage DS3524, type 1746, any model

    ·       IBM System Storage DS3950 Express, type 1814, any model


    The system is configured with one or more of the following IBM Options:

    ·       BladeCenter Boot Disk System (1726-22B), any model


    This tip is not software specific.

    Solution

    Forthe DS3500, DCS3700, and DCS3860, this issue is fixed in the 8.2x release. Forall other products, this is a permanent restriction and there will be nosolution.

    Workaround

    Whenevera controller is rebooted, the firmware will reset the timer mechanism, givingthe controllers another 828.5 days on the timer. The next reboots will occur at820 days for controller A or 825 days for controller B.

    Theway to avoid these unexpected reboots is with a controller firmware upgrade,since the process of upgrading controller firmware will reboot the controllers,thereby, resetting the timer mechanism. This also allows for the reboots to bescheduled at a convenient time for the customer's environment.

    Upgradingfirmware to the levels below is also recommended to reduce the possibility ofthe controller reboots happening at same time.
    DS3000 - 07.35.41.00 or higher
    DS4000 - 07.15.07.00 or higher
    DS5000 - 07.30.21.00 or higher

    IBM's best recommended practice is to maintain the environment with regularfirmware upgrades, at least once per year, to leverage the enhancementsimplemented in firmware and provide the best possible quality, performance, andavailability of the system.

    Ifthese recommended best practices are followed, then the reboot behavior willnot be observed.

    Regularlyscheduled maintenance of controller firmware will reset the timer, since thisprocess reboots the controller. A reboot, for any other reason, will also causethe timer to be reset.

    Additionalinformation

    Thecurrent design of the DS3000, DS4000, and DS5000 controller operating systemcontains a separate timer for each controller. Each timer rolls over after828.5 days. In order to keep the timer from rolling over, the controller isdesigned to reboot after 825.5 days to reset the timer. These timers are independentof each other, however, there is a possibility that the controllers couldreboot at the same time. Firmware levels 07.35.41.00, 07.15.07.00, and07.30.21.00 were changed to stagger the controller reboots - controller A willreboot at 820 days and controller B will reboot at 825 days. This eliminatesthe simultaneous controller reboot condition, and allows the two redundantcontrollers to protect each other using the normal failover/failbackoperations.

    Aproperly maintained DS3000, DS4000, and DS5000 system includes periodicfirmware upgrades. These firmware upgrades should never allow the controllersto get to the point where the timer rolls over.

    IBMhighly recommends to periodically upgrade controller firmware. Firmwareupgrades should be part of a yearly Change Management plan.

     

     

    Segment

    Product

    Component

    Platform

    Version

    Edition

    Disk Storage Systems

    DS3950

    Disk Storage Systems

    DS4200

    Disk Storage Systems

    DS4700

    Disk Storage Systems

    DS4800

    Disk Storage Systems

    DS5020

    Disk Storage Systems

    DS5100

    Disk Storage Systems

    DS3200

    Disk Storage Systems

    DS3300

    Disk Storage Systems

    DS3400

    Disk Storage Systems

    BladeCenter Boot Disk System

    Disk Storage Systems

    DS3500 (DS3512- DS3524)

    Disk Storage Systems

    DS4100

    Disk Storage Systems

    DS4300

    Disk Storage Systems

    DS4400

    Disk Storage Systems

    DS4500

    Disk Storage Systems

    FAStT500 Storage Server

    Disk Storage Systems

    DCS3700

    Disk Storage Systems

    System Storage DCS3860

    Cross reference information

     

     

  • 相关阅读:
    冒泡排序
    Objective-C 命名规范
    时间轴的制作
    CocoaPods 哪些事
    消息转发机制入门篇
    架构
    算法学习
    AutoLayout自动布局
    网络学习
    HDU 3832 Earth Hour (最短路)
  • 原文地址:https://www.cnblogs.com/jonathanyue/p/9301155.html
Copyright © 2020-2023  润新知