• storcli64和smartctl定位硬盘的故障信息


    定位硬盘盘位和盘符的方法

    From Lin.Wang

    Section One : Introduction

    strocli是megacli的升级版本,针对于戴尔服务器是perccli,用法完全一致

    smartctl可以查看磁盘的主控芯片smart信息

    lsscsi可以查看系统的scsi信息,数据来源/proc/scsi/scsi相关,该文档此处暂不介绍

    这些工具都是查看磁盘相关信息的常用工具,对于排查磁盘状态和raid卡问题都有帮助

    Section Two : Install package

    安装一下storcli或者perccli,并且将命令软连接到/usr/bin/目录下,方便使用命令:

    ln -s /opt/MegaRAID/storcli/storcli64 /usr/bin/

    ln -s /opt/MegaRAID/perccli/percclie64 /usr/bin/

    Section Three : Step

    由系统磁盘盘符/dev/sdf定位对应的硬盘盘位思路如下:

    1. perccli64 /c0/eall/sall show 看到该磁盘有

      img-/c0/eall/sall

      从该图看到有四个jbod分区,根据经验一般人为jbod的分区系统盘符会在raid分区之前,也就是说jbod的分区会从/dev/sda > /dev/sdd,raid的分区从/dev/sde开始;

      DG代表drive group,是配置raid建分组的顺序,有图上看到32:4和32:5是一个卷组。

    2. perccli64 /c0/vall show看到该磁盘的DG与VD的对应关系如下

    img-/c0/vall

    ​ 由图上看到DG/VD就是raid的卷组和系统里卷组的顺序对应关系,一般如果服务器只有raid卷组来说的话,VD0就是操作系统里的/dev/sda,以此类推;但是如果服务器包括了jbod卷组,则raid的卷组从jbod后开始排序,本例中也就是VD0=/dev/sde,则要定位/dev/sdf的话VD=1,对应DG=1;

    ​ 回到img-/c0/eall/sall上,DG为1时,DID=6,DID就是device id,这个概念后边有用;同时Slot NO.也就是slt = 6对应的服务器上盘位就是第7个(从0开始到6),此时即定位到了/dev/sdf的物理盘位。

    反之从服务器上看到硬盘故障灯,可以反推对应的系统分区盘符

    Note:

    ​ 如果服务器没有jbod卷组,全是raid的,则此时/c0/vall找到对应关系即可定位关联关系

    ​ 实际操作时还可以通过 perccli64 /c0/e32/s6 start/stop locate点亮关闭磁盘灯,来判断定位是否正确

    Section Four : storcli/perccli Usage

    查看控制器的信息

    **perccli64 show ctrlcount 查看有几个控制器即几个raid卡 **

    perccli64 show 显示raid卡信息

    [root@node-15 ~]# perccli64 show
    Status Code = 0
    Status = Success
    Description = None
    
    Number of Controllers = 1
    Host Name = node-15.domain.tld
    Operating System  = Linux3.10.0-327.20.1.es2.el7.x86_64
    
    System Overview :
    ===============
    
    ------------------------------------------------------------------------
    Ctl Model        Ports PDs DGs DNOpt VDs VNOpt BBU sPR DS EHS ASOs Hlth 
    ------------------------------------------------------------------------
      0 PERCH730Mini     8  16  11     0  11     0 Opt On  3  N      0 Opt  
    ------------------------------------------------------------------------
    
    Ctl=Controller Index|DGs=Drive groups|VDs=Virtual drives|Fld=Failed
    PDs=Physical drives|DNOpt=DG NotOptimal|VNOpt=VD NotOptimal|Opt=Optimal
    Msng=Missing|Dgd=Degraded|NdAtn=Need Attention|Unkwn=Unknown
    sPR=Scheduled Patrol Read|DS=DimmerSwitch|EHS=Emergency Hot Spare
    Y=Yes|N=No|ASOs=Advanced Software Options|BBU=Battery backup unit
    Hlth=Health|Safe=Safe-mode boot
    

    可以看到只有一个raid卡,ctrl 0也是就是/c0

    storcli64 /c0 show

    [root@node-15 ~]# perccli64 /c0 show
    Generating detailed summary of the adapter, it may take a while to complete.
    
    Controller = 0
    Status = Success
    Description = None
    
    Product Name = PERC H730 Mini
    Serial Number = 663021Z
    SAS Address =  51866da066153000
    PCI Address = 00:03:00:00
    System Time = 01/10/2019 20:48:38
    Mfg. Date = 06/17/16
    Controller Time = 01/10/2019 12:44:21
    FW Package Build = 25.4.0.0017
    BIOS Version = 6.29.00.0_4.16.07.00_0x06120100
    FW Version = 4.260.00-6259
    Driver Name = megaraid_sas
    Driver Version = 06.807.10.00-rh1
    Current Personality = RAID-Mode
    Vendor Id = 0x1000
    Device Id = 0x5D
    SubVendor Id = 0x1028
    SubDevice Id = 0x1F49
    Host Interface = PCI-E
    Device Interface = SAS-12G
    Bus Number = 3
    Device Number = 0
    Function Number = 0
    Drive Groups = 11
    
    TOPOLOGY :
    ========
    
    ---------------------------------------------------------------------------
    DG Arr Row EID:Slot DID Type  State BT     Size PDC  PI SED DS3  FSpace TR 
    ---------------------------------------------------------------------------
     0 -   -   -        -   RAID1 Optl  N  931.0 GB dflt N  N   dflt N      N  
     0 0   -   -        -   RAID1 Optl  N  931.0 GB dflt N  N   dflt N      N  
     0 0   0   32:4     4   DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     0 0   1   32:5     5   DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     1 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     1 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     1 0   0   32:6     6   DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     2 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     2 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     2 0   0   32:7     7   DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     3 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     3 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     3 0   0   32:8     8   DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     4 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     4 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     4 0   0   32:9     9   DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     5 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     5 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     5 0   0   32:10    10  DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     6 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     6 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     6 0   0   32:11    11  DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     7 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     7 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     7 0   0   32:12    12  DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     8 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     8 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     8 0   0   32:13    13  DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
     9 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     9 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
     9 0   0   32:14    14  DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
    10 -   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
    10 0   -   -        -   RAID0 Optl  N  931.0 GB dflt N  N   dflt N      N  
    10 0   0   32:15    15  DRIVE Onln  N  931.0 GB dflt N  N   dflt -      N  
    ---------------------------------------------------------------------------
    
    DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
    DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Dgrd=Degraded
    Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
    PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
    DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
    TR=Transport Ready
    
    Virtual Drives = 11
    
    VD LIST :
    =======
    
    -------------------------------------------------------------
    DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
    -------------------------------------------------------------
    0/0   RAID1 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    1/1   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    2/2   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    3/3   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    4/4   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    5/5   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    6/6   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    7/7   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    8/8   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    9/9   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    10/10 RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    -------------------------------------------------------------
    
    Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
    Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
    Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
    FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
    Check Consistency
    
    Physical Drives = 16
    
    PD LIST :
    =======
    
    ----------------------------------------------------------------------------
    EID:Slt DID State DG      Size Intf Med SED PI SeSz Model                Sp 
    ----------------------------------------------------------------------------
    32:0      0 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:1      1 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:2      2 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:3      3 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:4      4 Onln  0   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:5      5 Onln  0   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:6      6 Onln  1   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:7      7 Onln  2   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:8      8 Onln  3   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:9      9 Onln  4   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:10    10 Onln  5   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:11    11 Onln  6   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:12    12 Onln  7   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:13    13 Onln  8   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:14    14 Onln  9   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:15    15 Onln  10  931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    ----------------------------------------------------------------------------
    
    EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
    DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
    UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
    Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
    SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
    UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
    CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
    
    
    BBU_Info :
    ========
    
    ----------------------------------------------
    Model State   RetentionTime Temp Mode MfgDate 
    ----------------------------------------------
    BBU   Optimal 0 hour(s)     38C  -    0/00/00 
    ----------------------------------------------
    
    看磁盘的Device id、Slot No. 以及DriveGroup
    [root@node-15 ~]# perccli64 /c0/eall/sall show
    Controller = 0
    Status = Success
    Description = Show Drive Information Succeeded.
    
    
    Drive Information :
    =================
    
    ----------------------------------------------------------------------------
    EID:Slt DID State DG      Size Intf Med SED PI SeSz Model                Sp 
    ----------------------------------------------------------------------------
    32:0      0 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:1      1 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:2      2 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:3      3 JBOD  -  185.75 GB SATA SSD N   N  512B INTEL SSDSC2BX200G4R U  
    32:4      4 Onln  0   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:5      5 Onln  0   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:6      6 Onln  1   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:7      7 Onln  2   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:8      8 Onln  3   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:9      9 Onln  4   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:10    10 Onln  5   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:11    11 Onln  6   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:12    12 Onln  7   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:13    13 Onln  8   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:14    14 Onln  9   931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    32:15    15 Onln  10  931.0 GB SATA HDD N   N  512B ST91000640NS         U  
    ----------------------------------------------------------------------------
    
    EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
    DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
    UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
    Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
    SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
    UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
    CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
    

    Note:

    ​ 根据经验,在centos系统里的默认udev规则下,jbod的分区在raid的分区之前(如果在线修改的,重启后会变成jbod在前),通过lsscsi命令可以发现在同一个raid控制器下,jbod的分区的channel值小于raid分区的值,类似下图, 第一个字段的第二个值0是jbod和2是raid的区别.

    [root@SZVPN-2 udev]# lsscsi
    [0:0:24:0]   disk    IBM-ESXS MBF2600RC        SB2C  /dev/sda 
    [0:2:0:0]    disk    IBM      ServeRAID M5110e 3.19  /dev/sdb 
    [0:2:1:0]    disk    IBM      ServeRAID M5110e 3.19  /dev/sdc 
    

    并且jbod设备的分区在系统里被udev规则识别得到的scsi_level高于raid分区.

    udevadm -ap /sys/class/block/sdx |grep scsi_level

    我的测试值jbod的scsi_level是7而raid的scsi_level是6.

    相应的udev规则是 /lib/udev/rules.d/60-persistent-storage.rules

    scsci_level: ATTRS{scsi_level}=="[6-9]*"

    查看指定硬盘的信息
    [root@node-15 ~]# perccli64 /c0/e32/s6 show all
    Controller = 0
    Status = Success
    Description = Show Drive Information Succeeded.
    
    
    Drive /c0/e32/s6 :
    ================
    
    -------------------------------------------------------------------
    EID:Slt DID State DG     Size Intf Med SED PI SeSz Model        Sp 
    -------------------------------------------------------------------
    32:6      6 Onln   1 931.0 GB SATA HDD N   N  512B ST91000640NS U  
    -------------------------------------------------------------------
    
    EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
    DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
    UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
    Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
    SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
    UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
    CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
    
    
    Drive /c0/e32/s6 - Detailed Information :
    =======================================
    
    Drive /c0/e32/s6 State :
    ======================
    Shield Counter = 0
    Media Error Count = 46431				*** 很明显的问题发生了46431次介质错误 ***
    Other Error Count = 0
    Drive Temperature =  31C (87.80 F)	
    Predictive Failure Count = 126        	*** 预测故障次数126次 ***
    S.M.A.R.T alert flagged by drive = Yes
    
    
    Drive /c0/e32/s6 Device attributes :
    ==================================
    SN = 9XGA228L
    Manufacturer Id = ATA     
    Model Number = ST91000640NS
    NAND Vendor = NA
    WWN = 5000c500918f2f8a
    Firmware Revision =     AA63
    Raw size = 931.512 GB [0x74706db0 Sectors]
    Coerced size = 931.0 GB [0x74600000 Sectors]
    Non Coerced size = 931.012 GB [0x74606db0 Sectors]
    Device Speed = 6.0Gb/s
    Link Speed = 12.0Gb/s
    NCQ setting = N/A
    Write Cache = Enabled
    Logical Sector Size = 512B
    Physical Sector Size = 512B
    Connector Name = 00 
    
    
    Drive /c0/e32/s6 Policies/Settings :
    ==================================
    Drive position = DriveGroup:1, Span:0, Row:0
    Enclosure position = 0
    Connected Port Number = 0(path0) 
    Sequence Number = 2
    Commissioned Spare = No
    Emergency Spare = No
    Last Predictive Failure Event Sequence Number = 95183    *** 上一次预测错误的序号95183 ***
    Successful diagnostics completion on = N/A
    SED Capable = No
    SED Enabled = No
    Secured = No
    Cryptographic Erase Capable = No
    Locked = No
    Needs EKM Attention = No
    PI Eligible = No
    Certified = Yes
    Wide Port Capable = No
    
    Port Information :
    ================
    
    -----------------------------------------
    Port Status Linkspeed SAS address        
    -----------------------------------------
       0 Active 12.0Gb/s  0x500056b33fefe586 
    -----------------------------------------
    
    
    Inquiry Data = 
    5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00 
    00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20 
    58 39 41 47 32 32 4c 38 00 00 00 00 04 00 20 20 
    20 20 41 41 33 36 54 53 31 39 30 30 36 30 30 34 
    53 4e 20 20 20 20 20 20 20 20 20 20 20 20 20 20 
    20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80 
    00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00 
    3f 00 10 fc fb 00 10 00 ff ff ff 0f 00 00 07 00 
    

    Note:

    通过单个卷组的信息查看,发现了media error,说明了硬盘是有问题的

    查看磁盘与系统磁盘分区的对应
    [root@node-15 ~]# perccli64 /c0/vall show
    Controller = 0
    Status = Success
    Description = None
    
    
    Virtual Drives :
    ==============
    
    -------------------------------------------------------------
    DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name 
    -------------------------------------------------------------
    0/0   RAID1 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    1/1   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    2/2   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    3/3   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    4/4   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    5/5   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    6/6   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    7/7   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    8/8   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    9/9   RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    10/10 RAID0 Optl  RW     Yes     RWBD  -   OFF 931.0 GB      
    -------------------------------------------------------------
    
    Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
    Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
    Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
    FWB=Force WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
    Check Consistency
    

    Note:

    VD:一般认为是该硬盘在系统里的设备顺序,一般如果只有raid分区,那么VD=0的就是系统里的/dev/sda,VD=1就是/dev/sdb以此类推,但是如果有jbod的分区,先排列jbod分区,如jbod的到了/dev/sdc,VD0则是/dev/sdd,以此类推;
    DG:是在raid卡里配置卷组的顺序;

    Raid卡日志收集相关命令

    storcli64 /c0 show time 显示raid的时间

    storcli64 /c0 show alilog logfile=node-x.alilog 获取alilog,所有的log都包括了

    storcli64 /c0 show all logfile=node-x.all.log raid卡的信息

    storcli64 /c0 show badblocks 磁盘坏道的信息

    perccli64 /c0 show events filter=fatal 显示事件级别为fatal的,可以获取所有毁灭性事件的信息,发现磁盘故障或raid卡故障

    perccli64 /c0 show cc 数据一致性检测,raid1以上的级别多个盘的数据是需要进行一致性检测的,但是单盘raid0可能是不需要的,是否影响性能不确定

    Section Five : Smartctl Get Error info of Disks

    Common Commands Usage Description

    --scan Scan for devices

    --scan-open Scan for devices and try to open each device

    -x, --xall Show all information for device

    -a, --all Show all SMART information for device

    -i, --info Show identity information for device

    -d TYPE, --device=TYPE Specify device type to one of: ata, scsi, nvme[,NSID], sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, aacraid,H,L,ID, cciss,N, auto, test

    -s VALUE, --smart=VALUE Enable/disable SMART on device (on/off)

    -o VALUE, --offlineauto=VALUE(ATA) Enable/disable automatic offline testing on device (on/off)

    -S VALUE, --saveauto=VALUE(ATA) Enable/disable Attribute autosave on device (on/off)

    -H, --health Show device SMART health status

    -c, --capabilities(ATA,NVMe) Show device SMART capabilities

    -A, --attributes Show device SMART vendor-specific Attributes and values

    -l TYPE, --log=TYPE Show device log. TYPE: error, selftest, selective, directory[,g|s],
    ​ xerror[,N][,error], xselftest[,N][,selftest],
    ​ background, sasphy[,reset], sataphy[,reset],
    ​ scttemp[sts,hist], scttempint,N[,p],
    ​ scterc[,N,M], devstat[,N], ssd,
    ​ gplog,N[,RANGE], smartlog,N[,RANGE],
    ​ nvmelog,N,SIZE

    -t TEST, --test=TEST Run test. TEST: offline, short, long, conveyance, force, vendor,N,
    ​ select,M-N, pending,N, afterselect,[on|off]

    -X, --abort Abort any non-captive test on device

    Get info for /dev/sdf

    查看所有设备列表
    [root@node-15 ~]# smartctl --scan
    /dev/sda -d scsi # /dev/sda, SCSI device
    /dev/sdb -d scsi # /dev/sdb, SCSI device
    /dev/sdc -d scsi # /dev/sdc, SCSI device
    /dev/sdd -d scsi # /dev/sdd, SCSI device
    /dev/sde -d scsi # /dev/sde, SCSI device
    /dev/sdf -d scsi # /dev/sdf, SCSI device
    /dev/sdg -d scsi # /dev/sdg, SCSI device
    /dev/sdh -d scsi # /dev/sdh, SCSI device
    /dev/sdi -d scsi # /dev/sdi, SCSI device
    /dev/sdj -d scsi # /dev/sdj, SCSI device
    /dev/sdk -d scsi # /dev/sdk, SCSI device
    /dev/sdl -d scsi # /dev/sdl, SCSI device
    /dev/sdm -d scsi # /dev/sdm, SCSI device
    /dev/sdn -d scsi # /dev/sdn, SCSI device
    /dev/sdo -d scsi # /dev/sdo, SCSI device
    /dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
    /dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
    /dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
    /dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
    /dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
    /dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
    /dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
    /dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
    /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
    /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
    /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
    /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
    /dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
    /dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device
    /dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device
    /dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], SCSI device
    

    Note:

    通过前面的章节我们定位到了磁盘/dev/sdf在perccli里的DID即device_id为6,也就是/dev/bus/0 -d megaraid,6

    查看磁盘信息
    [root@node-15 ~]# smartctl -i -d megaraid,6 /dev/sdf
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Constellation.2 (SATA)
    Device Model:     ST91000640NS
    Serial Number:    9XGA228L
    LU WWN Device Id: 5 000c50 0918f2f8a
    Add. Product Id:  DELL(tm)
    Firmware Version: AA63
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    7200 rpm
    Form Factor:      2.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Fri Jan 11 11:28:46 2019 CST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    查看磁盘的属性信息

    一般此处可以用来查看磁盘的整体健康状态指标参数

    针对以下输出信息,字段的解释

    • ID:属性ID,通常是一个1到255之间的十进制或十六进制的数字。
    • ATTRIBUTE_NAME:硬盘制造商定义的属性名。
    • FLAG:属性操作标志(可以忽略)。
    • VALUE:这是表格中最重要的信息之一,代表给定属性的标准化值,在1到253之间。253意味着最好情况,1意味着最坏情况。取决于属性和制造商,初始化VALUE可以被设置成100或200.
    • WORST:所记录的最小VALUE。
    • THRESH:在报告硬盘FAILED状态前,WORST可以允许的最小值,也就是WORST如果小于THRESH,磁盘就会报告FAILED。
    • TYPE:属性的类型(Pre-fail或Oldage)。Pre-fail类型的属性可被看成一个关键属性,表示参与磁盘的整体SMART健康评估(PASSED/FAILED)。如果任何Pre-fail类型的属性故障,那么可视为磁盘将要发生故障。另一方面,Oldage类型的属性可被看成一个非关键的属性(如正常的磁盘磨损),表示不会使磁盘本身发生故障。
    • UPDATED:表示属性的更新频率。Offline代表磁盘上执行离线测试的时间。
    • WHEN_FAILED:如果VALUE小于等于THRESH,会被设置成“FAILING_NOW”;如果WORST小于等于THRESH会被设置成“In_the_past”;如果都不是,会被设置成“-”。在“FAILING_NOW”情况下,需要尽快备份重要文件,特别是属性是Pre-fail类型时。“In_the_past”代表属性已经故障了,但在运行测试的时候没问题。“-”代表这个属性从没故障过。
    • RAW_VALUE:制造商定义的原始值,从VALUE派生。
    [root@node-15 ~]# smartctl -A -d megaraid,6 /dev/sdf  
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x010f   081   038   044    Pre-fail  Always   In_the_past 151546765
      3 Spin_Up_Time            0x0103   094   094   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       21
      5 Reallocated_Sector_Ct   0x0133   100   100   036    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail  Always       -       338813105
      9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       18784
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       21
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1710
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
    189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
    190 Airflow_Temperature_Cel 0x0022   069   053   045    Old_age   Always       -       31 (Min/Max 24/40)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       19
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       852
    194 Temperature_Celsius     0x0022   031   047   000    Old_age   Always       -       31 (0 14 0 0 0)
    195 Hardware_ECC_Recovered  0x001a   117   099   000    Old_age   Always       -       151546765
    197 Current_Pending_Sector  0x0012   084   084   000    Old_age   Always       -       688
    198 Offline_Uncorrectable   0x0010   084   084   000    Old_age   Offline      -       688
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       8093 (164 214 0)
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1870535293
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1530387871
    
    查看磁盘的健康检测状态

    Note:

    关于以下检测结果,说明检测结果是PASSED的,就是磁盘还可以使用,但是列出了一条检测异常的WORST<THRESH,TYPE是Pre-fail,WHEN_FAILED是In_the_past,说明预测这个盘快坏了。

    [root@node-15 ~]# smartctl -H -d megaraid,6 /dev/sdf  
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART Status not supported: ATA return descriptor not supported by controller firmware
    SMART overall-health self-assessment test result: PASSED
    Warning: This result is based on an Attribute check.
    Please note the following marginal Attributes:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x010f   081   038   044    Pre-fail  Always   In_the_past 151546765
    
    查看磁盘的错误日志
    [root@node-15 ~]# smartctl -l error -d megaraid,6 /dev/sdf
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART Error Log Version: 1
    ATA Error Count: 46431 (device log contains only the most recent five errors)
            CR = Command Register [HEX]
            FR = Features Register [HEX]
            SC = Sector Count Register [HEX]
            SN = Sector Number Register [HEX]
            CL = Cylinder Low Register [HEX]
            CH = Cylinder High Register [HEX]
            DH = Device/Head Register [HEX]
            DC = Device Command Register [HEX]
            ER = Error register [HEX]
            ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 46431 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      42 00 00 ff ff ff 4f 00  46d+15:15:32.968  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:29.901  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:26.825  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:23.965  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:20.905  READ VERIFY SECTOR(S) EXT
    
    Error 46430 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      42 00 00 ff ff ff 4f 00  46d+15:15:29.901  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:26.825  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:23.965  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:20.905  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:18.093  READ VERIFY SECTOR(S) EXT
    
    Error 46429 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      42 00 00 ff ff ff 4f 00  46d+15:15:26.825  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:23.965  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:20.905  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:18.093  READ VERIFY SECTOR(S) EXT
      b0 da 00 00 4f c2 00 00  46d+15:15:17.838  SMART RETURN STATUS
    
    Error 46428 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      42 00 00 ff ff ff 4f 00  46d+15:15:23.965  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:20.905  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:18.093  READ VERIFY SECTOR(S) EXT
      b0 da 00 00 4f c2 00 00  46d+15:15:17.838  SMART RETURN STATUS
      2f 00 01 e0 00 00 40 00  46d+15:15:17.703  READ LOG EXT
    
    Error 46427 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      42 00 00 ff ff ff 4f 00  46d+15:15:20.905  READ VERIFY SECTOR(S) EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:18.093  READ VERIFY SECTOR(S) EXT
      b0 da 00 00 4f c2 00 00  46d+15:15:17.838  SMART RETURN STATUS
      2f 00 01 e0 00 00 40 00  46d+15:15:17.703  READ LOG EXT
      42 00 00 ff ff ff 4f 00  46d+15:15:15.276  READ VERIFY SECTOR(S) EXT
    
    补充
    • 如果没有开启磁盘的smart可以通过-s on device开启
    • 一般来说如果samrtctl -i 获取info时没有什么信息输出且smart support是允许的可用的,那么说明可能需要做test才能获取到-t short/long,该测试不会破坏硬盘上的数据,但对于存储一般不适用离线offline测试
    • 收集时可以通过-x -a参数获取更全面的磁盘信息
    • smartctl是可以配置服务的/etc/smartmontools/smartd.conf,对此目前没有研究,后续有研究成果再更新
  • 相关阅读:
    hutool 解析 Excel
    上传文件
    Cannot construct instance of `com.**` (although at least one Creator exists)
    Java8之Optional
    java8之Stream
    java8之Lambda
    springboot+mybatis事务管理
    queryWrapper in like
    Java 组装 Tree
    JWT
  • 原文地址:https://www.cnblogs.com/wangl-blog/p/10839635.html
Copyright © 2020-2023  润新知