• Intel 82599网卡异常挂死原因


    前提背景:

    生产环境上,服务器网络突然断链,ssh连接失败。

    问题初步定位:

    查找内核日志,得到网卡异常信息

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 14 not cleared within the polling period

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 15 not cleared within the polling period

    Jan 24 11:52:43 localhost kernel: bonding: bond5: link status definitely down for interface eth0, disabling it

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: detected SFP+: 5

    Jan 24 11:52:43 localhost kernel: ixgbe 0000:84:00.0: eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

    Jan 24 11:52:43 localhost kernel: bond5: link status definitely up for interface eth0, 10000 Mbps full duplex.

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Detected Tx Unit Hang

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx_buffer_info[next_to_clean]

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: tx hang 448 detected on queue 6, resetting adapter

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: Reset adapter

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 0 not cleared within the polling period

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 1 not cleared within the polling period

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 2 not cleared within the polling period

    Jan 24 11:52:47 localhost kernel: ixgbe 0000:84:00.0: eth0: RXDCTL.ENABLE on Rx queue 3 not cleared within the polling period

    网卡PCI信息:

    # lspci -vvv -s 84:00.0
    84:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
            Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
            Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
            Interrupt: pin A routed to IRQ 16
            Region 0: Memory at f7e20000 (64-bit, non-prefetchable) [disabled] [size=128K]
            Region 2: I/O ports at f020 [disabled] [size=32]
            Region 4: Memory at f7e44000 (64-bit, non-prefetchable) [disabled] [size=16K]
            Capabilities: [40] Power Management version 3
                    Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                    Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
            Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                    Address: 0000000000000000  Data: 0000
                    Masking: 00000000  Pending: 00000000
            Capabilities: [70] MSI-X: Enable- Count=64 Masked-
                    Vector table: BAR=4 offset=00000000
                    PBA: BAR=4 offset=00002000
            Capabilities: [a0] Express (v2) Endpoint, MSI 00
                    DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                    DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                            MaxPayload 128 bytes, MaxReadReq 512 bytes
                    DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                    LnkCap: Port #4, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <2us, L1 <32us
                            ClockPM- Surprise- LLActRep- BwNot-
                    LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                    LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                    DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
                    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                    LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                             Compliance De-emphasis: -6dB
                    LnkSta2: Current De-emphasis Level: -6dB
            Capabilities: [e0] Vital Product Data
                    Unknown small resource type 06, will not decode more.
            Capabilities: [100] Advanced Error Reporting
                    UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                    UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                    UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                    CESta:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                    CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                    AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
            Capabilities: [140] Device Serial Number 98-f5-37-ff-ff-e3-64-73
            Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
                    ARICap: MFVC- ACS-, Next Function: 1
                    ARICtl: MFVC- ACS-, Function Group: 0
            Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
                    IOVCap: Migration-, Interrupt Message Number: 000
                    IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                    IOVSta: Migration-
                    Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                    VF offset: 384, stride: 2, Device ID: 10ed
                    Supported Page Size: 00000553, System Page Size: 00000001
                    Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                    Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
                    VF Migration: offset: 00000000, BIR: 0
            Kernel driver in use: ixgbe
            Kernel modules: ixgbe

    网卡寄存器信息:

    # ethtool -d  eth0
    0x042A4: LINKS (Link Status register)                 0xFFFFFFFF
           Link Status:                                   up
           Link Speed:                                    10G
    0x05080: FCTRL (Filter Control register)              0xFFFFFFFF
           Receive Flow Control Packets:                  enabled
           Receive Priority Flow Control Packets:         enabled
           Discard Pause Frames:                          enabled
           Pass MAC Control Frames:                       enabled
           Broadcast Accept:                              enabled
           Unicast Promiscuous:                           enabled
           Multicast Promiscuous:                         enabled
           Store Bad Packets:                             enabled
    0x05088: VLNCTRL (VLAN Control register)              0xFFFFFFFF
           VLAN Mode:                                     enabled
           VLAN Filter:                                   enabled
    0x02100: SRRCTL0 (Split and Replic Rx Control 0)      0xFFFFFFFF
           Receive Buffer Size:                           16KB
    0x03D00: RMCS (Receive Music Control register)        0xFFFFFFFF
           Transmit Flow Control:                         enabled
           Priority Flow Control:                         enabled
    0x04250: HLREG0 (Highlander Control 0 register)       0xFFFFFFFF
           Transmit CRC:                                  enabled
           Receive CRC Strip:                             enabled
           Jumbo Frames:                                  enabled
           Pad Short Frames:                              enabled
           Loopback:                                      enabled
    0x00000: CTRL        (Device Control)                 0xFFFFFFFF
    0x00008: STATUS      (Device Status)                  0xFFFFFFFF
    0x00018: CTRL_EXT    (Extended Device Control)        0xFFFFFFFF
    0x00020: ESDP        (Extended SDP Control)           0xFFFFFFFF
    0x00028: EODSDP      (Extended OD SDP Control)        0xFFFFFFFF
    0x00200: LEDCTL      (LED Control)                    0xFFFFFFFF

    ........

    0x01010: RDH00       (Receive Descriptor Head 00)     0xFFFFFFFF
    0x01050: RDH01       (Receive Descriptor Head 01)     0xFFFFFFFF
    0x01090: RDH02       (Receive Descriptor Head 02)     0xFFFFFFFF
    0x010D0: RDH03       (Receive Descriptor Head 03)     0xFFFFFFFF
    0x01110: RDH04       (Receive Descriptor Head 04)     0xFFFFFFFF

    ..........

    0x01028: RXDCTL00    (Receive Descriptor Control 00)  0xFFFFFFFF
    0x01068: RXDCTL01    (Receive Descriptor Control 01)  0xFFFFFFFF
    0x010A8: RXDCTL02    (Receive Descriptor Control 02)  0xFFFFFFFF

    ........

    0x06010: TDH00       (Transmit Descriptor Head 00)    0xFFFFFFFF
    0x06050: TDH01       (Transmit Descriptor Head 01)    0xFFFFFFFF
    0x06090: TDH02       (Transmit Descriptor Head 02)    0xFFFFFFFF
    0x060D0: TDH03       (Transmit Descriptor Head 03)    0xFFFFFFFF
    0x06110: TDH04       (Transmit Descriptor Head 04)    0xFFFFFFFF
    0x06150: TDH05       (Transmit Descriptor Head 05)    0xFFFFFFFF

    问题可能原因:

    Bar0地址看起来没有问题,但寄存器全是0xffffffff了 82599寄存器开始是正常的, 跑了一段时间(10小时)就 变成FFFF了

    可能pcie 接口接触问题。

  • 相关阅读:
    看完这篇文章,对采用.Net做网站的信心大增
    sqlserver 使用游标存储过程分页
    ViewState慎用
    [zz] C++智能指针循环引用解决
    cello 有关trigger
    cello 把事件全部统一起来,Event,StateEvent,ActionEvent
    libvirt 错误FAQ
    cello collector中为什么要设置frame_work_in_machine
    比较好的一些句型
    一些疑问
  • 原文地址:https://www.cnblogs.com/smith9527/p/10348953.html
Copyright © 2020-2023  润新知