• 使用edac工具来检测服务器内存故障.【转】


    转自:https://www.cnblogs.com/luckyall/p/11225772.html

    随着虚拟化,Redis,BDB内存数据库等应用的普及,现在越来越多的服务器配置了大容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,支持的内存高达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是一件很头疼的事情,出现故障,还是可以连续运行几个月甚至几年,但如果运气不好,随时都会挂掉,好在linux中提供了一个edac-utils 内存纠错诊断工具,可以用来检查服务器内存潜在的故障。
    下面以CentOS为例,介绍下edac-utils 工具的使用.
    在使用edac-utils 工具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使用了 E5-2600 系列CPU,C600 系列芯片组.大致相同) 其CPU内存控制器对应通道,内存槽关系,如下所示。

    处理器0 (对应一个内存控制器)
    通道0:内存插槽A1、A5 和A9
    通道1:内存插槽A2、A6 和A10
    通道2:内存插槽A3、A7 和A11
    通道3:内存插槽A4、A8 和A12

    处理器1 (对应一个内存控制器)
    通道0:内存插槽B1、B5 和B9
    通道1:内存插槽B2、B6 和B10
    通道2:内存插槽B3、B7 和B11
    通道3:内存插槽B4、B8 和B12

    1.安装 edac-utils 工具

    yum install -y libsysfs edac-utils
    2.执行检测命令,可查看纠错提示如下

    edac-util -v

    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12

    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8
    mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9
    mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10
    mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11
    mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12

    其中 mc0 表示 表示内存控制器0, CPU_Src_ID#0表示源CPU0 , Channel#0 表示通道0
    DIMM#0 标示内存槽0,Corrected Errors 代表已经纠错的次数,根据前面列出的CPU通
    道和内存槽对应关系即可给edac-utils 返回的信息进行编号。
    即可得出 A1槽 6312 次纠错,B1槽 6459次纠错,B3槽 535次纠错. 3条内存出现潜在故障,接下来联系供应商进行更换即可。

    12条内存的对应关系
    mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
    mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
    mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
    mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
    mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
    mc0: csrow1: CPU#0Channel#2_DIMM#1: A6

    mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
    mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
    mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
    mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
    mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
    mc1: csrow1: CPU#1Channel#2_DIMM#1: B6

    20条内存的对应关系
    mc0: 0 Uncorrected Errors with no DIMM info
    mc0: 0 Corrected Errors with no DIMM info
    mc0: csrow0: 0 Uncorrected Errors
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
    mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
    mc0: csrow1: 0 Uncorrected Errors
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
    mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
    mc0: csrow2: 0 Uncorrected Errors
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
    mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
    mc1: 0 Uncorrected Errors with no DIMM info
    mc1: 0 Corrected Errors with no DIMM info
    mc1: csrow0: 0 Uncorrected Errors
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors 
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors 
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
    mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
    mc1: csrow1: 0 Uncorrected Errors
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
    mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors

    4x16关系
    mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
    mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
    mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
    mc0: csrow1: 0 Uncorrected Errors
    mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
    mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
    mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
    mc0: csrow2: 0 Uncorrected Errors
    mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
    mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h

    【作者】张昺华
    【大饼教你学系列】https://edu.csdn.net/course/detail/10393
    【新浪微博】 张昺华--sky
    【twitter】 @sky2030_
    【微信公众号】 张昺华
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利.
  • 相关阅读:
    C++中 extern "C" 的两种用法
    第5章类和对象(一)
    第4章 函数和作用域
    第5章类和对象(一)
    第5章类和对象(一)续
    C++中 extern "C" 的两种用法
    extern用法详解(转)
    第4章 函数和作用域
    IOS开发中一些尺寸问题
    键盘处理IOS
  • 原文地址:https://www.cnblogs.com/sky-heaven/p/13528181.html
Copyright © 2020-2023  润新知