• Ceph 状态报警告 pool rbd has many more objects per pg than average (too few pgs?)


    定位问题

    [root@lab8106 ~]# ceph -s
        cluster fa7ec1a1-662a-4ba3-b478-7cb570482b62
         health HEALTH_WARN
                pool rbd has many more objects per pg than average (too few pgs?)
         monmap e1: 1 mons at {lab8106=192.168.8.106:6789/0}
                election epoch 30, quorum 0 lab8106
         osdmap e157: 2 osds: 2 up, 2 in
                flags sortbitwise
          pgmap v1023: 417 pgs, 13 pools, 18519 MB data, 15920 objects
                18668 MB used, 538 GB / 556 GB avail
                     417 active+clean
    

    集群出现了这个警告,pool rbd has many more objects per pg than average (too few pgs?) 这个警告在hammer版本里面的提示是 pool rbd has too few pgs

    这个地方查看集群详细信息:

    [root@lab8106 ~]# ceph health detail
    HEALTH_WARN pool rbd has many more objects per pg than average (too few pgs?); mon.lab8106 low disk space
    pool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)
    

    看下集群的pool的对象状态

    [root@lab8106 ~]# ceph df
    GLOBAL:
        SIZE     AVAIL     RAW USED     %RAW USED 
        556G      538G       18668M          3.28 
    POOLS:
        NAME       ID     USED       %USED     MAX AVAIL     OBJECTS 
        rbd        6      16071M      2.82          536G       15296 
        pool1      7        204M      0.04          536G          52 
        pool2      8        184M      0.03          536G          47 
        pool3      9        188M      0.03          536G          48 
        pool4      10       192M      0.03          536G          49 
        pool5      11       204M      0.04          536G          52 
        pool6      12       148M      0.03          536G          38 
        pool7      13       184M      0.03          536G          47 
        pool8      14       200M      0.04          536G          51 
        pool9      15       200M      0.04          536G          51 
        pool10     16       248M      0.04          536G          63 
        pool11     17       232M      0.04          536G          59 
        pool12     18       264M      0.05          536G          67
    

    查看存储池的pg个数

    [root@lab8106 ~]# ceph osd dump|grep pool
    pool 6 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 132 flags hashpspool stripe_width 0
    pool 7 'pool1' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 134 flags hashpspool stripe_width 0
    pool 8 'pool2' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 136 flags hashpspool stripe_width 0
    pool 9 'pool3' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 138 flags hashpspool stripe_width 0
    pool 10 'pool4' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 140 flags hashpspool stripe_width 0
    pool 11 'pool5' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 142 flags hashpspool stripe_width 0
    pool 12 'pool6' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 144 flags hashpspool stripe_width 0
    pool 13 'pool7' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 146 flags hashpspool stripe_width 0
    pool 14 'pool8' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 148 flags hashpspool stripe_width 0
    pool 15 'pool9' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 150 flags hashpspool stripe_width 0
    pool 16 'pool10' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 152 flags hashpspool stripe_width 0
    pool 17 'pool11' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 154 flags hashpspool stripe_width 0
    pool 18 'pool12' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 200 pgp_num 200 last_change 156 flags hashpspool stripe_width 0
    

    我们看下这个是怎么得到的

    pool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)

    rbd objects_per_pg = 15296 / 8 = 1912

    objects_per_pg = 15920 /417 ≈ 38

    50.3158 = rbd objects_per_pg / objects_per_pg = 1912 / 38

    也就是出现其他pool的对象太少,而这个pg少,对象多,就会提示这个了,我们看下代码里面的判断

    https://github.com/ceph/ceph/blob/master/src/mon/PGMonitor.cc

     int average_objects_per_pg = pg_map.pg_sum.stats.sum.num_objects / pg_map.pg_stat.size();
          if (average_objects_per_pg > 0 &&
              pg_map.pg_sum.stats.sum.num_objects >= g_conf->mon_pg_warn_min_objects &&
              p->second.stats.sum.num_objects >= g_conf->mon_pg_warn_min_pool_objects) {
    	int objects_per_pg = p->second.stats.sum.num_objects / pi->get_pg_num();
    	float ratio = (float)objects_per_pg / (float)average_objects_per_pg;
    	if (g_conf->mon_pg_warn_max_object_skew > 0 &&
    	    ratio > g_conf->mon_pg_warn_max_object_skew) {
    	  ostringstream ss;
    	  ss << "pool " << name << " has many more objects per pg than average (too few pgs?)";
    	  summary.push_back(make_pair(HEALTH_WARN, ss.str()));
    	  if (detail) {
    	    ostringstream ss;
    	    ss << "pool " << name << " objects per pg ("
    	       << objects_per_pg << ") is more than " << ratio << " times cluster average ("
    	       << average_objects_per_pg << ")";
    	    detail->push_back(make_pair(HEALTH_WARN, ss.str()));
    	  }
    

    主要下面的几个限制条件

    mon_pg_warn_min_objects = 10000 //总的对象超过10000

    mon_pg_warn_min_pool_objects = 1000 //存储池对象超过1000

    mon_pg_warn_max_object_skew = 10 //就是上面的存储池的平均对象与所有pg的平均值的倍数关系

    解决问题

    有三个方法解决这个警告的提示:

    • 删除无用的存储池
      如果集群中有一些不用的存储池,并且相对的pg数目还比较高,那么可以删除一些这样的存储池,从而降低mon_pg_warn_max_object_skew这个值,警告就会没有了

    • 增加提示的pool的pg数目
      有可能的情况就是,这个存储池的pg数目从一开始就不够,增加pg和pgp数目,同样降低了mon_pg_warn_max_object_skew这个值了

    • 增加mon_pg_warn_max_object_skew的参数值
      如果集群里面已经有足够多的pg了,再增加pg会不稳定,如果想去掉这个警告,就可以增加这个参数值,默认为10

    总结

    这个警告是比较的是存储池中的对象数目与整个集群的pg的平均对象数目的偏差,如果偏差太大就会发出警告

    检查的步骤:

    ceph health detail
    ceph df
    ceph osd dump | grep pool
    

    mon_pg_warn_max_object_skew = 10.0

    ((objects/pg_num) in the affected pool)/(objects/pg_num in the entire system) >= 10.0 警告就会出现

    变更记录

    Why Who When
    创建 武汉-运维-磨渣 2016-07-27
  • 相关阅读:
    各类电压标准
    intel-FPGA的片内存储器问题
    基于FPGA的DDS设计(二)
    SpringCloud整合SpringBootActuator
    SpringCloud搭建项目遇到的问题
    第一章 PostGreSQL安装
    Maven环境变量
    JDK环境变量配置
    switch(expression)支持和不支持的数据类型
    @RequestMapping、@ResponseBody和@RequestBody的使用
  • 原文地址:https://www.cnblogs.com/zphj1987/p/13575362.html
Copyright © 2020-2023  润新知