• hadoop balance均衡datanode存储不起作用问题分析


      前段时间因为hadoop集群各datanode空间使用率很不均衡,需要重新balance(主要是有后加入集群的2台机器磁盘空间比较大引起的),在执行如下语句:

    bin/start-balancer.sh -threshold 10
    

      后,日志输出如下:

    Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
    Mar 10, 2014 11:03:40 AM          0                 0 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:41 AM          1                 0 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:42 AM          2               443 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:43 AM          3               443 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:44 AM          4            891.85 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:45 AM          5            891.85 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:46 AM          6            891.85 KB            614.5 GB              20 GB
    Mar 10, 2014 11:03:47 AM          7            891.85 KB           614.49 GB              20 GB
    Mar 10, 2014 11:03:48 AM          8            891.85 KB           614.49 GB              20 GB
    No block has been moved for 5 iterations. Exiting...
    Balancing took 10.023 seconds

    很明显,balancer已经计算出要移动的数据量,但是就是没有移动,这是为什么呢?

    查看hadoop-mysql-balancer-master.log并没有发现Error或者Warning,那只能去看源码了。

    原来hadoop balancer在进行转移block的时候是会判断的,具体要求见下面的代码:

     /* Decide if it is OK to move the given block from source to target
       * A block is a good candidate if
       * 1. the block is not in the process of being moved/has not been moved;
       * 2. the block does not have a replica on the target;
       * 3. doing the move does not reduce the number of racks that the block has
       */
    
    private boolean isGoodBlockCandidate(Source source, 
          BalancerDatanode target, BalancerBlock block) {
        // check if the block is moved or not
        if (movedBlocks.contains(block)) {
            return false;
        }
        if (block.isLocatedOnDatanode(target)) {
          return false;
        }
    
        boolean goodBlock = false;
        if (cluster.isOnSameRack(source.getDatanode(), target.getDatanode())) {
          // good if source and target are on the same rack
          goodBlock = true;
        } else {
          boolean notOnSameRack = true;
          synchronized (block) {
            for (BalancerDatanode loc : block.locations) {
              if (cluster.isOnSameRack(loc.datanode, target.datanode)) {
                notOnSameRack = false;
                break;
              }
            }
          }
          if (notOnSameRack) {
            // good if target is target is not on the same rack as any replica
            goodBlock = true;
          } else {
            // good if source is on the same rack as on of the replicas
            for (BalancerDatanode loc : block.locations) {
              if (loc != source && 
                  cluster.isOnSameRack(loc.datanode, source.datanode)) {
                goodBlock = true;
                break;
              }
            }
          }
        }
        return goodBlock;
      }
      

    对照上面的3个要求,逐一排查未移动block的原因:

    (1)需要移动的block在本次balance的过程中没有被移动过------这条满足;

    (2)需要移动的block在目标机器上不存在------这条待验证;

    (3)需要移动的block,在移动后不改变每个机架上block的数量(注意,这是的数量不是总数量,是去重以后的block数量,例如,block的备份数是2,其实是算一个唯一的block)------由于集群在配置的时候没有添加机架感知脚本,所以默认情况下,都在1个机架上,这条满足。

    那现在就去集群上验证第二条,果不其然,发现很多block在后面加入的2台机器上都已经存在,这还移动个屁啊,那边都已经存在了,所以balancer移动进程就退出了。

    解决方法:

    1.使用如下命令

    bin/hadoop fs -setRep -R / 2

    将集群中的block备份数同一设置成你在hdfs-site.xml中

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    </property>

    配置的备份数,然后重启hadoop集群,等hadoop完成校验blcok以后再balance即可解决问题。

  • 相关阅读:
    密码等级
    ie兼容透明
    分割线
    支付宝银行判断接口
    date只能选择今天之后的时间js
    离开页面之前提示,关闭,刷新等
    使用 Linux 系统的常用命令
    C#窗体简单增删改查
    1
    二维数组
  • 原文地址:https://www.cnblogs.com/wuren/p/3622856.html
Copyright © 2020-2023  润新知