• DataNode迁移方案


    DataNode迁移方案

    目标

    由于外界因素的影响,需要将原有dn所在节点的机器从A机房换到B机房,其中会涉及到主机名和IP的改变.最终的目标是迁移之后对集群不造成大影响,
    服务依然可用,数据不发生丢失.

    相关知识

    因为在dn迁移的时候,必定会导致迁移节点停止心跳,如果超过心跳检查超时时间,此节点就会被任务是dead node,为了平衡副本数,会造成集群内大量
    的block块复制的现象,如果不想要在短时间内不是节点成为dead node,需要人工把心跳超时检查时间设大.namenode超时心跳检测时间算法如下:

      DatanodeManager(final BlockManager blockManager, final Namesystem namesystem,
          final Configuration conf) throws IOException {
        ....
    
        final long heartbeatIntervalSeconds = conf.getLong(
            DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY,
            DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_DEFAULT);
        final int heartbeatRecheckInterval = conf.getInt(
            DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY, 
            DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_DEFAULT); // 5 minutes
        this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval
            + 10 * 1000 * heartbeatIntervalSeconds;
        ....

    核心公式如下:

    this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 * 1000 * heartbeatIntervalSeconds;

    heartbeatRecheckInterval心跳检测时间默认300s,心跳间隔时间3s,所以默认超时时间2*300+10*3=630s.因此需要将前者配置加大.

    <name>dfs.namenode.heartbeat.recheck-interval</name> 
    <value>10800000</value> 
    <source>hdfs-default.xml</source> 
    </property>

    在此地可以调大为3小时,根据使用场景进行变化
    * 更新standby namenode的hdfs-site.xml的配置,并重启.
    * 等待standby namenode退出safemode之后,再stop active namenode,并更新配置,并重启.

    但是此种方案适用于Datanode不涉及主机名和IP的变化的情况.

    下面是涉及主机名,IP变化的迁移方案:

    步骤1:Datanode迁移测试

    在dn做迁移之前,进行测试文件的上传和rpc请求的测试,与后面进行对比
    首先上传1个test文件

    bin/hadoop fs -put test.txt /tmp

    保证此文件所在的block必然会存在于此节点上,用-cat命令进行查看

    bin/hadoop fs -cat /tmp/test.txt
    • 测试完毕,此时停止dn,并使用jps命令查看dn是否真正停止.
    • 并观察namenode的web界面上将显示迁移节点在超过630s后被认为是dead node.
    • 之后在namenoded的web界面的Number of Under-Replicated Blocks指标将会显示出正在进行拷贝的block副本数,表明目前有大量的block的块
      在进行副本复制.

    步骤3:Datanode重启

    • 重新启动dn,查看datanode log日志
      ,因为dn在初次启动的时候由于缓存的dfsUsed值超过600s会过期,需要重新执行du命令扫描上面的磁盘块进行dfsUsed使用量的计算,会消耗
      几分钟的时间(如果是立即stop,并马上start datanode,将会非常快.)
    2016-01-06 16:05:08,118 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added new volume: DS-70097061-42f8-4c33-ac27-2a6ca21e60d4
    2016-01-06 16:05:08,118 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added volume - /home/data/data/hadoop/dfs/data/data12/current, StorageType: DISK
    2016-01-06 16:05:08,176 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Registered FSDatasetState MBean
    2016-01-06 16:05:08,177 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding block pool BP-1942012336-xx.xx.xx.xx-1406726500544
    2016-01-06 16:05:08,178 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data2/current...
    2016-01-06 16:05:08,179 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data3/current...
    2016-01-06 16:05:08,179 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data4/current...
    2016-01-06 16:05:08,179 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data5/current...
    2016-01-06 16:05:08,180 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data6/current...
    2016-01-06 16:05:08,180 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data7/current...
    2016-01-06 16:05:08,180 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data8/current...
    2016-01-06 16:05:08,180 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data9/current...
    2016-01-06 16:05:08,181 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data10/current...
    2016-01-06 16:05:08,181 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data11/current...
    2016-01-06 16:05:08,181 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on volume /home/data/data/hadoop/dfs/data/data12/current...
    2016-01-06 16:09:49,646 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data7/current: 281466ms
    2016-01-06 16:09:54,235 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data9/current: 286054ms
    2016-01-06 16:09:57,859 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data2/current: 289680ms
    2016-01-06 16:10:00,333 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data5/current: 292153ms
    2016-01-06 16:10:05,696 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data8/current: 297516ms
    2016-01-06 16:10:11,229 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data6/current: 303049ms
    2016-01-06 16:10:28,075 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data12/current: 319894ms
    2016-01-06 16:10:33,017 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data4/current: 324838ms
    2016-01-06 16:10:40,177 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data10/current: 331996ms
    2016-01-06 16:10:44,882 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data3/current: 336703ms
    2016-01-06 16:11:14,241 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1942012336-xx.xx.xx.xx-1406726500544 on /home/data/data/hadoop/dfs/data/data11/current: 366060ms
    2016-01-06 16:11:14,242 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Total time to scan all replicas for block pool BP-1942012336-xx.xx.xx.xx-1406726500544: 366065ms

    出现上述Time taken的日志代表磁盘扫描操作结束,dn启动成功了.

    • dn成功启动后,namenode的页面上的的Number of Under-Replicated Blocks指标将会重新变为0,代表不需要进行多余块的复制.
    • 在此节点上重新执行bin/hadoop fs -cat /tmp/test.txt命令,测试文件内容是否能够查看,测试结束后删除测试文件

    总结

    • 对于更换主机名和IP的DataNode迁移操作而言,只要在可控的时间内恢复dn服务,
      不会对集群造成大的影响,只会在迁移节点成为dead node的状态时会出现短暂块复制的现象.
    • 对于不变化主机名和IP操作的DataNode迁移操作,只要加大heartbeat recheck时间,使其在短时间内不成为dead node,然后进行恢复,
      将不会对集群造成影响.
  • 相关阅读:
    收藏
    计算矩阵连乘
    关于sublime text
    关于拓扑排序(topologicalsort)
    生成最小树prim算法
    矩阵转置的两种算法
    android wifi热点 socket通信
    AsyncTask异步任务类使用学习
    数据库操作学习
    android 监听短信并发送到服务器
  • 原文地址:https://www.cnblogs.com/bianqi/p/12183807.html
Copyright © 2020-2023  润新知