hadoop2.0 支持HA,基于这个功能可进行在线升级不需要停HDFS服务
注意,滚动升级只支持Hadoop-2.4.0以后的版本。
JNs相对稳定,在大多数情况下升级HDFS时不需要升级,在这里描述的滚动升级过程中,只考虑NNs和DNs,而不考虑JNs和ZKNs
本次测试是非联邦集群,有kerberos认证(保证配置即可,无需额外调整),hadoop2.7.7升级至hadoop2.8.5
升级准备检查
检测当前HDFS服务是否正常
[hadoop@hadoop001 ~]$ hdfs dfsadmin –report #查看是否有异常的datanode
[hadoop@hadoop001 ~]$ hdfs fsck / #hdfs文件系统是否是健康状态
..................................................................................................Status:HEALTHY
Total size: 20242735151 B (Total open files size: 332 B)
Total dirs: 821
Total files: 1198
Totalsymlinks: 0 (Files currently beingwritten: 5)
Total blocks(validated): 1122 (avg. block size18041653 B) (Total open file blocks (not validated): 4)
Minimallyreplicated blocks: 1122 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicatedblocks: 67 (5.9714794 %)
Mis-replicated blocks: 0 (0.0 %)
Defaultreplication factor: 3
Averageblock replication: 3.0
Corruptblocks: 0
Missingreplicas: 469 (12.2294655 %)
Number ofdata-nodes: 3
Number ofracks: 1
FSCK ended at Mon Sep 16 11:09:05 CST 2019 in 91milliseconds
Namenode主备是否正常
[hadoop@hadoop001 ~]$ hdfs haadmin -getServiceState nn1 #主备服务是否正常
standby
[hadoop@hadoop001 ~]$ hdfs haadmin -getServiceState nn2 #主备服务是否正常
active
[hadoop@hadoop001 ~]$ ssh hadoop002
Last login: Thu Sep 5 19:01:30 2019 from 172.16.40.43
[hadoop@hadoop002 ~]$ hadoop-daemon.sh stop namenode #主备切换是否正常
stopping namenode
[hadoop@hadoop002 ~]$ exit
logout
Connection to hadoop002 closed.
[hadoop@hadoop001 ~]$ hdfs haadmin -getServiceState nn2 #主备切换是否正常
19/09/16 11:14:02 INFO ipc.Client: Retrying connectto server: hadoop002:8020. Already tried 0 time(s); retrypolicy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000MILLISECONDS)
Operation failed: Call From hadoop001 to hadoop002:8020 failed on connectionexception: java.net.ConnectException: Connection refused; For more detailssee: http://wiki.apache.org/hadoop/ConnectionRefused
[hadoop@hadoop001 ~]$ hdfs haadmin -getServiceState nn1 #主备切换是否正常
Active
元数据备份
namenode主备节点元数据备份包括journalnode编辑日志,非必需步骤以防万一回退或升级失败!
升级准备,hdfs需离开safe模式,如果在安全模式不能手动退出,一遍hdfs文件系统检查完毕自动退出!
升级准备
[hadoop@hadoop001 ~]$ hdfs dfsadmin -rollingUpgrade prepare
PREPARE rolling upgrade ...
Preparing for upgrade. Data is being saved forrollback.
Run "dfsadmin -rollingUpgrade query" tocheck the status
for proceeding with rolling upgrade
Block PoolID: BP-686481837-192.168.40.42-1563178388776
StartTime: Mon Sep 16 11:38:36 CST 2019 (=1568605116802)
FinalizeTime:
#此时元数据目录下有回滚镜像文件
-rw-r--r-- 1 hadoop users 155196 Sep 16 11:38fsimage_rollback_0000000000002953069
-rw-r--r-- 1 hadoop users 71 Sep 16 11:38fsimage_rollback_0000000000002953069.md5
升级检查当前hdfs是否处于升级状态,如果不是下方显示请暂停处理。
[hadoop@hadoop001 ~]$ hdfs dfsadmin -rollingUpgrade query
QUERY rolling upgrade ...
There is no rolling upgrade in progress or rollingupgrade has already been finalized.
升级namenode
#升级主备节点
停服务
[hadoop@hadoop01 ~]$ hadoop-daemon.sh stop namenode
#停服务替换hadoop安装目录,同步配置文件,如果有ranger重新执行enabled的脚本,注意观察另一节点是否切换为active状态
替换高版本的安装包
[hadoop@hadoop001 core]$ mv hadoop/ hadoop-2.7.7
[hadoop@hadoop01 core]$ scp -r hadoop
hadoop@192.168.40.41:$PWD #测试环境是将其他集群的2.8.5的包直接scp过来,正常是将hadoop-2.8.5.tar.gz解压至该目录
替换配置文件
[hadoop@hadoop01 etc]$ pwd /opt/beh/core/hadoop/etc
[hadoop@hadoop01 etc]$ mv hadoop/ hadoop-2.8.5
[hadoop@hadoop01 etc]$ cp -r/opt/beh/core/hadoop-2.7.7/etc/hadoop/ .
注意如果有journalnode服务首先重启journalnode,zkfc,无需升级;启动后观察日志是否有异常。
[hadoop@hadoop001 hadoop]$ hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@hadoop001 hadoop]$ hadoop-daemon.sh start journalnode
starting journalnode, logging to/opt/beh/logs/hadoop/hadoop-hadoop-journalnode-hadoop001.out
[hadoop@hadoop001 hadoop]$ hadoop-daemon.sh stop zkfc
stopping zkfc
[hadoop@hadoop001 hadoop]$ hadoop-daemon.sh start zkfc
starting zkfc, logging to/opt/beh/logs/hadoop/hadoop-hadoop-zkfc-hadoop001.out
[hadoop@hadoop001 hadoop]$
升级namenode
[hadoop@hadoop001 hadoop]$ hdfs namenode -rollingUpgrade started #运行至退出安全模式
The reported blocks 3200 has reached the threshold1.0000 of total blocks 3200. The number of live datanodes 9 has reached theminimum number 0. In safe mode extension. Safe mode will be turned offautomatically in 9 seconds.
19/09/11 16:43:55 INFO hdfs.StateChange: STATE*Leaving safe mode after 33 secs
19/09/11 16:43:55 INFO hdfs.StateChange:STATE* Safe mode is OFF
19/09/11 16:43:55 INFO hdfs.StateChange: STATE*Network topology has 1 racks and 9 datanodes
19/09/11 16:43:55 INFO hdfs.StateChange: STATE*UnderReplicatedBlocks has 0 blocks
Ctrl+c停止前台进程,启动namenode
[hadoop@hadoop001 current]$ hadoop-daemon.sh start namenode
查询滚动升级状态
[hadoop@hadoop001 hadoop]$ hdfs dfsadmin -rollingUpgrade query
QUERY rolling upgrade ...
Proceed with rolling upgrade:
Block PoolID: BP-686481837-192.168.40.42-1563178388776
StartTime: Mon Sep 16 11:38:36 CST 2019 (=1568605116802)
FinalizeTime:
升级后namenoder启动日志有如下信息:
2019-09-11 16:56:38,972 INFOorg.apache.hadoop.hdfs.server.namenode.NameNode: Reported DataNode version'2.7.7' of DN DatanodeRegistration(0.0.0.0:50010,datanodeUuid=eb9a5cc2-e1e0-4d65-98c2-596a39336f36, infoPort=0,infoSecurePort=50475, ipcPort=50020, storageInfo=lv=-56;cid=CID-1f4dc3a9-7d17-46f7-9a0f-02578c683842;nsid=1047123487;c=0)does not match NameNode version '2.8.5'. Note: This is normal during a rollingupgrade.
#升级另一主节点,重复上述操作,注意替换安装目录后先重启journalnode
[hadoop@hadoop002 core]$ mv hadoop/ hadoop-2.7.7
[hadoop@hadoop001 core]$ scp -r hadoophadoop@hadoop002:$PWD
[hadoop@hadoop002 hadoop]$ hadoop-daemon.sh stop journalnode
[hadoop@hadoop002 hadoop]$ hadoop-daemon.sh start journalnode
[hadoop@hadoop002 hadoop]$ hadoop-daemon.sh stop zkfc
[hadoop@hadoop002 hadoop]$ hadoop-daemon.sh start zkfc
[hadoop@hadoop002 current]$ hadoop-daemon.sh stop namenode
[hadoop@hadoop002 hadoop]$ hdfs haadmin –getAllServiceState #检测namenode是否正常切换
hadoop001:8020 active
19/09/16 17:54:02 INFO ipc.Client: Retrying connectto server: hadoop002:8020. Already tried 0 time(s); retrypolicy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000MILLISECONDS)
[hadoop@hadoop002 hadoop]$ hdfs namenode -rollingUpgradestarted
[hadoop@hadoop002 current]$ hadoop-daemon.sh start namenode
升级datanode
升级datanode,先替换hadoop目录换成高版本的,保留配置文件也就是替换$HADOOP_HOME/etc/hadoop目录,然后执行如下操作
[hadoop@hadoop002 core]$ hdfs dfsadmin -shutdownDatanode hadoop003:50020 upgrade
active namenode日志输出:
2019-09-11 17:09:04,216 INFOorg.apache.hadoop.hdfs.server.namenode.FSNamesystem:updatePipeline(blk_1073766114_25448, newGS=25470, newLength=83,newNodes=[192.168.40.15:50010, 192.168.40.14:50010, 192.168.40.21:50010],client=DFSClient_NONMAPREDUCE_1761359517_14)
2019-09-11 17:09:04,217 INFOorg.apache.hadoop.hdfs.server.namenode.FSNamesystem:updatePipeline(blk_1073766114_25448 => blk_1073766114_25470) success
datanode升级完毕会关闭服务
[hadoop@hadoop01 ~]$ hdfs dfsadmin -getDatanodeInfo hadoop003:50020 #检测datanode是否shutdown,也就是出现链接异常信息
到hadoop003启动datanode
[hadoop@hadoop003 ~]$ hadoop-daemon.sh start datanode
VERSION文件发生改变升级前
[hadoop@hadoop003 current]$ more VERSION
#Tue Aug 27 16:27:14 CST 2019
storageID=DS-824be616-eee5-4954-88a4-c752de40e7e2
clusterID=CID-1f4dc3a9-7d17-46f7-9a0f-02578c683842
cTime=0
datanodeUuid=ebf453db-e511-4c50-a76c-f87fd83db864
storageType=DATA_NODE
layoutVersion=-56
VERSION文件发生改变升级后
[hadoop@hadoop03 current]$ more VERSION
#Wed Sep 11 17:25:31 CST 2019
storageID=DS-824be616-eee5-4954-88a4-c752de40e7e2
clusterID=CID-1f4dc3a9-7d17-46f7-9a0f-02578c683842
cTime=0
datanodeUuid=ebf453db-e511-4c50-a76c-f87fd83db864
storageType=DATA_NODE
layoutVersion=-57
重复上述步骤,直到更新集群中的所有数据节点运行完毕。
YARN服务启动注意事项
kerberos情况下注意非kerberos请忽略,修改下面文件属性并重启yarn(注意hadoop家目录的权限755,750租户无法提交yarn任务)
chown root:hadoop /opt/hadoop/bin/container-executor
chmod 6050 /opt/hadoop/bin/container-executor
分发spark-shuffle的jar包到集群所有节点,支持hive on spark
scp spark-2.0.0-yarn-shuffle.jarhadoop@hadoop01:/opt/hadoop/share/hadoop/yarn/lib
重启集群的yarn服务
完成滚动升级
执行了finalize ,namenode主备节点元数据目录的回滚元数据镜像就会被删除,就不能回滚到之前的版本,建议集群运行一个周期执行下面操作。
[hadoop@hadoop01 hadoop]$ hdfs dfsadmin-rollingUpgrade finalize
FINALIZE rolling upgrade ...
Rolling upgrade is finalized.
Block Pool ID: BP-261222913-172.16.13.12-1564812628651
Start Time: Wed Sep11 15:41:21 CST 2019 (=1568187681874)
Finalize Time: Wed Sep 11 17:54:22 CST2019 (=1568195662530)
附
datanode升级脚本
#!/bin/bash
CORE_HOME=/opt/beh/core
hosts=`cat ~/datanode`
for host in $hosts
do
ssh hadoop@$host "source ~/.bashrc;
echo ------------------------------------------------------------;
jps;
mv $CORE_HOME/hadoop $CORE_HOME/hadoop-2.7.7"
scp -r ~/hadoop hadoop@$host:$CORE_HOME
ssh hadoop@$host "rm -r $CORE_HOME/hadoop/etc/hadoop;
cp -r $CORE_HOME/hadoop-2.7.7/etc/hadoop $CORE_HOME/hadoop/etc ;
sudo chown root:hadoop /opt/beh/core/hadoop/bin/container-executor;
sudo chmod 6050 /opt/beh/core/hadoop/bin/container-executor ;"
echo " hdfs dfsadmin -shutdownDatanode $host:50020 upgrade"
echo " hdfs dfsadmin -getDatanodeInfo $host:50020"
if (whiptail --title "exec update line" --yesno " hdfs dfsadmin -shutdownDatanode $host:50020 upgrade" 10 60)then
ssh hadoop@$host "source ~/.bashrc;
hadoop-daemon.sh start datanode;yarn-daemon.sh stop nodemanager;yarn-daemon.sh start nodemanager;
echo ------------------------------------------------------------;
jps"
else
echo "no update"
fi
done
参考
http://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html