1 详细异常
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /wm1/link/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/003993.sst at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:181) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:245) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:562) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:609) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /wm1/link/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/003993.sst at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:950) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:937) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:210) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2020-01-06 10:14:24,136 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NodeManager at ****。**** ************************************************************/
发现疑似目录:/var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state下存在: 005615.sst 005638.log 005640.log CURRENT LOCK MANIFEST-004397移除所有文件。重启nodemanager 成功。 回顾错误原因可能是,我在该nodemanager终止情况下,在集群中添加了新的nodemanager,使得角色数目增加,而启动失败的nodemanager时,它使用存储的状态来恢复,在和数据库校验过程中发现数目不符合而启动失败。因此删除上述目录下的文件。