• Namenode写Journalnode超时,导致Namenode挂掉的问题


    昨天还好好的集群,今天早上来看又挂掉了,还好是家里的测试服务器集群。。。

    首先,查看了Namenode的状态,发现两台Namenode只剩下一台了,赶紧到挂了的那台去查看了logs下的日志:

    2016-08-09 16:33:51,526 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:52,169 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
    2016-08-09 16:33:52,526 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:53,527 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:54,529 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:55,530 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:56,531 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:57,533 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:58,533 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:33:59,534 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:34:00,536 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:34:01,537 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:34:02,538 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:34:03,540 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:34:04,541 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486]
    2016-08-09 16:34:05,525 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.80.248.17:8486, 10.80.248.18:8486, 10.80.248.19:8486], stream=QuorumOutputStream starting at txid 2947))
    java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1221)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1158)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1238)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6344)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:933)
        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
    2016-08-09 16:34:05,526 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 2947
    2016-08-09 16:34:05,600 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
    2016-08-09 16:34:05,733 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at ut07/10.80.248.17
    ************************************************************/

    以上是hadoop-hadooptest-namenode-ut07.log在Namenode退出时候的关键日志,可以从中发现,Namenode在写Journalnode发生了超时,默认的超时时间为20秒,而在超时发生后,Namenode会触发ExitUtil类的terminate 方法,导致进程的System.exit()

    至于为什么好好的集群,会发生写入超时的问题呢,这个问题现在还不太确定,有人说是因为执行了较大文件的HDFS操作,导致Namenode的FULL GC时间较长,所以导致写Journalnode超时。但是我昨天到今天基本上没有去动HDFS,这个根本原因还得再深挖一下。。。

    无论如何,先把集群恢复吧,还要用呢。

    其实在实际的生产环境中,也很容易发生类似的这种超时情况,所以我们需要把默认的20s超时改成更大的值,比如60s。

    我们可以在hadoop/etc/hadoop下的hdfs-site.xml中,加入一组配置:

    <property>
            <name>dfs.qjournal.write-txns.timeout.ms</name>
            <value>60000</value>
    </property>

    这也是我从别人博客中看到的配置方法,神奇的是,我在hadoop的官网中的关于hdfs-site.xml介绍中,居然找不到关于这个配置的说明。。。

    http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

    最后记得重启整个集群,这样配置才能生效。

    友情提示:使用了Flume的同学,记得也要重启Flume集群哦~

  • 相关阅读:
    c/cpp枚举练习
    数据类型的标识
    引用变量
    cocos2dx 3.3 笔记
    希望获取到页面中所有的checkbox怎么做?
    如何判断某变量是否为数组数据类型?
    驼峰函数写法
    trim()函数
    js 获取页面可视区域宽高
    全屏滚动插件
  • 原文地址:https://www.cnblogs.com/xyliao/p/5755438.html
Copyright © 2020-2023  润新知