• [Storm] java.io.FileNotFoundException: File '../stormconf.ser' does not exist


    This bug will kill supervisors

    Affects Version/s: 0.9.2-incubating, 0.9.3, 0.9.4 

    Fix Version/s: 0.10.0, 0.9.5

    问题背景

    最近发现刚搭起的Storm集群,没过多久,Supervisor 便悄然死去了一大半。查看死去Supervisor的log,发现java.io.FileNotFoundException: File '../stormconf.ser' does not exist异常。网上给出的答案大多是

        将 { storm.local.dir } 目录下的文件清空,重启就好了。

    但这是指标不治本,即时重启可以跑起来,可是为什么会出现这个问题,依然不知道。

    然后才发现线STORM-130解决了这个问题。该问题的重现场景:

    1) Run a storm cluster with atleast 2 supervisors with 4 slots each
    2) Deploy a topology that uses 4 workers, topology will be distributed with each supervisor having two workers each
    3) kill one of the supervisor lets say supervisor1 
    4) wait till topology re-balances to occupy 4 workers on supervisor2
    5) now bring up supervisor1, It goes through the cycle of cleaning up old topology code
    6) nimbus re-balances topology which triggers supervisor.sync-process method
    7) sync-process tries to launch a worker for the topology whose code data is delete when the supervisor started causing it throw up following exception

    问题原因

    上面场景分析提到的 sync-process是supervisor运行的一个函数。Supervisor会在后台运行这两个函数:

    • synchronize-supervisor: This is called whenever assignments in Zookeeper change and also every 10 seconds. 
      • Downloads code from Nimbus for topologies assigned to this machine for which it doesn't have the code yet. 
      • Writes into local filesystem what this node is supposed to be running. It writes a map from port -> LocalAssignment. LocalAssignment contains a topology id as well as the list of task ids for that worker. 
    • sync-processes: Reads from the LFS what synchronize-supervisor wrote and compares that to what's actually running on the machine. It then starts/stops worker processes as necessary to synchronize. 

    从描述中可以看出,synchronized-supervisor 和 sync-process 两个函数是通过 LFS 进行同步。The key reason is "synchronize-supervisor" which responsible for download file and remove file thread and "sync-processes" which responsible for start worker process thread is Asynchronous. 

    in synchronize-supervisor read assigment information from zk, supervisor download necessary file from nimbus and write local state. In aother thread sync-processes funciton read local state to launch workor process, when the worker process has not start ,synchronize-supervisor function is called again topology's assignment information has changed (cased by rebalance,or worker time out etc) worker assignment to this supervisor has move to another supervisor, synchronize-supervisor remove the unnecessary file (jar file and ser file etc.) , after this, worker launched by " sync-processes" ,ser file was not exsit , this issue occur. 

    可能解决办法

    • 换一个storm
    • 调整参数
      • Change "synchronize-supervisor" thread loop time to a longger than 10(default time) sec, such as 30 sec。
      • supervisor.worker.timeout.secs: 30 -> 5

    References:

    • https://issues.apache.org/jira/browse/STORM-130
    • http://storm.apache.org/documentation/Lifecycle-of-a-topology.html

     

  • 相关阅读:
    IPv4地址被用光,IPv6将接手
    杀猪盘
    大家都应该看看这个贴子,会让你心明眼亮。 注意到这些变化了吗?中国正在发生的100个变化,越往后读越震惊!
    区块链在中国怎么练?
    区块链到底是什么样的技术呢?
    2019感恩节
    人工智能、大数据、物联网、区块链,四大新科技PK,你更看好谁?
    vue遇见的问题(2)---imported multiple times(转载)
    drf-序列化器的理解
    Django rest_framework序列化many=True参数解释
  • 原文地址:https://www.cnblogs.com/qingwen/p/4997302.html
Copyright © 2020-2023  润新知