• yarn关于app max attempt深度解析,针对长服务appmaster平滑重启


    在YARN上开发长服务,需要注意fault-tolerance,本篇文章对appmaster的平滑重启的一个参数做了解析,如何设置可以有助于达到appmaster平滑重启。

    在yarn-site.xml有个参数

    /**
       * The maximum number of application attempts.
       * It's a global setting for all application masters.
       */
    yarn.resourcemanager.am.max-attempts

    一个全局的appmaster重试次数的限制,yarn提交应用时,还可以为单独一个应用设置最大重试次数

    /**
       * Set the number of max attempts of the application to be submitted. WARNING:
       * it should be no larger than the global number of max attempts in the Yarn
       * configuration.
       * @param maxAppAttempts the number of max attempts of the application
       * to be submitted.
       */
      @Public
      @Stable
      public abstract void setMaxAppAttempts(int maxAppAttempts);

    当attempt失败时,如果设置keepContainersAcrossAppAttempts了,resource manager会决定上个attempt的container是否仍然保留着。

    boolean keepContainersAcrossAppAttempts = false;
    switch (finalAttemptState) {
      case FINISHED:
      {
        appEvent = new RMAppFinishedAttemptEvent(applicationId,
            appAttempt.getDiagnostics());
      }
      break;
      case KILLED:
      {
        // don't leave the tracking URL pointing to a non-existent AM
        appAttempt.setTrackingUrlToRMAppPage();
        appAttempt.invalidateAMHostAndPort();
        appEvent =
            new RMAppFailedAttemptEvent(applicationId,
                RMAppEventType.ATTEMPT_KILLED,
                "Application killed by user.", false);
      }
      break;
      case FAILED:
      {
        // don't leave the tracking URL pointing to a non-existent AM
        appAttempt.setTrackingUrlToRMAppPage();
        appAttempt.invalidateAMHostAndPort();
    
        if (appAttempt.submissionContext
          .getKeepContainersAcrossApplicationAttempts()
            && !appAttempt.submissionContext.getUnmanagedAM()) {
          // See if we should retain containers for non-unmanaged applications
          if (!appAttempt.shouldCountTowardsMaxAttemptRetry()) {
            // Premption, hardware failures, NM resync doesn't count towards
            // app-failures and so we should retain containers.
            keepContainersAcrossAppAttempts = true;
          } else if (!appAttempt.maybeLastAttempt) {
            // Not preemption, hardware failures or NM resync.
            // Not last-attempt too - keep containers.
            keepContainersAcrossAppAttempts = true;
          }
        }
        appEvent =
            new RMAppFailedAttemptEvent(applicationId,
              RMAppEventType.ATTEMPT_FAILED, appAttempt.getDiagnostics(),
              keepContainersAcrossAppAttempts);
    
      }
    }

    关注appAttempt.maybeLastAttempt这个变量,rs如何判断是否这次attempt是最后一次呢?

    private void createNewAttempt() {
        ApplicationAttemptId appAttemptId =
            ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1);
        RMAppAttempt attempt =
            new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
              submissionContext, conf,
              // The newly created attempt maybe last attempt if (number of
              // previously failed attempts(which should not include Preempted,
              // hardware error and NM resync) + 1) equal to the max-attempt
              // limit.
              maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq);
        attempts.put(appAttemptId, attempt);
        currentAttempt = attempt;
      }

    在每次构造新的attempt时候,maxAppAttempts == (getNumFailedAppAttempts() + 1)会决定,已经失败的次数+1,是否已经达到了maxAppAttempts的限制了。

    而maxAppAttempts这个参数是由global和individual两个配置取min,决定的。

    int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
            YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS);
        int individualMaxAppAttempts = submissionContext.getMaxAppAttempts();
        if (individualMaxAppAttempts <= 0 ||
            individualMaxAppAttempts > globalMaxAppAttempts) {
          this.maxAppAttempts = globalMaxAppAttempts;
          LOG.warn("The specific max attempts: " + individualMaxAppAttempts
              + " for application: " + applicationId.getId()
              + " is invalid, because it is out of the range [1, "
              + globalMaxAppAttempts + "]. Use the global max attempts instead.");
        } else {
          this.maxAppAttempts = individualMaxAppAttempts;
        }

    总结:

    如果希望appmaster可以达到不断重启,而且可以接管之前的container,需要把yarn.resourcemanager.am.max-attempts这个参数尽量调大,比如设置为10000,并且提交app时候设置submit context的最大次数,以及刷新窗口,这样基本就可以满足长服务应用在yarn上面的运行需求了。

  • 相关阅读:
    c# 调用短信平台接口,给手机发送短信
    WPF Bitmap转成Imagesource的性能优化
    WPF TriggerAction弹出子窗体 TargetedTrigger、TargetedTriggerAction用法
    统一社会信用代码 正则验证
    全选或反选表格中第一列的checkbok
    数据库五大约束使用方法
    通用的通过模型插入或更新数据库
    U8隐藏的配置项
    android上如何写配置文件
    Android Studio出现Failed to open zip file问题的解决方法
  • 原文地址:https://www.cnblogs.com/yanghuahui/p/4911276.html
Copyright © 2020-2023  润新知