别被官方文档迷惑了！这篇文章帮你详解yarn公平调度

欢迎大家前往腾讯云+社区，获取更多腾讯海量技术实践干货哦~

本文由@edwinhzhang发表于云+社区专栏

FairScheduler是yarn常用的调度器，但是仅仅参考官方文档，有很多参数和概念文档里没有详细说明，但是这些参明显会影响到集群的正常运行。本文的主要目的是通过梳理代码将关键参数的功能理清楚。下面列出官方文档中常用的参数：

yarn.scheduler.fair.preemption.cluster-utilization-threshold	The utilization threshold after which preemption kicks in. The utilization is computed as the maximum ratio of usage to capacity among all resources. Defaults to 0.8f.
yarn.scheduler.fair.update-interval-ms	The interval at which to lock the scheduler and recalculate fair shares, recalculate demand, and check whether anything is due for preemption. Defaults to 500 ms.
maxAMShare	limit the fraction of the queue’s fair share that can be used to run application masters. This property can only be used for leaf queues. For example, if set to 1.0f, then AMs in the leaf queue can take up to 100% of both the memory and CPU fair share. The value of -1.0f will disable this feature and the amShare will not be checked. The default value is 0.5f.
minSharePreemptionTimeout	number of seconds the queue is under its minimum share before it will try to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
fairSharePreemptionTimeout	number of seconds the queue is under its fair share threshold before it will try to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
fairSharePreemptionThreshold	If the queue waits fairSharePreemptionTimeout without receiving fairSharePreemptionThreshold*fairShare resources, it is allowed to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.

在上述参数描述中，timeout等参数值没有给出默认值，没有告知不设置会怎样。minShare,fairShare等概念也没有说清楚，很容易让人云里雾里。关于这些参数和概念的详细解释，在下面的分析中一一给出。

FairScheduler整体结构

图（1） FairScheduler 运行流程图

公平调度器的运行流程就是RM去启动FairScheduler,SchedulerDispatcher两个服务，这两个服务各自负责update线程，handle线程。

update线程有两个任务：（1）更新各个队列的资源（Instantaneous Fair Share）,（2）判断各个leaf队列是否需要抢占资源（如果开启抢占功能）

handle线程主要是处理一些事件响应，比如集群增加节点，队列增加APP，队列删除APP，APP更新container等。

FairScheduler类图

图（2） FairScheduler相关类图

队列继承模块：yarn通过树形结构来管理队列。从管理资源角度来看，树的根节点root队列（FSParentQueue）,非根节点（FSParentQueue），叶子节点（FSLeaf）,app任务（FSAppAttempt，公平调度器角度的App）都是抽象的资源，它们都实现了Schedulable接口，都是一个可调度资源对象。它们都有自己的fair share（队列的资源量）方法（这里又用到了fair share概念），weight属性（权重）、minShare属性（最小资源量）、maxShare属性(最大资源量)，priority属性(优先级)、resourceUsage属性（资源使用量属性）以及资源需求量属性(demand)，同时也都实现了preemptContainer抢占资源的方法，assignContainer方法（为一个ACCEPTED的APP分配AM的container）。

public interface Schedulable {
  /**
   * Name of job/queue, used for debugging as well as for breaking ties in
   * scheduling order deterministically.
   */
  public String getName();

  /**
   * Maximum number of resources required by this Schedulable. This is defined as
   * number of currently utilized resources + number of unlaunched resources (that
   * are either not yet launched or need to be speculated).
   */
  public Resource getDemand();

  /** Get the aggregate amount of resources consumed by the schedulable. */
  public Resource getResourceUsage();

  /** Minimum Resource share assigned to the schedulable. */
  public Resource getMinShare();

  /** Maximum Resource share assigned to the schedulable. */
  public Resource getMaxShare();

  /** Job/queue weight in fair sharing. */
  public ResourceWeights getWeights();

  /** Start time for jobs in FIFO queues; meaningless for QueueSchedulables.*/
  public long getStartTime();

 /** Job priority for jobs in FIFO queues; meaningless for QueueSchedulables. */
  public Priority getPriority();

  /** Refresh the Schedulable's demand and those of its children if any. */
  public void updateDemand();

  /**
   * Assign a container on this node if possible, and return the amount of
   * resources assigned.
   */
  public Resource assignContainer(FSSchedulerNode node);

  /**
   * Preempt a container from this Schedulable if possible.
   */
  public RMContainer preemptContainer();

  /** Get the fair share assigned to this Schedulable. */
  public Resource getFairShare();

  /** Assign a fair share to this Schedulable. */
  public void setFairShare(Resource fairShare);
}

队列运行模块：从类图角度描述公平调度的工作原理。SchedulerEventDispatcher类负责管理handle线程。FairScheduler类管理update线程，通过QueueManager获取所有队列信息。

我们从Instantaneous Fair Share 和Steady Fair Share 这两个yarn的基本概念开始进行代码分析。

Fair Share指的都是Yarn根据每个队列的权重、最大，最小可运行资源计算的得到的可以分配给这个队列的最大可用资源。本文描述的是公平调度，公平调度的默认策略FairSharePolicy的规则是single-resource，即只关注内存资源这一项指标。

Steady Fair Share：是每个队列内存资源量的固定理论值。Steady Fair Share在RM初期工作后不再轻易改变，只有后续在增加节点（addNode）时才会重新计算。RM的初期工作也是handle线程把集群的每个节点添加到调度器中（addNode）。

Instantaneous Fair Share：是每个队列的内存资源量的实际值，是在动态变化的。yarn里的fair share如果没有专门指代，都是指的的Instantaneous Fair Share。

1 Steady Fair Share计算方式

图（3） steady fair share 计算流程

handle线程如果接收到NODE_ADDED事件，会去调用addNode方法。

  private synchronized void addNode(RMNode node) {
    FSSchedulerNode schedulerNode = new FSSchedulerNode(node, usePortForNodeName);
    nodes.put(node.getNodeID(), schedulerNode);
    //将该节点的内存加入到集群总资源
    Resources.addTo(clusterResource, schedulerNode.getTotalResource());
    //更新available资源
    updateRootQueueMetrics();
    //更新一个container的最大分配，就是UI界面里的MAX（如果没有记错的话）
    updateMaximumAllocation(schedulerNode, true);

    //设置root队列的steadyFailr=clusterResource的总资源
    queueMgr.getRootQueue().setSteadyFairShare(clusterResource);
    //重新计算SteadyShares
    queueMgr.getRootQueue().recomputeSteadyShares();
    LOG.info("Added node " + node.getNodeAddress() +
        " cluster capacity: " + clusterResource);
  }

recomputeSteadyShares 使用广度优先遍历计算每个队列的内存资源量，直到叶子节点。

 public void recomputeSteadyShares() {
    //广度遍历整个队列树
    //此时getSteadyFairShare 为clusterResource
    policy.computeSteadyShares(childQueues, getSteadyFairShare());
    for (FSQueue childQueue : childQueues) {
      childQueue.getMetrics().setSteadyFairShare(childQueue.getSteadyFairShare());
      if (childQueue instanceof FSParentQueue) {
        ((FSParentQueue) childQueue).recomputeSteadyShares();
      }
    }
  }

computeSteadyShares方法计算每个队列应该分配到的内存资源，总体来说是根据每个队列的权重值去分配，权重大的队列分配到的资源更多，权重小的队列分配到得资源少。但是实际的细节还会受到其他因素影响，是因为每队列有minResources和maxResources两个参数来限制资源的上下限。computeSteadyShares最终去调用computeSharesInternal方法。比如以下图为例：

图中的数字是权重，假如有600G的总资源，parent=300G,leaf1=300G,leaf2=210G,leaf3=70G。

图（4） yarn队列权重

computeSharesInternal方法概括来说就是通过二分查找法寻找到一个资源比重值R（weight-to-slots），使用这个R为每个队列分配资源（在该方法里队列的类型是Schedulable,再次说明队列是一个资源对象），公式是steadyFairShare=R * QueueWeights。

computeSharesInternal是计算Steady Fair Share 和Instantaneous Fair Share共用的方法，根据参数isSteadyShare来区别计算。

之所以要做的这么复杂，是因为队列不是单纯的按照比例来分配资源的（单纯按权重比例，需要maxR,minR都不设置。maxR的默认值是0x7fffffff，minR默认值是0）。如果设置了maxR,minR,按比例分到的资源小于minR,那么必须满足minR。按比例分到的资源大于maxR，那么必须满足maxR。因此想要找到一个R（weight-to-slots）来尽可能满足：

R*（Queue1Weights + Queue2Weights+...+QueueNWeights） <=totalResource
R*QueueWeights >= minShare
R*QueueWeights <= maxShare

注：QueueNWeights为队列各自的权重，minShare和maxShare即各个队列的minResources和maxResources

computcomputeSharesInternal详细来说分为四个步骤：

确定可用资源：totalResources = min(totalResources-takenResources(fixedShare), totalMaxShare)
确定R上下限
二分查找法逼近R
使用R设置fair Share

  private static void computeSharesInternal(
      Collection<? extends Schedulable> allSchedulables,
      Resource totalResources, ResourceType type, boolean isSteadyShare) {

    Collection<Schedulable> schedulables = new ArrayList<Schedulable>();
    //第一步
    //排除有固定资源不能动的队列,并得出固定内存资源
    int takenResources = handleFixedFairShares(
        allSchedulables, schedulables, isSteadyShare, type);

    if (schedulables.isEmpty()) {
      return;
    }
    // Find an upper bound on R that we can use in our binary search. We start
    // at R = 1 and double it until we have either used all the resources or we
    // have met all Schedulables' max shares.
    int totalMaxShare = 0;
    //遍历schedulables（非固定fixed队列），将各个队列的资源相加得到totalMaxShare
    for (Schedulable sched : schedulables) {
      int maxShare = getResourceValue(sched.getMaxShare(), type);
      totalMaxShare = (int) Math.min((long)maxShare + (long)totalMaxShare,
          Integer.MAX_VALUE);
      if (totalMaxShare == Integer.MAX_VALUE) {
        break;
      }
    }
    //总资源要减去fiexd share
    int totalResource = Math.max((getResourceValue(totalResources, type) -
        takenResources), 0);
    //队列所拥有的最大资源是有集群总资源和每个队列的MaxResource双重限制
    totalResource = Math.min(totalMaxShare, totalResource);
    //第二步:设置R的上下限
    double rMax = 1.0;
    while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
        < totalResource) {
      rMax *= 2.0;
    }

    //第三步：二分法逼近合理R值
    // Perform the binary search for up to COMPUTE_FAIR_SHARES_ITERATIONS steps
    double left = 0;
    double right = rMax;
    for (int i = 0; i < COMPUTE_FAIR_SHARES_ITERATIONS; i++) {
      double mid = (left + right) / 2.0;
      int plannedResourceUsed = resourceUsedWithWeightToResourceRatio(
          mid, schedulables, type);
      if (plannedResourceUsed == totalResource) {
        right = mid;
        break;
      } else if (plannedResourceUsed < totalResource) {
        left = mid;
      } else {
        right = mid;
      }
    }
    //第四步：使用R值设置，确定各个非fixed队列的fairShar,意味着只有活跃队列可以分资源
    // Set the fair shares based on the value of R we've converged to
    for (Schedulable sched : schedulables) {
      if (isSteadyShare) {
        setResourceValue(computeShare(sched, right, type),
            ((FSQueue) sched).getSteadyFairShare(), type);
      } else {
        setResourceValue(
            computeShare(sched, right, type), sched.getFairShare(), type);
      }
    }
  }

(1) 确定可用资源

handleFixedFairShares方法来统计出所有fixed队列的fixed内存资源（fixedShare）相加，并且fixed队列排除掉不得瓜分系统资源。yarn确定fixed队列的标准如下：

  private static int getFairShareIfFixed(Schedulable sched,
      boolean isSteadyShare, ResourceType type) {

    //如果队列的maxShare <=0  则是fixed队列，fixdShare=0
    if (getResourceValue(sched.getMaxShare(), type) <= 0) {
      return 0;
    }

    //如果是计算Instantaneous Fair Share,并且该队列内没有APP再跑，
    // 则是fixed队列，fixdShare=0
    if (!isSteadyShare &&
        (sched instanceof FSQueue) && !((FSQueue)sched).isActive()) {
      return 0;
    }

    //如果队列weight<=0,则是fixed队列
    //如果对列minShare <=0,fixdShare=0,否则fixdShare=minShare
    if (sched.getWeights().getWeight(type) <= 0) {
      int minShare = getResourceValue(sched.getMinShare(), type);
      return (minShare <= 0) ? 0 : minShare;
    }

    return -1;
  }

(2)确定R上下限

R的下限为1.0，R的上限是由resourceUsedWithWeightToResourceRatio方法来确定。该方法确定的资源值W，第一步中确定的可用资源值T：W>=T时，R才能确定。

//根据R值去计算每个队列应该分配的资源
  private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
      Collection<? extends Schedulable> schedulables, ResourceType type) {
    int resourcesTaken = 0;
    for (Schedulable sched : schedulables) {
      int share = computeShare(sched, w2rRatio, type);
      resourcesTaken += share;
    }
    return resourcesTaken;
  }
 private static int computeShare(Schedulable sched, double w2rRatio,
      ResourceType type) {
    //share=R*weight,type是内存
    double share = sched.getWeights().getWeight(type) * w2rRatio;
    share = Math.max(share, getResourceValue(sched.getMinShare(), type));
    share = Math.min(share, getResourceValue(sched.getMaxShare(), type));
    return (int) share;
  }

（3）二分查找法逼近R

满足下面两个条件中的一个即可终止二分查找：

W == T(步骤2中的W和T)
超过25次（COMPUTE_FAIR_SHARES_ITERATIONS）

（4）使用R设置fair share

设置fair share时，可以看到区分了Steady Fair Share 和Instantaneous Fair Share。

  for (Schedulable sched : schedulables) {
      if (isSteadyShare) {
        setResourceValue(computeShare(sched, right, type),
            ((FSQueue) sched).getSteadyFairShare(), type);
      } else {
        setResourceValue(
            computeShare(sched, right, type), sched.getFairShare(), type);
      }
    }

2 Instaneous Fair Share计算方式

图（5）Instaneous Fair Share 计算流程

该计算方式与steady fair的计算调用栈是一致的，最终都要使用到computeSharesInternal方法，唯一不同的是计算的时机不一样。steady fair只有在addNode的时候才会重新计算一次，而Instantaneous Fair Share是由update线程定期去更新。

此处强调的一点是，在上文中我们已经分析如果是计算Instantaneous Fair Share，并且队列为空，那么该队列就是fixed队列，也就是非活跃队列，那么计算fair share时，该队列是不会去瓜分集群的内存资源。

而update线程的更新频率就是由 yarn.scheduler.fair.update-interval-ms来决定的。

private class UpdateThread extends Thread {

    @Override
    public void run() {
      while (!Thread.currentThread().isInterrupted()) {
        try {
          //yarn.scheduler.fair.update-interval-ms
          Thread.sleep(updateInterval);
          long start = getClock().getTime();
          // 更新Instantaneous Fair Share
          update();
          //抢占资源
          preemptTasksIfNecessary();
          long duration = getClock().getTime() - start;
          fsOpDurations.addUpdateThreadRunDuration(duration);
        } catch (InterruptedException ie) {
          LOG.warn("Update thread interrupted. Exiting.");
          return;
        } catch (Exception e) {
          LOG.error("Exception in fair scheduler UpdateThread", e);
        }
      }
    }
  }

3 maxAMShare意义

handle线程如果接收到NODE_UPDATE事件，如果（1）该node的机器内存资源满足条件，（2）并且有ACCEPTED状态的Application，那么将会为该待运行的APP的AM分配一个container，使该APP在所处的queue中跑起来。但在分配之前还需要一道检查canRuunAppAM。能否通过canRuunAppAM,就是由maxAMShare参数限制。

  public boolean canRunAppAM(Resource amResource) {
    //默认是0.5f
    float maxAMShare =
        scheduler.getAllocationConfiguration().getQueueMaxAMShare(getName());
    if (Math.abs(maxAMShare - -1.0f) < 0.0001) {
      return true;
    }
    //该队的maxAMResource=maxAMShare * fair share(Instantaneous Fair Share)
    Resource maxAMResource = Resources.multiply(getFairShare(), maxAMShare);
    //amResourceUsage是该队列已经在运行的App的AM所占资源累加和
    Resource ifRunAMResource = Resources.add(amResourceUsage, amResource);
    //查看当前ifRunAMResource是否超过maxAMResource
    return !policy
        .checkIfAMResourceUsageOverLimit(ifRunAMResource, maxAMResource);
  }

上面代码我们用公式来描述：

队列中运行的APP为An，每个APP的AM占用资源为R
ACCEPTED状态（待运行）的APP的AM大小为R1
队列的fair share为QueFS
队列的maxAMResource=maxAMShare * QueFS
ifRunAMResource=A1.R+A2.R+...+An.R+R1
ifRunAMResource > maxAMResource，则该队列不能接纳待运行的APP

之所以要关注这个参数，是因为EMR很多客户在使用公平队列时会反映集群的总资源没有用满，但是还有APP在排队，没有跑起来，如下图所示：

图（6） APP阻塞实例

公平调度默认策略不关心Core的资源，只关心Memory。图中Memory用了292G，还有53.6G的内存没用，APP就可以阻塞。原因就是default队列所有运行中APP的AM资源总和超过了（345.6 * 0.5），导致APP阻塞。

总结

通过分析fair share的计算流程，搞清楚yarn的基本概念和部分参数，从下面的表格对比中，我们也可以看到官方的文档对概念和参数的描述是比较难懂的。剩余的参数放在第二篇-公平调度之抢占中分析。

	官方描述	总结
Steady Fair Share	The queue’s steady fair share of resources. These shares consider all the queues irrespective of whether they are active (have running applications) or not. These are computed less frequently and change only when the configuration or capacity changes.They are meant to provide visibility into resources the user can expect, and hence displayed in the Web UI.	每个非fixed队列内存资源量的固定理论值。Steady Fair Share在RM初期工作后不再轻易改变，只有后续在增加节点改编配置（addNode）时才会重新计算。RM的初期工作也是handle线程把集群的每个节点添加到调度器中（addNode）。
Instantaneous Fair Share	The queue’s instantaneous fair share of resources. These shares consider only actives queues (those with running applications), and are used for scheduling decisions. Queues may be allocated resources beyond their shares when other queues aren’t using them. A queue whose resource consumption lies at or below its instantaneous fair share will never have its containers preempted.	每个非fixed队列(活跃队列)的内存资源量的实际值，是在动态变化的，由update线程去定时更新队列的fair share。yarn里的fair share如果没有专门指代，都是指的的Instantaneous Fair Share。
yarn.scheduler.fair.update-interval-ms	The interval at which to lock the scheduler and recalculate fair shares, recalculate demand, and check whether anything is due for preemption. Defaults to 500 ms.	update线程的间隔时间，该线程的工作是1更新fair share，2检查是否需要抢占资源。
maxAMShare	limit the fraction of the queue’s fair share that can be used to run application masters. This property can only be used for leaf queues. For example, if set to 1.0f, then AMs in the leaf queue can take up to 100% of both the memory and CPU fair share. The value of -1.0f will disable this feature and the amShare will not be checked. The default value is 0.5f.	队列所有运行中的APP的AM资源总和必须不能超过maxAMShare * fair share

问答
如何将yarn 升级到特定版本？
相关阅读
Yarn与Mesos
Spark on Yarn | Spark，从入门到精通
 YARN三大模块介绍
 【每日课程推荐】机器学习实战！快速入门在线广告业务及CTR相应知识

此文已由作者授权腾讯云+社区发布，更多原文请点击

搜索关注公众号「云加社区」，第一时间获取技术干货，关注后回复1024 送你一份技术课程大礼包！

海量技术实践经验，尽在云加社区！

相关阅读:
LeetCode 64. 最小路径和
 LeetCode 344. 反转字符串
 LeetCode 162. 寻找峰值
 LeetCode 227. 基本计算器 II
LeetCode 232. 用栈实现队列
 LeetCode 160. 相交链表
 LeetCode 112. 路径总和
 谈谈反作弊风控的实践经验
 LeetCode 704. 二分查找
 Hive SQL rank()/dense_rank()/row_number()的区别
原文地址：https://www.cnblogs.com/qcloud1001/p/9753183.html