Kubernetes 学习（九）Kubernetes 源码阅读之正式篇------核心组件之 Scheduler

0. 前言

继续上一篇博客阅读 Kubernetes 源码，参照《k8s 源码阅读》首先学习 Kubernetes 的一些核心组件，首先是 kube-scheduler
本文严重参考原文：《k8s 源码阅读》之 2.2 章节：scheduler，加入部分自己阅读的体会作为自己的阅读笔记
感谢《k8s 源码阅读》的作者们辛苦编写教材，在此郑重表示感谢，望大家多多支持！~

1. 整体设计

1.1 概述

官网描述：
- The Kubernetes scheduler runs as a process alongside the other master components such as the API server.
- Its interface to the API server is to watch for Pods with an empty PodSpec.NodeName, and for each Pod,
- it posts a binding indicating where the Pod should be scheduled.
Scheduler 是相对独立的一个组件，主动访问 API server，寻找等待调度的 Pod（PodSpec.NodeName 为空）
然后通过一系列调度算法寻找哪个 Node 适合跑这个 Pod
然后将这个 Pod 和 Node 的绑定关系发给 API server，即整个调度流程

1.2 源码层次

cmd/kube-scheduler/scheduler.go：main() 函数入口位置，在 Scheduler 过程开始前的一系列初始化工作
pkg/scheduler/scheduler.go：调度框架整体逻辑，抽象调用各个算法的 Interface
pkg/scheduler/core/generic_scheduler.go：计算哪些 Node适合跑哪些 Pod 的具体算法

1.3 调度流程

通过一系列的 Predicates（预选）过滤掉不能运行 Pod 的 Node
- 比如一个 Pod 需要 500M 的内存，有些节点剩余内存只有 100M 了，就会被剔除
通过一系列的 Priorities（优选）给剩下的 Node 排一个分数，寻找能够运行 Pod 的 Node 中最合适的一个
得分最高的一个 Node 胜出，获得了运行 Pod 的资格
若是上述预选过程中没有找到任何一个合适的 Node，则进入抢占模式，移除部分低优先级的 Pod，再选择 Node

1.4 Predicates 和 Priorities 策略

Predicates 是用于过滤不合适 Node的策略
Priorities 是用于计算 Node 分数的策略（作用在通过 Predicates 过滤的 Node上）
Kubernetes 默认内建了一些 Predicates 和 Priorities 策略，代码分别在：
- pkg/scheduler/algorithm/predicates/predicates.go
- pkg/scheduler/algorithm/priorities/

1.5 调度策略的修改

默认调度策略是通过 defaultPredicates() 和 defaultPriorities() 函数定义的
代码在 pkg/scheduler/algorithmprovider/defaults/defaults.go
可以通过命令行 flag --policy-config-file 来覆盖默认行为
- 可以修改 pkg/scheduler/algorithm/predicates/predicates.go 和 /pkg/scheduler/algorithm/priorities/，然后注册到 defaultPredicates() 和 defaultPriorities() 来实现
- 或者通过配置文件：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ],
"hardPodAffinitySymmetricWeight" : 10,
"alwaysCheckAllPredicates" : false
}

View Code

2. 启动前逻辑

Scheduler 可以分为三层，第一层是调度器启动前的逻辑，包括：命令行参数解析、参数校验、调度器初始化等一系列逻辑

2.1 Cobra

2.1.1 简介

Cobra 既是一个创建强大的现代化命令行程序的库，又是一个用于生成应用和命令行文件的程序。有很多流行的 Golang 项目用了 Cobra（Kubernetes 和 Docker）

2.1.2 具体使用

Go 1.11 版本以上先暂时关闭 go modules，安装 Cobra：go get -u github.com/spf13/cobra/cobra
- 若是网卡，可能导致安装不成功，可以先下载报错的依赖包，再重新 go get 安装
- 安装成功后会看到 $GOPATH/
进入 $GOPATH/src 目录，初始化项目：cobra init myapp --pkg-name myapp

$ cobra init myapp --pkg-name myapp
Your Cobra applicaton is ready at
/home/ao/go/src/myapp
$ ls myapp 
cmd  LICENSE  main.go
$ pwd
/home/ao/go/src

View Code

本地可以看到一个 main.go 和一个 cmd 目录
查看 cmd/root.go：

package cmd

import (
  "fmt"
  "os"
  "github.com/spf13/cobra"

  homedir "github.com/mitchellh/go-homedir"
  "github.com/spf13/viper"

)


var cfgFile string


// rootCmd represents the base command when called without any subcommands
var rootCmd = &cobra.Command{
  Use:   "myapp",
  Short: "A brief description of your application",
  Long: `A longer description that spans multiple lines and likely contains
examples and usage of using your application. For example:

Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
  // Uncomment the following line if your bare application
  // has an action associated with it:
  //    Run: func(cmd *cobra.Command, args []string) { },
}

View Code

查看 main.go：

package main

import "myapp/cmd"

func main() {
  cmd.Execute()
}

View Code

main.go 里 import了 myapp/cmd（也就是 root.go 文件）
- 在 Execute 里面调用了 rootCmd.Execute() 方法
- rootCmd 是*cobra.Command 类型
添加新的命令：cobra add version

$ cobra add version                
version created at /home/ao/go/src/myapp%

View Code

查看 cmd/version.go：
- init() 函数里面调用 rootCmd.AddCommand(versionCmd)，子命令就是 version

package cmd

import (
    "fmt"

    "github.com/spf13/cobra"
)

// versionCmd represents the version command
var versionCmd = &cobra.Command{
    Use:   "version",
    Short: "A brief description of your command",
    Long: `A longer description that spans multiple lines and likely contains examples
and usage of using your command. For example:

Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
    Run: func(cmd *cobra.Command, args []string) {
        fmt.Println("version called")
    },
}

func init() {
    rootCmd.AddCommand(versionCmd)
    ......
}

View Code

添加多级子命令：

$ cobra add server 
server created at /home/ao/go/src/myapp%                                                                                         
$ cobra add create -p serverCmd
create created at /home/ao/go/src/myapp%

View Code

查看 cmd/create.go 文件：
- init() 函数调用 serverCmd.AddCommand(createCmd)，表示 create 为子命令

package cmd

import (
    "fmt"

    "github.com/spf13/cobra"
)

// createCmd represents the create command
var createCmd = &cobra.Command{
    Use:   "create",
    Short: "A brief description of your command",
    Long: `A longer description that spans multiple lines and likely contains examples
and usage of using your command. For example:

Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
    Run: func(cmd *cobra.Command, args []string) {
        fmt.Println("create called")
    },
}

func init() {
    serverCmd.AddCommand(createCmd)
    ......
}

View Code

2.2 Scheduler 的 main 函数

下面正式进入 Kubernetes 的 Scheduler 源码阅读
查看 cmd/kube-scheduler/scheduler.go，删除非主干代码：

func main() {
    command := app.NewSchedulerCommand()
    if err := command.Execute(); err != nil {
        fmt.Fprintf(os.Stderr, "%v
", err)
        os.Exit(1)
    }
}

View Code

查看定义 NewSchedulerCommand 的 cmd/kube-scheduler/app/server.go，删除非主干代码：

/ NewSchedulerCommand creates a *cobra.Command object with default parameters
func NewSchedulerCommand() *cobra.Command {
    cmd := &cobra.Command{
        Use: "kube-scheduler",
        Long: `The Kubernetes scheduler is a policy-rich, topology-aware,
workload-specific function that significantly impacts availability, performance,
and capacity. The scheduler needs to take into account individual and collective
resource requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality, inter-workload
interference, deadlines, and so on. Workload-specific requirements will be exposed
through the API as necessary.`,
        Run: func(cmd *cobra.Command, args []string) {
            if err := runCommand(cmd, args, opts); err != nil {
                fmt.Fprintf(os.Stderr, "%v
", err)
                os.Exit(1)
            }
        },
    }
    return cmd
}

View Code

Schduler 启动时调用了 runCommand(cmd, args, opts)，查看 cmd/kube-scheduler/app/server.go 关于 runCommand 定义，删除非主干代码：

// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, args []string, opts *options.Options) error {
    c, err := opts.Config()
    stopCh := make(chan struct{})
    // Get the completed config
    cc := c.Complete()
    return Run(cc, stopCh)
}

View Code

处理配置问题后调用了一个 Run() 函数，Run() 的作用是基于给定的配置启动 Scheduler，它只会在出错时或者 channel stopCh 被关闭时才退出，查看 cmd/kube-scheduler/app/server.go 关于 Run 定义，删除非主干代码：

// Run executes the scheduler based on the given configuration. It only return on error or when stopCh is closed.
func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error {
    // Create the scheduler.
    sched, err := scheduler.New(cc.Client,
        cc.InformerFactory.Core().V1().Nodes(),
        stopCh,
        scheduler.WithName(cc.ComponentConfig.SchedulerName))

    // Prepare a reusable runCommand function.
    run := func(ctx context.Context) {
        sched.Run()
        <-ctx.Done()
    }

    ctx, cancel := context.WithCancel(context.TODO()) 
    defer cancel()

    go func() {
        select {
        case <-stopCh:
            cancel()
        case <-ctx.Done():
        }
    }()

    // Leader election is disabled, so runCommand inline until done.
    run(ctx)
    return fmt.Errorf("finished without leader elect")
}

View Code

最终是要跑 sched.Run() 这个方法来启动 Scheduler，sched.Run() 方法就是 Scheduler 框架真正运行的逻辑
上述有一个 sched 变量，linux 里经常会看到一些软件叫 ***d，d 也就是 daemon，守护进程的意思，也就是一直跑在后台的一个程序
- 这里的 sched 也就是指 “Scheduler Daemon“
- sched 的其实是 *Scheduler 类型，定义在 pkg/scheduler/scheduler.go：
  - Scheduler 监控新创建的未被调度的 Pods，然后尝试寻找合适的 Node，回写一个绑定关系到 API Server
  - 这里也可以体会到 Daemon 的感觉，我们平时搭建的 k8s 集群中运行着一个 Daemon 进程叫做 kube-scheduler，在程序里面也就对应这样一个对象：Scheduler

// Scheduler watches for new unscheduled pods. It attempts to find
// nodes that they fit on and writes bindings back to the api server.
type Scheduler struct {
    config *factory.Config
}

View Code

Scheduler 结构体中的 Config 属性对象定义在 pkg/scheduler/factory/factory.go，主要属性如下：

// Config is an implementation of the Scheduler's configured input data.
type Config struct {
    // It is expected that changes made via SchedulerCache will be observed
    // by NodeLister and Algorithm.
    SchedulerCache schedulerinternalcache.Cache
    // Ecache is used for optimistically invalid affected cache items after
    // successfully binding a pod
    Ecache     *equivalence.Cache
    NodeLister algorithm.NodeLister
    Algorithm  algorithm.ScheduleAlgorithm
    GetBinder  func(pod *v1.Pod) Binder
    // PodConditionUpdater is used only in case of scheduling errors. If we succeed
    // with scheduling, PodScheduled condition will be updated in apiserver in /bind
    // handler so that binding and setting PodCondition it is atomic.
    PodConditionUpdater PodConditionUpdater
    // PodPreemptor is used to evict pods and update pod annotations.
    PodPreemptor PodPreemptor

    // NextPod should be a function that blocks until the next pod
    // is available. We don't use a channel for this, because scheduling
    // a pod may take some amount of time and we don't want pods to get
    // stale while they sit in a channel.
    NextPod func() *v1.Pod

    // SchedulingQueue holds pods to be scheduled
    SchedulingQueue internalqueue.SchedulingQueue
}

View Code

3. 整体调度框架

3.1 启动

调度的第二层：负责 Scheduler 除了具体 Node 过滤算法外的工作逻辑
Scheduler 对象绑定了一个 Run() 方法，定义如下：

// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
    if !sched.config.WaitForCacheSync() {
        return
    }
    go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}

View Code

Run 函数在 cmd/kube-scheduler/app/server.go 调用：

    // Prepare a reusable runCommand function.
    run := func(ctx context.Context) {
        sched.Run()
        <-ctx.Done()
    }

View Code

调用了 sched.Run() 之后就等待 ctx.Done()
wait.Until 实现：每隔 n 时间调用 f 一次，除非 channel c 被关闭
- 这里的 n 就是 0，也就是一直调用，前一次调用返回下一次调用就开始了
- f 指 sched.scheduleOne，c 指 sched.config.StopEverything

3.2 一个 Pod 的基本调度流程

scheduleOne 实现 1 个 Pod 的完整调度工作流
这个过程是顺序执行的，也就是非并发的
- 即前一个 Pod 的 scheduleOne 一完成，下一个 Pod 的 ScheduleOne 立马接着执行
scheduleOne 的主要逻辑：

func (sched *Scheduler) scheduleOne() {
    pod := sched.config.NextPod()
    suggestedHost, err := sched.schedule(pod)
    if err != nil {
        if fitError, ok := err.(*core.FitError); ok {
            preemptionStartTime := time.Now()
            sched.preempt(pod, fitError)
        }
        return
    }
    assumedPod := pod.DeepCopy()
    allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
    err = sched.assume(assumedPod, suggestedHost)
    go func() {
        err := sched.bind(assumedPod, &v1.Binding{
            ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
            Target: v1.ObjectReference{
                Kind: "Node",
                Name: suggestedHost,
            },
        })
    }()
}

View Code

参考流程如下：

3.3 Schedule 算法计算合适 Node

主流程核心步骤是 suggestedHost, err := sched.schedule(pod)
这里完成了非抢占模式下 Node 的计算，包括预选过程、优选过程等
sched.schedule(pod) 方法定义见 pkg/scheduler/scheduler.go：

// schedule implements the scheduling algorithm and returns the suggested host.
func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
    host, err := sched.config.Algorithm.Schedule(pod, sched.config.NodeLister)
    if err != nil {
        pod = pod.DeepCopy()
        sched.config.Error(pod, err)
        sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "%v", err)
        sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
            Type:          v1.PodScheduled,
            Status:        v1.ConditionFalse,
            LastProbeTime: metav1.Now(),
            Reason:        v1.PodReasonUnschedulable,
            Message:       err.Error(),
        })
        return "", err
    }
    return host, err
}

View Code

里面调用了 sched.config.Algorithm.Schedule() 方法（本身为接口），定义见 pkg/scheduler/algorithm/scheduler_interface.go：

type ScheduleAlgorithm interface {
    Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
    Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
    Predicates() map[string]FitPredicate
    Prioritizers() []PriorityConfig
}

View Code

这个接口有 4 个方法，实现 ScheduleAlgorithm 接口的对象实现如何调度 Pods 到 Nodes 上
默认的实现是 pkg/scheduler/core/generic_scheduler.go 的 genericScheduler 对象
- Schedule()：给定 Pod 和 Nodes，计算出一个适合跑 Pod 的 Node 并返回
- Preempt()：抢占模式
- Predicates()：预选过程
- Prioritizers()：优选过程

4. 一般调度过程

6. 参考文献

《k8s 源码阅读》之 2.2 章节：scheduler

相关阅读:
算法练习-寻找和为定值的两个数
 算法练习-字符串全排列
 算法练习-最长回文子串
 判断一点是否在三角形的外接圆内
 用递归方法计算行列式的值
 算法练习-回文判断
 算法练习-字符串转换成整数(实现atoi函数)
算法练习-字符串包含
 数据结构-队列
 结构体（或者联合体）变量的成员在内存里是如何分布的
原文地址：https://www.cnblogs.com/wangao1236/p/11045674.html