docker 源码分析六（基于1.8.2版本），Docker run启动过程

docker 源码分析六（基于1.8.2版本），Docker run启动过程

上一篇大致了解了docker 容器的创建过程，其实主要还是从文件系统的视角分析了创建一个容器时需要得建立 RootFS，建立volumes等步骤；本章来分析一下建立好一个容器后，将这个容器运行起来的过程，

本章主要分析一下 docker deamon端的实现方法；根据前面几章的介绍可以容易找到，客户端的实现代码在api/client/run.go中，大体步骤是首先通过上一篇文章中的createContainer()方法建立一个container，然后通过调用cli.call("POST", "/containers/"+createResponse.ID+"/start", nil, nil)来实现将这个container启动；在api/server/server.go中，客户端请求对应的mapping为 "/containers/{name:.*}/start": s.postContainersStart，实现方法postContainerStart在api/server/container.go文件中，代码如下：

func (s *Server) postContainersStart(version version.Version, w http.ResponseWriter, r *http.Request, vars map[string]string) error {

    if vars == nil {

        return fmt.Errorf("Missing parameter")

    }

    var hostConfig *runconfig.HostConfig

    if r.Body != nil && (r.ContentLength > 0 || r.ContentLength == -1) {

        if err := checkForJSON(r); err != nil {

            return err

        }

        c, err := runconfig.DecodeHostConfig(r.Body)

        if err != nil {

            return err

        }

        hostConfig = c

    }

    if err := s.daemon.ContainerStart(vars["name"], hostConfig); err != nil {

        if err.Error() == "Container already started" {

            w.WriteHeader(http.StatusNotModified)

            return nil

        }

        return err

    }

    w.WriteHeader(http.StatusNoContent)

    return nil

}

逻辑非常简单，首先从request中解析参数，然后调用s.daemon.ContainerStart(vars["name"],hostConfig)启动容器，最后将结果写回response；主要的实现部分在s.daemon.ContainerStart(vars["name"],hostConfig)之中。在daemon/start.go中；

func (daemon *Daemon) ContainerStart(name string, hostConfig *runconfig.HostConfig) error {

    container, err := daemon.Get(name)

    if err != nil {

        return err

    }

    if container.IsPaused() {

        return fmt.Errorf("Cannot start a paused container, try unpause instead.")

    }

    if container.IsRunning() {

        return fmt.Errorf("Container already started")

    }

    // Windows does not have the backwards compatibility issue here.

    if runtime.GOOS != "windows" {

        // This is kept for backward compatibility - hostconfig should be passed when

        // creating a container, not during start.

        if hostConfig != nil {

            if err := daemon.setHostConfig(container, hostConfig); err != nil {

                return err

            }

        }

    } else {

        if hostConfig != nil {

  return fmt.Errorf("Supplying a hostconfig on start is not supported. It should be supplied on create")

        }

    }

    // check if hostConfig is in line with the current system settings.

    // It may happen cgroups are umounted or the like.

    if _, err = daemon.verifyContainerSettings(container.hostConfig, nil); err != nil {

        return err

    }

    if err := container.Start(); err != nil {

        return fmt.Errorf("Cannot start container %s: %s", name, err)

    }

    return nil

}

首先根据传进来的名字，通过deamon.Get() (daemon/daemon.go)

func (daemon *Daemon) Get(prefixOrName string) (*Container, error) {

    if containerByID := daemon.containers.Get(prefixOrName); containerByID != nil {

        // prefix is an exact match to a full container ID

        return containerByID, nil

    }

    // GetByName will match only an exact name provided; we ignore errors

    if containerByName, _ := daemon.GetByName(prefixOrName); containerByName != nil {

        // prefix is an exact match to a full container Name

        return containerByName, nil

    }

    containerId, indexError := daemon.idIndex.Get(prefixOrName)

    if indexError != nil {

        return nil, indexError

    }

    return daemon.containers.Get(containerId), nil

}

首先从daemon.containers中根据name来进行查找，找出container是否已经存在了。daemon.container是contStore类型的结构体，其结构如下：

type contStore struct {

    s map[string]*Container

    sync.Mutex

}

接着通过GetByName查找：GetByName同样在daemon/daemon.go中，代码如下：

func (daemon *Daemon) GetByName(name string) (*Container, error) {

    fullName, err := GetFullContainerName(name)

    if err != nil {

        return nil, err

    }

    entity := daemon.containerGraph.Get(fullName)

    if entity == nil {

        return nil, fmt.Errorf("Could not find entity for %s", name)

    }

    e := daemon.containers.Get(entity.ID())

    if e == nil {

        return nil, fmt.Errorf("Could not find container for entity id %s", entity.ID())

    }

    return e, nil

}

daemon.containerGraph是graphdb.Database类型(pkg/graphdb/graphdb.go文件中)，

type Database struct {

    conn *sql.DB

    mux sync.RWMutex

}

Database是一个存储容器和容器之间关系的数据库；目前Database是一个sqlite3数据库，所在的路径是/var/lib/docker/link/linkgraph.db中，其是在NewDaemon的实例化过程中，传递进来的。

graphdbPath := filepath.Join(config.Root, "linkgraph.db")

graph, err := graphdb.NewSqliteConn(graphdbPath)

if err != nil {

return nil, err

}

d.containerGraph = graph

数据库中最主要有两个表，分别是Entity，Edge，每一个镜像对应一个实体，存在Entity表；每个镜像与其父镜像的关系存在Edge表。每一个表在代码中也对应着一个结构体：

// Entity with a unique id.

type Entity struct {

    id string

}

// An Edge connects two entities together.

type Edge struct {

    EntityID string

    Name     string

    ParentID string

}

通过建表语句也许更能直观一些：

  createEntityTable = `

    CREATE TABLE IF NOT EXISTS entity (

        id text NOT NULL PRIMARY KEY

    );`

    createEdgeTable = `

    CREATE TABLE IF NOT EXISTS edge (

        "entity_id" text NOT NULL,

        "parent_id" text NULL,

        "name" text NOT NULL,

        CONSTRAINT "parent_fk" FOREIGN KEY ("parent_id") REFERENCES "entity" ("id"),

        CONSTRAINT "entity_fk" FOREIGN KEY ("entity_id") REFERENCES "entity" ("id")

        );

    `

最后一步就是通过GetByName查找完之后，接着根据daemon.idIndex.Get()进行查找，idIndex和前一篇中的镜像的idIndex是一样的，是一个trie的结构；

回到ContainerStart() 函数，在获取了container之后，接着判断container是否是停止和正在运行的，如果都不是，在进行一些参数验证(端口映射的设置、验证exec driver、验证内核是否支持cpu share，IO weight等)后，则启动调用container.Start() (daemon/container.go)启动container；

func (container *Container) Start() (err error) {

    container.Lock()

    defer container.Unlock()

    if container.Running {

        return nil

    }

    if container.removalInProgress || container.Dead {

        return fmt.Errorf("Container is marked for removal and cannot be started.")

    }

    // if we encounter an error during start we need to ensure that any other

    // setup has been cleaned up properly

    defer func() {

        if err != nil {

            container.setError(err)

            // if no one else has set it, make sure we don't leave it at zero

            if container.ExitCode == 0 {

                container.ExitCode = 128

            }

            container.toDisk()

            container.cleanup()

            container.LogEvent("die")

        }

    }()

    if err := container.Mount(); err != nil {

  return err

    }

    // Make sure NetworkMode has an acceptable value. We do this to ensure

    // backwards API compatibility.

    container.hostConfig = runconfig.SetDefaultNetModeIfBlank(container.hostConfig)

    if err := container.initializeNetworking(); err != nil {

        return err

    }

    linkedEnv, err := container.setupLinkedContainers()

    if err != nil {

        return err

    }

    if err := container.setupWorkingDirectory(); err != nil {

        return err

    }

    env := container.createDaemonEnvironment(linkedEnv)

    if err := populateCommand(container, env); err != nil {

        return err

    }

    mounts, err := container.setupMounts()

    if err != nil {

        return err

    }

    container.command.Mounts = mounts

return container.waitForStart()
}

defer func() 里面的作用就是如果start container出问题的话，进行一些清理工作；

container.Mount() 挂在container的aufs文件系统；

initializeNetworking() 对网络进行初始化，docker网络模式有三种，分别是 bridge模式（每个容器用户单独的网络栈），host模式（与宿主机共用一个网络栈），contaier模式（与其他容器共用一个网络栈，猜测kubernate中的pod所用的模式）；根据config和hostConfig中的参数来确定容器的网络模式，然后调动libnetwork包来建立网络，关于docker网络的部分后面会单独拿出一章出来梳理；

container.setupLinkedContainers() 将通过--link相连的容器中的信息获取过来，然后将其中的信息转成环境变量(是[]string数组的形式，每一个元素类似于"NAME=xxxx")的形式

返回；

setupWorkingDirectory() 建立容器执行命令时的工作目录；

createDaemonEnvironment() 将container中的自有的一些环境变量和之前的linkedEnv和合在一起(append)，然后返回；

populateCommand(container, env) 主要是为container的execdriver(最终启动容器的) 设置网络模式、设置namespace(pid,ipc,uts)等、资源(resources)限制等，并且设置在容器内执行的Command，Command中含有容器内进程的启动命令；

container.setupMounts() 返回container的所有挂载点；

最后调用container.waitForStart()函数启动容器；

func (container *Container) waitForStart() error {

    container.monitor = newContainerMonitor(container, container.hostConfig.RestartPolicy)

    // block until we either receive an error from the initial start of the container's

    // process or until the process is running in the container

    select {

    case <-container.monitor.startSignal:

    case err := <-promise.Go(container.monitor.Start):

        return err

    }

    return nil

}

首先实例化出来一个containerMonitor，monitor的作用主要是监控容器内第一个进程的执行，如果执行没有成功，那么monitor可以按照一定的重启策略(startPolicy)来进行重启；

看下一下montitor(daemon/monitor.go)中的Start()函数，最主要的部分是

m.container.daemon.Run(m.container, pipes, m.callback)

在daemon/daemon.go文件中， Run方法：

func (daemon *Daemon) Run(c *Container, pipes *execdriver.Pipes, startCallback execdriver.StartCallback) (execdriver.ExitStatus, error) {
return daemon.execDriver.Run(c.command, pipes, startCallback)
}

docker的execDriver有两个：lxc 和 native；lxc是较早的driver，native是默认的，用的是libcontainer；所以最终这个Run的方式是调用daemon/execdriver/native/driver.go中的Run() 方法：

func (d *Driver) Run(c *execdriver.Command, pipes *execdriver.Pipes, startCallback execdriver.StartCallback) (execdriver. ExitStatus, error) {

// take the Command and populate the libcontainer.Config from it

container, err := d.createContainer(c)

if err != nil {

return execdriver.ExitStatus{ExitCode: -1}, err

}

p := &libcontainer.Process{

Args: append([]string{c.ProcessConfig.Entrypoint}, c.ProcessConfig.Arguments...),

Env: c.ProcessConfig.Env,

Cwd: c.WorkingDir,

User: c.ProcessConfig.User,

}

if err := setupPipes(container, &c.ProcessConfig, p, pipes); err != nil {

return execdriver.ExitStatus{ExitCode: -1}, err

}

cont, err := d.factory.Create(c.ID, container)

if err != nil {

return execdriver.ExitStatus{ExitCode: -1}, err

}

d.Lock()

d.activeContainers[c.ID] = cont

d.Unlock()

defer func() {

cont.Destroy()

d.cleanContainer(c.ID)

}()

if err := cont.Start(p); err != nil {

return execdriver.ExitStatus{ExitCode: -1}, err

}

if startCallback != nil {

pid, err := p.Pid()

if err != nil {

p.Signal(os.Kill)

p.Wait()

return execdriver.ExitStatus{ExitCode: -1}, err

}

startCallback(&c.ProcessConfig, pid)

}

oom := notifyOnOOM(cont)

waitF := p.Wait

if nss := cont.Config().Namespaces; !nss.Contains(configs.NEWPID) {

// we need such hack for tracking processes with inherited fds,

// because cmd.Wait() waiting for all streams to be copied

waitF = waitInPIDHost(p, cont)

}

ps, err := waitF()

if err != nil {

execErr, ok := err.(*exec.ExitError)

if !ok {

return execdriver.ExitStatus{ExitCode: -1}, err

}

ps = execErr.ProcessState

}

cont.Destroy()

_, oomKill := <-oom

return execdriver.ExitStatus{ExitCode: utils.ExitStatus(ps.Sys().(syscall.WaitStatus)), OOMKilled: oomKill}, nil

}

d.createContainer(c) 根据command实例化出来一个container需要的配置；Capabilities、Namespace、Group、mountpoints等，首先根据模板生成固定的配置（daemon/execdriver/native/template/default_template.go），然后在根据command建立容器特定的namespace

接着实例化一个libcontainer.Process{}，里面的Args参数就是用户输入的entrypoint和cmd参数的组合，这也是将来容器的第一个进程(initProcess)要运行的一部分；

setupPipes(container, &c.ProcessConfig, p, pipes); 将container类(pipes)的标准输入输出与 libcontainer.Process (也是将来容器中的的init processs，就是变量p）进行绑定，这样就可以获取初始进程的输入和输出；

cont, err := d.factory.Create(c.ID, container) 调用driver.factory(~/docker_src/vendor/src/github.com/opencontainers/runc/libcontainer/factory_linux.go )来实例化一个linux container，结构如下：

linuxContainer{

id: id,

root: containerRoot,

config: config,

initPath: l.InitPath,

initArgs: l.InitArgs,

criuPath: l.CriuPath,

cgroupManager: l.NewCgroupsManager(config.Cgroups, nil),

}

这个linuxContainer类和之前的container类是不同的，这个是execdriver专有的类，其中比较主要的，ID就是containerID，initPath：是dockerinit的路径，initArgs是docker init的参数，然后是CriuPath（用于给容器做checkpoint），cgroupMangeer：管理容器的进程所在的资源；

dockerinit要说一下，dockerinit是一个固定的二进制文件，是一个容器运行起来之后去执行的第一个可执行文件，dockerinit的作用是在新的namespace中设置挂在资源，初始化网络栈等等，当然还有一作用是由dockerinit来负责执行用户设定的entrypoint和cmd；执行entrypoint和cmd，执行entrypoint和cmd的时候，与dockerinit是在同一个进程中；

cont.Start(p); 通过linuxcontainer运行之前的libcontainer.Process，这个步骤稍后会详细讲解；

接下来就是常规的步骤了，调用callback函数、监控container是否会有内存溢出的问题(通过cgroupmanager)、然后p.Wait()等待libcontainer.Process执行完毕、无误执行完毕后接着调用destroy销毁linuxcontainer，然后返回执行状态；

接下来对linuxcontainer的start(vendor/src/github.com/opencontainers/runc/libcontainer/container_linux.go)过程详细介绍一下；

func (c *linuxContainer) Start(process *Process) error {

c.m.Lock()

defer c.m.Unlock()

status, err := c.currentStatus()

if err != nil {

return err

}

doInit := status == Destroyed

parent, err := c.newParentProcess(process, doInit)

if err != nil {

return newSystemError(err)

}

if err := parent.start(); err != nil {

// terminate the process to ensure that it properly is reaped.

if err := parent.terminate(); err != nil {

logrus.Warn(err)

}

return newSystemError(err)

}

process.ops = parent

if doInit {

c.updateState(parent)

}

return nil

}

这个Start()函数的作用就是开启容器的第一个进程initProcess，docker daemon开启一个新的容器，其实就是fork出一个新的进程（这个进程有自己的namespace，从而实现容器间的隔离），这个进程同时也是容器的初始进程，这个初始进程用来执行dockerinit、entrypoint、cmd等一系列操作；

status, err := c.currentStatus() 首先判断一下容器的初始进程是否已经存在，不存在的话会返回destroyd状态；

parent, err := c.newParentProcess(process, doInit) 开启新的进程，下面插进来一下关于newParentProcess的代码

func (c *linuxContainer) newParentProcess(p *Process, doInit bool) (parentProcess, error) {

parentPipe, childPipe, err := newPipe()

if err != nil {

return nil, newSystemError(err)

}

cmd, err := c.commandTemplate(p, childPipe)

if err != nil {

return nil, newSystemError(err)

}

if !doInit {

return c.newSetnsProcess(p, cmd, parentPipe, childPipe), nil

}

return c.newInitProcess(p, cmd, parentPipe, childPipe)

}

func (c *linuxContainer) commandTemplate(p *Process, childPipe *os.File) (*exec.Cmd, error) {

cmd := &exec.Cmd{

Path: c.initPath,

Args: c.initArgs,

}

cmd.Stdin = p.Stdin

cmd.Stdout = p.Stdout

cmd.Stderr = p.Stderr

cmd.Dir = c.config.Rootfs

if cmd.SysProcAttr == nil {

cmd.SysProcAttr = &syscall.SysProcAttr{}

}

cmd.ExtraFiles = append(p.ExtraFiles, childPipe)

cmd.Env = append(cmd.Env, fmt.Sprintf("_LIBCONTAINER_INITPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1))

if c.config.ParentDeathSignal > 0 {

cmd.SysProcAttr.Pdeathsig = syscall.Signal(c.config.ParentDeathSignal)

}

return cmd, nil

}

上面两个函数是相互关联的，上面的函数调用了下面的函数，

newParentProcess中首先调用了

parentPipe, childPipe, err := newPipe() 来创建一个socket pair，形成一个管道；这个管道是docker daemon 与将来的dockerinit进行通信的渠道，上面说过dockerinit的作用是初始化新的namespace 内的一些重要资源，但这些资源是需要docker daemon 在宿主机上申请的，如：veth pair，docker daemon 在自己的命名空间中创建了这些内容之后，通过这个管道将数据交给 dockerinit

接着cmd, err := c.commandTemplate(p, childPipe)。这部分主要有两个作用，将dockerinit及其参数分装成go语言中的exec.Cmd类，

&exec.Cmd{

Path: c.initPath,

Args: c.initArgs,

}

这个Cmd类就是将来要真正执行的进程；其他一些事情是绑定Cmd的表述输入输入到libcontainer.Process（之前已经将输入输出绑定到container类），还有将管道的childpipe一端绑定到Cmd类的打开的文件中。

接着在newParentProcess中，返回了 newInitProcess(p, cmd, parentPipe, childPipe)，其实质是返回了一个initProcess类(vendor/src/github.com/opencontainers/runc/libcontainer/process_linux.go);

initProcess{

cmd: cmd,

childPipe: childPipe,

parentPipe: parentPipe,

manager: c.cgroupManager,

config: c.newInitConfig(p),

}

其中的cmd，就是之前封装好的exec.Cmd类、然后childPipe已经绑定到了cmd的文件描述符中、parentPipe是pipe的另一端、manager是cgroup控制资源的作用、config是将之前的libcontainer.Process的配置（其中包括entrypoint和cmd的配置）转化成一些配置信息，这部分配置信息将通过parentPipe发给cmd的childpipe，最终由dockerinit来运行、接下来会讲到；

然后回到 Start()函数中， parent就是一个initProcess类，紧接着就是调用这个类的start()方法了

func (p *initProcess) start() error {

defer p.parentPipe.Close()

err := p.cmd.Start()

p.childPipe.Close()

if err != nil {

return newSystemError(err)

}

fds, err := getPipeFds(p.pid())

if err != nil {

return newSystemError(err)

}

p.setExternalDescriptors(fds)

if err := p.manager.Apply(p.pid()); err != nil {

return newSystemError(err)

}

defer func() {

if err != nil {

// TODO: should not be the responsibility to call here

p.manager.Destroy()

}

}()

if err := p.createNetworkInterfaces(); err != nil {

return newSystemError(err)

}

if err := p.sendConfig(); err != nil {

return newSystemError(err)

}

// wait for the child process to fully complete and receive an error message

// if one was encoutered

var ierr *genericError

if err := json.NewDecoder(p.parentPipe).Decode(&ierr); err != nil && err != io.EOF {

return newSystemError(err)

}

if ierr != nil {

return newSystemError(ierr)

  }

return nil

}

最主要的几个步骤，p.cmd.Start() 首先运行cmd的命令；

p.manager.Apply(p.pid()) cmd运行起来之后，是一个新的进程，也是container中的第一个进程，会有一个pid，将这个pid加入到cgroup配置中，确保以后由初始进程fork出来的子进程也能遵守cgroup的资源配置；

createNetworkInterfaces() 为进程建立网络配置，并放到config配置中；

p.sendConfig() 将配置（包括网络配置、entrypoint、cmd等）通过parentPipe发给cmd进程，并有cmd中的dockerinit执行；

json.NewDecoder(p.parentPipe).Decode(&ierr); 等待cmd的执行是否会有问题；

容器的启动主要过程就是 docker 将container的主要配置封装成一个Command类，然后交给execdriver（libcontainer），libcontainer将command中的配置生成一个libcontainer.process类和一个linuxcontainer类，然后由linux container这个类运行libcontainer.process。运行的过程是生成一个os.exec.Cmd类（里面包含dockerinit），启动这个dockerinit，然后在运行entrypoint和cmd；

年前就先分析这么多了，接下来要看看swarm、kubernates、和docker 网络相关的东西；
相关阅读:
20、职责链模式
 19、命令模式
 18、桥接模式
 17、单例模式
 javascript移动端实现企业图谱总结
 前端用js模拟疫情扩散开发总结
 移动端企业图谱开发兼容性等问题踩坑
 js实现企业图谱（pc端企业图谱项目总结与踩坑分享）
基于vue脚手架的项目打包上线（发布）方法和误区
 实现一个网页版的聊天室（类似于钉钉群）
原文地址：https://www.cnblogs.com/yuhan-TB/p/5118122.html

docker 源码分析 六（基于1.8.2版本），Docker run启动过程

docker 源码分析六（基于1.8.2版本），Docker run启动过程