• nsenter


    https://x3fwy.bitcron.com/post/runc-malicious-container-escape

    The `nsenter` package will `import "C"` and it uses [cgo](https://golang.org/cmd/cgo/)
    package. In cgo, if the import of "C" is immediately preceded by a comment, that comment,
    called the preamble, is used as a header when compiling the C parts of the package.
    So every time we  import package `nsenter`, the C code function `nsexec()` would be
    called. And package `nsenter` is only imported in `init.go`, so every time the runc
    `init` command is invoked, that C code is run.
    
    Because `nsexec()` must be run before the Go runtime in order to use the
    Linux kernel namespace, you must `import` this library into a package if
    you plan to use `libcontainer` directly. Otherwise Go will not execute
    the `nsexec()` constructor, which means that the re-exec will not cause
    the namespaces to be joined. You can import it like this:
    
    ```go
    import _ "github.com/opencontainers/runc/libcontainer/nsenter"
    ```

     init.go

    func init() {
            if len(os.Args) > 1 && os.Args[1] == "init" {
                    runtime.GOMAXPROCS(1)
                    runtime.LockOSThread()
    
                    level := os.Getenv("_LIBCONTAINER_LOGLEVEL")
                    logLevel, err := logrus.ParseLevel(level)
                    if err != nil {
                            panic(fmt.Sprintf("libcontainer: failed to parse log level: %q: %v", level, err))
                    }
    
                    err = logs.ConfigureLogging(logs.Config{
                            LogPipeFd: os.Getenv("_LIBCONTAINER_LOGPIPE"),
                            LogFormat: "json",
                            LogLevel:  logLevel,
                    })
                    if err != nil {
                            panic(fmt.Sprintf("libcontainer: failed to configure logging: %v", err))
                    }
                    logrus.Debugf("child process in init() and child pid is %d", os.Getpid())
            }
    }
    
    var initCommand = cli.Command{
            Name:  "init",
            Usage: `initialize the namespaces and launch the process (do not call it outside of runc)`,
            Action: func(context *cli.Context) error {
                    factory, _ := libcontainer.New("")
                    if err := factory.StartInitialization(); err != nil {
                            // as the error is sent back to the parent there is no need to log
                            // or write it to stderr because the parent process will handle this
                            os.Exit(1)
                    }
                    panic("libcontainer: container init failed to exec")
            },
    }

    main.go

    app.Commands = []cli.Command{
                    checkpointCommand,
                    createCommand,
                    deleteCommand,
                    eventsCommand,
                    execCommand,
                    initCommand,
                    killCommand,
                    listCommand,
                    pauseCommand,
                    psCommand,
                    restoreCommand,
                    resumeCommand,
                    runCommand,
                    specCommand,
                    startCommand,
                    stateCommand,
                    updateCommand,
            }

    nsenter模块分析

    nsenter模块,主要涉及namespace管理(把当前进程加入到指定的namespace或者创建新的namespace)、uid和gid的映射管理以及串口的管理等。

    涉及golang和c两种语言实现,具体实现代码:

    libcontainer/nsenter, 核心实现在libcontainer/nsenter/nsexec.c。

    模块入口

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    package nsenter

    /*
    #cgo CFLAGS: -Wall
    extern void nsexec();
    void __attribute__((constructor)) init(void) {
    nsexec();
    }
    */
    import "C"

    当有包import _ "github.com/opencontainers/runc/libcontainer/nsenter"的时候,会导致C语言实现的部分在编译的时候,编译到对应的可执行文件中。而这里的C代码,定义了一个构造函数init(void),从C语言的构造函数特性,可以了解到,构造函数会在main函数执行之前运行。那么,init(void)函数会在可执行文件一开始就运行。所以,nsexec()函数会第一个执行。

    nsexec函数

    主要功能如下:

    1. 设置log pipe,用于日志传输;
    2. 设置init pipe,用于namespace等配置数据的传输以及子进程pid的回传;
    3. ensure clone binary,用于解决CVE-2019-5736,防止/proc/self/exe导致的安全漏洞;
    4. 读取并解析init pipe传入的namespace等数据信息;
    5. 更新oom配置;
    6. 执行double fork

    ensure clone binary

    在第一次运行时,拷贝原始的二进制文件内容到内存。后续的二进制执行,都是使用的内存数据。从而消除,运行过程中二进制被修改,导致的安全漏洞。

    具体实现待分析:clone_binary.c — ensure_cloned_binary()

    tatic int clone_binary(void)
    {
        int binfd, execfd;
        struct stat statbuf = {};
        size_t sent = 0;
        int fdtype = EFD_NONE;
    
        /*
         * Before we resort to copying, let's try creating an ro-binfd in one shot
         * by getting a handle for a read-only bind-mount of the execfd.
         */
        execfd = try_bindfd();
        if (execfd >= 0)
            return execfd;
    
        /*
         * Dammit, that didn't work -- time to copy the binary to a safe place we
         * can seal the contents.
         */
        execfd = make_execfd(&fdtype);
        if (execfd < 0 || fdtype == EFD_NONE)
            return -ENOTRECOVERABLE;
    
        binfd = open("/proc/self/exe", O_RDONLY | O_CLOEXEC);

     

    double clone

    nsexec中,进行了2次clone进程。

    至于为何需要进行2次clone操作的原因,可以参考注释:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    /*
    * Okay, so this is quite annoying.
    *
    * In order for this unsharing code to be more extensible we need to split
    * up unshare(CLONE_NEWUSER) and clone() in various ways. The ideal case
    * would be if we did clone(CLONE_NEWUSER) and the other namespaces
    * separately, but because of SELinux issues we cannot really do that. But
    * we cannot just dump the namespace flags into clone(...) because several
    * usecases (such as rootless containers) require more granularity around
    * the namespace setup. In addition, some older kernels had issues where
    * CLONE_NEWUSER wasn't handled before other namespaces (but we cannot
    * handle this while also dealing with SELinux so we choose SELinux support
    * over broken kernel support).
    *
    * However, if we unshare(2) the user namespace *before* we clone(2), then
    * all hell breaks loose.
    *
    * The parent no longer has permissions to do many things (unshare(2) drops
    * all capabilities in your old namespace), and the container cannot be set
    * up to have more than one {uid,gid} mapping. This is obviously less than
    * ideal. In order to fix this, we have to first clone(2) and then unshare.
    *
    * Unfortunately, it's not as simple as that. We have to fork to enter the
    * PID namespace (the PID namespace only applies to children). Since we'll
    * have to double-fork, this clone_parent() call won't be able to get the
    * PID of the _actual_ init process (without doing more synchronisation than
    * I can deal with at the moment). So we'll just get the parent to send it
    * for us, the only job of this process is to update
    * /proc/pid/{setgroups,uid_map,gid_map}.
    *
    * And as a result of the above, we also need to setns(2) in the first child
    * because if we join a PID namespace in the topmost parent then our child
    * will be in that namespace (and it will not be able to give us a PID value
    * that makes sense without resorting to sending things with cmsg).
    *
    * This also deals with an older issue caused by dumping cloneflags into
    * clone(2): On old kernels, CLONE_PARENT didn't work with CLONE_NEWPID, so
    * we have to unshare(2) before clone(2) in order to do this. This was fixed
    * in upstream commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5, and was
    * introduced by 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e. As far as we're
    * aware, the last mainline kernel which had this bug was Linux 3.12.
    * However, we cannot comment on which kernels the broken patch was
    * backported to.
    *
    * -- Aleksa "what has my life come to?" Sarai
    */

    包括父进程在内,一共涉及了3个进程,它们的关系序列如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    Title: How to clone init process
    Parent->Child: clone first child
    Note right of Child: join namespace and unshare newuser
    Child->Parent: send SYNC_USERMAP_PLS
    Note left of Parent: update groups,uid and gid
    Parent->Child: send SYNC_USERMAP_ACK
    Note right of Child: unshare other namespace, except cgroup
    Child->GrandChild: clone grand child
    Child->Parent: send SYNC_RECVPID_PLS
    Note left of Parent: get pid of childs
    Parent->Child: send SYNC_RECVPID_ACK
    Note left of Parent: send pid of childs to parent of myself(process of runc create)
    Child->Parent: send SYNC_CHILD_READY
    Note right of Child: finish
    Parent->GrandChild: send SYNC_GRANDCHILD
    Note left of Parent: wait SYNC_CHILD_READY from GrandChild
    Note right of GrandChild: set sid,uid,gid
    Note right of GrandChild: unshare cgroup namespace
    GrandChild->Parent: send SYNC_CHILD_READY
    Note left of Parent: finish
    Note right of GrandChild: let go runtime take over process
  • 相关阅读:
    Linux、Windows网络工程师面试题精选
    (转)JVM 垃圾回收算法
    笔试题学习
    使用Spring的好处
    JAVA保留字与关键字
    经典算法问题的java实现
    详解平均查找长度
    13种排序算法详解
    Grunt
    sublimeText
  • 原文地址:https://www.cnblogs.com/dream397/p/14093596.html
Copyright © 2020-2023  润新知