nsenter

https://x3fwy.bitcron.com/post/runc-malicious-container-escape

The `nsenter` package will `import "C"` and it uses [cgo](https://golang.org/cmd/cgo/)
package. In cgo, if the import of "C" is immediately preceded by a comment, that comment,
called the preamble, is used as a header when compiling the C parts of the package.
So every time we  import package `nsenter`, the C code function `nsexec()` would be
called. And package `nsenter` is only imported in `init.go`, so every time the runc
`init` command is invoked, that C code is run.

Because `nsexec()` must be run before the Go runtime in order to use the
Linux kernel namespace, you must `import` this library into a package if
you plan to use `libcontainer` directly. Otherwise Go will not execute
the `nsexec()` constructor, which means that the re-exec will not cause
the namespaces to be joined. You can import it like this:

```go
import _ "github.com/opencontainers/runc/libcontainer/nsenter"
```

init.go

func init() {
        if len(os.Args) > 1 && os.Args[1] == "init" {
                runtime.GOMAXPROCS(1)
                runtime.LockOSThread()

                level := os.Getenv("_LIBCONTAINER_LOGLEVEL")
                logLevel, err := logrus.ParseLevel(level)
                if err != nil {
                        panic(fmt.Sprintf("libcontainer: failed to parse log level: %q: %v", level, err))
                }

                err = logs.ConfigureLogging(logs.Config{
                        LogPipeFd: os.Getenv("_LIBCONTAINER_LOGPIPE"),
                        LogFormat: "json",
                        LogLevel:  logLevel,
                })
                if err != nil {
                        panic(fmt.Sprintf("libcontainer: failed to configure logging: %v", err))
                }
                logrus.Debugf("child process in init() and child pid is %d", os.Getpid())
        }
}

var initCommand = cli.Command{
        Name:  "init",
        Usage: `initialize the namespaces and launch the process (do not call it outside of runc)`,
        Action: func(context *cli.Context) error {
                factory, _ := libcontainer.New("")
                if err := factory.StartInitialization(); err != nil {
                        // as the error is sent back to the parent there is no need to log
                        // or write it to stderr because the parent process will handle this
                        os.Exit(1)
                }
                panic("libcontainer: container init failed to exec")
        },
}

main.go

app.Commands = []cli.Command{
                checkpointCommand,
                createCommand,
                deleteCommand,
                eventsCommand,
                execCommand,
                initCommand,
                killCommand,
                listCommand,
                pauseCommand,
                psCommand,
                restoreCommand,
                resumeCommand,
                runCommand,
                specCommand,
                startCommand,
                stateCommand,
                updateCommand,
        }

nsenter模块分析

nsenter模块，主要涉及namespace管理（把当前进程加入到指定的namespace或者创建新的namespace）、uid和gid的映射管理以及串口的管理等。

涉及golang和c两种语言实现，具体实现代码：

libcontainer/nsenter，核心实现在libcontainer/nsenter/nsexec.c。

模块入口

package nsenter

/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
	nsexec();
}
*/
import "C"

当有包import _ "github.com/opencontainers/runc/libcontainer/nsenter"的时候，会导致C语言实现的部分在编译的时候，编译到对应的可执行文件中。而这里的C代码，定义了一个构造函数init(void)，从C语言的构造函数特性，可以了解到，构造函数会在main函数执行之前运行。那么，init(void)函数会在可执行文件一开始就运行。所以，nsexec()函数会第一个执行。

nsexec函数

主要功能如下：

设置log pipe，用于日志传输；
设置init pipe，用于namespace等配置数据的传输以及子进程pid的回传；
ensure clone binary，用于解决CVE-2019-5736，防止/proc/self/exe导致的安全漏洞；
读取并解析init pipe传入的namespace等数据信息；
更新oom配置；
执行double fork

ensure clone binary

在第一次运行时，拷贝原始的二进制文件内容到内存。后续的二进制执行，都是使用的内存数据。从而消除，运行过程中二进制被修改，导致的安全漏洞。

具体实现待分析：clone_binary.c — ensure_cloned_binary()

tatic int clone_binary(void)
{
    int binfd, execfd;
    struct stat statbuf = {};
    size_t sent = 0;
    int fdtype = EFD_NONE;

    /*
     * Before we resort to copying, let's try creating an ro-binfd in one shot
     * by getting a handle for a read-only bind-mount of the execfd.
     */
    execfd = try_bindfd();
    if (execfd >= 0)
        return execfd;

    /*
     * Dammit, that didn't work -- time to copy the binary to a safe place we
     * can seal the contents.
     */
    execfd = make_execfd(&fdtype);
    if (execfd < 0 || fdtype == EFD_NONE)
        return -ENOTRECOVERABLE;

    binfd = open("/proc/self/exe", O_RDONLY | O_CLOEXEC);

double clone

nsexec中，进行了2次clone进程。

至于为何需要进行2次clone操作的原因，可以参考注释：

/*
	 * Okay, so this is quite annoying.
	 *
	 * In order for this unsharing code to be more extensible we need to split
	 * up unshare(CLONE_NEWUSER) and clone() in various ways. The ideal case
	 * would be if we did clone(CLONE_NEWUSER) and the other namespaces
	 * separately, but because of SELinux issues we cannot really do that. But
	 * we cannot just dump the namespace flags into clone(...) because several
	 * usecases (such as rootless containers) require more granularity around
	 * the namespace setup. In addition, some older kernels had issues where
	 * CLONE_NEWUSER wasn't handled before other namespaces (but we cannot
	 * handle this while also dealing with SELinux so we choose SELinux support
	 * over broken kernel support).
	 *
	 * However, if we unshare(2) the user namespace *before* we clone(2), then
	 * all hell breaks loose.
	 *
	 * The parent no longer has permissions to do many things (unshare(2) drops
	 * all capabilities in your old namespace), and the container cannot be set
	 * up to have more than one {uid,gid} mapping. This is obviously less than
	 * ideal. In order to fix this, we have to first clone(2) and then unshare.
	 *
	 * Unfortunately, it's not as simple as that. We have to fork to enter the
	 * PID namespace (the PID namespace only applies to children). Since we'll
	 * have to double-fork, this clone_parent() call won't be able to get the
	 * PID of the _actual_ init process (without doing more synchronisation than
	 * I can deal with at the moment). So we'll just get the parent to send it
	 * for us, the only job of this process is to update
	 * /proc/pid/{setgroups,uid_map,gid_map}.
	 *
	 * And as a result of the above, we also need to setns(2) in the first child
	 * because if we join a PID namespace in the topmost parent then our child
	 * will be in that namespace (and it will not be able to give us a PID value
	 * that makes sense without resorting to sending things with cmsg).
	 *
	 * This also deals with an older issue caused by dumping cloneflags into
	 * clone(2): On old kernels, CLONE_PARENT didn't work with CLONE_NEWPID, so
	 * we have to unshare(2) before clone(2) in order to do this. This was fixed
	 * in upstream commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5, and was
	 * introduced by 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e. As far as we're
	 * aware, the last mainline kernel which had this bug was Linux 3.12.
	 * However, we cannot comment on which kernels the broken patch was
	 * backported to.
	 *
	 * -- Aleksa "what has my life come to?" Sarai
	 */

包括父进程在内，一共涉及了3个进程，它们的关系序列如下：

Title: How to clone init process
Parent->Child: clone first child
Note right of Child:  join namespace and unshare newuser
Child->Parent: send SYNC_USERMAP_PLS
Note left of Parent: update groups,uid and gid
Parent->Child: send SYNC_USERMAP_ACK
Note right of Child: unshare other namespace, except cgroup
Child->GrandChild: clone grand child
Child->Parent: send SYNC_RECVPID_PLS
Note left of Parent: get pid of childs
Parent->Child: send SYNC_RECVPID_ACK
Note left of Parent: send pid of childs to parent of myself(process of runc create)
Child->Parent: send SYNC_CHILD_READY
Note right of Child: finish
Parent->GrandChild: send SYNC_GRANDCHILD
Note left of Parent: wait SYNC_CHILD_READY from GrandChild
Note right of GrandChild: set sid,uid,gid
Note right of GrandChild: unshare cgroup namespace
GrandChild->Parent: send SYNC_CHILD_READY
Note left of Parent: finish
Note right of GrandChild: let go runtime take over process

相关阅读:
Linux、Windows网络工程师面试题精选
 (转)JVM 垃圾回收算法
 笔试题学习
 使用Spring的好处
 JAVA保留字与关键字
 经典算法问题的java实现
 详解平均查找长度
 13种排序算法详解
 Grunt
sublimeText
原文地址：https://www.cnblogs.com/dream397/p/14093596.html