https://terassyi.net/posts/2020/04/14/gvisor.html
https://wenboshen.org/posts/2018-12-25-gvisor-inside.html
System calls
For Linux kernel, Anatomy of a system call, part 1 gives a good overview of how syscall is handled in kernel. MSR_LSTAR is a Model-Specific Registers, used to hold “Target RIP for the called procedure when SYSCALL is executed in 64-bit mode”, details in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers Table 2-2. On the latest kernel v4.20, syscall_init
sets MSR_LSTAR to be entry_SYSCALL_64
, which will jump to syscall according to the syscall number at do_syscall_64
.
For gvisor, from How gvisor trap to syscall handler in kvm platform, “On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler sysenter, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.”
SyscallTable
is a struct. All the implemented syscalls are listed in var AMD64
.
How does KVM system call redirection work?
- During setup, the sentry sets LSTAR to the syscall handler, sysenter (Just like any OS).
- In SwitchToUser, the sentry calls sysret (or iret), which saves the sentry stack state and executes SYSRET to switch to ring 3 and execute user code.
- User code eventually executes SYSCALL, and the core switches to (guest) ring 0 and jumps to sysenter.
- sysenter restores the sentry stack state and returns. This effectively makes the sysret() call in SwitchToUser "return", and the sentry runs the remainder of SwitchToUser.
- This ultimately makes Platform.Switch return in the core sentry, where we ultimately handle the syscall by selecting an implementation from the syscall table.
WARNING: DATA RACE Write at 0x00c00014f0c8 by goroutine 332: gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*queue).WriteFromBlocks() pkg/sentry/fs/tty/queue.go:246 +0x2a6 gvisor.googlesource.com/gvisor/pkg/sentry/safemem.Writer.WriteFromBlocks-fm() pkg/sentry/safemem/io.go:46 +0x75 gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).withInternalMappings() pkg/sentry/mm/io.go:503 +0x8ac gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).withVecInternalMappings() pkg/sentry/mm/io.go:572 +0x964 gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).CopyInTo() pkg/sentry/mm/io.go:309 +0x1f1 gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*queue).write() pkg/sentry/usermem/usermem.go:543 +0x164 gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*lineDiscipline).inputQueueWrite() pkg/sentry/fs/tty/line_discipline.go:205 +0x147 gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*masterFileOperations).Write() pkg/sentry/fs/tty/master.go:141 +0x11c gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*File).Writev() pkg/sentry/fs/file.go:314 +0x1fc gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.writev() pkg/sentry/syscalls/linux/sys_write.go:261 +0xe0 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Write() pkg/sentry/syscalls/linux/sys_write.go:71 +0x293 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall() pkg/sentry/kernel/task_syscall.go:165 +0x407 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke() pkg/sentry/kernel/task_syscall.go:283 +0xb4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter() pkg/sentry/kernel/task_syscall.go:244 +0x109 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall() pkg/sentry/kernel/task_syscall.go:219 +0x1b6 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute() pkg/sentry/kernel/task_run.go:215 +0x1852 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run() pkg/sentry/kernel/task_run.go:91 +0x2e5 Previous read at 0x00c00014f0c8 by goroutine 113: gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*lineDiscipline).masterReadiness() pkg/sentry/fs/tty/queue.go:121 +0x43 gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*masterFileOperations).Readiness() pkg/sentry/fs/tty/master.go:131 +0x71 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls.(*PollFD).initReadiness() pkg/sentry/fs/file.go:199 +0x2d0 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls.Poll() pkg/sentry/syscalls/polling.go:96 +0x139 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.doPoll() pkg/sentry/syscalls/linux/sys_poll.go:70 +0x2ac gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Ppoll() pkg/sentry/syscalls/linux/sys_poll.go:343 +0x113 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall() pkg/sentry/kernel/task_syscall.go:165 +0x407 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke() pkg/sentry/kernel/task_syscall.go:283 +0xb4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter() pkg/sentry/kernel/task_syscall.go:244 +0x109 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall() pkg/sentry/kernel/task_syscall.go:219 +0x1b6 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute() pkg/sentry/kernel/task_run.go:215 +0x1852 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run() pkg/sentry/kernel/task_run.go:91 +0x2e5
gvisor/pkg/sentry/kernel/syscalls.go
// SyscallTable is a lookup table of system calls. // // Note that a SyscallTable is not savable directly. Instead, they are saved as // an OS/Arch pair and lookup happens again on restore. type SyscallTable struct { // OS is the operating system that this syscall table implements. OS abi.OS // Arch is the architecture that this syscall table targets. Arch arch.Arch // The OS version that this syscall table implements. Version Version // AuditNumber is a numeric constant that represents the syscall table. If // non-zero, auditNumber must be one of the AUDIT_ARCH_* values defined by // linux/audit.h. AuditNumber uint32 // Table is the collection of functions. Table map[uintptr]Syscall // lookup is a fixed-size array that holds the syscalls (indexed by // their numbers). It is used for fast look ups. lookup []SyscallFn // Emulate is a collection of instruction addresses to emulate. The // keys are addresses, and the values are system call numbers. Emulate map[usermem.Addr]uintptr // The function to call in case of a missing system call. Missing MissingFn // Stracer traces this syscall table. Stracer Stracer // External is used to handle an external callback. External func(*Kernel) // ExternalFilterBefore is called before External is called before the syscall is executed. // External is not called if it returns false. ExternalFilterBefore func(*Task, uintptr, arch.SyscallArguments) bool // ExternalFilterAfter is called before External is called after the syscall is executed. // External is not called if it returns false. ExternalFilterAfter func(*Task, uintptr, arch.SyscallArguments) bool // FeatureEnable stores the strace and one-shot enable bits. FeatureEnable SyscallFlagsTable }
// Init initializes the system call table. // // This should normally be called only during registration. func (s *SyscallTable) Init() { if s.Table == nil { // Ensure non-nil lookup table. s.Table = make(map[uintptr]Syscall) } if s.Emulate == nil { // Ensure non-nil emulate table. s.Emulate = make(map[usermem.Addr]uintptr) } max := s.MaxSysno() // Checked during RegisterSyscallTable. // Initialize the fast-lookup table. s.lookup = make([]SyscallFn, max+1) for num, sc := range s.Table { s.lookup[num] = sc.Fn //syscll生成lookup } // Initialize all features. s.FeatureEnable.init(s.Table, max) }
gvisor/pkg/sentry/syscalls/linux/linux64.go
// AMD64 is a table of Linux amd64 syscall API with the corresponding syscall // numbers from Linux 4.4. var AMD64 = &kernel.SyscallTable{ OS: abi.Linux, Arch: arch.AMD64, Version: kernel.Version{ // Version 4.4 is chosen as a stable, longterm version of Linux, which // guides the interface provided by this syscall table. The build // version is that for a clean build with default kernel config, at 5 // minutes after v4.4 was tagged. Sysname: LinuxSysname, Release: LinuxRelease, Version: LinuxVersion, }, AuditNumber: linux.AUDIT_ARCH_X86_64, Table: map[uintptr]kernel.Syscall{ 0: syscalls.Supported("read", Read), 1: syscalls.Supported("write", Write), 2: syscalls.PartiallySupported("open", Open, "Options O_DIRECT, O_NOATIME, O_PATH, O_TMPFILE, O_SYNC are not supported.", nil), 3: syscalls.Supported("close", Close), 4: syscalls.Supported("stat", Stat), 5: syscalls.Supported("fstat", Fstat), 6: syscalls.Supported("lstat", Lstat), 7: syscalls.Supported("poll", Poll), 8: syscalls.Supported("lseek", Lseek), 9: syscalls.PartiallySupported("mmap", Mmap, "Generally supported with exceptions. Options MAP_FIXED_NOREPLACE, MAP_SHARED_VALIDATE, MAP_SYNC MAP_GROWSDOWN, MAP_HUGETLB are not supported.", nil), 10: syscalls.Supported("mprotect", Mprotect), 11: syscalls.Supported("munmap", Munmap), 12: syscalls.Supported("brk", Brk), 13: syscalls.Supported("rt_sigaction", RtSigaction), 14: syscalls.Supported("rt_sigprocmask", RtSigprocmask), 15: syscalls.Supported("rt_sigreturn", RtSigreturn), 16: syscalls.PartiallySupported("ioctl", Ioctl, "Only a few ioctls are implemented for backing devices and file systems.", nil), 17: syscalls.Supported("pread64", Pread64), 18: syscalls.Supported("pwrite64", Pwrite64), 19: syscalls.Supported("readv", Readv), 20: syscalls.Supported("writev", Writev), 21: syscalls.Supported("access", Access), 22: syscalls.Supported("pipe", Pipe), 23: syscalls.Supported("select", Select), 24: syscalls.Supported("sched_yield", SchedYield), 25: syscalls.Supported("mremap", Mremap),
goroutine 974 [running]: panic(0x10a1140, 0xc00043c070) GOROOT/src/runtime/panic.go:1064 +0x470 fp=0xc000985508 sp=0xc000985450 pc=0x437030 gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*inodeRefs).IncRef(0xc000d6e008) bazel-out/k8-fastbuild-ST-3bfd66f45e612c1a5c797474a25664e227d81bf914f3b08a40e00b2e2692afa4/bin/pkg/sentry/fsimpl/tmpfs/inode_refs.go:88 +0x18c fp=0xc000985580 sp=0xc000985508 pc=0x92828c gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*inode).incRef(...) pkg/sentry/fsimpl/tmpfs/tmpfs.go:512 gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*dentry).IncRef(0xc0000c4aa0) pkg/sentry/fsimpl/tmpfs/tmpfs.go:357 +0x49 fp=0xc000985598 sp=0xc000985580 pc=0x92ef89 gvisor.dev/gvisor/pkg/sentry/vfs.(*Dentry).IncRef(...) pkg/sentry/vfs/dentry.go:150 gvisor.dev/gvisor/pkg/sentry/vfs.(*FileDescription).Init(0xc000d66500, 0x140d420, 0xc000d66500, 0xc000008241, 0xc000532660, 0xc0000c4aa0, 0xc000985624, 0x47a03f, 0xc000557358) pkg/sentry/vfs/file_description.go:151 +0x167 fp=0xc0009855c0 sp=0xc000985598 pc=0x7d3c87 gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*dentry).open(0xc0000c4aa0, 0x1402d60, 0xc000bdaa80, 0xc000d6a000, 0xc000985878, 0x1, 0x0, 0x0, 0x0) pkg/sentry/fsimpl/tmpfs/filesystem.go:584 +0x1dd fp=0xc000985660 sp=0xc0009855c0 pc=0x923abd gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*filesystem).OpenAt(0xc000557300, 0x1402d60, 0xc000bdaa80, 0xc000d6a000, 0x8241, 0x0, 0x0, 0x0) pkg/sentry/fsimpl/tmpfs/filesystem.go:519 +0xa1e fp=0xc000985858 sp=0xc000985660 pc=0x92309e gvisor.dev/gvisor/pkg/sentry/vfs.(*VirtualFilesystem).OpenAt(0xc000228908, 0x1402d60, 0xc000bdaa80, 0xc000cec300, 0xc000985aa0, 0xc000985a88, 0x100, 0xc000532420, 0xc0002ac000) pkg/sentry/vfs/vfs.go:515 +0x1ee fp=0xc0009859e8 sp=0xc000985858 pc=0x7ebe6e gvisor.dev/gvisor/pkg/sentry/syscalls/linux/vfs2.openat(0xc000bdaa80, 0x2b4bffffff9c, 0x20000180, 0x241, 0x0, 0x0, 0x0, 0x0, 0x0) pkg/sentry/syscalls/linux/vfs2/filesystem.go:219 +0x2bc fp=0xc000985b38 sp=0xc0009859e8 pc=0xe4d2bc gvisor.dev/gvisor/pkg/sentry/syscalls/linux/vfs2.Creat(0xc000bdaa80, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) pkg/sentry/syscalls/linux/vfs2/filesystem.go:200 +0x71 fp=0xc000985b90 sp=0xc000985b38 pc=0xe4cfb1 gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).executeSyscall(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0xea72d7, 0x1272f60, ...) pkg/sentry/kernel/task_syscall.go:116 +0x1b9 fp=0xc000985c50 sp=0xc000985b90 pc=0xa470f9 gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) pkg/sentry/kernel/task_syscall.go:291 +0x70 fp=0xc000985cd8 sp=0xc000985c50 pc=0xa48410 gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) pkg/sentry/kernel/task_syscall.go:238 +0xb4 fp=0xc000985d38 sp=0xc000985cd8 pc=0xa47eb4 gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscall(0xc000bdaa80, 0x2, 0xc000bdaa80) pkg/sentry/kernel/task_syscall.go:205 +0x198 fp=0xc000985e08 sp=0xc000985d38 pc=0xa47798 gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute(0x0, 0xc000bdaa80, 0x13d5ba0, 0x0) pkg/sentry/kernel/task_run.go:327 +0xd8c fp=0xc000985f60 sp=0xc000985e08 pc=0xa3a10c gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run(0xc000bdaa80, 0x2d) pkg/sentry/kernel/task_run.go:100 +0x1e2 fp=0xc000985fd0 sp=0xc000985f60 pc=0xa38c02 runtime.goexit() src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000985fd8 sp=0xc000985fd0 pc=0x4705a1 created by gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).Start pkg/sentry/kernel/task_start.go:374 +0x116
pkg/sentry/syscalls/linux/sys_thread.go
// Fork implements Linux syscall fork(2). func Fork(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) { // "A call to fork() is equivalent to a call to clone(2) specifying flags // as just SIGCHLD." - fork(2) return clone(t, int(syscall.SIGCHLD), 0, 0, 0, 0) }
pkg/sentry/kernel/task_clone.go
func clone(t *kernel.Task, flags int, stack usermem.Addr, parentTID usermem.Addr, childTID usermem.Addr, tls usermem.Addr) (uintptr, *kernel.SyscallControl, error) { opts := kernel.CloneOptions{ SharingOptions: kernel.SharingOptions{ NewAddressSpace: flags&syscall.CLONE_VM == 0, NewSignalHandlers: flags&syscall.CLONE_SIGHAND == 0, NewThreadGroup: flags&syscall.CLONE_THREAD == 0, TerminationSignal: linux.Signal(flags & exitSignalMask), NewPIDNamespace: flags&syscall.CLONE_NEWPID == syscall.CLONE_NEWPID, NewUserNamespace: flags&syscall.CLONE_NEWUSER == syscall.CLONE_NEWUSER, NewNetworkNamespace: flags&syscall.CLONE_NEWNET == syscall.CLONE_NEWNET, NewFiles: flags&syscall.CLONE_FILES == 0, NewFSContext: flags&syscall.CLONE_FS == 0, NewUTSNamespace: flags&syscall.CLONE_NEWUTS == syscall.CLONE_NEWUTS, NewIPCNamespace: flags&syscall.CLONE_NEWIPC == syscall.CLONE_NEWIPC, }, Stack: stack, SetTLS: flags&syscall.CLONE_SETTLS == syscall.CLONE_SETTLS, TLS: tls, ChildClearTID: flags&syscall.CLONE_CHILD_CLEARTID == syscall.CLONE_CHILD_CLEARTID, ChildSetTID: flags&syscall.CLONE_CHILD_SETTID == syscall.CLONE_CHILD_SETTID, ChildTID: childTID, ParentSetTID: flags&syscall.CLONE_PARENT_SETTID == syscall.CLONE_PARENT_SETTID, ParentTID: parentTID, Vfork: flags&syscall.CLONE_VFORK == syscall.CLONE_VFORK, Untraced: flags&syscall.CLONE_UNTRACED == syscall.CLONE_UNTRACED, InheritTracer: flags&syscall.CLONE_PTRACE == syscall.CLONE_PTRACE, } ntid, ctrl, err := t.Clone(&opts) return uintptr(ntid), ctrl, err }
log.Infof("Process should have started...") l.watchdog.Start() return l.k.Start()
ype Loader struct { // k is the kernel. k *kernel.Kernel
// // threadID a dummy value set to the task's TID in the root PID namespace to // make it visible in stack dumps. A goroutine for a given task can be identified // searching for Task.run()'s argument value. func (t *Task) run(threadID uintptr) { atomic.StoreInt64(&t.goid, goid.Get()) // Construct t.blockingTimer here. We do this here because we can't // reconstruct t.blockingTimer during restore in Task.afterLoad(), because // kernel.timekeeper.SetClocks() hasn't been called yet. blockingTimerNotifier, blockingTimerChan := ktime.NewChannelNotifier() t.blockingTimer = ktime.NewTimer(t.k.MonotonicClock(), blockingTimerNotifier) defer t.blockingTimer.Destroy() t.blockingTimerChan = blockingTimerChan // Activate our address space. t.Activate() // The corresponding t.Deactivate occurs in the exit path // (runExitMain.execute) so that when // Platform.CooperativelySharesAddressSpace() == true, we give up the // AddressSpace before the task goroutine finishes executing. // If this is a newly-started task, it should check for participation in // group stops. If this is a task resuming after restore, it was // interrupted by saving. In either case, the task is initially // interrupted. t.interruptSelf() for { // Explanation for this ordering: // // - A freshly-started task that is stopped should not do anything // before it enters the stop. // // - If taskRunState.execute returns nil, the task goroutine should // exit without checking for a stop. // // - Task.Start won't start Task.run if t.runState is nil, so this // ordering is safe. t.doStop() t.runState = t.runState.execute(t) if t.runState == nil { t.accountTaskGoroutineEnter(TaskGoroutineNonexistent) t.goroutineStopped.Done() t.tg.liveGoroutines.Done() t.tg.pidns.owner.liveGoroutines.Done() t.tg.pidns.owner.runningGoroutines.Done() t.p.Release() // Deferring this store triggers a false positive in the race // detector (https://github.com/golang/go/issues/42599). atomic.StoreInt64(&t.goid, 0) // Keep argument alive because stack trace for dead variables may not be correct. runtime.KeepAlive(threadID) return } } }
func (ts *TaskSet) newTask(cfg *TaskConfig) (*Task, error) { tg := cfg.ThreadGroup image := cfg.TaskImage t := &Task{ taskNode: taskNode{ tg: tg, parent: cfg.Parent, children: make(map[*Task]struct{}), }, runState: (*runApp)(nil), interruptChan: make(chan struct{}, 1), signalMask: cfg.SignalMask, signalStack: arch.SignalStack{Flags: arch.SignalStackFlagDisable}, image: *image, fsContext: cfg.FSContext, fdTable: cfg.FDTable, p: cfg.Kernel.Platform.NewContext(), k: cfg.Kernel, ptraceTracees: make(map[*Task]struct{}), allowedCPUMask: cfg.AllowedCPUMask.Copy(), ioUsage: &usage.IO{}, niceness: cfg.Niceness, netns: cfg.NetworkNamespace, utsns: cfg.UTSNamespace, ipcns: cfg.IPCNamespace, abstractSockets: cfg.AbstractSocketNamespace, mountNamespaceVFS2: cfg.MountNamespaceVFS2, rseqCPU: -1, rseqAddr: cfg.RSeqAddr, rseqSignature: cfg.RSeqSignature, futexWaiter: futex.NewWaiter(), containerID: cfg.ContainerID, }
pkg/sentry/kernel/task_run.go
func (*runApp) execute(t *Task) taskRunState { ... switch err { case nil: // Handle application system call. return t.doSyscall() ... }
doSyscall pkg/sentry/kernel/task_syscall.go
// doSyscall is the entry point for an invocation of a system call specified by // the current state of t's registers. // // The syscall path is very hot; avoid defer. func (t *Task) doSyscall() taskRunState { sysno := t.Arch().SyscallNo() args := t.Arch().SyscallArgs() // Tracers expect to see this between when the task traps into the kernel // to perform a syscall and when the syscall is actually invoked. // This useless-looking temporary is needed because Go. tmp := uintptr(syscall.ENOSYS) t.Arch().SetReturn(-tmp) // Check seccomp filters. The nil check is for performance (as seccomp use // is rare), not needed for correctness. if t.syscallFilters.Load() != nil { switch r := t.checkSeccompSyscall(int32(sysno), args, usermem.Addr(t.Arch().IP())); r { case linux.SECCOMP_RET_ERRNO, linux.SECCOMP_RET_TRAP: t.Debugf("Syscall %d: denied by seccomp", sysno) return (*runSyscallExit)(nil) case linux.SECCOMP_RET_ALLOW: // ok case linux.SECCOMP_RET_KILL_THREAD: t.Debugf("Syscall %d: killed by seccomp", sysno) t.PrepareExit(ExitStatus{Signo: int(linux.SIGSYS)}) return (*runExit)(nil) case linux.SECCOMP_RET_TRACE: t.Debugf("Syscall %d: stopping for PTRACE_EVENT_SECCOMP", sysno) return (*runSyscallAfterPtraceEventSeccomp)(nil) default: panic(fmt.Sprintf("Unknown seccomp result %d", r)) } } return t.doSyscallEnter(sysno, args) }
func (t *Task) executeSyscall(sysno uintptr, args arch.SyscallArguments) (rval uintptr, ctrl *SyscallControl, err error) { s := t.SyscallTable() fe := s.FeatureEnable.Word(sysno) var straceContext interface{} if bits.IsAnyOn32(fe, StraceEnableBits) { straceContext = s.Stracer.SyscallEnter(t, sysno, args, fe) } if bits.IsOn32(fe, ExternalBeforeEnable) && (s.ExternalFilterBefore == nil || s.ExternalFilterBefore(t, sysno, args)) { t.invokeExternal() // Ensure we check for stops, then invoke the syscall again. ctrl = ctrlStopAndReinvokeSyscall } else { fn := s.Lookup(sysno) if fn != nil { // Call our syscall implementation. rval, ctrl, err = fn(t, args) } else { // Use the missing function if not found. rval, err = t.SyscallTable().Missing(t, sysno, args) } } if bits.IsOn32(fe, ExternalAfterEnable) && (s.ExternalFilterAfter == nil || s.ExternalFilterAfter(t, sysno, args)) { t.invokeExternal() // Don't reinvoke the syscall. } if bits.IsAnyOn32(fe, StraceEnableBits) { s.Stracer.SyscallExit(straceContext, t, sysno, rval, err) } return }
pkg/sentry/syscalls/linux/vfs2/fscontext.go
// Chdir implements Linux syscall chdir(2). func Chdir(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) { addr := args[0].Pointer() path, err := copyInPath(t, addr) if err != nil { return 0, nil, err } tpop, err := getTaskPathOperation(t, linux.AT_FDCWD, path, disallowEmptyPath, followFinalSymlink) if err != nil { return 0, nil, err } defer tpop.Release(t) vd, err := t.Kernel().VFS().GetDentryAt(t, t.Credentials(), &tpop.pop, &vfs.GetDentryOptions{ CheckSearchable: true, }) if err != nil { return 0, nil, err } t.FSContext().SetWorkingDirectoryVFS2(t, vd) vd.DecRef(t) return 0, nil, nil }
root@cloud:~/onlyGvisor# ps -elf | grep 947729
4 S nobody 947729 947703 0 80 0 - 68452015012 sys_po 11:17 ? 00:00:00 runsc-sandbox --root=/var/run/docker/runtime-runsc-kvm/moby --log=/run/containerd/io.containerd.runtime.v1.linux/moby/b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88/log.json --log-format=json --platform=kvm --log-fd=3 boot --bundle=/run/containerd/io.containerd.runtime.v1.linux/moby/b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88 --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --io-fds=9 --io-fds=10 --io-fds=11 --device-fd=12 --stdio-fds=13 --stdio-fds=14 --stdio-fds=15 --pidns=true --cpu-num 64 b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88
0 S root 948093 947631 0 80 0 - 1418 pipe_r 11:33 pts/3 00:00:00 grep --color=auto 947729
root@cloud:~/onlyGvisor# docker inspect test2 | grep Pid | head -n 1
"Pid": 947729,
root@cloud:~/onlyGvisor#
gdb Socket
root@cloud:/gvisor# docker run --runtime=runsc-kvm --rm --name=test -d alpine sleep 1000 1076ade686c4ccea6e8c40e6d6881e4f5e9c403ff21aab9febf4557218a10e17 root@cloud:/gvisor# docker inspect test | grep Pid | head -n 1 "Pid": 927424, root@cloud:/gvisor# docker exec -it test ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=1 ttl=42 time=57.058 ms 64 bytes from 8.8.8.8: seq=2 ttl=42 time=56.148 ms 64 bytes from 8.8.8.8: seq=3 ttl=42 time=56.321 ms 64 bytes from 8.8.8.8: seq=4 ttl=42 time=69.416 ms 64 bytes from 8.8.8.8: seq=5 ttl=42 time=55.813 ms 64 bytes from 8.8.8.8: seq=6 ttl=42 time=68.444 ms 64 bytes from 8.8.8.8: seq=7 ttl=42 time=56.031 ms ^C --- 8.8.8.8 ping statistics --- 8 packets transmitted, 7 packets received, 12% packet loss round-trip min/avg/max = 55.813/59.890/69.416 ms root@cloud:/gvisor# docker exec -it test ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=42 time=111.545 ms 64 bytes from 8.8.8.8: seq=1 ttl=42 time=55.150 ms 64 bytes from 8.8.8.8: seq=2 ttl=42 time=55.362 ms 64 bytes from 8.8.8.8: seq=3 ttl=42 time=58.652 ms 64 bytes from 8.8.8.8: seq=4 ttl=42 time=56.521 ms 64 bytes from 8.8.8.8: seq=5 ttl=42 time=55.958 ms 64 bytes from 8.8.8.8: seq=6 ttl=42 time=55.386 ms 64 bytes from 8.8.8.8: seq=7 ttl=42 time=54.869 ms 64 bytes from 8.8.8.8: seq=8 ttl=42 time=54.373 ms 64 bytes from 8.8.8.8: seq=9 ttl=42 time=74.912 ms 64 bytes from 8.8.8.8: seq=10 ttl=42 time=55.755 ms
root@cloud:/mycontainer# dlv attach 927424 Type 'help' for list of commands. (dlv) b Socket Command failed: Location "Socket" ambiguous: golang.org/x/sys/unix.Socket, syscall.Socket, type..eq.gvisor.dev/gvisor/pkg/unet.Socket, gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket, gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket… (dlv) c received SIGINT, stopping process (will not forward signal) > syscall.Syscall6() src/syscall/asm_linux_arm64.s:43 (PC: 0x8dccc) Warning: debugging optimized function (dlv) b linux.Socket Breakpoint 1 set at 0x587f30 for gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket() pkg/sentry/syscalls/linux/sys_socket.go:172 (dlv) b netstack.Socket Command failed: Location "netstack.Socket" ambiguous: gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket, gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*providerVFS2).Socket… (dlv) b netstack.(*provider).Socket Breakpoint 2 set at 0x647270 for gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket() pkg/sentry/socket/netstack/provider.go:94 (dlv) b netstack.(*providerVFS2).Socket Breakpoint 3 set at 0x647960 for gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*providerVFS2).Socket() pkg/sentry/socket/netstack/provider_vfs2.go:38 (dlv) c > gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket() pkg/sentry/syscalls/linux/sys_socket.go:172 (hits goroutine(306):1 total:1) (PC: 0x587f30) Warning: debugging optimized function (dlv) bt 0 0x0000000000587f30 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket at pkg/sentry/syscalls/linux/sys_socket.go:172 1 0x0000000000522ea4 in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).executeSyscall at pkg/sentry/kernel/task_syscall.go:104 2 0x0000000000523c5c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke at pkg/sentry/kernel/task_syscall.go:239 3 0x00000000005238dc in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter at pkg/sentry/kernel/task_syscall.go:199 4 0x00000000005233e0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscall at pkg/sentry/kernel/task_syscall.go:174 5 0x0000000000518e00 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute at pkg/sentry/kernel/task_run.go:282 6 0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run at pkg/sentry/kernel/task_run.go:97 7 0x0000000000077c84 in runtime.goexit at src/runtime/asm_arm64.s:1136
entersyscall + exitsyscall
entersyscall() bluepill(c) vector = c.CPU.SwitchToUser(switchOpts) exitsyscall()
//go:linkname entersyscall runtime.entersyscall
func entersyscall() ------
//go:linkname exitsyscall runtime.exitsyscall
func exitsyscall()
gvisor中 entersyscall 和exitsyscall使用的是runtime的
oot@cloud:~/onlyGvisor# dlv attach 947729 Type 'help' for list of commands. (dlv) b entersyscall Breakpoint 1 set at 0x72780 for runtime.entersyscall() GOROOT/src/runtime/proc.go:3126 (dlv) c > runtime.entersyscall() GOROOT/src/runtime/proc.go:3126 (hits goroutine(203):1 total:1) (PC: 0x72780) Warning: debugging optimized function (dlv) bt 0 0x0000000000072780 in runtime.entersyscall at GOROOT/src/runtime/proc.go:3126 1 0x000000000008dcb0 in syscall.Syscall6 at src/syscall/asm_linux_arm64.s:35 2 0x00000000005458e8 in gvisor.dev/gvisor/pkg/fdnotifier.epollWait at pkg/fdnotifier/poll_unsafe.go:76 3 0x0000000000545564 in gvisor.dev/gvisor/pkg/fdnotifier.(*notifier).waitAndNotify at pkg/fdnotifier/fdnotifier.go:149 4 0x0000000000077c84 in runtime.goexit at src/runtime/asm_arm64.s:1136 (dlv) clearall Breakpoint 1 cleared at 0x72780 for runtime.entersyscall() GOROOT/src/runtime/proc.go:3126 (dlv) b pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248 Breakpoint 2 set at 0x87f504 for gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248 (dlv) c > gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248 (hits goroutine(94):1 total:1) (PC: 0x87f504) Warning: debugging optimized function (dlv) bt 0 0x000000000087f504 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser at pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248 1 0x000000000087bb1c in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*context).Switch at pkg/sentry/platform/kvm/context.go:75 2 0x00000000005186d0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute at pkg/sentry/kernel/task_run.go:271 3 0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run at pkg/sentry/kernel/task_run.go:97 4 0x0000000000077c84 in runtime.goexit at src/runtime/asm_arm64.s:1136 (dlv) s > gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:249 (PC: 0x87f508) Warning: debugging optimized function (dlv) bt 0 0x000000000087f508 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser at pkg/sentry/platform/kvm/machine_arm64_unsafe.go:249 1 0x000000000087bb1c in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*context).Switch at pkg/sentry/platform/kvm/context.go:75 2 0x00000000005186d0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute at pkg/sentry/kernel/task_run.go:271 3 0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run at pkg/sentry/kernel/task_run.go:97 4 0x0000000000077c84 in runtime.goexit at src/runtime/asm_arm64.s:1136 (dlv) s Stopped at: 0x881b80 =>no source available