阅读笔记：如何给OpenSolaris增加一个系统调用

阅读笔记：如何给OpenSolaris增加一个系统调用
阅读笔记：如何给OpenSolaris增加一个系统调用

原作者: Eric Schrock
原文来自: http://blogs.sun.com/roller/page/eschrock

译注者: Badcoffee
Email: blog.oliver@gmail.com
Blog: http://blog.csdn.net/yayong
2005年7月

按：给操作系统增加系统一个简单的系统调用是熟悉OS的内核代码的一个好方法，本文给出了在Solaris内核中增加一个系统调用的基本步骤。

How to add a system call to OpenSolaris

When I first started in the Solaris group, I was faced with two equally difficult tasks: learning the development model, and understanding the source code. For both these tasks, the recommended method is usually picking a small bug and working through the process. For the curious, the first bug I putback to ON was 4912227 (ptree call returns zero on failure), a simple bug with near zero risk. It was the first step down a very long road.

As a another first step, someone suggested adding a very simple system call to the kernel. This turned out to be a whole lot harder than one would expect, and has so many subtle(细微的) aspects(方面) that experienced Solaris engineers (myself included) still miss some of the necessary changes. With that in mind, I thought a reasonable first OpenSolaris blog would be describing exactly how to add a new system call to the kernel.

For the purposes of this post, we will assume that it's a simple system call that lives in the generic kernel code, and we'll put the code into an existing file to avoid having to deal with Makefiles. The goal is to print an arbitrary(随意的) message to the console whenever the system call is issued.

注：
1. 做Solaris开发面临2个难题，一个是需要了解Solaris开发的模式，或者说是process上的东西；而另一个就是理解Solairs源代码了。有一个最好的办法就是选择Solaris上一个很小的bug来熟悉process上的东西。
2. 而理解Solaris的源代码，最好是从增加一个非常简单的系统调用开始。但是这有一点难，有很多细微之处即便是有经验的Solaris工程师也会遗漏。而本篇文章的作者将以此为起点，描述如何给Solaris的kernel增加一个系统调用。
3. 为尽量简化，作者把新增调用的代码放到了已经存在的源文件中，来避免对Makefile的改动。这个新的系统调用只是在被调用时输出任意的信息到 console上。

1. Picking a syscall number

Before writing any real code, we first have to pick a number that will represent our system call. The main source of documentation here is syscall.h, which describes all the available system call numbers, as well as which ones are reserved. The maximum number of syscalls is currently 256 (NSYSCALL), which doesn't leave much space for new ones. This could theoretically be extended - I believe the hard limit is in the size of sysset_t, whose 16 integers must be able to represent a complete bitmask of all system calls. This puts our actual limit at 16*32, or 512, system calls. But for the purposes of our tutorial, we'll pick system call number 56, which is currently unused. For my own amusement(娱乐), we'll name our (my?) system call 'schrock'. So first we add the following line to syscall.h
```
#define SYS_uadmin      55
#define SYS_schrock     56
#define SYS_utssys      57
```
注：
4. 第1步，需要选择一个系统调用号，需要在syscall.h里增加一个定义，这个头文件包含了目前系统所有可用的系统调用号。
5. 系统最大的调用号数是systm.h文件的NSYSCALL定义的，目前的值是256，实际上256被占用，没有空间增加新的调用号。
6. 理论上，可以扩展最大调用号，但sysset_t对这个有限制，它是所有系统调用的位掩码，在syscall.h的定义中表明，它的最大位数是16＊32=512。
7. 为简化问题，作者使用了调用号56，这个号恰好没有被使用过，而系统调用的名字就叫“schrock"。

2. Writing the syscall handler

Next, we have to actually add the function that will get called when we invoke the system call. What we should really do is add a new file schrock.c to usr/src/uts/common/syscall, but I'm trying to avoid Makefiles. Instead, we'll just stick it in getpid.c:
```
#include <sys/cmn_err.h>

int
schrock(void *arg)
{
	char	buf[1024];
	size_t	len;

	if (copyinstr(arg, buf, sizeof (buf), &len) != 0)
		return (set_errno(EFAULT));

	cmn_err(CE_WARN, "%s", buf);

	return (0);
}
```
Note that declaring a buffer of 1024 bytes on the stack is a very bad thing to do in the kernel. We have limited stack space, and a stack overflow will result in a panic. We also don't check that the length of the string was less than our scratch space. But this will suffice for illustrative purposes. The cmn_err() function is the simplest way to display messages from the kernel.

注：
8. 第2步，实现系统调用函数。为避免修改Makefile，作者选择了在getpid.c文件里来增加新调用schrock，实现比较简单，就是在 console输出一个指定的字符串。
9. 这个函数声明了一个1024字节的buffer，这个buffer是要在kernel的stack中分配的，由于kernel的stack空间是非常有限的，分配这么大的一个buffer是很不好的，stack的溢出是会导致系统panic的。通常，为避免耗尽kernel的stack，局部变量和嵌套函数调用都要考虑占用stack的资源问题。
10. 查看OpenSolaris的源代码可以看到，copyinstr()这个函数是从用户空间将以空字符终止的字符串拷贝到内核空间中，函数原型如下：
```
copyinstr(const char *uaddr, char *kaddr, size_t maxlength,
    size_t *lencopied);
```
其中，第1，2个参数分别是位于用户空间的源串和内核空间的目的串；第3个参数是目的串的长度；第4个参数写回实际拷贝的长度。
11. cmn_err()相当于Linux的printk()，可以把内核消息输出到console上。

3. Adding an entry to the syscall table

We need to place an entry in the system call table. This table lives in sysent.c, and makes heavy use of macros to simplify the source. Our system call takes a single argument and returns an integer, so we'll need to use the SYSENT_CI macro. We need to add a prototype for our syscall, and add an entry to the sysent and sysent32 tables:
```
int     rename();
void    rexit();
int     schrock();
int     semsys();
int     setgid();

/* ... */

        /* 54 */ SYSENT_CI("ioctl",             ioctl,          3),
        /* 55 */ SYSENT_CI("uadmin",            uadmin,         3),
        /* 56 */ SYSENT_CI("schrock",		schrock,	1),
        /* 57 */ IF_LP64(
                        SYSENT_2CI("utssys",    utssys64,       4),
                        SYSENT_2CI("utssys",    utssys32,       4)),

/* ... */

        /* 54 */ SYSENT_CI("ioctl",             ioctl,          3),
        /* 55 */ SYSENT_CI("uadmin",            uadmin,         3),
        /* 56 */ SYSENT_CI("schrock",		schrock,	1),
        /* 57 */ SYSENT_2CI("utssys",           utssys32,       4),
```
注：
12. 第3步，在系统调用表里增加一项。这个表就在sysent.c里，为简化源代码这里使用了很多宏定义。
13. sysent和sysent32用来存放系统调用表，可在sysent.c找到如下说明：
```
/*
* This table is the switch used to transfer to the appropriate
* routine for processing a system call.  Each row contains the
* number of arguments expected, a switch that tells systrap()
* in trap.c whether a setjmp() is not necessary, and a pointer
* to the routine.
*/
```
可以看出，事实上这个表里有每个系统调用的名称，该调用处理函数的指针，还有入口参数的个数。
sysent32用于64位内核时，存放32位系统到调用的表结构。

14. 由于新增的调用返回值个数为1，且类型为int，在LP64和ILP32模式下都是32位的，因此使用宏SYSENT_CI，在sysent.c可以找到相关的定义:
```
/* returns a 64-bit quantity for both ABIs */
#define SYSENT_C(name, call, narg)      /
{ (narg), SE_64RVAL, NULL, NULL, (llfcn_t)(call) }

/* returns one 32-bit value for both ABIs: r_val1 */
#define SYSENT_CI(name, call, narg)     /
{ (narg), SE_32RVAL1, NULL, NULL, (llfcn_t)(call) }

/* returns 2 32-bit values: r_val1 & r_val2 */
#define SYSENT_2CI(name, call, narg)    /
{ (narg), SE_32RVAL1|SE_32RVAL2, NULL, NULL, (llfcn_t)(call) }
```
可以看到，根据系统调用的返回值的类型及个数，可以使用不同的宏定义，对于本例，需要使用SYSENT_CI。
SYSENT_CI的参数中，第1个是调用名字符串，第2个是函数指针llfcn指向的处理函数，第3个参数是参数的个数，因此除在sysent和sysent32表中增加相应的项外，还需要声明一下schrock()函数。
4. /etc/name_to_sysnum

At this point, we could write a program to invoke our system call, but the point here is to illustrate everything that needs to be done to integrate a system call, so we can't ignore the little things. One of these little things is /etc/name_to_sysnum, which provides a mapping between system call names and numbers, and is used by dtrace(1M), truss(1), and friends. Of course, there is one version for x86 and one for SPARC, so you will have to add the following lines to both the intel and SPARC versions:
```
ioctl                   54
uadmin                  55
schrock                 56
utssys                  57
fdsync                  58
```
注：
15. 第4步，需要在/etc/name_to_sysnum里添加一个相应的系统调用号。其实这时主要的工作已经完成，已经可以写一个应用程序调用执行新的系统调用了，但这个教程实际上是要讲述集成一个系统调用所需做的所有步骤，当然也就不能忽略这些细节了。
16. /etc/name_to_sysnum实际上是为dtrace(1M)和truss(1)之类的程序提供了一个系统调用名字和系统调用号之间的影射关系。在这里，需要修改Intel和SPARC两个版本的文件。

5. truss(1)

Truss does fancy decoding of system call arguments. In order to do this, we need to maintain a table in truss that describes the type of each argument for every syscall. This table is found in systable.c. Since our syscall takes a single string, we add the following entry:
```
{"ioctl",       3, DEC, NOV, DEC, IOC, IOA},                    /*  54 */
{"uadmin",      3, DEC, NOV, DEC, DEC, DEC},                    /*  55 */
{"schrock",     1, DEC, NOV, STG},                              /*  56 */
{"utssys",      4, DEC, NOV, HEX, DEC, UTS, HEX},               /*  57 */
{"fdsync",      2, DEC, NOV, DEC, FFG},                         /*  58 */
```
Don't worry too much about the different constants. But be sure to read up(攻读) on the truss source code if you're adding a complicated system call.

注：
17. 第5步，为了让truss(1)命令可以解释出新加的系统调用的参数，需要在systable.c文件中的systable中增加一条相应的记录。
18. systable实际上是truss(1)维护的一个表结构，用来描述系统调用的入口参数个数，返回值和入口参数的输出表示形式，其定义如下：
```
const struct systable systable[] = {
{ NULL,		8, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX},
{"_exit",	1, DEC, NOV, DEC},				/*   1 */
{"forkall",	0, DEC, NOV},					/*   2 */
{"read",	3, DEC, NOV, DEC, IOB, UNS},			/*   3 */
{"write",	3, DEC, NOV, DEC, IOB, UNS},			/*   4 */
{"open",	3, DEC, NOV, STG, OPN, OCT},			/*   5 */
..............
{"cladm",	3, DEC, NOV, CLC, CLF, HEX},			/* 253 */
{ NULL,		8, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX},
{"umount2",	2, DEC, NOV, STG, MTF},				/* 255 */
{ NULL, -1, DEC, NOV},
};
```
可以看到，其中每一行实际上对应一个系统调用的描述，对应着结构体systable，其定义如下：
```
struct systable {
const char *name;	/* name of system call */
short	nargs;		/* number of arguments */
char	rval[2];	/* return value types */
char	arg[8];		/* argument types */
};
```
所以systable这张表每行的第1个值对应调用名，第2个对应参数个数，第3，4对应返回值的描述，剩下8个值对应调用的入口参数描述。通过这样的描述，truss(1)就知道每个系统调用的入口参数和返回值格式，并正确的输出了，新增的系统调用对应的记录为：
```
{"schrock",     1, DEC, NOV, STG},                              /*  56 */
```
如前所述，再结合print.h中对DEC,NOV,STG的定义，就知道这一行的含义了。

6. proc_names.c

This is the file that gets missed the most often when adding a new syscall. Libproc uses the table in proc_names.c to translate between system call numbers and names. Why it doesn't make use of /etc/name_to_sysnum is anybody's guess, but for now you have to update the systable array in this file:
```
        "ioctl",                /* 54 */
        "uadmin",               /* 55 */
        "schrock",              /* 56 */
        "utssys",               /* 57 */
        "fdsync",               /* 58 */
```
注：
19. 第6步，为保证Libproc能正确识别新加的系统调用，需要在proc_names.c增加对应的行，这一步是经常容易被遗漏的。至于Libproc为何不用/etc/name_to_sysnum而另外定义一个系统调用名和调用号的影射关系，恐怕只有作者知道了。
20. Libproc是Solaris提供的一组访问proc文件系统的接口，proc(1)中介绍的一组命令使用了这组接口。这组接口位于libproc.so动态链接库，关于proc文件系统，可以参考proc(4)。

7. Putting it all together

Finally, everything is in place. We can test our system call with a simple program:
```
#include <sys/syscall.h>

int
main(int argc, char **argv)
{
	syscall(SYS_schrock, "OpenSolaris Rules!");
	return (0);
}
```
If we run this on our system, we'll see the following output on the console:
```
June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!
```
Because we did all the extra work, we can actually observe the behavior using truss(1), mdb(1), or dtrace(1M). As you can see, adding a system call is not as easy as it should be. One of the ideas that has been floating around for a while is the Grand Unified Syscall(tm) project, which would centralize all this information as well as provide type information for the DTrace syscall provider. But until that happens, we'll have to deal with this process.

注：
21. 最后，写一个小程序测试一下新加的系统调用。其实，这里略去了很重要而且很复杂的一个环节，就是重新build一下OpenSolairs的内核，然后 Install或者update一下OpenSolaris，让新加的调用可用。因为所有应做的改动都做了，因此，除了可以调用新的系统调用之外，还可以使用OpenSolaris所有debug工具，如truss(1), mdb(1)和dtrace(1M)。
22. 文章的结尾处，作者透露了未来 OpenSolaris所做的改进，就是将集中化所有有关系统调用的定义，同时为dtrace的syscall provider提供系统调用的类型信息。在这些改进完成之前，增加新的系统调用就不得不走一遍本文所述流程。

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
相关阅读:
二分查找
 「数学」二次函数中项系数大小与图像的关系
 「数学」夹角公式
 「CF80A」Panoramix's Prediction
「Luogu P6101」[EER2]出言不逊
 「数学」三角函数公式以及部分证明
 「Luogu P6069」[MdOI2020] Group
「CF80B」Depression
「数学」Menelaus定理与Ceva定理
 「AT1175」ニコニコ文字列
原文地址：https://www.cnblogs.com/ainima/p/6330828.html

阅读笔记：如何给OpenSolaris增加一个系统调用

1. Picking a syscall number

2. Writing the syscall handler

3. Adding an entry to the syscall table

4. /etc/name_to_sysnum

5. truss(1)

6. proc_names.c

7. Putting it all together