• Arm Linux系统调用流程详细解析


     

    Linux系统通过向内核发出系统调用(system call)实现了用户态进程和硬件设备之间的大部分接口。

    系统调用是操作系统提供的服务,用户程序通过各种系统调用,来引用内核提供的各种服务,系统调用的执行让用户程序陷入内核,该陷入动作由swi软中断完成。

    1、用户可以通过两种方式使用系统调用:

    第一种方式是通过C库函数,包括系统调用在C库中的封装函数和其他普通函数。

    第二种方式是使用_syscall宏。2.6.18版本之前的内核,在include/asm-i386/unistd.h文件中定义有7个_syscall宏,分别是:

    _syscall0(type,name)  
    _syscall1(type,name,type1,arg1)  
    _syscall2(type,name,type1,arg1,type2,arg2)  
    _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)  
    _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4)  
    _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5)  
    _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5,type6,arg6) 

    其中,type表示所生成系统调用的返回值类型,name表示该系统调用的名称,typeN、argN分别表示第N个参数的类型和名称,它们的数目和_syscall后面的数字一样大。

    这些宏的作用是创建名为name的函数,_syscall后面跟的数字指明了该函数的参数的个数。

    比如sysinfo系统调用用于获取系统总体统计信息,使用_syscall宏定义为:

    _syscall1(int, sysinfo, struct sysinfo *, info); 

    展开后的形式为:

    int sysinfo(struct sysinfo * info)  
    {
      long __res;
      __asm__ volatile("int $0x80" : "=a" (__res) : "0" (116),"b" ((long)(info)));

      do {
        if ((unsigned long)(__res) >= (unsigned long)(-(128 + 1)))
        {
          errno = -(__res);
          __res = -1;
        }

        return (int) (__res);
      } while (0);
    }

    可以看出,_syscall1(int, sysinfo, struct sysinfo *, info)展开成一个名为sysinfo的函数,原参数int就是函数的返回类型,原参数struct sysinfo *和info分别构成新函数的参数。

    在程序文件里使用_syscall宏定义需要的系统调用,就可以在接下来的代码中通过系统调用名称直接调用该系统调用。下面是一个使用sysinfo系统调用的实例。

    代码清单5.1  sysinfo系统调用使用实例

    #include <stdlib.h> 
    #include <errno.h> 
    #include <linux/unistd.h>         
    #include <linux/kernel.h>       
    
    /* for struct sysinfo */  
    _syscall1(int, sysinfo, struct sysinfo *, info);       
    
    int main(void)  
    {  
      struct sysinfo s_info;  
      int error;
      error
    = sysinfo(&s_info);   printf("code error = %d/n", error);   printf("Uptime = %lds/nLoad:       1 min %lu / 5 min %lu / 15 min %lu/n"       "RAM: total %lu / free %lu / shared %lu/n"       "Memory in buffers = %lu/nSwap: total %lu / free %lu/n"   "Number of processes = %d/n",   s_info.uptime,
          s_info.loads[
    0], s_info.loads[1], s_info.loads[2],       s_info.totalram, s_info.freeram, s_info.sharedram,
    s_info.bufferram, s_info.totalswap, s_info.freeswap,       s_info.procs);   exit(EXIT_SUCCESS); }

    但是自2.6.19版本开始,_syscall宏被废除,我们需要使用syscall函数,通过指定系统调用号和一组参数来调用系统调用。

    syscall函数原型为:

    int syscall(int number, ...); 

    其中number是系统调用号,number后面应顺序接上该系统调用的所有参数。下面是gettid系统调用的调用实例。

    代码清单5.2  gettid系统调用使用实例

    #include <unistd.h> 
    #include <sys/syscall.h> 
    #include <sys/types.h> 
    
    #define __NR_gettid      224  
    
    int main(int argc, char *argv[])  
    {       
        pid_t tid;  
      
        tid = syscall(__NR_gettid);  
    }

    大部分系统调用都包括了一个SYS_符号常量来指定自己到系统调用号的映射,因此上面第10行可重写为:

    tid = syscall(SYS_gettid);  

    2 系统调用与应用编程接口(API)区别

    应用编程接口(API)与系统调用的不同在于,前者只是一个函数定义,说明了如何获得一个给定的服务,而后者是通过软件中断向内核发出的一个明确的请求。POSIX标准针对API,而不针对系统调用。Unix系统给程序员提供了很多API库函数。libc的标准c库所定义的一些API引用了封装例程(wrapper routine)(其唯一目的就是发布系统调用)。通常情况下,每个系统调用对应一个封装例程,而封装例程定义了应用程序使用的API。反之则不然,一个API没必要对应一个特定的系统调用。从编程者的观点看,API和系统调用之间的差别是没有关系的:唯一相关的事情就是函数名、参数类型及返回代码的含义。然而,从内核设计者的观点看,这种差别确实有关系,因为系统调用属于内核,而用户态的库函数不属于内核。

    大部分封装例程返回一个整数,其值的含义依赖于相应的系统调用。返回-1通常表示内核不能满足进程的请求。系统调用处理程序的失败可能是由无效参数引起的,也可能是因为缺乏可用资源,或硬件出了问题等等。在libd库中定义的errno变量包含特定的出错码。每个出错码定义为一个常量宏。

    当用户态的进程调用一个系统调用时,CPU切换到内核态并开始执行一个内核函数。因为内核实现了很多不同的系统调用,因此进程必须传递一个名为系统调用号(system call number)的参数来识别所需的系统调用。所有的系统调用都返回一个整数值。这些返回值与封装例程返回值的约定是不同的。在内核中,整数或0表示系统调用成功结束,而负数表示一个出错条件。在后一种情况下,这个值就是存放在errno变量中必须返回给应用程序的负出错码。

    3 系统调用执行过程

    ARM Linux系统利用SWI指令来从用户空间进入内核空间,还是先让我们了解下这个SWI指令吧。SWI指令用于产生软件中断,从而实现从用户模式变换到管理模式,CPSR保存到管理模式的SPSR,执行转移到SWI向量。在其他模式下也可使用SWI指令,处理器同样地切换到管理模式。指令格式如下:

    SWI{cond} immed_24

    其中:

    immed_24  24位立即数,值为从0——16777215之间的整数。

    使用SWI指令时,通常使用一下两种方法进行参数传递,SWI异常处理程序可以提供相关的服务,这两种方法均是用户软件协定。SWI异常中断处理程序要通过读取引起软件中断的SWI指令,以取得24为立即数。

    1)、指令中24位的立即数指定了用户请求的服务类型,参数通过通用寄存器传递。如:

    MOV R0,#34
    SWI 12

    2)、指令中的24位立即数被忽略,用户请求的服务类型有寄存器R0的只决定,参数通过其他的通用寄存器传递。如:

    MOV R0, #12
    MOV R1, #34
    SWI 0

    SWI异常处理程序中,去除SWI立即数的步骤为:首先确定一起软中断的SWI指令时ARM指令还是Thumb指令,这可通过对SPSR访问得到;然后取得该SWI指令的地址,这可通过访问LR寄存器得到;接着读出指令,分解出立即数(低24位)。

    由用户空间进入系统调用

    通常情况下,我们写的代码都是通过封装的C lib来调用系统调用的。以0.9.30uClibc中的open为例,来追踪一下这个封装的函数是如何一步一步的调用系统调用的。在include/fcntl.h中有定义:

    # define open open64

    open实际上只是open64的一个别名而已。

    libc/sysdeps/linux/common/open64.c中可以看到:

    extern __typeof(open64) __libc_open64;
    extern __typeof(open) __libc_open;

    可见open64也只不过是__libc_open64的别名,而__libc_open64函数在同一个文件中定义:

    libc_hidden_proto(__libc_open64)
    int __libc_open64 (const char *file, int oflag, ...)
    {
        mode_t mode = 0;
    
        if (oflag & O_CREAT)
        {
           va_list arg;
           va_start (arg, oflag);
           mode = va_arg (arg, mode_t);
           va_end (arg);
        }
     
        return __libc_open(file, oflag | O_LARGEFILE, mode);
    }
    libc_hidden_def(__libc_open64)

    最终__libc_open64又调用了__libc_open函数,这个函数在文件libc/sysdeps/linux/common/open.c中定义:

    libc_hidden_proto(__libc_open)
    int __libc_open(const char *file, int oflag, ...)
    {
       mode_t mode = 0;
    
       if (oflag & O_CREAT) {
          va_list arg;
          va_start (arg, oflag);
          mode = va_arg (arg, mode_t);
          va_end (arg);
       }
    
       return __syscall_open(file, oflag, mode);
    }
    libc_hidden_def(__libc_open)

    __syscall_open在同一个文件中定义:

    static __inline__ _syscall3(int, __syscall_open, const char *, file, int, flags, __kernel_mode_t, mode)

    在文件libc/sysdeps/linux/arm/bits/syscalls.h文件中可以看到:

    #undef _syscall3
    #define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)

    type name(type1 arg1,type2 arg2,type3 arg3)
    {
    return (type) (INLINE_SYSCALL(name, 3, arg1, arg2, arg3));
    }

    这个宏实际上完成定义一个函数的工作,这个宏的第一个参数是函数的返回值类型,第二个参数是函数名,之后的参数就如同它的参数名所表明的那样,分别是函数的参数类型及参数名。__syscall_open实际上为:

    int __syscall_open (const char * file,int flags, __kernel_mode_t mode)
    {
        return (int) (INLINE_SYSCALL(__syscall_open, 3, file, flags, mode));
    }

    INLINE_SYSCALL为同一个文件中定义的宏:

    #undef INLINE_SYSCALL
    
    #define INLINE_SYSCALL(name, nr, args...)            
    
      ({ unsigned int _inline_sys_result = INTERNAL_SYSCALL (name, , nr, args);  
    
         if (__builtin_expect (INTERNAL_SYSCALL_ERROR_P (_inline_sys_result, ), 0))  
    
           {                        
    
        __set_errno (INTERNAL_SYSCALL_ERRNO (_inline_sys_result, ));    
    
        _inline_sys_result = (unsigned int) -1;          
    
           }                        
    
         (int) _inline_sys_result; })
    
     
    
    #undef INTERNAL_SYSCALL
    
    #if !defined(__thumb__)
    
    #if defined(__ARM_EABI__)
    
    #define INTERNAL_SYSCALL(name, err, nr, args...)        
    
      ({unsigned int __sys_result;                 
    
         {                          
    
           register int _a1 __asm__ ("r0"), _nr __asm__ ("r7");    
    
           LOAD_ARGS_##nr (args)                
    
           _nr = SYS_ify(name);                 
    
           __asm__ __volatile__ ("swi  0x0   @ syscall " #name  
    
                  : "=r" (_a1)            
    
                  : "r" (_nr) ASM_ARGS_##nr        
    
                  : "memory");            
    
              __sys_result = _a1;               
    
         }                          
    
         (int) __sys_result; })
    
    #else /* defined(__ARM_EABI__) */
    
     
    
    #define INTERNAL_SYSCALL(name, err, nr, args...)        
    
      ({ unsigned int __sys_result;                
    
         {                          
    
           register int _a1 __asm__ ("a1");               
    
           LOAD_ARGS_##nr (args)                
    
           __asm__ __volatile__ ("swi  %1 @ syscall " #name  
    
               : "=r" (_a1)               
    
               : "i" (SYS_ify(name)) ASM_ARGS_##nr    
    
               : "memory");               
    
           __sys_result = _a1;                  
    
         }                          
    
         (int) __sys_result; })
    
    #endif
    
    #else /* !defined(__thumb__) */
    
    /* We can't use push/pop inside the asm because that breaks
    
       unwinding (ie. thread cancellation).
    
     */
    
    #define INTERNAL_SYSCALL(name, err, nr, args...)        
    
      ({ unsigned int __sys_result;                
    
        {                           
    
          int _sys_buf[2];                   
    
          register int _a1 __asm__ ("a1");                
    
          register int *_v3 __asm__ ("v3") = _sys_buf;       
    
          *_v3 = (int) (SYS_ify(name));               
    
          LOAD_ARGS_##nr (args)                 
    
          __asm__ __volatile__ ("str   r7, [v3, #4]
    "       
    
              "	ldr   r7, [v3]
    "           
    
              "	swi   0  @ syscall " #name "
    "      
    
              "	ldr   r7, [v3, #4]"            
    
              : "=r" (_a1)                
    
              : "r" (_v3) ASM_ARGS_##nr            
    
                        : "memory");              
    
       __sys_result = _a1;                  
    
        }                           
    
        (int) __sys_result; })
    
    #endif /*!defined(__thumb__)*/

    这里也将同文件中的LOAD_ARGS宏的定义贴出来:

    #define LOAD_ARGS_0()

    #define
    ASM_ARGS_0 #define LOAD_ARGS_1(a1) _a1 = (int) (a1); LOAD_ARGS_0 () #define ASM_ARGS_1 ASM_ARGS_0, "r" (_a1) #define LOAD_ARGS_2(a1, a2) register int _a2 __asm__ ("a2") = (int) (a2); LOAD_ARGS_1 (a1) #define ASM_ARGS_2 ASM_ARGS_1, "r" (_a2) #define LOAD_ARGS_3(a1, a2, a3) register int _a3 __asm__ ("a3") = (int) (a3); LOAD_ARGS_2 (a1, a2)

    这项宏用来在相应的寄存器中加载相应的参数。SYS_ify宏获得系统调用号

    #define SYS_ify(syscall_name)  (__NR_##syscall_name)

    也就是__NR___syscall_open,在libc/sysdeps/linux/common/open.c中可以看到这个宏的定义:

    #define __NR___syscall_open __NR_open

    __NR_open在内核代码的头文件中有定义。在r7寄存器中存放系统调用号,而参数传递似乎和普通的函数调用的参数传递也没有什么区别。

    在这个地方,得注意那个EABI, EABI是什么东西呢?ABIApplication Binary Interface,应用二进制接口。在较新的EABI规范中,是将系统调用号压入寄存器r7中,而在老的OABI中则是执行的swi中断号的方式,也就是说原来的调用方式(Old ABI)是通过跟随在swi指令中的调用号来进行的。同时这两种调用方式的系统调用号也是存在这区别的,在内核的文件arch/arm/inclue/asm/unistd.h中可以看到:

    #define __NR_OABI_SYSCALL_BASE 0x900000

    #if
    defined(__thumb__) || defined(__ARM_EABI__) #define __NR_SYSCALL_BASE 0 #else #define __NR_SYSCALL_BASE __NR_OABI_SYSCALL_BASE #endif /* * This file contains the system call numbers. */ #define __NR_restart_syscall (__NR_SYSCALL_BASE+ 0) #define __NR_exit (__NR_SYSCALL_BASE+ 1) #define __NR_fork (__NR_SYSCALL_BASE+ 2) #define __NR_read (__NR_SYSCALL_BASE+ 3) #define __NR_write (__NR_SYSCALL_BASE+ 4) #define __NR_open (__NR_SYSCALL_BASE+ 5) ……

    接下来来看操作系统对系统调用的处理。我们回到ARM Linux的异常向量表,因为当执行swi时,会从异常向量表中取例程的地址从而跳转到相应的处理程序中。在文件arch/arm/kernel/entry-armv.S中:

    /*
     * We group all the following data together to optimise
     * for CPUs with separate I & D caches.
     */
        .align    5
    
    .LCvswi:
        .word    vector_swi
    
        .globl    __stubs_end
    __stubs_end:
    
        .equ    stubs_offset, __vectors_start + 0x200 - __stubs_start
    
        .globl    __vectors_start
    __vectors_start:
     ARM(    swi    SYS_ERROR0    )
     THUMB(    svc    #0        )
     THUMB(    nop            )
        W(b)    vector_und + stubs_offset
        W(ldr)    pc, .LCvswi + stubs_offset
        W(b)    vector_pabt + stubs_offset
        W(b)    vector_dabt + stubs_offset
        W(b)    vector_addrexcptn + stubs_offset
        W(b)    vector_irq + stubs_offset
        W(b)    vector_fiq + stubs_offset
    
        .globl    __vectors_end
    __vectors_end:

    .LCvswi在同一个文件中定义为:

    .LCvswi:
       .word vector_swi

    也就是最终会执行例程vector_swi来完成对系统调用的处理,接下来我们来看下在arch/arm/kernel/entry-common.S中定义的vector_swi例程:

    /*=============================================================================
     * SWI handler
     *-----------------------------------------------------------------------------
     */
    
        /* If we're optimising for StrongARM the resulting code won't 
           run on an ARM7 and we can save a couple of instructions.  
                                    --pb */
    #ifdef CONFIG_CPU_ARM710
    #define A710(code...) code
    .Larm710bug:
        ldmia    sp, {r0 - lr}^            @ Get calling r0 - lr
        mov    r0, r0
        add    sp, sp, #S_FRAME_SIZE
        subs    pc, lr, #4
    #else
    #define A710(code...)
    #endif
    
        .align    5
    ENTRY(vector_swi)
        sub    sp, sp, #S_FRAME_SIZE
        stmia    sp, {r0 - r12}            @ Calling r0 - r12
     ARM(    add    r8, sp, #S_PC        )
     ARM(    stmdb    r8, {sp, lr}^        )    @ Calling sp, lr
     THUMB(    mov    r8, sp            )
     THUMB(    store_user_sp_lr r8, r10, S_SP    )    @ calling sp, lr
        mrs    r8, spsr            @ called from non-FIQ mode, so ok.
        str    lr, [sp, #S_PC]            @ Save calling PC
        str    r8, [sp, #S_PSR]        @ Save CPSR
        str    r0, [sp, #S_OLD_R0]        @ Save OLD_R0
        zero_fp
    
        /*
         * Get the system call number.
         */
    
    #if defined(CONFIG_OABI_COMPAT)
    
        /*
         * If we have CONFIG_OABI_COMPAT then we need to look at the swi
         * value to determine if it is an EABI or an old ABI call.
         */
    #ifdef CONFIG_ARM_THUMB
        tst    r8, #PSR_T_BIT
        movne    r10, #0                @ no thumb OABI emulation
        ldreq    r10, [lr, #-4]            @ get SWI instruction
    #else
        ldr    r10, [lr, #-4]            @ get SWI instruction
      A710(    and    ip, r10, #0x0f000000        @ check for SWI        )
      A710(    teq    ip, #0x0f000000                        )
      A710(    bne    .Larm710bug                        )
    #endif
    #ifdef CONFIG_CPU_ENDIAN_BE8
        rev    r10, r10            @ little endian instruction
    #endif
    
    #elif defined(CONFIG_AEABI)
    
        /*
         * Pure EABI user space always put syscall number into scno (r7).
         */
      A710(    ldr    ip, [lr, #-4]            @ get SWI instruction    )
      A710(    and    ip, ip, #0x0f000000        @ check for SWI        )
      A710(    teq    ip, #0x0f000000                        )
      A710(    bne    .Larm710bug                        )
    
    #elif defined(CONFIG_ARM_THUMB)
    
        /* Legacy ABI only, possibly thumb mode. */
        tst    r8, #PSR_T_BIT            @ this is SPSR from save_user_regs
        addne    scno, r7, #__NR_SYSCALL_BASE    @ put OS number in
        ldreq    scno, [lr, #-4]
    
    #else
    
        /* Legacy ABI only. */
        ldr    scno, [lr, #-4]            @ get SWI instruction
      A710(    and    ip, scno, #0x0f000000        @ check for SWI        )
      A710(    teq    ip, #0x0f000000                        )
      A710(    bne    .Larm710bug                        )
    
    #endif
    
    #ifdef CONFIG_ALIGNMENT_TRAP
        ldr    ip, __cr_alignment
        ldr    ip, [ip]
        mcr    p15, 0, ip, c1, c0        @ update control register
    #endif
        enable_irq

        //tsk 是寄存器r9的别名,在arch/arm/kernel/entry-header.S中定义:// tsk .req   r9     @current thread_info

          // 获得线程对象的基地址。

        get_thread_info tsk

          // tbl是r8寄存器的别名,在arch/arm/kernel/entry-header.S中定义:

          // tbl  .req   r8     @syscall table pointer,

          // 用来存放系统调用表的指针,系统调用表在后面调用

        adr    tbl, sys_call_table        @ load syscall table pointer
    
    #if defined(CONFIG_OABI_COMPAT)
        /*
         * If the swi argument is zero, this is an EABI call and we do nothing.
         *
         * If this is an old ABI call, get the syscall number into scno and
         * get the old ABI syscall table address.
         */
        bics    r10, r10, #0xff000000
        eorne    scno, r10, #__NR_OABI_SYSCALL_BASE
        ldrne    tbl, =sys_oabi_call_table
    #elif !defined(CONFIG_AEABI)
       // scno是寄存器r7的别名
    bic scno, scno, #0xff000000 @ mask off SWI op-code eor scno, scno, #__NR_SYSCALL_BASE @ check OS number #endif ldr r10, [tsk, #TI_FLAGS] @ check for syscall tracing stmdb sp!, {r4, r5} @ push fifth and sixth args #ifdef CONFIG_SECCOMP tst r10, #_TIF_SECCOMP beq 1f mov r0, scno bl __secure_computing add r0, sp, #S_R0 + S_OFF @ pointer to regs ldmia r0, {r0 - r3} @ have to reload r0 - r3 1: #endif tst r10, #_TIF_SYSCALL_TRACE @ are we tracing syscalls? bne __sys_trace cmp scno, #NR_syscalls @ check upper syscall limit adr lr, BSYM(ret_fast_syscall) @ return address ldrcc pc, [tbl, scno, lsl #2] @ call sys_* routine add r1, sp, #S_OFF

          // why也是r8寄存器的别名

    2: mov why, #0 @ no longer a real syscall

        cmp    scno, #(__ARM_NR_BASE - __NR_SYSCALL_BASE)
        eor    r0, scno, #__NR_SYSCALL_BASE    @ put OS number back
        bcs    arm_syscall    
        b    sys_ni_syscall            @ not private func
    ENDPROC(vector_swi)
    
        /*
         * This is the really slow path.  We're going to be doing
         * context switches, and waiting for our parent to respond.
         */
    __sys_trace:
        mov    r2, scno
        add    r1, sp, #S_OFF
        mov    r0, #0                @ trace entry [IP = 0]
        bl    syscall_trace
    
        adr    lr, BSYM(__sys_trace_return)    @ return address
        mov    scno, r0            @ syscall number (possibly new)
        add    r1, sp, #S_R0 + S_OFF        @ pointer to regs
        cmp    scno, #NR_syscalls        @ check upper syscall limit
        ldmccia    r1, {r0 - r3}            @ have to reload r0 - r3
        ldrcc    pc, [tbl, scno, lsl #2]        @ call sys_* routine
        b    2b
    
    __sys_trace_return:
        str    r0, [sp, #S_R0 + S_OFF]!    @ save returned r0
        mov    r2, scno
        mov    r1, sp
        mov    r0, #1                @ trace exit [IP = 1]
        bl    syscall_trace
        b    ret_slow_syscall
    
        .align    5
    #ifdef CONFIG_ALIGNMENT_TRAP
        .type    __cr_alignment, #object
    __cr_alignment:
        .word    cr_alignment
    #endif
        .ltorg
    
    /*
     * This is the syscall table declaration for native ABI syscalls.
     * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
     */
    #define ABI(native, compat) native
    #ifdef CONFIG_AEABI
    #define OBSOLETE(syscall) sys_ni_syscall
    #else
    #define OBSOLETE(syscall) syscall
    #endif
    
        .type    sys_call_table, #object
    ENTRY(sys_call_table)
    #include "calls.S"
    #undef ABI
    #undef OBSOLETE

    上面的zero_fp是一个宏,在arch/arm/kernel/entry-header.S中定义:

      .macro zero_fp
    
    #ifdef CONFIG_FRAME_POINTER
    
       mov   fp, #0
    
    #endif
    
       .endm
    
    //而fp位寄存器r11。

        像每一个异常处理程序一样,要做的第一件事当然就是保护现场了。紧接着是获得系统调用的系统调用号。

        然后以系统调用号作为索引来查找系统调用表,如果系统调用号正常的话,就会调用相应的处理例程来处理,就是上面的那个ldrcc  pc, [tbl, scno, lsl #2]语句,然后通过例程ret_fast_syscall来返回。

        在这个地方我们接着来讨论ABI的问题。现在,我们首先来看两个宏,一个是CONFIG_OABI_COMPAT 意思是说与old ABI兼容,另一个是CONFIG_AEABI 意思是说指定现在的方式为EABI。这两个宏可以同时配置,也可以都不配,也可以配置任何一种。我们来看一下内核是怎么处理这一问题的。我们知道,sys_call_table 在内核中是个跳转表,这个表中存储的是一系列的函数指针,这些指针就是系统调用函数的指针,如(sys_open)。系统调用是根据一个系统调用号(通常就是表的索引)找到实际该调用内核哪个函数,然后通过运行该函数完成的。 
        
    首先,对于old ABI,内核给出的处理是为它建立一个单独的system call table,sys_oabi_call_table,这样,兼容方式下就会有两个system call table, old ABI方式的系统调用会执行old_syscall_table表中的系统调用函数,EABI方式的系统调用会用sys_call_table中的函数指针。 
    配置无外乎以下4中: 
    第一、两个宏都配置行为就是上面说的那样。 
    第二、只配置CONFIG_OABI_COMPAT,那么以old ABI方式调用的会用sys_oabi_call_table,以EABI方式调用的用sys_call_table,和1实质上是相同的。只是情况1更加明确。 
    第三、只配置CONFIG_AEABI系统中不存在sys_oabi_call_table,对old ABI方式调用不兼容。只能 以EABI方式调用,用sys_call_table

    第四、两个都没有配置,系统默认会只允许old ABI方式,但是不存在old_syscall_table,最终会通过sys_call_table 完成函数调用

    系统会根据ABI的不同而将相应的系统调用表的基地址加载进tbl寄存器,也就是r8寄存器。接下来来看系统调用表,如前面所说的那样,有两个,同样都在文件arch/arm/kernel/entry-common.S中:

    /*
     * This is the syscall table declaration for native ABI syscalls.
     * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
     */
    #define ABI(native, compat) native
    #ifdef CONFIG_AEABI
    #define OBSOLETE(syscall) sys_ni_syscall
    #else
    #define OBSOLETE(syscall) syscall
    #endif
    
        .type    sys_call_table, #object
    ENTRY(sys_call_table)
    #include "calls.S"
    #undef ABI
    #undef OBSOLETE

    另外一个为:

    /*
     * This is the syscall table declaration for native ABI syscalls.
     * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
     */
    #define ABI(native, compat) native
    #ifdef CONFIG_AEABI
    #define OBSOLETE(syscall) sys_ni_syscall
    #else
    #define OBSOLETE(syscall) syscall
    #endif
    
        .type    sys_call_table, #object
    ENTRY(sys_call_table)
    #include "calls.S"
    #undef ABI
    #undef OBSOLETE

    这样看来貌似两个系统调用表是完全一样的。这里预处理指令include的独特用法也挺有意思,在系统调用表的内容就是整个arch/arm/kernel/calls.S文件的内容这个文件的内容如下(由于太长,这里就不全部列出了):

    /*
     *  linux/arch/arm/kernel/calls.S
     *
     *  Copyright (C) 1995-2005 Russell King
     *
     * This program is free software; you can redistribute it and/or modify
     * it under the terms of the GNU General Public License version 2 as
     * published by the Free Software Foundation.
     *
     *  This file is included thrice in entry-common.S
     */
    /* 0 */        CALL(sys_restart_syscall)
            CALL(sys_exit)
            CALL(sys_fork_wrapper)
            CALL(sys_read)
            CALL(sys_write)
    /* 5 */        CALL(sys_open)
            CALL(sys_close)
            CALL(sys_ni_syscall)        /* was sys_waitpid */
            CALL(sys_creat)
            CALL(sys_link)
                    ...

    这个是同样在文件arch/arm/kernel/entry-common.S中的宏CALL()的定义:

        .equ NR_syscalls,0
    #define CALL(x) .equ NR_syscalls,NR_syscalls+1
    #include "calls.S"
    #undef CALL
    #define CALL(x) .long x

    最后再罗嗦一点,如果用sys_open来搜的话,是搜不到系统调用open的定义的,系统调用函数都是用宏来定义的,比如对于open,在文件fs/open.c文件中这样定义:

    SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
    {
        long ret;
    
        if (force_o_largefile())
            flags |= O_LARGEFILE;
    
        ret = do_sys_open(AT_FDCWD, filename, flags, mode);
        /* avoid REGPARM breakage on x86: */
        asmlinkage_protect(3, ret, filename, flags, mode);
        return ret;
    }

    继续回到vector_swi,而如果系统调用号不正确,则会调用arm_syscall函数来进行处理,这个函数在文件arch/arm/kernel/traps.c中定义:

    /*
     * Handle all unrecognised system calls.
     *  0x9f0000 - 0x9fffff are some more esoteric system calls
     */
    #define NR(x) ((__ARM_NR_##x) - __ARM_NR_BASE)
    asmlinkage int arm_syscall(int no, struct pt_regs *regs)
    {
        struct thread_info *thread = current_thread_info();
        siginfo_t info;
    
        if ((no >> 16) != (__ARM_NR_BASE>> 16))
            return bad_syscall(no, regs);
    
        switch (no & 0xffff) {
        case 0: /* branch through 0 */
            info.si_signo = SIGSEGV;
            info.si_errno = 0;
            info.si_code  = SEGV_MAPERR;
            info.si_addr  = NULL;
    
            arm_notify_die("branch through zero", regs, &info, 0, 0);
            return 0;
    
        case NR(breakpoint): /* SWI BREAK_POINT */
            regs->ARM_pc -= thumb_mode(regs) ? 2 : 4;
            ptrace_break(current, regs);
            return regs->ARM_r0;
    
        /*
         * Flush a region from virtual address 'r0' to virtual address 'r1'
         * _exclusive_.  There is no alignment requirement on either address;
         * user space does not need to know the hardware cache layout.
         *
         * r2 contains flags.  It should ALWAYS be passed as ZERO until it
         * is defined to be something else.  For now we ignore it, but may
         * the fires of hell burn in your belly if you break this rule. ;)
         *
         * (at a later date, we may want to allow this call to not flush
         * various aspects of the cache.  Passing '0' will guarantee that
         * everything necessary gets flushed to maintain consistency in
         * the specified region).
         */
        case NR(cacheflush):
            do_cache_op(regs->ARM_r0, regs->ARM_r1, regs->ARM_r2);
            return 0;
    
        case NR(usr26):
            if (!(elf_hwcap & HWCAP_26BIT))
                break;
            regs->ARM_cpsr &= ~MODE32_BIT;
            return regs->ARM_r0;
    
        case NR(usr32):
            if (!(elf_hwcap & HWCAP_26BIT))
                break;
            regs->ARM_cpsr |= MODE32_BIT;
            return regs->ARM_r0;
    
        case NR(set_tls):
            thread->tp_value = regs->ARM_r0;
            if (tls_emu)
                return 0;
            if (has_tls_reg) {
                asm ("mcr p15, 0, %0, c13, c0, 3"
                    : : "r" (regs->ARM_r0));
            } else {
                /*
                 * User space must never try to access this directly.
                 * Expect your app to break eventually if you do so.
                 * The user helper at 0xffff0fe0 must be used instead.
                 * (see entry-armv.S for details)
                 */
                *((unsigned int *)0xffff0ff0) = regs->ARM_r0;
            }
            return 0;
    
    #ifdef CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG
        /*
         * Atomically store r1 in *r2 if *r2 is equal to r0 for user space.
         * Return zero in r0 if *MEM was changed or non-zero if no exchange
         * happened.  Also set the user C flag accordingly.
         * If access permissions have to be fixed up then non-zero is
         * returned and the operation has to be re-attempted.
         *
         * *NOTE*: This is a ghost syscall private to the kernel.  Only the
         * __kuser_cmpxchg code in entry-armv.S should be aware of its
         * existence.  Don't ever use this from user code.
         */
        case NR(cmpxchg):
        for (;;) {
            extern void do_DataAbort(unsigned long addr, unsigned int fsr,
                         struct pt_regs *regs);
            unsigned long val;
            unsigned long addr = regs->ARM_r2;
            struct mm_struct *mm = current->mm;
            pgd_t *pgd; pmd_t *pmd; pte_t *pte;
            spinlock_t *ptl;
    
            regs->ARM_cpsr &= ~PSR_C_BIT;
            down_read(&mm->mmap_sem);
            pgd = pgd_offset(mm, addr);
            if (!pgd_present(*pgd))
                goto bad_access;
            pmd = pmd_offset(pgd, addr);
            if (!pmd_present(*pmd))
                goto bad_access;
            pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
            if (!pte_present(*pte) || !pte_dirty(*pte)) {
                pte_unmap_unlock(pte, ptl);
                goto bad_access;
            }
            val = *(unsigned long *)addr;
            val -= regs->ARM_r0;
            if (val == 0) {
                *(unsigned long *)addr = regs->ARM_r1;
                regs->ARM_cpsr |= PSR_C_BIT;
            }
            pte_unmap_unlock(pte, ptl);
            up_read(&mm->mmap_sem);
            return val;
    
            bad_access:
            up_read(&mm->mmap_sem);
            /* simulate a write access fault */
            do_DataAbort(addr, 15 + (1 << 11), regs);
        }
    #endif
    
        default:
            /* Calls 9f00xx..9f07ff are defined to return -ENOSYS
               if not implemented, rather than raising SIGILL.  This
               way the calling program can gracefully determine whether
               a feature is supported.  */
            if ((no & 0xffff) <= 0x7ff)
                return -ENOSYS;
            break;
        }
    #ifdef CONFIG_DEBUG_USER
        /*
         * experience shows that these seem to indicate that
         * something catastrophic has happened
         */
        if (user_debug & UDBG_SYSCALL) {
            printk("[%d] %s: arm syscall %d
    ",
                   task_pid_nr(current), current->comm, no);
            dump_instr("", regs);
            if (user_mode(regs)) {
                __show_regs(regs);
                c_backtrace(regs->ARM_fp, processor_mode(regs));
            }
        }
    #endif
        info.si_signo = SIGILL;
        info.si_errno = 0;
        info.si_code  = ILL_ILLTRP;
        info.si_addr  = (void __user *)instruction_pointer(regs) -
                 (thumb_mode(regs) ? 2 : 4);
    
        arm_notify_die("Oops - bad syscall(2)", regs, &info, no, 0);
        return 0;
    }

    还有那个sys_ni_syscall,这个函数在kernel/sys_ni.c中定义,它的作用似乎也仅仅是要给用户空间返回错误码ENOSYS

    /*  we can't #include <linux/syscalls.h> here,
        but tell gcc to not warn with -Wmissing-prototypes  */
    asmlinkage long sys_ni_syscall(void);
    
    /*
     * Non-implemented system calls get redirected here.
     */
    asmlinkage long sys_ni_syscall(void)
    {
        return -ENOSYS;
    }

    系统调用号正确也好不正确也好,最终都是通过ret_fast_syscall例程来返回,同样在arch/arm/kernel/entry-common.S文件中:

        .align    5
    /*
     * This is the fast syscall return path.  We do as little as
     * possible here, and this includes saving r0 back into the SVC
     * stack.
     */
    ret_fast_syscall:
     UNWIND(.fnstart    )
     UNWIND(.cantunwind    )
        disable_irq                @ disable interrupts
        ldr    r1, [tsk, #TI_FLAGS]
        tst    r1, #_TIF_WORK_MASK
        bne    fast_work_pending
    #if defined(CONFIG_IRQSOFF_TRACER)
        asm_trace_hardirqs_on
    #endif
    
        /* perform architecture specific actions before user return */
        arch_ret_to_user r1, lr
    
        restore_user_regs fast = 1, offset = S_OFF
     UNWIND(.fnend        )

    四.声明系统调用的相关宏

    linux下的系统调用函数定义接口:

    1.SYSCALL_DEFINE1~6(include/linux/syscalls.h )

    #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
    #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
    #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
    #define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
    #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
    #define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)

    2.SYSCALL_DEFINEx

    #ifdef CONFIG_FTRACE_SYSCALLS
    #define SYSCALL_DEFINEx(x, sname, ...)                
        static const char *types_##sname[] = {            
            __SC_STR_TDECL##x(__VA_ARGS__)            
        };                            
        static const char *args_##sname[] = {            
            __SC_STR_ADECL##x(__VA_ARGS__)            
        };                            
        SYSCALL_METADATA(sname, x);                
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
    #else
    #define SYSCALL_DEFINEx(x, sname, ...)                
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
    #endif

    3.__SYSCALL_DEFINEx

    #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
    
    #define SYSCALL_DEFINE(name) static inline long SYSC_##name
    
    #define __SYSCALL_DEFINEx(x, name, ...)                    
        asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__));        
        static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__));    
        asmlinkage long SyS##name(__SC_LONG##x(__VA_ARGS__))        
        {                                
            __SC_TEST##x(__VA_ARGS__);                
            return (long) SYSC##name(__SC_CAST##x(__VA_ARGS__));    
        }                                
        SYSCALL_ALIAS(sys##name, SyS##name);                
        static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__))
    
    #else /* CONFIG_HAVE_SYSCALL_WRAPPERS */
    
    #define SYSCALL_DEFINE(name) asmlinkage long sys_##name
    #define __SYSCALL_DEFINEx(x, name, ...)                    
        asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__))
    
    #endif /* CONFIG_HAVE_SYSCALL_WRAPPERS */

    4.__SC_开头的宏

    #define __SC_DECL1(t1, a1)    t1 a1
    #define __SC_DECL2(t2, a2, ...) t2 a2, __SC_DECL1(__VA_ARGS__)
    #define __SC_DECL3(t3, a3, ...) t3 a3, __SC_DECL2(__VA_ARGS__)
    #define __SC_DECL4(t4, a4, ...) t4 a4, __SC_DECL3(__VA_ARGS__)
    #define __SC_DECL5(t5, a5, ...) t5 a5, __SC_DECL4(__VA_ARGS__)
    #define __SC_DECL6(t6, a6, ...) t6 a6, __SC_DECL5(__VA_ARGS__)
    
    #define __SC_LONG1(t1, a1)     long a1
    #define __SC_LONG2(t2, a2, ...) long a2, __SC_LONG1(__VA_ARGS__)
    #define __SC_LONG3(t3, a3, ...) long a3, __SC_LONG2(__VA_ARGS__)
    #define __SC_LONG4(t4, a4, ...) long a4, __SC_LONG3(__VA_ARGS__)
    #define __SC_LONG5(t5, a5, ...) long a5, __SC_LONG4(__VA_ARGS__)
    #define __SC_LONG6(t6, a6, ...) long a6, __SC_LONG5(__VA_ARGS__)
    
    #define __SC_CAST1(t1, a1)    (t1) a1
    #define __SC_CAST2(t2, a2, ...) (t2) a2, __SC_CAST1(__VA_ARGS__)
    #define __SC_CAST3(t3, a3, ...) (t3) a3, __SC_CAST2(__VA_ARGS__)
    #define __SC_CAST4(t4, a4, ...) (t4) a4, __SC_CAST3(__VA_ARGS__)
    #define __SC_CAST5(t5, a5, ...) (t5) a5, __SC_CAST4(__VA_ARGS__)
    #define __SC_CAST6(t6, a6, ...) (t6) a6, __SC_CAST5(__VA_ARGS__)
    ...

    5.针对SYSCALL_DEFINE1(close, unsigned int, fd)来分析一下

    SYSCALL_DEFINE1(close, unsigned int, fd)根据#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)

    化简SYSCALL_DEFINEx(1, _close, __VA_ARGS__)  【 ##是连接符的意思】,根据SYSCALL_DEFINEx的定义

    化简__SYSCALL_DEFINEx(1, _close, __VA_ARGS__) 根据__SYSCALL_DEFINEx的定义

    #define __SYSCALL_DEFINEx(1, _close, ...)                
        asmlinkage long sys_close(__SC_DECL1(__VA_ARGS__));        
        static inline long SYSC_close(__SC_DECL1(__VA_ARGS__));    
        asmlinkage long SyS_close(__SC_LONG1(__VA_ARGS__))        
        {                            
            __SC_TEST1(__VA_ARGS__);                
            return (long) SYSC_close(__SC_CAST1(__VA_ARGS__));    
        }                            
        SYSCALL_ALIAS(sys_close, SyS_close);                
        static inline long SYSC_close(__SC_DECL1(__VA_ARGS__))

    这里__VA_ARGS__是可变参数宏,可以认为等于unsigned int, fd

    根据__SC_宏化简

    #define __SYSCALL_DEFINEx(1, _close, ...)                
        asmlinkage long sys_close(unsigned int fd);            
        static inline long SYSC_close(unsigned int fd);        
        asmlinkage long SyS_close(long fd))                
        {                            
            BUILD_BUG_ON(sizeof(unsigned int) > sizeof(long))    
            return (long) SYSC_close((unsigned int)fd);        
        }                            
        SYSCALL_ALIAS(sys_close, SyS_close);                
        static inline long SYSC_close(unsigned int fd)

    声明了sys_close函数

    定义了SyS_close函数,函数体调用SYSC_close函数,并返回其返回值

    SYSCALL_ALIAS宏

    #define SYSCALL_ALIAS(alias, name)                    
        asm ("	.globl " #alias "
    	.set " #alias ", " #name)

    插入汇编代码 让执行sys_close等同于执行SYS_close

    #define SYSCALL_ALIAS(alias, name)                    
        asm ("	.globl " #alias "
    	.set " #alias ", " #name)

    【#是预处理的意思】

    BUILD_BUG_ON宏是个错误判断检测的功能

    最后一句是SYSC_close的函数定义

    所以在SYSCALL_DEFINE1宏定义后面紧跟的是{}包围起来的函数体

    6.根据5的解析可推断出

    SYSCALL_DEFINE1的'1'代表的是sys_close的参数个数为1

    同理SYSCALL_DEFINE?的'/'代表的是sys_name的参数为'?'个

    7.系统调用函数的定义用SYSCALL_DEFINE宏修饰

    系统调用函数的外部声明在include/linux/Syscalls.h头文件中

    5 添加新的系统调用

    第一、打开arch/arm/kernel/calls.S,在最后添加系统调用的函数原型的指针,例如:

    CALL(sys_set_senda)

    补充说明一点关于NR_syscalls的东西,这个常量表示系统调用的总的个数,在较新版本的内核中,文件arch/arm/kernel/entry-common.S中可以找到:

       .equ NR_syscalls,0
    #define CALL(x) .equ NR_syscalls,NR_syscalls+1
    #include "calls.S"
    #undef CALL
    #define CALL(x) .long x

    相当的巧妙,不是吗?在系统调用表中每添加一个系统调用,NR_syscalls就自动增加一。在这个地方先求出NR_syscalls,然后重新定义CALL(x)宏,这样也可以不影响文件后面系统调用表的建立。

    第二、打开include/asm-arm/unistd.h,添加系统调用号的宏,感觉这步可以省略,因为这个地方定义的系统调用号主要是个C库,比如uClibcGlibc用的。例如:

        #define __NR_plan_set_senda             (__NR_SYSCALL_BASE+365)

    为了向后兼容,系统调用只能增加而不能减少,这里的编号添加时,也必须按顺序来。否则会导致核心运行错误。

    第三,实例化该系统调用,即编写新添加系统调用的实现例如:

    SYSCALL_DEFINE1(set_senda, int,iset)
    {
           if(iset)
              UART_PUT_CR(&at91_port[2],AT91C_US_SENDA);
           else
              UART_PUT_CR(&at91_port[2],AT91C_US_RSTSTA);
    
           return 0;
    }

    第四、打开include/linux/syscalls.h添加函数声明

    asmlinkage long sys_set_senda(int iset);

    第五、在应用程序中调用该系统调用,可以参考uClibc的实现。

    第六、结束。

  • 相关阅读:
    Qt之QFileSystemWatcher
    office2007-安装程序找不到office.zh-cn*.文件
    Maven父子项目配置-多模块(multi-modules)结构
    Maven项目打包,Jar包不更新的问题
    开发Spring Shell应用程序
    Spring Shell参考文档
    Spring Shell介绍
    maven项目打包时生成dependency-reduced-pom.xml
    使用VBA批量转换Excel格式,由.xls转换成.xlsx
    修改MyEclipse取消默认工作空间
  • 原文地址:https://www.cnblogs.com/cslunatic/p/3655970.html
Copyright © 2020-2023  润新知