| -*-Mode: outline-*- |
| |
| Light-weight System Calls for IA-64 |
| ----------------------------------- |
| |
| Started: 13-Jan-2003 |
| Last update: 27-Sep-2003 |
| |
| David Mosberger-Tang |
| <davidm@hpl.hp.com> |
| |
| Using the "epc" instruction effectively introduces a new mode of |
| execution to the ia64 linux kernel. We call this mode the |
| "fsys-mode". To recap, the normal states of execution are: |
| |
| - kernel mode: |
| Both the register stack and the memory stack have been |
| switched over to kernel memory. The user-level state is saved |
| in a pt-regs structure at the top of the kernel memory stack. |
| |
| - user mode: |
| Both the register stack and the kernel stack are in |
| user memory. The user-level state is contained in the |
| CPU registers. |
| |
| - bank 0 interruption-handling mode: |
| This is the non-interruptible state which all |
| interruption-handlers start execution in. The user-level |
| state remains in the CPU registers and some kernel state may |
| be stored in bank 0 of registers r16-r31. |
| |
| In contrast, fsys-mode has the following special properties: |
| |
| - execution is at privilege level 0 (most-privileged) |
| |
| - CPU registers may contain a mixture of user-level and kernel-level |
| state (it is the responsibility of the kernel to ensure that no |
| security-sensitive kernel-level state is leaked back to |
| user-level) |
| |
| - execution is interruptible and preemptible (an fsys-mode handler |
| can disable interrupts and avoid all other interruption-sources |
| to avoid preemption) |
| |
| - neither the memory-stack nor the register-stack can be trusted while |
| in fsys-mode (they point to the user-level stacks, which may |
| be invalid, or completely bogus addresses) |
| |
| In summary, fsys-mode is much more similar to running in user-mode |
| than it is to running in kernel-mode. Of course, given that the |
| privilege level is at level 0, this means that fsys-mode requires some |
| care (see below). |
| |
| |
| * How to tell fsys-mode |
| |
| Linux operates in fsys-mode when (a) the privilege level is 0 (most |
| privileged) and (b) the stacks have NOT been switched to kernel memory |
| yet. For convenience, the header file <asm-ia64/ptrace.h> provides |
| three macros: |
| |
| user_mode(regs) |
| user_stack(task,regs) |
| fsys_mode(task,regs) |
| |
| The "regs" argument is a pointer to a pt_regs structure. The "task" |
| argument is a pointer to the task structure to which the "regs" |
| pointer belongs to. user_mode() returns TRUE if the CPU state pointed |
| to by "regs" was executing in user mode (privilege level 3). |
| user_stack() returns TRUE if the state pointed to by "regs" was |
| executing on the user-level stack(s). Finally, fsys_mode() returns |
| TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. |
| The fsys_mode() macro is equivalent to the expression: |
| |
| !user_mode(regs) && user_stack(task,regs) |
| |
| * How to write an fsyscall handler |
| |
| The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers |
| (fsyscall_table). This table contains one entry for each system call. |
| By default, a system call is handled by fsys_fallback_syscall(). This |
| routine takes care of entering (full) kernel mode and calling the |
| normal Linux system call handler. For performance-critical system |
| calls, it is possible to write a hand-tuned fsyscall_handler. For |
| example, fsys.S contains fsys_getpid(), which is a hand-tuned version |
| of the getpid() system call. |
| |
| The entry and exit-state of an fsyscall handler is as follows: |
| |
| ** Machine state on entry to fsyscall handler: |
| |
| - r10 = 0 |
| - r11 = saved ar.pfs (a user-level value) |
| - r15 = system call number |
| - r16 = "current" task pointer (in normal kernel-mode, this is in r13) |
| - r32-r39 = system call arguments |
| - b6 = return address (a user-level value) |
| - ar.pfs = previous frame-state (a user-level value) |
| - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) |
| - all other registers may contain values passed in from user-mode |
| |
| ** Required machine state on exit to fsyscall handler: |
| |
| - r11 = saved ar.pfs (as passed into the fsyscall handler) |
| - r15 = system call number (as passed into the fsyscall handler) |
| - r32-r39 = system call arguments (as passed into the fsyscall handler) |
| - b6 = return address (as passed into the fsyscall handler) |
| - ar.pfs = previous frame-state (as passed into the fsyscall handler) |
| |
| Fsyscall handlers can execute with very little overhead, but with that |
| speed comes a set of restrictions: |
| |
| o Fsyscall-handlers MUST check for any pending work in the flags |
| member of the thread-info structure and if any of the |
| TIF_ALLWORK_MASK flags are set, the handler needs to fall back on |
| doing a full system call (by calling fsys_fallback_syscall). |
| |
| o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, |
| r15, b6, and ar.pfs) because they will be needed in case of a |
| system call restart. Of course, all "preserved" registers also |
| must be preserved, in accordance to the normal calling conventions. |
| |
| o Fsyscall-handlers MUST check argument registers for containing a |
| NaT value before using them in any way that could trigger a |
| NaT-consumption fault. If a system call argument is found to |
| contain a NaT value, an fsyscall-handler may return immediately |
| with r8=EINVAL, r10=-1. |
| |
| o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform |
| any other operation that would trigger mandatory RSE |
| (register-stack engine) traffic. |
| |
| o Fsyscall-handlers MUST NOT write to any stacked registers because |
| it is not safe to assume that user-level called a handler with the |
| proper number of arguments. |
| |
| o Fsyscall-handlers need to be careful when accessing per-CPU variables: |
| unless proper safe-guards are taken (e.g., interruptions are avoided), |
| execution may be pre-empted and resumed on another CPU at any given |
| time. |
| |
| o Fsyscall-handlers must be careful not to leak sensitive kernel' |
| information back to user-level. In particular, before returning to |
| user-level, care needs to be taken to clear any scratch registers |
| that could contain sensitive information (note that the current |
| task pointer is not considered sensitive: it's already exposed |
| through ar.k6). |
| |
| o Fsyscall-handlers MUST NOT access user-memory without first |
| validating access-permission (this can be done typically via |
| probe.r.fault and/or probe.w.fault) and without guarding against |
| memory access exceptions (this can be done with the EX() macros |
| defined by asmmacro.h). |
| |
| The above restrictions may seem draconian, but remember that it's |
| possible to trade off some of the restrictions by paying a slightly |
| higher overhead. For example, if an fsyscall-handler could benefit |
| from the shadow register bank, it could temporarily disable PSR.i and |
| PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as |
| needed. In other words, following the above rules yields extremely |
| fast system call execution (while fully preserving system call |
| semantics), but there is also a lot of flexibility in handling more |
| complicated cases. |
| |
| * Signal handling |
| |
| The delivery of (asynchronous) signals must be delayed until fsys-mode |
| is exited. This is accomplished with the help of the lower-privilege |
| transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() |
| checks whether the interrupted task was in fsys-mode and, if so, sets |
| PSR.lp and returns immediately. When fsys-mode is exited via the |
| "br.ret" instruction that lowers the privilege level, a trap will |
| occur. The trap handler clears PSR.lp again and returns immediately. |
| The kernel exit path then checks for and delivers any pending signals. |
| |
| * PSR Handling |
| |
| The "epc" instruction doesn't change the contents of PSR at all. This |
| is in contrast to a regular interruption, which clears almost all |
| bits. Because of that, some care needs to be taken to ensure things |
| work as expected. The following discussion describes how each PSR bit |
| is handled. |
| |
| PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used |
| to ensure the CPU is in little-endian mode before the first |
| load/store instruction is executed. PSR.be is normally NOT |
| restored upon return from an fsys-mode handler. In other |
| words, user-level code must not rely on PSR.be being preserved |
| across a system call. |
| PSR.up Unchanged. |
| PSR.ac Unchanged. |
| PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! |
| PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! |
| PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. |
| PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. |
| PSR.pk Unchanged. |
| PSR.dt Unchanged. |
| PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! |
| PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! |
| PSR.sp Unchanged. |
| PSR.pp Unchanged. |
| PSR.di Unchanged. |
| PSR.si Unchanged. |
| PSR.db Unchanged. The kernel prevents user-level from setting a hardware |
| breakpoint that triggers at any privilege level other than 3 (user-mode). |
| PSR.lp Unchanged. |
| PSR.tb Lazy redirect. If a taken-branch trap occurs while in |
| fsys-mode, the trap-handler modifies the saved machine state |
| such that execution resumes in the gate page at |
| syscall_via_break(), with privilege level 3. Note: the |
| taken branch would occur on the branch invoking the |
| fsyscall-handler, at which point, by definition, a syscall |
| restart is still safe. If the system call number is invalid, |
| the fsys-mode handler will return directly to user-level. This |
| return will trigger a taken-branch trap, but since the trap is |
| taken _after_ restoring the privilege level, the CPU has already |
| left fsys-mode, so no special treatment is needed. |
| PSR.rt Unchanged. |
| PSR.cpl Cleared to 0. |
| PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). |
| PSR.mc Unchanged. |
| PSR.it Unchanged (guaranteed to be 1). |
| PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. |
| PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. |
| PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. |
| PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to |
| be taken. The trap handler then modifies the saved machine |
| state such that execution resumes in the gate page at |
| syscall_via_break(), with privilege level 3. |
| PSR.ri Unchanged. |
| PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode |
| handler performed a speculative load that gets NaTted. If so, this |
| would be the normal & expected behavior, so no special treatment is |
| needed. |
| PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. |
| Doing so requires clearing PSR.i and PSR.ic as well. |
| PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. |
| |
| * Using fast system calls |
| |
| To use fast system calls, userspace applications need simply call |
| __kernel_syscall_via_epc(). For example |
| |
| -- example fgettimeofday() call -- |
| -- fgettimeofday.S -- |
| |
| #include <asm/asmmacro.h> |
| |
| GLOBAL_ENTRY(fgettimeofday) |
| .prologue |
| .save ar.pfs, r11 |
| mov r11 = ar.pfs |
| .body |
| |
| mov r2 = 0xa000000000020660;; // gate address |
| // found by inspection of System.map for the |
| // __kernel_syscall_via_epc() function. See |
| // below for how to do this for real. |
| |
| mov b7 = r2 |
| mov r15 = 1087 // gettimeofday syscall |
| ;; |
| br.call.sptk.many b6 = b7 |
| ;; |
| |
| .restore sp |
| |
| mov ar.pfs = r11 |
| br.ret.sptk.many rp;; // return to caller |
| END(fgettimeofday) |
| |
| -- end fgettimeofday.S -- |
| |
| In reality, getting the gate address is accomplished by two extra |
| values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) |
| |
| o AT_SYSINFO : is the address of __kernel_syscall_via_epc() |
| o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO |
| |
| The ELF DSO is a pre-linked library that is mapped in by the kernel at |
| the gate page. It is a proper ELF shared object so, with a dynamic |
| loader that recognises the library, you should be able to make calls to |
| the exported functions within it as with any other shared library. |
| AT_SYSINFO points into the kernel DSO at the |
| __kernel_syscall_via_epc() function for historical reasons (it was |
| used before the kernel DSO) and as a convenience. |