Qiaowei Ren | 5776563 | 2014-11-14 07:18:32 -0800 | [diff] [blame] | 1 | 1. Intel(R) MPX Overview |
| 2 | ======================== |
| 3 | |
| 4 | Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability |
| 5 | introduced into Intel Architecture. Intel MPX provides hardware features |
| 6 | that can be used in conjunction with compiler changes to check memory |
| 7 | references, for those references whose compile-time normal intentions are |
| 8 | usurped at runtime due to buffer overflow or underflow. |
| 9 | |
Dave Hansen | 72e9b5f | 2014-12-12 10:38:36 -0800 | [diff] [blame] | 10 | You can tell if your CPU supports MPX by looking in /proc/cpuinfo: |
| 11 | |
| 12 | cat /proc/cpuinfo | grep ' mpx ' |
| 13 | |
Qiaowei Ren | 5776563 | 2014-11-14 07:18:32 -0800 | [diff] [blame] | 14 | For more information, please refer to Intel(R) Architecture Instruction |
| 15 | Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection |
| 16 | Extensions. |
| 17 | |
Dave Hansen | 72e9b5f | 2014-12-12 10:38:36 -0800 | [diff] [blame] | 18 | Note: As of December 2014, no hardware with MPX is available but it is |
Qiaowei Ren | 5776563 | 2014-11-14 07:18:32 -0800 | [diff] [blame] | 19 | possible to use SDE (Intel(R) Software Development Emulator) instead, which |
| 20 | can be downloaded from |
| 21 | http://software.intel.com/en-us/articles/intel-software-development-emulator |
| 22 | |
| 23 | |
| 24 | 2. How to get the advantage of MPX |
| 25 | ================================== |
| 26 | |
| 27 | For MPX to work, changes are required in the kernel, binutils and compiler. |
| 28 | No source changes are required for applications, just a recompile. |
| 29 | |
| 30 | There are a lot of moving parts of this to all work right. The following |
| 31 | is how we expect the compiler, application and kernel to work together. |
| 32 | |
| 33 | 1) Application developer compiles with -fmpx. The compiler will add the |
| 34 | instrumentation as well as some setup code called early after the app |
| 35 | starts. New instruction prefixes are noops for old CPUs. |
| 36 | 2) That setup code allocates (virtual) space for the "bounds directory", |
Dave Hansen | 010e593 | 2014-12-12 10:38:35 -0800 | [diff] [blame] | 37 | points the "bndcfgu" register to the directory (must also set the valid |
| 38 | bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) |
| 39 | that the app will be using MPX. The app must be careful not to access |
| 40 | the bounds tables between the time when it populates "bndcfgu" and |
| 41 | when it calls the prctl(). This might be hard to guarantee if the app |
| 42 | is compiled with MPX. You can add "__attribute__((bnd_legacy))" to |
| 43 | the function to disable MPX instrumentation to help guarantee this. |
| 44 | Also be careful not to call out to any other code which might be |
| 45 | MPX-instrumented. |
Qiaowei Ren | 5776563 | 2014-11-14 07:18:32 -0800 | [diff] [blame] | 46 | 3) The kernel detects that the CPU has MPX, allows the new prctl() to |
| 47 | succeed, and notes the location of the bounds directory. Userspace is |
| 48 | expected to keep the bounds directory at that locationWe note it |
| 49 | instead of reading it each time because the 'xsave' operation needed |
| 50 | to access the bounds directory register is an expensive operation. |
| 51 | 4) If the application needs to spill bounds out of the 4 registers, it |
| 52 | issues a bndstx instruction. Since the bounds directory is empty at |
| 53 | this point, a bounds fault (#BR) is raised, the kernel allocates a |
| 54 | bounds table (in the user address space) and makes the relevant entry |
| 55 | in the bounds directory point to the new table. |
| 56 | 5) If the application violates the bounds specified in the bounds registers, |
| 57 | a separate kind of #BR is raised which will deliver a signal with |
| 58 | information about the violation in the 'struct siginfo'. |
| 59 | 6) Whenever memory is freed, we know that it can no longer contain valid |
| 60 | pointers, and we attempt to free the associated space in the bounds |
| 61 | tables. If an entire table becomes unused, we will attempt to free |
| 62 | the table and remove the entry in the directory. |
| 63 | |
| 64 | To summarize, there are essentially three things interacting here: |
| 65 | |
| 66 | GCC with -fmpx: |
| 67 | * enables annotation of code with MPX instructions and prefixes |
| 68 | * inserts code early in the application to call in to the "gcc runtime" |
| 69 | GCC MPX Runtime: |
| 70 | * Checks for hardware MPX support in cpuid leaf |
| 71 | * allocates virtual space for the bounds directory (malloc() essentially) |
| 72 | * points the hardware BNDCFGU register at the directory |
| 73 | * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to |
| 74 | start managing the bounds directories |
| 75 | Kernel MPX Code: |
| 76 | * Checks for hardware MPX support in cpuid leaf |
| 77 | * Handles #BR exceptions and sends SIGSEGV to the app when it violates |
| 78 | bounds, like during a buffer overflow. |
| 79 | * When bounds are spilled in to an unallocated bounds table, the kernel |
| 80 | notices in the #BR exception, allocates the virtual space, then |
| 81 | updates the bounds directory to point to the new table. It keeps |
| 82 | special track of the memory with a VM_MPX flag. |
| 83 | * Frees unused bounds tables at the time that the memory they described |
| 84 | is unmapped. |
| 85 | |
| 86 | |
| 87 | 3. How does MPX kernel code work |
| 88 | ================================ |
| 89 | |
| 90 | Handling #BR faults caused by MPX |
| 91 | --------------------------------- |
| 92 | |
| 93 | When MPX is enabled, there are 2 new situations that can generate |
| 94 | #BR faults. |
| 95 | * new bounds tables (BT) need to be allocated to save bounds. |
| 96 | * bounds violation caused by MPX instructions. |
| 97 | |
| 98 | We hook #BR handler to handle these two new situations. |
| 99 | |
| 100 | On-demand kernel allocation of bounds tables |
| 101 | -------------------------------------------- |
| 102 | |
| 103 | MPX only has 4 hardware registers for storing bounds information. If |
| 104 | MPX-enabled code needs more than these 4 registers, it needs to spill |
| 105 | them somewhere. It has two special instructions for this which allow |
| 106 | the bounds to be moved between the bounds registers and some new "bounds |
| 107 | tables". |
| 108 | |
| 109 | #BR exceptions are a new class of exceptions just for MPX. They are |
| 110 | similar conceptually to a page fault and will be raised by the MPX |
| 111 | hardware during both bounds violations or when the tables are not |
| 112 | present. The kernel handles those #BR exceptions for not-present tables |
| 113 | by carving the space out of the normal processes address space and then |
| 114 | pointing the bounds-directory over to it. |
| 115 | |
| 116 | The tables need to be accessed and controlled by userspace because |
| 117 | the instructions for moving bounds in and out of them are extremely |
| 118 | frequent. They potentially happen every time a register points to |
| 119 | memory. Any direct kernel involvement (like a syscall) to access the |
| 120 | tables would obviously destroy performance. |
| 121 | |
| 122 | Why not do this in userspace? MPX does not strictly require anything in |
| 123 | the kernel. It can theoretically be done completely from userspace. Here |
| 124 | are a few ways this could be done. We don't think any of them are practical |
| 125 | in the real-world, but here they are. |
| 126 | |
| 127 | Q: Can virtual space simply be reserved for the bounds tables so that we |
| 128 | never have to allocate them? |
| 129 | A: MPX-enabled application will possibly create a lot of bounds tables in |
| 130 | process address space to save bounds information. These tables can take |
| 131 | up huge swaths of memory (as much as 80% of the memory on the system) |
| 132 | even if we clean them up aggressively. In the worst-case scenario, the |
| 133 | tables can be 4x the size of the data structure being tracked. IOW, a |
| 134 | 1-page structure can require 4 bounds-table pages. An X-GB virtual |
| 135 | area needs 4*X GB of virtual space, plus 2GB for the bounds directory. |
| 136 | If we were to preallocate them for the 128TB of user virtual address |
| 137 | space, we would need to reserve 512TB+2GB, which is larger than the |
| 138 | entire virtual address space today. This means they can not be reserved |
| 139 | ahead of time. Also, a single process's pre-popualated bounds directory |
| 140 | consumes 2GB of virtual *AND* physical memory. IOW, it's completely |
| 141 | infeasible to prepopulate bounds directories. |
| 142 | |
| 143 | Q: Can we preallocate bounds table space at the same time memory is |
| 144 | allocated which might contain pointers that might eventually need |
| 145 | bounds tables? |
| 146 | A: This would work if we could hook the site of each and every memory |
| 147 | allocation syscall. This can be done for small, constrained applications. |
| 148 | But, it isn't practical at a larger scale since a given app has no |
| 149 | way of controlling how all the parts of the app might allocate memory |
| 150 | (think libraries). The kernel is really the only place to intercept |
| 151 | these calls. |
| 152 | |
| 153 | Q: Could a bounds fault be handed to userspace and the tables allocated |
| 154 | there in a signal handler intead of in the kernel? |
| 155 | A: mmap() is not on the list of safe async handler functions and even |
| 156 | if mmap() would work it still requires locking or nasty tricks to |
| 157 | keep track of the allocation state there. |
| 158 | |
| 159 | Having ruled out all of the userspace-only approaches for managing |
| 160 | bounds tables that we could think of, we create them on demand in |
| 161 | the kernel. |
| 162 | |
| 163 | Decoding MPX instructions |
| 164 | ------------------------- |
| 165 | |
| 166 | If a #BR is generated due to a bounds violation caused by MPX. |
| 167 | We need to decode MPX instructions to get violation address and |
| 168 | set this address into extended struct siginfo. |
| 169 | |
| 170 | The _sigfault feild of struct siginfo is extended as follow: |
| 171 | |
| 172 | 87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ |
| 173 | 88 struct { |
| 174 | 89 void __user *_addr; /* faulting insn/memory ref. */ |
| 175 | 90 #ifdef __ARCH_SI_TRAPNO |
| 176 | 91 int _trapno; /* TRAP # which caused the signal */ |
| 177 | 92 #endif |
| 178 | 93 short _addr_lsb; /* LSB of the reported address */ |
| 179 | 94 struct { |
| 180 | 95 void __user *_lower; |
| 181 | 96 void __user *_upper; |
| 182 | 97 } _addr_bnd; |
| 183 | 98 } _sigfault; |
| 184 | |
| 185 | The '_addr' field refers to violation address, and new '_addr_and' |
| 186 | field refers to the upper/lower bounds when a #BR is caused. |
| 187 | |
| 188 | Glibc will be also updated to support this new siginfo. So user |
| 189 | can get violation address and bounds when bounds violations occur. |
| 190 | |
| 191 | Cleanup unused bounds tables |
| 192 | ---------------------------- |
| 193 | |
| 194 | When a BNDSTX instruction attempts to save bounds to a bounds directory |
| 195 | entry marked as invalid, a #BR is generated. This is an indication that |
| 196 | no bounds table exists for this entry. In this case the fault handler |
| 197 | will allocate a new bounds table on demand. |
| 198 | |
| 199 | Since the kernel allocated those tables on-demand without userspace |
| 200 | knowledge, it is also responsible for freeing them when the associated |
| 201 | mappings go away. |
| 202 | |
| 203 | Here, the solution for this issue is to hook do_munmap() to check |
| 204 | whether one process is MPX enabled. If yes, those bounds tables covered |
| 205 | in the virtual address region which is being unmapped will be freed also. |
| 206 | |
| 207 | Adding new prctl commands |
| 208 | ------------------------- |
| 209 | |
| 210 | Two new prctl commands are added to enable and disable MPX bounds tables |
| 211 | management in kernel. |
| 212 | |
| 213 | 155 #define PR_MPX_ENABLE_MANAGEMENT 43 |
| 214 | 156 #define PR_MPX_DISABLE_MANAGEMENT 44 |
| 215 | |
| 216 | Runtime library in userspace is responsible for allocation of bounds |
| 217 | directory. So kernel have to use XSAVE instruction to get the base |
| 218 | of bounds directory from BNDCFG register. |
| 219 | |
| 220 | But XSAVE is expected to be very expensive. In order to do performance |
| 221 | optimization, we have to get the base of bounds directory and save it |
| 222 | into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT |
| 223 | command execution. |
| 224 | |
| 225 | |
| 226 | 4. Special rules |
| 227 | ================ |
| 228 | |
| 229 | 1) If userspace is requesting help from the kernel to do the management |
| 230 | of bounds tables, it may not create or modify entries in the bounds directory. |
| 231 | |
| 232 | Certainly users can allocate bounds tables and forcibly point the bounds |
| 233 | directory at them through XSAVE instruction, and then set valid bit |
| 234 | of bounds entry to have this entry valid. But, the kernel will decline |
| 235 | to assist in managing these tables. |
| 236 | |
| 237 | 2) Userspace may not take multiple bounds directory entries and point |
| 238 | them at the same bounds table. |
| 239 | |
| 240 | This is allowed architecturally. See more information "Intel(R) Architecture |
| 241 | Instruction Set Extensions Programming Reference" (9.3.4). |
| 242 | |
| 243 | However, if users did this, the kernel might be fooled in to unmaping an |
| 244 | in-use bounds table since it does not recognize sharing. |