Andrew Jones | 3bb9614 | 2017-06-04 14:43:53 +0200 | [diff] [blame] | 1 | ================= |
| 2 | KVM VCPU Requests |
| 3 | ================= |
| 4 | |
| 5 | Overview |
| 6 | ======== |
| 7 | |
| 8 | KVM supports an internal API enabling threads to request a VCPU thread to |
| 9 | perform some activity. For example, a thread may request a VCPU to flush |
| 10 | its TLB with a VCPU request. The API consists of the following functions:: |
| 11 | |
| 12 | /* Check if any requests are pending for VCPU @vcpu. */ |
| 13 | bool kvm_request_pending(struct kvm_vcpu *vcpu); |
| 14 | |
| 15 | /* Check if VCPU @vcpu has request @req pending. */ |
| 16 | bool kvm_test_request(int req, struct kvm_vcpu *vcpu); |
| 17 | |
| 18 | /* Clear request @req for VCPU @vcpu. */ |
| 19 | void kvm_clear_request(int req, struct kvm_vcpu *vcpu); |
| 20 | |
| 21 | /* |
| 22 | * Check if VCPU @vcpu has request @req pending. When the request is |
| 23 | * pending it will be cleared and a memory barrier, which pairs with |
| 24 | * another in kvm_make_request(), will be issued. |
| 25 | */ |
| 26 | bool kvm_check_request(int req, struct kvm_vcpu *vcpu); |
| 27 | |
| 28 | /* |
| 29 | * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs |
| 30 | * with another in kvm_check_request(), prior to setting the request. |
| 31 | */ |
| 32 | void kvm_make_request(int req, struct kvm_vcpu *vcpu); |
| 33 | |
| 34 | /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ |
| 35 | bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); |
| 36 | |
| 37 | Typically a requester wants the VCPU to perform the activity as soon |
| 38 | as possible after making the request. This means most requests |
| 39 | (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(), |
| 40 | and kvm_make_all_cpus_request() has the kicking of all VCPUs built |
| 41 | into it. |
| 42 | |
| 43 | VCPU Kicks |
| 44 | ---------- |
| 45 | |
| 46 | The goal of a VCPU kick is to bring a VCPU thread out of guest mode in |
| 47 | order to perform some KVM maintenance. To do so, an IPI is sent, forcing |
| 48 | a guest mode exit. However, a VCPU thread may not be in guest mode at the |
| 49 | time of the kick. Therefore, depending on the mode and state of the VCPU |
| 50 | thread, there are two other actions a kick may take. All three actions |
| 51 | are listed below: |
| 52 | |
| 53 | 1) Send an IPI. This forces a guest mode exit. |
| 54 | 2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest |
| 55 | mode that wait on waitqueues. Waking them removes the threads from |
| 56 | the waitqueues, allowing the threads to run again. This behavior |
| 57 | may be suppressed, see KVM_REQUEST_NO_WAKEUP below. |
| 58 | 3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not |
| 59 | sleeping, then there is nothing to do. |
| 60 | |
| 61 | VCPU Mode |
| 62 | --------- |
| 63 | |
| 64 | VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the |
| 65 | guest is running in guest mode or not, as well as some specific |
| 66 | outside guest mode states. The architecture may use ``vcpu->mode`` to |
| 67 | ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"), |
| 68 | as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and |
| 69 | even to ensure IPI acknowledgements are waited upon (see "Waiting for |
| 70 | Acknowledgements"). The following modes are defined: |
| 71 | |
| 72 | OUTSIDE_GUEST_MODE |
| 73 | |
| 74 | The VCPU thread is outside guest mode. |
| 75 | |
| 76 | IN_GUEST_MODE |
| 77 | |
| 78 | The VCPU thread is in guest mode. |
| 79 | |
| 80 | EXITING_GUEST_MODE |
| 81 | |
| 82 | The VCPU thread is transitioning from IN_GUEST_MODE to |
| 83 | OUTSIDE_GUEST_MODE. |
| 84 | |
| 85 | READING_SHADOW_PAGE_TABLES |
| 86 | |
| 87 | The VCPU thread is outside guest mode, but it wants the sender of |
| 88 | certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU |
| 89 | thread is done reading the page tables. |
| 90 | |
| 91 | VCPU Request Internals |
| 92 | ====================== |
| 93 | |
| 94 | VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap. |
| 95 | This means general bitops, like those documented in [atomic-ops]_ could |
| 96 | also be used, e.g. :: |
| 97 | |
| 98 | clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests); |
| 99 | |
| 100 | However, VCPU request users should refrain from doing so, as it would |
| 101 | break the abstraction. The first 8 bits are reserved for architecture |
| 102 | independent requests, all additional bits are available for architecture |
| 103 | dependent requests. |
| 104 | |
| 105 | Architecture Independent Requests |
| 106 | --------------------------------- |
| 107 | |
| 108 | KVM_REQ_TLB_FLUSH |
| 109 | |
| 110 | KVM's common MMU notifier may need to flush all of a guest's TLB |
| 111 | entries, calling kvm_flush_remote_tlbs() to do so. Architectures that |
| 112 | choose to use the common kvm_flush_remote_tlbs() implementation will |
| 113 | need to handle this VCPU request. |
| 114 | |
| 115 | KVM_REQ_MMU_RELOAD |
| 116 | |
| 117 | When shadow page tables are used and memory slots are removed it's |
| 118 | necessary to inform each VCPU to completely refresh the tables. This |
| 119 | request is used for that. |
| 120 | |
| 121 | KVM_REQ_PENDING_TIMER |
| 122 | |
| 123 | This request may be made from a timer handler run on the host on behalf |
| 124 | of a VCPU. It informs the VCPU thread to inject a timer interrupt. |
| 125 | |
| 126 | KVM_REQ_UNHALT |
| 127 | |
| 128 | This request may be made from the KVM common function kvm_vcpu_block(), |
| 129 | which is used to emulate an instruction that causes a CPU to halt until |
| 130 | one of an architectural specific set of events and/or interrupts is |
| 131 | received (determined by checking kvm_arch_vcpu_runnable()). When that |
| 132 | event or interrupt arrives kvm_vcpu_block() makes the request. This is |
| 133 | in contrast to when kvm_vcpu_block() returns due to any other reason, |
| 134 | such as a pending signal, which does not indicate the VCPU's halt |
| 135 | emulation should stop, and therefore does not make the request. |
| 136 | |
| 137 | KVM_REQUEST_MASK |
| 138 | ---------------- |
| 139 | |
| 140 | VCPU requests should be masked by KVM_REQUEST_MASK before using them with |
| 141 | bitops. This is because only the lower 8 bits are used to represent the |
| 142 | request's number. The upper bits are used as flags. Currently only two |
| 143 | flags are defined. |
| 144 | |
| 145 | VCPU Request Flags |
| 146 | ------------------ |
| 147 | |
| 148 | KVM_REQUEST_NO_WAKEUP |
| 149 | |
| 150 | This flag is applied to requests that only need immediate attention |
| 151 | from VCPUs running in guest mode. That is, sleeping VCPUs do not need |
| 152 | to be awaken for these requests. Sleeping VCPUs will handle the |
| 153 | requests when they are awaken later for some other reason. |
| 154 | |
| 155 | KVM_REQUEST_WAIT |
| 156 | |
| 157 | When requests with this flag are made with kvm_make_all_cpus_request(), |
| 158 | then the caller will wait for each VCPU to acknowledge its IPI before |
| 159 | proceeding. This flag only applies to VCPUs that would receive IPIs. |
| 160 | If, for example, the VCPU is sleeping, so no IPI is necessary, then |
| 161 | the requesting thread does not wait. This means that this flag may be |
| 162 | safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for |
| 163 | Acknowledgements" for more information about requests with |
| 164 | KVM_REQUEST_WAIT. |
| 165 | |
| 166 | VCPU Requests with Associated State |
| 167 | =================================== |
| 168 | |
| 169 | Requesters that want the receiving VCPU to handle new state need to ensure |
| 170 | the newly written state is observable to the receiving VCPU thread's CPU |
| 171 | by the time it observes the request. This means a write memory barrier |
| 172 | must be inserted after writing the new state and before setting the VCPU |
| 173 | request bit. Additionally, on the receiving VCPU thread's side, a |
| 174 | corresponding read barrier must be inserted after reading the request bit |
| 175 | and before proceeding to read the new state associated with it. See |
| 176 | scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation |
| 177 | [memory-barriers]_. |
| 178 | |
| 179 | The pair of functions, kvm_check_request() and kvm_make_request(), provide |
| 180 | the memory barriers, allowing this requirement to be handled internally by |
| 181 | the API. |
| 182 | |
| 183 | Ensuring Requests Are Seen |
| 184 | ========================== |
| 185 | |
| 186 | When making requests to VCPUs, we want to avoid the receiving VCPU |
| 187 | executing in guest mode for an arbitrary long time without handling the |
| 188 | request. We can be sure this won't happen as long as we ensure the VCPU |
| 189 | thread checks kvm_request_pending() before entering guest mode and that a |
| 190 | kick will send an IPI to force an exit from guest mode when necessary. |
| 191 | Extra care must be taken to cover the period after the VCPU thread's last |
| 192 | kvm_request_pending() check and before it has entered guest mode, as kick |
| 193 | IPIs will only trigger guest mode exits for VCPU threads that are in guest |
| 194 | mode or at least have already disabled interrupts in order to prepare to |
| 195 | enter guest mode. This means that an optimized implementation (see "IPI |
| 196 | Reduction") must be certain when it's safe to not send the IPI. One |
| 197 | solution, which all architectures except s390 apply, is to: |
| 198 | |
| 199 | - set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and |
| 200 | the last kvm_request_pending() check; |
| 201 | - enable interrupts atomically when entering the guest. |
| 202 | |
| 203 | This solution also requires memory barriers to be placed carefully in both |
| 204 | the requesting thread and the receiving VCPU. With the memory barriers we |
| 205 | can exclude the possibility of a VCPU thread observing |
| 206 | !kvm_request_pending() on its last check and then not receiving an IPI for |
| 207 | the next request made of it, even if the request is made immediately after |
| 208 | the check. This is done by way of the Dekker memory barrier pattern |
| 209 | (scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables, |
| 210 | this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting |
| 211 | them into the pattern gives:: |
| 212 | |
| 213 | CPU1 CPU2 |
| 214 | ================= ================= |
| 215 | local_irq_disable(); |
| 216 | WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu); |
| 217 | smp_mb(); smp_mb(); |
| 218 | if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) == |
| 219 | IN_GUEST_MODE) { |
| 220 | ...abort guest entry... ...send IPI... |
| 221 | } } |
| 222 | |
| 223 | As stated above, the IPI is only useful for VCPU threads in guest mode or |
| 224 | that have already disabled interrupts. This is why this specific case of |
| 225 | the Dekker pattern has been extended to disable interrupts before setting |
| 226 | ``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to |
| 227 | pedantically implement the memory barrier pattern, guaranteeing the |
| 228 | compiler doesn't interfere with ``vcpu->mode``'s carefully planned |
| 229 | accesses. |
| 230 | |
| 231 | IPI Reduction |
| 232 | ------------- |
| 233 | |
| 234 | As only one IPI is needed to get a VCPU to check for any/all requests, |
| 235 | then they may be coalesced. This is easily done by having the first IPI |
| 236 | sending kick also change the VCPU mode to something !IN_GUEST_MODE. The |
| 237 | transitional state, EXITING_GUEST_MODE, is used for this purpose. |
| 238 | |
| 239 | Waiting for Acknowledgements |
| 240 | ---------------------------- |
| 241 | |
| 242 | Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to |
| 243 | be sent, and the acknowledgements to be waited upon, even when the target |
| 244 | VCPU threads are in modes other than IN_GUEST_MODE. For example, one case |
| 245 | is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which |
| 246 | is set after disabling interrupts. To support these cases, the |
| 247 | KVM_REQUEST_WAIT flag changes the condition for sending an IPI from |
| 248 | checking that the VCPU is IN_GUEST_MODE to checking that it is not |
| 249 | OUTSIDE_GUEST_MODE. |
| 250 | |
| 251 | Request-less VCPU Kicks |
| 252 | ----------------------- |
| 253 | |
| 254 | As the determination of whether or not to send an IPI depends on the |
| 255 | two-variable Dekker memory barrier pattern, then it's clear that |
| 256 | request-less VCPU kicks are almost never correct. Without the assurance |
| 257 | that a non-IPI generating kick will still result in an action by the |
| 258 | receiving VCPU, as the final kvm_request_pending() check does for |
| 259 | request-accompanying kicks, then the kick may not do anything useful at |
| 260 | all. If, for instance, a request-less kick was made to a VCPU that was |
| 261 | just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then |
| 262 | the VCPU thread may continue its entry without actually having done |
| 263 | whatever it was the kick was meant to initiate. |
| 264 | |
| 265 | One exception is x86's posted interrupt mechanism. In this case, however, |
| 266 | even the request-less VCPU kick is coupled with the same |
| 267 | local_irq_disable() + smp_mb() pattern described above; the ON bit |
| 268 | (Outstanding Notification) in the posted interrupt descriptor takes the |
| 269 | role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is |
| 270 | set before reading ``vcpu->mode``; dually, in the VCPU thread, |
| 271 | vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to |
| 272 | IN_GUEST_MODE. |
| 273 | |
| 274 | Additional Considerations |
| 275 | ========================= |
| 276 | |
| 277 | Sleeping VCPUs |
| 278 | -------------- |
| 279 | |
| 280 | VCPU threads may need to consider requests before and/or after calling |
| 281 | functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they |
| 282 | do or not, and, if they do, which requests need consideration, is |
| 283 | architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable() |
| 284 | to check if it should awaken. One reason to do so is to provide |
| 285 | architectures a function where requests may be checked if necessary. |
| 286 | |
| 287 | Clearing Requests |
| 288 | ----------------- |
| 289 | |
| 290 | Generally it only makes sense for the receiving VCPU thread to clear a |
| 291 | request. However, in some circumstances, such as when the requesting |
| 292 | thread and the receiving VCPU thread are executed serially, such as when |
| 293 | they are the same thread, or when they are using some form of concurrency |
| 294 | control to temporarily execute synchronously, then it's possible to know |
| 295 | that the request may be cleared immediately, rather than waiting for the |
| 296 | receiving VCPU thread to handle the request in VCPU RUN. The only current |
| 297 | examples of this are kvm_vcpu_block() calls made by VCPUs to block |
| 298 | themselves. A possible side-effect of that call is to make the |
| 299 | KVM_REQ_UNHALT request, which may then be cleared immediately when the |
| 300 | VCPU returns from the call. |
| 301 | |
| 302 | References |
| 303 | ========== |
| 304 | |
| 305 | .. [atomic-ops] Documentation/core-api/atomic_ops.rst |
| 306 | .. [memory-barriers] Documentation/memory-barriers.txt |
| 307 | .. [lwn-mb] https://lwn.net/Articles/573436/ |