| KVM Lock Overview |
| ================= |
| |
| 1. Acquisition Orders |
| --------------------- |
| |
| The acquisition orders for mutexes are as follows: |
| |
| - kvm->lock is taken outside vcpu->mutex |
| |
| - kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock |
| |
| - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring |
| them together is quite rare. |
| |
| On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. |
| |
| Everything else is a leaf: no other lock is taken inside the critical |
| sections. |
| |
| 2: Exception |
| ------------ |
| |
| Fast page fault: |
| |
| Fast page fault is the fast path which fixes the guest page fault out of |
| the mmu-lock on x86. Currently, the page fault can be fast in one of the |
| following two cases: |
| |
| 1. Access Tracking: The SPTE is not present, but it is marked for access |
| tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to |
| restore the saved R/X bits. This is described in more detail later below. |
| |
| 2. Write-Protection: The SPTE is present and the fault is |
| caused by write-protect. That means we just need to change the W bit of the |
| spte. |
| |
| What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and |
| SPTE_MMU_WRITEABLE bit on the spte: |
| - SPTE_HOST_WRITEABLE means the gfn is writable on host. |
| - SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when |
| the gfn is writable on guest mmu and it is not write-protected by shadow |
| page write-protection. |
| |
| On fast page fault path, we will use cmpxchg to atomically set the spte W |
| bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or |
| restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This |
| is safe because whenever changing these bits can be detected by cmpxchg. |
| |
| But we need carefully check these cases: |
| 1): The mapping from gfn to pfn |
| The mapping from gfn to pfn may be changed since we can only ensure the pfn |
| is not changed during cmpxchg. This is a ABA problem, for example, below case |
| will happen: |
| |
| At the beginning: |
| gpte = gfn1 |
| gfn1 is mapped to pfn1 on host |
| spte is the shadow page table entry corresponding with gpte and |
| spte = pfn1 |
| |
| VCPU 0 VCPU0 |
| on fast page fault path: |
| |
| old_spte = *spte; |
| pfn1 is swapped out: |
| spte = 0; |
| |
| pfn1 is re-alloced for gfn2. |
| |
| gpte is changed to point to |
| gfn2 by the guest: |
| spte = pfn1; |
| |
| if (cmpxchg(spte, old_spte, old_spte+W) |
| mark_page_dirty(vcpu->kvm, gfn1) |
| OOPS!!! |
| |
| We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. |
| |
| For direct sp, we can easily avoid it since the spte of direct sp is fixed |
| to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() |
| to pin gfn to pfn, because after gfn_to_pfn_atomic(): |
| - We have held the refcount of pfn that means the pfn can not be freed and |
| be reused for another gfn. |
| - The pfn is writable that means it can not be shared between different gfns |
| by KSM. |
| |
| Then, we can ensure the dirty bitmaps is correctly set for a gfn. |
| |
| Currently, to simplify the whole things, we disable fast page fault for |
| indirect shadow page. |
| |
| 2): Dirty bit tracking |
| In the origin code, the spte can be fast updated (non-atomically) if the |
| spte is read-only and the Accessed bit has already been set since the |
| Accessed bit and Dirty bit can not be lost. |
| |
| But it is not true after fast page fault since the spte can be marked |
| writable between reading spte and updating spte. Like below case: |
| |
| At the beginning: |
| spte.W = 0 |
| spte.Accessed = 1 |
| |
| VCPU 0 VCPU0 |
| In mmu_spte_clear_track_bits(): |
| |
| old_spte = *spte; |
| |
| /* 'if' condition is satisfied. */ |
| if (old_spte.Accessed == 1 && |
| old_spte.W == 0) |
| spte = 0ull; |
| on fast page fault path: |
| spte.W = 1 |
| memory write on the spte: |
| spte.Dirty = 1 |
| |
| |
| else |
| old_spte = xchg(spte, 0ull) |
| |
| |
| if (old_spte.Accessed == 1) |
| kvm_set_pfn_accessed(spte.pfn); |
| if (old_spte.Dirty == 1) |
| kvm_set_pfn_dirty(spte.pfn); |
| OOPS!!! |
| |
| The Dirty bit is lost in this case. |
| |
| In order to avoid this kind of issue, we always treat the spte as "volatile" |
| if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, |
| the spte is always atomically updated in this case. |
| |
| 3): flush tlbs due to spte updated |
| If the spte is updated from writable to readonly, we should flush all TLBs, |
| otherwise rmap_write_protect will find a read-only spte, even though the |
| writable spte might be cached on a CPU's TLB. |
| |
| As mentioned before, the spte can be updated to writable out of mmu-lock on |
| fast page fault path, in order to easily audit the path, we see if TLBs need |
| be flushed caused by this reason in mmu_spte_update() since this is a common |
| function to update spte (present -> present). |
| |
| Since the spte is "volatile" if it can be updated out of mmu-lock, we always |
| atomically update the spte, the race caused by fast page fault can be avoided, |
| See the comments in spte_has_volatile_bits() and mmu_spte_update(). |
| |
| Lockless Access Tracking: |
| |
| This is used for Intel CPUs that are using EPT but do not support the EPT A/D |
| bits. In this case, when the KVM MMU notifier is called to track accesses to a |
| page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present |
| by clearing the RWX bits in the PTE and storing the original R & X bits in |
| some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the |
| PTE (using the ignored bit 62). When the VM tries to access the page later on, |
| a fault is generated and the fast page fault mechanism described above is used |
| to atomically restore the PTE to a Present state. The W bit is not saved when |
| the PTE is marked for access tracking and during restoration to the Present |
| state, the W bit is set depending on whether or not it was a write access. If |
| it wasn't, then the W bit will remain clear until a write access happens, at |
| which time it will be set using the Dirty tracking mechanism described above. |
| |
| 3. Reference |
| ------------ |
| |
| Name: kvm_lock |
| Type: mutex |
| Arch: any |
| Protects: - vm_list |
| |
| Name: kvm_count_lock |
| Type: raw_spinlock_t |
| Arch: any |
| Protects: - hardware virtualization enable/disable |
| Comment: 'raw' because hardware enabling/disabling must be atomic /wrt |
| migration. |
| |
| Name: kvm_arch::tsc_write_lock |
| Type: raw_spinlock |
| Arch: x86 |
| Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} |
| - tsc offset in vmcb |
| Comment: 'raw' because updating the tsc offsets must not be preempted. |
| |
| Name: kvm->mmu_lock |
| Type: spinlock_t |
| Arch: any |
| Protects: -shadow page/shadow tlb entry |
| Comment: it is a spinlock since it is used in mmu notifier. |
| |
| Name: kvm->srcu |
| Type: srcu lock |
| Arch: any |
| Protects: - kvm->memslots |
| - kvm->buses |
| Comment: The srcu read lock must be held while accessing memslots (e.g. |
| when using gfn_to_* functions) and while accessing in-kernel |
| MMIO/PIO address->device structure mapping (kvm->buses). |
| The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu |
| if it is needed by multiple functions. |
| |
| Name: blocked_vcpu_on_cpu_lock |
| Type: spinlock_t |
| Arch: x86 |
| Protects: blocked_vcpu_on_cpu |
| Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. |
| When VT-d posted-interrupts is supported and the VM has assigned |
| devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu |
| protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues |
| wakeup notification event since external interrupts from the |
| assigned devices happens, we will find the vCPU on the list to |
| wakeup. |