diff options
Diffstat (limited to 'Documentation/virt/kvm/locking.txt')
| -rw-r--r-- | Documentation/virt/kvm/locking.txt | 215 | 
1 files changed, 215 insertions, 0 deletions
| diff --git a/Documentation/virt/kvm/locking.txt b/Documentation/virt/kvm/locking.txt new file mode 100644 index 000000000000..635cd6eaf714 --- /dev/null +++ b/Documentation/virt/kvm/locking.txt @@ -0,0 +1,215 @@ +KVM Lock Overview +================= + +1. Acquisition Orders +--------------------- + +The acquisition orders for mutexes are as follows: + +- kvm->lock is taken outside vcpu->mutex + +- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock + +- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring +  them together is quite rare. + +On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. + +Everything else is a leaf: no other lock is taken inside the critical +sections. + +2: Exception +------------ + +Fast page fault: + +Fast page fault is the fast path which fixes the guest page fault out of +the mmu-lock on x86. Currently, the page fault can be fast in one of the +following two cases: + +1. Access Tracking: The SPTE is not present, but it is marked for access +tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to +restore the saved R/X bits. This is described in more detail later below. + +2. Write-Protection: The SPTE is present and the fault is +caused by write-protect. That means we just need to change the W bit of the  +spte. + +What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and +SPTE_MMU_WRITEABLE bit on the spte: +- SPTE_HOST_WRITEABLE means the gfn is writable on host. +- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when +  the gfn is writable on guest mmu and it is not write-protected by shadow +  page write-protection. + +On fast page fault path, we will use cmpxchg to atomically set the spte W +bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or  +restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This +is safe because whenever changing these bits can be detected by cmpxchg. + +But we need carefully check these cases: +1): The mapping from gfn to pfn +The mapping from gfn to pfn may be changed since we can only ensure the pfn +is not changed during cmpxchg. This is a ABA problem, for example, below case +will happen: + +At the beginning: +gpte = gfn1 +gfn1 is mapped to pfn1 on host +spte is the shadow page table entry corresponding with gpte and +spte = pfn1 + +   VCPU 0                           VCPU0 +on fast page fault path: + +   old_spte = *spte; +                                 pfn1 is swapped out: +                                    spte = 0; + +                                 pfn1 is re-alloced for gfn2. + +                                 gpte is changed to point to +                                 gfn2 by the guest: +                                    spte = pfn1; + +   if (cmpxchg(spte, old_spte, old_spte+W) +	mark_page_dirty(vcpu->kvm, gfn1) +             OOPS!!! + +We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. + +For direct sp, we can easily avoid it since the spte of direct sp is fixed +to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() +to pin gfn to pfn, because after gfn_to_pfn_atomic(): +- We have held the refcount of pfn that means the pfn can not be freed and +  be reused for another gfn. +- The pfn is writable that means it can not be shared between different gfns +  by KSM. + +Then, we can ensure the dirty bitmaps is correctly set for a gfn. + +Currently, to simplify the whole things, we disable fast page fault for +indirect shadow page. + +2): Dirty bit tracking +In the origin code, the spte can be fast updated (non-atomically) if the +spte is read-only and the Accessed bit has already been set since the +Accessed bit and Dirty bit can not be lost. + +But it is not true after fast page fault since the spte can be marked +writable between reading spte and updating spte. Like below case: + +At the beginning: +spte.W = 0 +spte.Accessed = 1 + +   VCPU 0                                       VCPU0 +In mmu_spte_clear_track_bits(): + +   old_spte = *spte; + +   /* 'if' condition is satisfied. */ +   if (old_spte.Accessed == 1 && +        old_spte.W == 0) +      spte = 0ull; +                                         on fast page fault path: +                                             spte.W = 1 +                                         memory write on the spte: +                                             spte.Dirty = 1 + + +   else +      old_spte = xchg(spte, 0ull) + + +   if (old_spte.Accessed == 1) +      kvm_set_pfn_accessed(spte.pfn); +   if (old_spte.Dirty == 1) +      kvm_set_pfn_dirty(spte.pfn); +      OOPS!!! + +The Dirty bit is lost in this case. + +In order to avoid this kind of issue, we always treat the spte as "volatile" +if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, +the spte is always atomically updated in this case. + +3): flush tlbs due to spte updated +If the spte is updated from writable to readonly, we should flush all TLBs, +otherwise rmap_write_protect will find a read-only spte, even though the +writable spte might be cached on a CPU's TLB. + +As mentioned before, the spte can be updated to writable out of mmu-lock on +fast page fault path, in order to easily audit the path, we see if TLBs need +be flushed caused by this reason in mmu_spte_update() since this is a common +function to update spte (present -> present). + +Since the spte is "volatile" if it can be updated out of mmu-lock, we always +atomically update the spte, the race caused by fast page fault can be avoided, +See the comments in spte_has_volatile_bits() and mmu_spte_update(). + +Lockless Access Tracking: + +This is used for Intel CPUs that are using EPT but do not support the EPT A/D +bits. In this case, when the KVM MMU notifier is called to track accesses to a +page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present +by clearing the RWX bits in the PTE and storing the original R & X bits in +some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the +PTE (using the ignored bit 62). When the VM tries to access the page later on, +a fault is generated and the fast page fault mechanism described above is used +to atomically restore the PTE to a Present state. The W bit is not saved when +the PTE is marked for access tracking and during restoration to the Present +state, the W bit is set depending on whether or not it was a write access. If +it wasn't, then the W bit will remain clear until a write access happens, at  +which time it will be set using the Dirty tracking mechanism described above. + +3. Reference +------------ + +Name:		kvm_lock +Type:		mutex +Arch:		any +Protects:	- vm_list + +Name:		kvm_count_lock +Type:		raw_spinlock_t +Arch:		any +Protects:	- hardware virtualization enable/disable +Comment:	'raw' because hardware enabling/disabling must be atomic /wrt +		migration. + +Name:		kvm_arch::tsc_write_lock +Type:		raw_spinlock +Arch:		x86 +Protects:	- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} +		- tsc offset in vmcb +Comment:	'raw' because updating the tsc offsets must not be preempted. + +Name:		kvm->mmu_lock +Type:		spinlock_t +Arch:		any +Protects:	-shadow page/shadow tlb entry +Comment:	it is a spinlock since it is used in mmu notifier. + +Name:		kvm->srcu +Type:		srcu lock +Arch:		any +Protects:	- kvm->memslots +		- kvm->buses +Comment:	The srcu read lock must be held while accessing memslots (e.g. +		when using gfn_to_* functions) and while accessing in-kernel +		MMIO/PIO address->device structure mapping (kvm->buses). +		The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu +		if it is needed by multiple functions. + +Name:		blocked_vcpu_on_cpu_lock +Type:		spinlock_t +Arch:		x86 +Protects:	blocked_vcpu_on_cpu +Comment:	This is a per-CPU lock and it is used for VT-d posted-interrupts. +		When VT-d posted-interrupts is supported and the VM has assigned +		devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu +		protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues +		wakeup notification event since external interrupts from the +		assigned devices happens, we will find the vCPU on the list to +		wakeup. | 
