summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
11 daysKVM: x86: clarify leave_smm() return valuePaolo Bonzini
The return value of vmx_leave_smm() is unrelated from that of nested_vmx_enter_non_root_mode(). Check explicitly for success (which happens to be 0) and return 1 just like everywhere else in vmx_leave_smm(). Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno returned by tenter_svm_guest_mode(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: SVM: check validity of VMCB controls when returning from SMMPaolo Bonzini
The VMCB12 is stored in guest memory and can be mangled while in SMM; it is then reloaded by svm_leave_smm(), but it is not checked again for validity. Move the cached vmcb12 control and save consistency checks out of svm_set_nested_state() and into a helper, and reuse it in svm_leave_smm(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: VMX: check validity of VMCS controls when returning from SMMPaolo Bonzini
The VMCS12 is not available while in SMM. However, it can be overwritten if userspace manages to trigger copy_enlightened_to_vmcs12() - for example via KVM_GET_NESTED_STATE. Because of this, the VMCS12 has to be checked for validity before it is used to generate the VMCS02. Move the check code out of vmx_set_nested_state() (the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry) and reuse it in vmx_leave_smm(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: SVM: Set/clear CR8 write interception when AVIC is (de)activatedSean Christopherson
Explicitly set/clear CR8 write interception when AVIC is (de)activated to fix a bug where KVM leaves the interception enabled after AVIC is activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8 will remain intercepted in perpetuity. On its own, the dangling CR8 intercept is "just" a performance issue, but combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging intercept is fatal to Windows guests as the TPR seen by hardware gets wildly out of sync with reality. Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this is firmly an SVM implementation flaw/detail. WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should never enter the guest with AVIC enabled and CR8 writes intercepted. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Cc: Naveen N Rao (AMD) <naveen@kernel.org> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> [Squash fix to avic_deactivate_vmcb. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APICSean Christopherson
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the vCPU could activate AVIC at any point in its lifecycle. Configuring the VMCB if and only if AVIC is active "works" purely because of optimizations in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled *and* to defer updates until the first KVM_RUN. In quotes because KVM likely won't do the right thing if kvm_apicv_activated() is false, i.e. if a vCPU is created while APICv is inhibited at the VM level for whatever reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to vendor code due to seeing "apicv_active == activate". Cleaning up the initialization code will also allow fixing a bug where KVM incorrectly leaves CR8 interception enabled when AVIC is activated without creating a mess with respect to whether AVIC is activated or not. Cc: stable@vger.kernel.org Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC") Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv") Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMMJim Mattson
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted prior to commit 6b1dd26544d0 ("KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest"). Enable the quirk by default for backwards compatibility (like all quirks); userspace can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the constraints on WRMSR(IA32_DEBUGCTL). Note that the quirk only bypasses the consistency check. The vmcs02 bit is still owned by the host, and PMCs are not frozen during virtualized SMM. In particular, if a host administrator decides that PMCs should not be frozen during physical SMM, then L1 has no say in the matter. Fixes: 095686e6fcb4 ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter") Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com [sean: tag for stable@, clean-up and fix goofs in the comment and docs] Signed-off-by: Sean Christopherson <seanjc@google.com> [Rename quirk. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers()Li RongQing
The mask_notifier_list is protected by kvm->irq_srcu, but the traversal in kvm_fire_mask_notifiers() incorrectly uses hlist_for_each_entry_rcu(). This leads to lockdep warnings because the standard RCU iterator expects to be under rcu_read_lock(), not SRCU. Replace the RCU variant with hlist_for_each_entry_srcu() and provide the proper srcu_read_lock_held() annotation to ensure correct synchronization and silence lockdep. Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://patch.msgid.link/20260204091206.2617-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr()Namhyung Kim
The previous change had a bug to update a guest MSR with a host value. Fixes: c3d6a7210a4de9096 ("KVM: VMX: Dedup code for adding MSR to VMCS's auto list") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260220220216.389475-1-namhyung@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: x86: hyper-v: Validate all GVAs during PV TLB flushManuel Andreas
In KVM guests with Hyper-V hypercalls enabled, the hypercalls HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX allow a guest to request invalidation of portions of a virtual TLB. For this, the hypercall parameter includes a list of GVAs that are supposed to be invalidated. Currently, only the base GVA is checked to be canonical. In reality, this check needs to be performed for the entire range of GVAs, as checking only the base GVA enables guests running on Intel hardware to trigger a WARN_ONCE in the host (see Fixes commit below). Move the check for non-canonical addresses to be performed for every GVA of the supplied range to avoid the splat, and to be more in line with the Hyper-V specification, since, although unlikely, a range starting with an invalid GVA may still contain GVAs that are valid. Fixes: fa787ac07b3c ("KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush") Signed-off-by: Manuel Andreas <manuel.andreas@tum.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/00a7a31b-573b-4d92-91f8-7d7e2f88ea48@tum.de [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysKVM: x86: synthesize CPUID bits only if CPU capability is setCarlos López
KVM incorrectly synthesizes CPUID bits for KVM-only leaves, as the following branch in kvm_cpu_cap_init() is never taken: if (leaf < NCAPINTS) kvm_cpu_caps[leaf] &= kernel_cpu_caps[leaf]; This means that bits set via SYNTHESIZED_F() for KVM-only leaves are unconditionally set. This for example can cause issues for SEV-SNP guests running on Family 19h CPUs, as TSA_SQ_NO and TSA_L1_NO are always enabled by KVM in 80000021[ECX]. When userspace issues a SNP_LAUNCH_UPDATE command to update the CPUID page for the guest, SNP firmware will explicitly reject the command if the page sets sets these bits on vulnerable CPUs. To fix this, check in SYNTHESIZED_F() that the corresponding X86 capability is set before adding it to to kvm_cpu_cap_features. Fixes: 31272abd5974 ("KVM: SVM: Advertise TSA CPUID bits to guests") Link: https://lore.kernel.org/all/20260208164233.30405-1-clopez@suse.de/ Signed-off-by: Carlos López <clopez@suse.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://patch.msgid.link/20260209153108.70667-2-clopez@suse.de Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
11 daysMerge tag 'kvm-x86-generic-7.0-rc3' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini
HEAD KVM generic changes for 7.0 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive.
2026-02-28KVM: always define KVM_CAP_SYNC_MMUPaolo Bonzini
KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always available. Move the definition from individual architectures to common code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-02-28KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIERPaolo Bonzini
All architectures now use MMU notifier for KVM page table management. Remove the Kconfig symbol and the code that is used when it is disabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-11Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- | Test case | emulated vPMU | all passthrough | passthrough with | | | | | event filters | --------------------------------------------------------------------------- | basic-sampling | 33.62% | 4.24% | 6.21% | --------------------------------------------------------------------------- | multiplex-sampling | 79.32% | 7.34% | 10.45% | ---------------------------------------------------------------------------
2026-02-11Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 APIC-ish changes for 6.20 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when creating a vCPU-specific mapping of guest memory. - Clean up KVM's handling of marking mapped vCPU pages dirty. - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless. - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow burying the RTC shenanigans behind CONFIG_KVM_IOAPIC=y. - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y. - Add a regression test for recent APICv update fixes. - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans. - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information. - Drop a dead function declaration.
2026-02-11Merge tag 'kvm-x86-gmem-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM guest_memfd changes for 6.20 - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage handling, and instead rely on SNP (the only user of the tracking) to do its own tracking via the RMP. - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to avoid non-trivial complexity for a non-existent usecase (and because in-place conversion simply can't support unaligned sources). - When populating guest_memfd memory, GUP the source page in common code and pass the refcounted page to the vendor callback, instead of letting vendor code do the heavy lifting. Doing so avoids a looming deadlock bug with in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap invalidate lock.
2026-02-09Merge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.20 - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN. - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that were advertised as supported to userspace when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled. - Fix a bug where KVM would attempt to read protect guest state (CR3) when configuring an async #PF entry. - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow the few exports that are intended for external usage. - Ignore -EBUSY when checking nested events after a vCPU exits blocking as the WARN is user-triggerable, and because exiting to userspace on -EBUSY does more harm than good in pretty much every situation. - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an unwinnable game. - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace. - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2. - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration. - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs). - Minor cleanups.
2026-02-09Merge tag 'kvm-x86-svm-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.20 - Drop a user-triggerable WARN on nested_svm_load_cr3() failure. - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) *must* flush the RAP. - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model. - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig. - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0. - Treat exit_code as an unsigned 64-bit value through all of KVM. - Add support for fetching SNP certificates from userspace. - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2. - Misc fixes and cleanups.
2026-02-09Merge tag 'kvm-x86-vmx-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.20 - Fix an SGX bug where KVM would incorrectly try to handle EPCM #PFs by always relecting EPCM #PFs back into the guest. KVM doesn't shadow EPCM entries, and so EPCM violations cannot be due to KVM interference, and can't be resolved by KVM. - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed. - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist. - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch.
2026-01-30KVM: x86: Add x2APIC "features" to control EOI broadcast suppressionKhushit Shah
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated by userspace), which KVM completely mishandles. When x2APIC support was first added, KVM incorrectly advertised and "enabled" Suppress EOI Broadcast, without fully supporting the I/O APIC side of the equation, i.e. without adding directed EOI to KVM's in-kernel I/O APIC. That flaw was carried over to split IRQCHIP support, i.e. KVM advertised support for Suppress EOI Broadcasts irrespective of whether or not the userspace I/O APIC implementation supported directed EOIs. Even worse, KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without support for directed EOI came to rely on the "spurious" broadcasts. KVM "fixed" the in-kernel I/O APIC implementation by completely disabling support for Suppress EOI Broadcasts in commit 0bcc3fb95b97 ("KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but didn't do anything to remedy userspace I/O APIC implementations. KVM's bogus handling of Suppress EOI Broadcast is problematic when the guest relies on interrupts being masked in the I/O APIC until well after the initial local APIC EOI. E.g. Windows with Credential Guard enabled handles interrupts in the following order: 1. Interrupt for L2 arrives. 2. L1 APIC EOIs the interrupt. 3. L1 resumes L2 and injects the interrupt. 4. L2 EOIs after servicing. 5. L1 performs the I/O APIC EOI. Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt storm, e.g. if the IRQ line is still asserted and userspace reacts to the EOI by re-injecting the IRQ, because the guest doesn't de-assert the line until step #4, and doesn't expect the interrupt to be re-enabled until step #5. Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way of knowing if the userspace I/O APIC supports directed EOIs, i.e. suppressing EOI broadcasts would result in interrupts being stuck masked in the userspace I/O APIC due to step #5 being ignored by userspace. And fully disabling support for Suppress EOI Broadcast is also undesirable, as picking up the fix would require a guest reboot, *and* more importantly would change the virtual CPU model exposed to the guest without any buy-in from userspace. Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to explicitly enable or disable support for Suppress EOI Broadcasts. This gives userspace control over the virtual CPU model exposed to the guest, as KVM should never have enabled support for Suppress EOI Broadcast without userspace opt-in. Not setting either flag will result in legacy quirky behavior for backward compatibility. Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear that commit c806a6ad35bf ("KVM: x86: call irq notifiers with directed EOI") was entirely correct, i.e. it may have simply papered over the lack of Directed EOI emulation in the I/O APIC. Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI is to support Directed EOI (KVM's name) irrespective of guest CPU vendor. Fixes: 7543a635aa09 ("KVM: x86: Add KVM exit for IOAPIC EOIs") Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com Cc: stable@vger.kernel.org Suggested-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Khushit Shah <khushit.shah@nutanix.com> Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com [sean: clean up minor formatting goofs and fix a comment typo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30KVM: x86: Harden against unexpected adjustments to kvm_cpu_capsSean Christopherson
Add a flag to track when KVM is actively configuring its CPU caps, and WARN if a cap is set or cleared if KVM isn't in its configuration stage. Modifying CPU caps after {svm,vmx}_set_cpu_caps() can be fatal to KVM, as vendor setup code expects the CPU caps to be frozen at that point, e.g. will do additional configuration based on the caps. Rename kvm_set_cpu_caps() to kvm_initialize_cpu_caps() to pair with the new "finalize", and to make it more obvious that KVM's CPU caps aren't fully configured within the function. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30KVM: VMX: Print out "bad" offsets+value on VMCS config mismatchSean Christopherson
When kvm-intel.ko refuses to load due to a mismatched VMCS config, print all mismatching offsets+values to make it easier to debug goofs during development, and to make it at least feasible to triage failures that occur during production. E.g. if a physical core is flaky or is running with the "wrong" microcode patch loaded, then a CPU can get a legitimate mismatch even without KVM bugs. Print the mismatches as 32-bit values as a compromise between hand coding every field (to provide precise information) and printing individual bytes (requires more effort to deduce the mismatch bit(s)). All fields in the VMCS config are either 32-bit or 64-bit values, i.e. in many cases, printing 32-bit values will be 100% precise, and in the others it's close enough, especially when considering that MSR values are split into EDX:EAX anyways. E.g. on mismatch CET entry/exit controls, KVM will print: kvm_intel: VMCS config on CPU 0 doesn't match reference config: Offset 76 REF = 0x107fffff, CPU0 = 0x007fffff, mismatch = 0x10000000 Offset 84 REF = 0x0010f3ff, CPU0 = 0x0000f3ff, mismatch = 0x00100000 Opportunistically tweak the wording on the initial error message to say "mismatch" instead of "inconsistent", as the VMCS config itself isn't inconsistent, and the wording conflates the cross-CPU compatibility check with the error_on_inconsistent_vmcs_config knob that treats inconsistent VMCS configurations as errors (e.g. if a CPU supports CET entry controls but no CET exit controls). Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()Sean Christopherson
Explicitly configure KVM's supported XSS as part of each vendor's setup flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested VMX is enabled, i.e. when nested=1. The late clearing results in nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks, ultimately leading to a mismatched VMCS config due to the reference config having the CET bits set, but every CPU's "local" config having the bits cleared. Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by kvm_x86_vendor_init(), before calling into vendor code, and not referenced between ops->hardware_setup() and their current/old location. Fixes: 69cc3e886582 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER") Cc: stable@vger.kernel.org Cc: Mathias Krause <minipli@grsecurity.net> Cc: John Allen <john.allen@amd.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26KVM: nVMX: Remove explicit filtering of GUEST_INTR_STATUS from shadow VMCS ↵Sean Christopherson
fields Drop KVM's filtering of GUEST_INTR_STATUS when generating the shadow VMCS bitmap now that KVM drops GUEST_INTR_STATUS from the set of supported vmcs12 fields if the field isn't supported by hardware, and initialization of the shadow VMCS fields omits unsupported vmcs12 fields. Note, there is technically a small functional change here, as the vmcs12 filtering only requires support for Virtual Interrupt Delivery, whereas the shadow VMCS code being removed required "full" APICv support, i.e. required Virtual Interrupt Delivery *and* APIC Register Virtualizaton *and* Posted Interrupt support. Opportunistically tweak the comment to more precisely explain why the PML and VMX preemption timer fields need to be explicitly checked. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260115173427.716021-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26KVM: nVMX: Disallow access to vmcs12 fields that aren't supported by "hardware"Sean Christopherson
Disallow access (VMREAD/VMWRITE), both emulated and via a shadow VMCS, to VMCS fields that the loaded incarnation of KVM doesn't support, e.g. due to lack of hardware support, as a middle ground between allowing access to any vmcs12 field defined by KVM (current behavior) and gating access based on the userspace-defined vCPU model (the most functionally correct, but very costly, implementation). Disallowing access to unsupported fields helps a tiny bit in terms of closing the virtualization hole (see below), but the main motivation is to avoid having to weed out unsupported fields when synchronizing between vmcs12 and a shadow VMCS. Because shadow VMCS accesses are done via VMREAD and VMWRITE, KVM _must_ filter out unsupported fields (or eat VMREAD/VMWRITE failures), and filtering out just shadow VMCS fields is about the same amount of effort, and arguably much more confusing. As a bonus, this also fixes a KVM-Unit-Test failure bug when running on _hardware_ without support for TSC Scaling, which fails with the same signature as the bug fixed by commit ba1f82456ba8 ("KVM: nVMX: Dynamically compute max VMCS index for vmcs12"): FAIL: VMX_VMCS_ENUM.MAX_INDEX expected: 19, actual: 17 Dynamically computing the max VMCS index only resolved the issue where KVM was hardcoding max index, but for CPUs with TSC Scaling, that was "good enough". Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xin Li <xin@zytor.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/all/20251026201911.505204-22-xin@zytor.com Link: https://lore.kernel.org/all/YR2Tf9WPNEzrE7Xg@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260115173427.716021-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Add SRCU protection for reading PDPTRs in __get_sregs2()Vasiliy Kovalev
Add SRCU read-side protection when reading PDPTR registers in __get_sregs2(). Reading PDPTRs may trigger access to guest memory: kvm_pdptr_read() -> svm_cache_reg() -> load_pdptrs() -> kvm_vcpu_read_guest_page() -> kvm_vcpu_gfn_to_memslot() kvm_vcpu_gfn_to_memslot() dereferences memslots via __kvm_memslots(), which uses srcu_dereference_check() and requires either kvm->srcu or kvm->slots_lock to be held. Currently only vcpu->mutex is held, triggering lockdep warning: ============================= WARNING: suspicious RCU usage in kvm_vcpu_gfn_to_memslot 6.12.59+ #3 Not tainted include/linux/kvm_host.h:1062 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz.5.1717/15100: #0: ff1100002f4b00b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x1d5/0x1590 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xf0/0x120 lib/dump_stack.c:120 lockdep_rcu_suspicious+0x1e3/0x270 kernel/locking/lockdep.c:6824 __kvm_memslots include/linux/kvm_host.h:1062 [inline] __kvm_memslots include/linux/kvm_host.h:1059 [inline] kvm_vcpu_memslots include/linux/kvm_host.h:1076 [inline] kvm_vcpu_gfn_to_memslot+0x518/0x5e0 virt/kvm/kvm_main.c:2617 kvm_vcpu_read_guest_page+0x27/0x50 virt/kvm/kvm_main.c:3302 load_pdptrs+0xff/0x4b0 arch/x86/kvm/x86.c:1065 svm_cache_reg+0x1c9/0x230 arch/x86/kvm/svm/svm.c:1688 kvm_pdptr_read arch/x86/kvm/kvm_cache_regs.h:141 [inline] __get_sregs2 arch/x86/kvm/x86.c:11784 [inline] kvm_arch_vcpu_ioctl+0x3e20/0x4aa0 arch/x86/kvm/x86.c:6279 kvm_vcpu_ioctl+0x856/0x1590 virt/kvm/kvm_main.c:4663 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl fs/ioctl.c:893 [inline] __x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xbd/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: 6dba94035203 ("KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2") Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org> Link: https://patch.msgid.link/20260123222801.646123-1-kovalev@altlinux.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Advertise AVX10_VNNI_INT CPUID to userspaceZhao Liu
Define and advertise AVX10_VNNI_INT CPUID to userspace when it's supported by the host. AVX10_VNNI_INT (0x24.0x1.ECX[bit 2]) is a discrete feature bit introduced on Intel Diamond Rapids, which enumerates the support for EVEX VPDP* instructions for INT8/INT16 [*]. Since this feature has no actual kernel usages, define it as a KVM-only feature in reverse_cpuid.h. Advertise new CPUID subleaf 0x24.0x1 with AVX10_VNNI_INT bit to userspace for guest use. It's safe since no additional enabling work is needed in the host kernel. [*]: Intel Advanced Vector Extensions 10.2 Architecture Specification (rev 5.0). Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-5-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Advertise AVX10.2 CPUID to userspaceZhao Liu
Bump up the maximum supported AVX10 version and enumerate AVX10.2 to userspace when it's supported by the host. Intel AVX10 Version 2 (Intel AVX10.2) includes a suite of new instructions delivering new AI features and performance, accelerated media processing, expanded Web Assembly, and Cryptography support, along with enhancements to existing legacy instructions for completeness and efficiency, and it is enumerated as version 2 in CPUID 0x24.0x0.EBX[bits 0-7] [1]. AVX10.2 has no current kernel usage and requires no additional host kernel enabling work (based on AVX10.1 support) and provides no new VMX controls [2]. Moreover, since AVX10.2 is the superset of AVX10.1, there's no need to worry about AVX10.1 and AVX10.2 compatibility issues in KVM. Therefore, it's safe to advertise AVX10.2 version to userspace directly if host supports AVX10.2. [1]: Intel Advanced Vector Extensions 10.2 Architecture Specification (rev 5.0). [2]: Note: Since AVX10.2 spec (rev 4.0), it has been declared "AVX10/512 will be used in all Intel products, supporting vector lengths of 128, 256, and 512 in all product lines", and the VMX support (in earlier revisions) for AVX10/256 guest on AVX10/512 host has been dropped. Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-4-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Advertise AMX CPUIDs in subleaf 0x1E.0x1 to userspaceZhao Liu
Define and advertise AMX CPUIDs (0x1E.0x1) to userspace when the leaf is supported by the host. Intel Diamond Rapids adds new AMX instructions to support new formats and memory operations [*], and introduces the CPUID subleaf 0x1E.0x1 to centralize the discrete AMX feature bits within EAX. Since these AMX features have no actual kernel usages, define them as KVM-only features in reverse_cpuid.h. In addition to the new features, CPUID 0x1E.0x1.EAX[bits 0-3] are aliaseed positions of existing AMX feature bits distributed across the 0x7 leaves. To avoid duplicate feature names, name these aliases with an *_ALIAS suffix, and define them in reverse_cpuid.h as KVM-only features as well. Advertise new CPUID subleaf 0x1E.0x1 with its AMX CPUID feature bits to userspace for guest use. It's safe since no additional enabling work is needed in the host kernel. [*]: Intel Architecture Instruction Set Extensions and Future Features (rev.059). Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-3-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Advertise MOVRS CPUID to userspaceZhao Liu
Define the feature flag for MOVRS and advertise support to userspace when the feature is supported by the host. MOVRS is a new set of instructions introduced in the Intel platform Diamond Rapids, to provide load instructions that carry a read-shared hint. Functionally, MOVRS family is equivalent to existing load instructions, but its read-shared hint indicates that the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written (assuming it is ever written in the future). This hint could optimize the behavior of the caches, especially shared caches, for this data for future reads by multiple processors. Additionally, MOVRS family also includes a software prefetch instruction, PREFETCHRST2, that carries the same read-shared hint. [*] MOVRS family is enumerated by CPUID single-bit (0x7.0x1.EAX[bit 31]). Since it's on a densely-populated CPUID leaf and some other bits on this leaf have kernel usages, define this new feature in cpufeatures.h, but hide it in /proc/cpuinfo due to lack of current kernel usage. Advertise MOVRS bit to userspace directly. It's safe, since there's no new VMX controls or additional host enabling required for guests to use this feature. [*]: Intel Architecture Instruction Set Extensions and Future Features (rev.059). Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-2-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: SEV: Add KVM_SEV_SNP_ENABLE_REQ_CERTS commandMichael Roth
Introduce a new command for KVM_MEMORY_ENCRYPT_OP ioctl that can be used to enable fetching of endorsement key certificates from userspace via the new KVM_EXIT_SNP_REQ_CERTS exit type. Also introduce a new KVM_X86_SEV_SNP_REQ_CERTS KVM device attribute so that userspace can query whether the kernel supports the new command/exit. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Link: https://patch.msgid.link/20260109231732.1160759-3-michael.roth@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: Introduce KVM_EXIT_SNP_REQ_CERTS for SNP certificate-fetchingMichael Roth
For SEV-SNP, the host can optionally provide a certificate table to the guest when it issues an attestation request to firmware (see GHCB 2.0 specification regarding "SNP Extended Guest Requests"). This certificate table can then be used to verify the endorsement key used by firmware to sign the attestation report. While it is possible for guests to obtain the certificates through other means, handling it via the host provides more flexibility in being able to keep the certificate data in sync with the endorsement key throughout host-side operations that might resulting in the endorsement key changing. In the case of KVM, userspace will be responsible for fetching the certificate table and keeping it in sync with any modifications to the endorsement key by other userspace management tools. Define a new KVM_EXIT_SNP_REQ_CERTS event where userspace is provided with the GPA of the buffer the guest has provided as part of the attestation request so that userspace can write the certificate data into it while relying on filesystem-based locking to keep the certificates up-to-date relative to the endorsement keys installed/utilized by firmware at the time the certificates are fetched. [Melody: Update the documentation scheme about how file locking is expected to happen.] Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Dionna Glaze <dionnaglaze@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Melody Wang <huibo.wang@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Link: https://patch.msgid.link/20260109231732.1160759-2-michael.roth@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Drop WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPISean Christopherson
Drop the sanity check in kvm_apic_accept_events() that attempts to detect KVM bugs by asserting that a vCPU isn't in Wait-For-SIPI if INIT/SIPI are blocked, because if INIT is blocked, then it should be impossible for a vCPU to get into WFS in the first place. Unfortunately, syzbot is smarter than KVM (and its maintainers), and circumvented the guards put in place by commit 0fe3e8d804fd ("KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked check to KVM_RUN") by swapping the order and stuffing VMXON after INIT, and then triggering kvm_apic_accept_events() by way of KVM_GET_MP_STATE. Simply drop the WARN as it hasn't detected any meaningful KVM bugs in years (if ever?), and preventing userspace from clobbering guest state is generally a non-goal. More importantly, fully closing the hole would likely require enforcing some amount of ordering in KVM's ioctls, which is a much bigger risk than simply deleting the WARN. Reported-by: syzbot+59f2c3a3fc4f6c09b8cd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6925da1b.a70a0220.d98e3.00b0.GAE@google.com Link: https://patch.msgid.link/20260123022816.2283567-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-16KVM: VMX: Add a wrapper around ROL16() to get a vmcs12 from a field encodingSean Christopherson
Add a wrapper macro, ENC_TO_VMCS12_IDX(), to get a vmcs12 index given a field encoding in anticipation of adding a macro to get from a vmcs12 index back to the field encoding. And because open coding ROL16(n, 6) everywhere is gross. No functional change intended. Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260115173427.716021-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-16KVM: nVMX: Setup VMX MSRs on loading CPU during nested_vmx_hardware_setup()Sean Christopherson
Move the call to nested_vmx_setup_ctls_msrs() from vmx_hardware_setup() to nested_vmx_hardware_setup() so that the nested code can deal with ordering dependencies without having to straddle vmx_hardware_setup() and nested_vmx_hardware_setup(). Specifically, an upcoming change will sanitize the vmcs12 fields based on hardware support, and that code needs to run _before_ the MSRs are configured, because the lovely vmcs_enum MSR depends on the max support vmcs12 field. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260115173427.716021-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-15KVM: guest_memfd: GUP source pages prior to populating guest memoryMichael Roth
Currently the post-populate callbacks handle copying source pages into private GPA ranges backed by guest_memfd, where kvm_gmem_populate() acquires the filemap invalidate lock, then calls a post-populate callback which may issue a get_user_pages() on the source pages prior to copying them into the private GPA (e.g. TDX). This will not be compatible with in-place conversion, where the userspace page fault path will attempt to acquire the filemap invalidate lock while holding the mm->mmap_lock, leading to a potential ABBA deadlock. Address this by hoisting the GUP above the filemap invalidate lock so that these page faults path can be taken early, prior to acquiring the filemap invalidate lock. It's not currently clear whether this issue is reachable with the current implementation of guest_memfd, which doesn't support in-place conversion, however it does provide a consistent mechanism to provide stable source/target PFNs to callbacks rather than punting to vendor-specific code, which allows for more commonality across architectures, which may be worthwhile even without in-place conversion. As part of this change, also begin enforcing that the 'src' argument to kvm_gmem_populate() must be page-aligned, as this greatly reduces the complexity around how the post-populate callbacks are implemented, and since no current in-tree users support using a non-page-aligned 'src' argument. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Vishal Annapurve <vannapurve@google.com> Tested-by: Vishal Annapurve <vannapurve@google.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260108214622.1084057-7-michael.roth@amd.com [sean: avoid local "p" variable] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-15KVM: SEV: Document/enforce page-alignment for KVM_SEV_SNP_LAUNCH_UPDATEMichael Roth
In the past, KVM_SEV_SNP_LAUNCH_UPDATE accepted a non-page-aligned 'uaddr' parameter to copy data from, but continuing to support this with new functionality like in-place conversion and hugepages in the pipeline has proven to be more trouble than it is worth, since there are no known users that have been identified who use a non-page-aligned 'uaddr' parameter. Rather than locking guest_memfd into continuing to support this, go ahead and document page-alignment as a requirement and begin enforcing this in the handling function. Reviewed-by: Vishal Annapurve <vannapurve@google.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Link: https://patch.msgid.link/20260108214622.1084057-5-michael.roth@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-15KVM: guest_memfd: Remove partial hugepage handling from kvm_gmem_populate()Michael Roth
kvm_gmem_populate(), and the associated post-populate callbacks, have some limited support for dealing with guests backed by hugepages by passing the order information along to each post-populate callback and iterating through the pages passed to kvm_gmem_populate() in hugepage-chunks. However, guest_memfd doesn't yet support hugepages, and in most cases additional changes in the kvm_gmem_populate() path would also be needed to actually allow for this functionality. This makes the existing code unnecessarily complex, and makes changes difficult to work through upstream due to theoretical impacts on hugepage support that can't be considered properly without an actual hugepage implementation to reference. So for now, remove what's there so changes for things like in-place conversion can be implemented/reviewed more efficiently. Suggested-by: Vishal Annapurve <vannapurve@google.com> Co-developed-by: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Vishal Annapurve <vannapurve@google.com> Tested-by: Vishal Annapurve <vannapurve@google.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260108214622.1084057-3-michael.roth@amd.com [sean: check for !IS_ERR() before checking folio_order()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: SVM: Stop toggling virtual VMSAVE/VMLOAD on intercept recalcYosry Ahmed
Virtual VMSAVE/VMLOAD enablement (i.e. VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK) is set/cleared by svm_recalc_instruction_intercepts() when the intercepts are cleared/set. This is unnecessary because the bit is meaningless when intercepts are set and KVM emulates the instructions. Initialize the bit in vmcb01 base on vls, and keep it unchanged. This is similar-ish to how vGIF is handled. It is enabled in init_vmcb() if vgif=1 and remains unchanged when the STGI intercept is enabled (e.g. for NMI windows). This fixes a bug in svm_recalc_instruction_intercepts(). The intercepts for VMSAVE/VMLOAD are always toggled in vmcb01, but VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK is toggled in the current VMCB, which could be vmcb02 instead of vmcb01 if L2 is active. Virtual VMSAVE/VMLOAD enablement in vmcb02 is separately controlled by nested_vmcb02_prepare_control() based on the vCPU features and VMCB12, and if intercepts are needed they are set by recalc_intercepts(). The bug is benign though. Not toggling the bit for vmcb01 is harmless because it's useless anyway. For vmcb02: - The bit could be incorrectly cleared when intercepts are set in vmcb01. This is harmless because VMSAVE/VMLOAD will be emulated by KVM anyway. - The bit could be incorrectly set when the intercepts are cleared in vmcb01. However, if the bit was originally clear in vmcb02, then recalc_intercepts() will enable in the intercepts in vmcb02 anyway and VMSAVE/VMLOAD will be emulated by KVM. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260110004821.3411245-3-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: nSVM: Always use vmcb01 in VMLOAD/VMSAVE emulationYosry Ahmed
Commit cc3ed80ae69f ("KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state") made KVM always use vmcb01 for the fields controlled by VMSAVE/VMLOAD, but it missed updating the VMLOAD/VMSAVE emulation code to always use vmcb01. As a result, if VMSAVE/VMLOAD is executed by an L2 guest and is not intercepted by L1, KVM will mistakenly use vmcb02. Always use vmcb01 instead of the current VMCB. Fixes: cc3ed80ae69f ("KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260110004821.3411245-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: VMX: Don't register posted interrupt wakeup handler if alloc_kvm_area() ↵Hou Wenlong
fails Unregistering the posted interrupt wakeup handler only happens during hardware unsetup. Therefore, if alloc_kvm_area() fails and continue to register the posted interrupt wakeup handler, this will leave the global posted interrupt wakeup handler pointer in an incorrect state. Although it should not be an issue, it's still better to change it. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Fixes: ec5a4919fa7b ("KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup") Link: https://patch.msgid.link/0ac6908b608cf80eab7437004334fedd0f5f5317.1768304590.git.houwenlong.hwl@antgroup.com [sean: use a goto] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: x86: Assert that non-MSI doesn't have bypass vCPU when deleting producerSean Christopherson
When disconnecting a non-MSI irqfd from an IRQ bypass producer, WARN if the irqfd is configured for IRQ bypass and set its IRTE back to remapped mode to harden against kernel/KVM bugs (keeping the irqfd in bypass mode is often fatal to the host). Deactivating an irqfd (removing it from the list of irqfds), updating irqfd routes, and the code in question are all mutually exclusive (all run under irqfds.lock). If an irqfd is configured for bypass, and the irqfd is deassigned at the same time IRQ routing is updated (to change the routing to non-MSI), then either kvm_arch_update_irqfd_routing() should process the irqfd routing change and put the IRTE into remapped mode (routing update "wins"), or kvm_arch_irq_bypass_del_producer() should see the MSI routing info (deactivation "wins"). Link: https://patch.msgid.link/20260113174606.104978-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: SVM: Check vCPU ID against max x2AVIC ID if and only if x2AVIC is enabledSean Christopherson
When allocating the AVIC backing page, only check one of the max AVIC vs. x2AVIC ID based on whether or not x2AVIC is enabled. Doing so fixes a bug where KVM incorrectly inhibits AVIC if x2AVIC is _disabled_ and any vCPU with a non-zero APIC ID is created, as x2avic_max_physical_id is left '0' when x2AVIC is disabled. Fixes: 940fc47cfb0d ("KVM: SVM: Add AVIC support for 4k vCPUs in x2AVIC mode") Cc: stable@vger.kernel.org Cc: Naveen N Rao (AMD) <naveen@kernel.org> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://patch.msgid.link/20260112232805.1512361-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: nSVM: Drop redundant/wrong comment in nested_vmcb02_prepare_save()Yosry Ahmed
The comment above DR6 and DR7 initializations is redundant, because the entire function follows the same pattern of only initializing the fields in vmcb02 if the vmcb12 changed or the fields are dirty, which handles the first execution case. Also, the comment refers to new_vmcb12 as new_vmcs12. Just drop the comment. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260113172807.2178526-1-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: VMX: Remove declaration of nested_mark_vmcs12_pages_dirty()Binbin Wu
Remove the declaration of nested_mark_vmcs12_pages_dirty() from the header file since it has been moved and renamed to nested_vmx_mark_all_vmcs12_pages_dirty(), which is a static function. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260113084748.1714633-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-14KVM: SVM: Fix an off-by-one typo in the comment for enabling AVIC by defaultSean Christopherson
Fix a goof in the comment that documents KVM's logic for enabling AVIC by default to reference Zen5+ as family 0x1A (Zen5), not family 0x19 (Zen4). The code is correct (checks for _greater_ than 0x19), only the comment is flawed. Opportunistically tweak the check too, even though it's already correct, so that both the comment and the code reference 0x1A, and so that the checks are "ascending", i.e. check Zen4 and then Zen5+. No functional change intended. Fixes: ca2967de5a5b ("KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is support") Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://patch.msgid.link/20260109035037.1015073-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>