| Age | Commit message (Collapse) | Author |
|
The return value of vmx_leave_smm() is unrelated from that of
nested_vmx_enter_non_root_mode(). Check explicitly for success
(which happens to be 0) and return 1 just like everywhere
else in vmx_leave_smm().
Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno
returned by tenter_svm_guest_mode().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The VMCB12 is stored in guest memory and can be mangled while in SMM; it
is then reloaded by svm_leave_smm(), but it is not checked again for
validity.
Move the cached vmcb12 control and save consistency checks out of
svm_set_nested_state() and into a helper, and reuse it in
svm_leave_smm().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The VMCS12 is not available while in SMM. However, it can be overwritten
if userspace manages to trigger copy_enlightened_to_vmcs12() - for example
via KVM_GET_NESTED_STATE.
Because of this, the VMCS12 has to be checked for validity before it is
used to generate the VMCS02. Move the check code out of vmx_set_nested_state()
(the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry)
and reuse it in vmx_leave_smm().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Explicitly set/clear CR8 write interception when AVIC is (de)activated to
fix a bug where KVM leaves the interception enabled after AVIC is
activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8
will remain intercepted in perpetuity.
On its own, the dangling CR8 intercept is "just" a performance issue, but
combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM:
Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging
intercept is fatal to Windows guests as the TPR seen by hardware gets
wildly out of sync with reality.
Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored
when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in
KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this
is firmly an SVM implementation flaw/detail.
WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should
never enter the guest with AVIC enabled and CR8 writes intercepted.
Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Squash fix to avic_deactivate_vmcb. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled
in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the
vCPU could activate AVIC at any point in its lifecycle. Configuring the
VMCB if and only if AVIC is active "works" purely because of optimizations
in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled
*and* to defer updates until the first KVM_RUN. In quotes because KVM
likely won't do the right thing if kvm_apicv_activated() is false, i.e. if
a vCPU is created while APICv is inhibited at the VM level for whatever
reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is
handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to
vendor code due to seeing "apicv_active == activate".
Cleaning up the initialization code will also allow fixing a bug where KVM
incorrectly leaves CR8 interception enabled when AVIC is activated without
creating a mess with respect to whether AVIC is activated or not.
Cc: stable@vger.kernel.org
Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC")
Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv")
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set
FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted
prior to commit 6b1dd26544d0 ("KVM: VMX: Preserve host's
DEBUGCTLMSR_FREEZE_IN_SMM while running the guest"). Enable the quirk
by default for backwards compatibility (like all quirks); userspace
can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the
constraints on WRMSR(IA32_DEBUGCTL).
Note that the quirk only bypasses the consistency check. The vmcs02 bit is
still owned by the host, and PMCs are not frozen during virtualized SMM.
In particular, if a host administrator decides that PMCs should not be
frozen during physical SMM, then L1 has no say in the matter.
Fixes: 095686e6fcb4 ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com
[sean: tag for stable@, clean-up and fix goofs in the comment and docs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rename quirk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The mask_notifier_list is protected by kvm->irq_srcu, but the traversal
in kvm_fire_mask_notifiers() incorrectly uses hlist_for_each_entry_rcu().
This leads to lockdep warnings because the standard RCU iterator expects
to be under rcu_read_lock(), not SRCU.
Replace the RCU variant with hlist_for_each_entry_srcu() and provide
the proper srcu_read_lock_held() annotation to ensure correct
synchronization and silence lockdep.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20260204091206.2617-1-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The previous change had a bug to update a guest MSR with a host value.
Fixes: c3d6a7210a4de9096 ("KVM: VMX: Dedup code for adding MSR to VMCS's auto list")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260220220216.389475-1-namhyung@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In KVM guests with Hyper-V hypercalls enabled, the hypercalls
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
allow a guest to request invalidation of portions of a virtual TLB.
For this, the hypercall parameter includes a list of GVAs that are supposed
to be invalidated.
Currently, only the base GVA is checked to be canonical. In reality, this
check needs to be performed for the entire range of GVAs, as checking only
the base GVA enables guests running on Intel hardware to trigger a
WARN_ONCE in the host (see Fixes commit below).
Move the check for non-canonical addresses to be performed for every GVA
of the supplied range to avoid the splat, and to be more in line with the
Hyper-V specification, since, although unlikely, a range starting with an
invalid GVA may still contain GVAs that are valid.
Fixes: fa787ac07b3c ("KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush")
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/00a7a31b-573b-4d92-91f8-7d7e2f88ea48@tum.de
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM incorrectly synthesizes CPUID bits for KVM-only leaves, as the
following branch in kvm_cpu_cap_init() is never taken:
if (leaf < NCAPINTS)
kvm_cpu_caps[leaf] &= kernel_cpu_caps[leaf];
This means that bits set via SYNTHESIZED_F() for KVM-only leaves are
unconditionally set. This for example can cause issues for SEV-SNP
guests running on Family 19h CPUs, as TSA_SQ_NO and TSA_L1_NO are
always enabled by KVM in 80000021[ECX]. When userspace issues a
SNP_LAUNCH_UPDATE command to update the CPUID page for the guest, SNP
firmware will explicitly reject the command if the page sets sets these
bits on vulnerable CPUs.
To fix this, check in SYNTHESIZED_F() that the corresponding X86
capability is set before adding it to to kvm_cpu_cap_features.
Fixes: 31272abd5974 ("KVM: SVM: Advertise TSA CPUID bits to guests")
Link: https://lore.kernel.org/all/20260208164233.30405-1-clopez@suse.de/
Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://patch.msgid.link/20260209153108.70667-2-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
HEAD
KVM generic changes for 7.0
- Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being
unnecessary and confusing, triggered compiler warnings due to
-Wflex-array-member-not-at-end.
- Document that vcpu->mutex is take outside of kvm->slots_lock and
kvm->slots_arch_lock, which is intentional and desirable despite being
rather unintuitive.
|
|
KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always
available. Move the definition from individual architectures to common
code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
All architectures now use MMU notifier for KVM page table management.
Remove the Kconfig symbol and the code that is used when it is
disabled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
KVM mediated PMU support for 6.20
Add support for mediated PMUs, where KVM gives the guest full ownership of PMU
hardware (contexted switched around the fastpath run loop) and allows direct
access to data MSRs and PMCs (restricted by the vPMU model), but intercepts
access to control registers, e.g. to enforce event filtering and to prevent the
guest from profiling sensitive host state.
To keep overall complexity reasonable, mediated PMU usage is all or nothing
for a given instance of KVM (controlled via module param). The Mediated PMU
is disabled default, partly to maintain backwards compatilibity for existing
setup, partly because there are tradeoffs when running with a mediated PMU that
may be non-starters for some use cases, e.g. the host loses the ability to
profile guests with mediated PMUs, the fastpath run loop is also a blind spot,
entry/exit transitions are more expensive, etc.
Versus the emulated PMU, where KVM is "just another perf user", the mediated
PMU delivers more accurate profiling and monitoring (no risk of contention and
thus dropped events), with significantly less overhead (fewer exits and faster
emulation/programming of event selectors) E.g. when running Specint-2017 on
a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from
within the guest:
Perf command:
a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite
b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite
Guest performance overhead:
---------------------------------------------------------------------------
| Test case | emulated vPMU | all passthrough | passthrough with |
| | | | event filters |
---------------------------------------------------------------------------
| basic-sampling | 33.62% | 4.24% | 6.21% |
---------------------------------------------------------------------------
| multiplex-sampling | 79.32% | 7.34% | 10.45% |
---------------------------------------------------------------------------
|
|
KVM x86 APIC-ish changes for 6.20
- Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when
creating a vCPU-specific mapping of guest memory.
- Clean up KVM's handling of marking mapped vCPU pages dirty.
- Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
ASSERT() macro, most of which could be trivially triggered by the guest
and/or user, and all of which were useless.
- Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
more obvious what the weird parameter is used for, and to allow burying the
RTC shenanigans behind CONFIG_KVM_IOAPIC=y.
- Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y.
- Add a regression test for recent APICv update fixes.
- Rework KVM's handling of VMCS updates while L2 is active to temporarily
switch to vmcs01 instead of deferring the update until the next nested
VM-Exit. The deferred updates approach directly contributed to several
bugs, was proving to be a maintenance burden due to the difficulty in
auditing the correctness of deferred updates, and was polluting
"struct nested_vmx" with a growing pile of booleans.
- Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
to consolidate the updates, and to co-locate SVI updates with the updates
for KVM's own cache of ISR information.
- Drop a dead function declaration.
|
|
KVM guest_memfd changes for 6.20
- Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage
handling, and instead rely on SNP (the only user of the tracking) to do its
own tracking via the RMP.
- Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE
and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to
avoid non-trivial complexity for a non-existent usecase (and because
in-place conversion simply can't support unaligned sources).
- When populating guest_memfd memory, GUP the source page in common code and
pass the refcounted page to the vendor callback, instead of letting vendor
code do the heavy lifting. Doing so avoids a looming deadlock bug with
in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap
invalidate lock.
|
|
KVM x86 misc changes for 6.20
- Disallow changing the virtual CPU model if L2 is active, for all the same
reasons KVM disallows change the model after the first KVM_RUN.
- Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
were advertised as supported to userspace when running with
KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.
- Fix a bug where KVM would attempt to read protect guest state (CR3) when
configuring an async #PF entry.
- Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow
the few exports that are intended for external usage.
- Ignore -EBUSY when checking nested events after a vCPU exits blocking as
the WARN is user-triggerable, and because exiting to userspace on -EBUSY
does more harm than good in pretty much every situation.
- Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
unwinnable game.
- Add support for new Intel instructions that don't require anything beyond
enumerating feature flags to userspace.
- Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
- Add WARNs to guard against modifying KVM's CPU caps outside of the intended
setup flow, as nested VMX in particular is sensitive to unexpected changes
in KVM's golden configuration.
- Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
when the suppression feature is enabled by the guest (currently limited to
split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor
Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
of whether or not userspace I/O APIC supports Directed EOIs).
- Minor cleanups.
|
|
KVM SVM changes for 6.20
- Drop a user-triggerable WARN on nested_svm_load_cr3() failure.
- Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS
relies on an upcoming, publicly announced change in the APM to reduce the
set of conditions where hardware (i.e. KVM) *must* flush the RAP.
- Ignore nSVM intercepts for instructions that are not supported according to
L1's virtual CPU model.
- Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath
for EPT Misconfig.
- Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM,
and allow userspace to restore nested state with GIF=0.
- Treat exit_code as an unsigned 64-bit value through all of KVM.
- Add support for fetching SNP certificates from userspace.
- Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD
or VMSAVE on behalf of L2.
- Misc fixes and cleanups.
|
|
KVM VMX changes for 6.20
- Fix an SGX bug where KVM would incorrectly try to handle EPCM #PFs by always
relecting EPCM #PFs back into the guest. KVM doesn't shadow EPCM entries,
and so EPCM violations cannot be due to KVM interference, and can't be
resolved by KVM.
- Fix a bug where KVM would register its posted interrupt wakeup handler even
if loading kvm-intel.ko ultimately failed.
- Disallow access to vmcb12 fields that aren't fully supported, mostly to
avoid weirdness and complexity for FRED and other features, where KVM wants
enable VMCS shadowing for fields that conditionally exist.
- Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or
refuses to online a CPU) due to a VMCS config mismatch.
|
|
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support
for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated
by userspace), which KVM completely mishandles. When x2APIC support was
first added, KVM incorrectly advertised and "enabled" Suppress EOI
Broadcast, without fully supporting the I/O APIC side of the equation,
i.e. without adding directed EOI to KVM's in-kernel I/O APIC.
That flaw was carried over to split IRQCHIP support, i.e. KVM advertised
support for Suppress EOI Broadcasts irrespective of whether or not the
userspace I/O APIC implementation supported directed EOIs. Even worse,
KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without
support for directed EOI came to rely on the "spurious" broadcasts.
KVM "fixed" the in-kernel I/O APIC implementation by completely disabling
support for Suppress EOI Broadcasts in commit 0bcc3fb95b97 ("KVM: lapic:
stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but
didn't do anything to remedy userspace I/O APIC implementations.
KVM's bogus handling of Suppress EOI Broadcast is problematic when the
guest relies on interrupts being masked in the I/O APIC until well after
the initial local APIC EOI. E.g. Windows with Credential Guard enabled
handles interrupts in the following order:
1. Interrupt for L2 arrives.
2. L1 APIC EOIs the interrupt.
3. L1 resumes L2 and injects the interrupt.
4. L2 EOIs after servicing.
5. L1 performs the I/O APIC EOI.
Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt
storm, e.g. if the IRQ line is still asserted and userspace reacts to the
EOI by re-injecting the IRQ, because the guest doesn't de-assert the line
until step #4, and doesn't expect the interrupt to be re-enabled until
step #5.
Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way
of knowing if the userspace I/O APIC supports directed EOIs, i.e.
suppressing EOI broadcasts would result in interrupts being stuck masked
in the userspace I/O APIC due to step #5 being ignored by userspace. And
fully disabling support for Suppress EOI Broadcast is also undesirable, as
picking up the fix would require a guest reboot, *and* more importantly
would change the virtual CPU model exposed to the guest without any buy-in
from userspace.
Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and
KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to
explicitly enable or disable support for Suppress EOI Broadcasts. This
gives userspace control over the virtual CPU model exposed to the guest,
as KVM should never have enabled support for Suppress EOI Broadcast without
userspace opt-in. Not setting either flag will result in legacy quirky
behavior for backward compatibility.
Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel
I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear
that commit c806a6ad35bf ("KVM: x86: call irq notifiers with directed EOI")
was entirely correct, i.e. it may have simply papered over the lack of
Directed EOI emulation in the I/O APIC.
Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's
APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI
is to support Directed EOI (KVM's name) irrespective of guest CPU vendor.
Fixes: 7543a635aa09 ("KVM: x86: Add KVM exit for IOAPIC EOIs")
Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com
Cc: stable@vger.kernel.org
Suggested-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Khushit Shah <khushit.shah@nutanix.com>
Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com
[sean: clean up minor formatting goofs and fix a comment typo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a flag to track when KVM is actively configuring its CPU caps, and
WARN if a cap is set or cleared if KVM isn't in its configuration stage.
Modifying CPU caps after {svm,vmx}_set_cpu_caps() can be fatal to KVM, as
vendor setup code expects the CPU caps to be frozen at that point, e.g.
will do additional configuration based on the caps.
Rename kvm_set_cpu_caps() to kvm_initialize_cpu_caps() to pair with the
new "finalize", and to make it more obvious that KVM's CPU caps aren't
fully configured within the function.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When kvm-intel.ko refuses to load due to a mismatched VMCS config, print
all mismatching offsets+values to make it easier to debug goofs during
development, and to make it at least feasible to triage failures that
occur during production. E.g. if a physical core is flaky or is running
with the "wrong" microcode patch loaded, then a CPU can get a legitimate
mismatch even without KVM bugs.
Print the mismatches as 32-bit values as a compromise between hand coding
every field (to provide precise information) and printing individual bytes
(requires more effort to deduce the mismatch bit(s)). All fields in the
VMCS config are either 32-bit or 64-bit values, i.e. in many cases,
printing 32-bit values will be 100% precise, and in the others it's close
enough, especially when considering that MSR values are split into EDX:EAX
anyways.
E.g. on mismatch CET entry/exit controls, KVM will print:
kvm_intel: VMCS config on CPU 0 doesn't match reference config:
Offset 76 REF = 0x107fffff, CPU0 = 0x007fffff, mismatch = 0x10000000
Offset 84 REF = 0x0010f3ff, CPU0 = 0x0000f3ff, mismatch = 0x00100000
Opportunistically tweak the wording on the initial error message to say
"mismatch" instead of "inconsistent", as the VMCS config itself isn't
inconsistent, and the wording conflates the cross-CPU compatibility check
with the error_on_inconsistent_vmcs_config knob that treats inconsistent
VMCS configurations as errors (e.g. if a CPU supports CET entry controls
but no CET exit controls).
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly configure KVM's supported XSS as part of each vendor's setup
flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due
to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested
VMX is enabled, i.e. when nested=1. The late clearing results in
nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE
when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks,
ultimately leading to a mismatched VMCS config due to the reference config
having the CET bits set, but every CPU's "local" config having the bits
cleared.
Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by
kvm_x86_vendor_init(), before calling into vendor code, and not referenced
between ops->hardware_setup() and their current/old location.
Fixes: 69cc3e886582 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER")
Cc: stable@vger.kernel.org
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Chao Gao <chao.gao@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
fields
Drop KVM's filtering of GUEST_INTR_STATUS when generating the shadow VMCS
bitmap now that KVM drops GUEST_INTR_STATUS from the set of supported
vmcs12 fields if the field isn't supported by hardware, and initialization
of the shadow VMCS fields omits unsupported vmcs12 fields.
Note, there is technically a small functional change here, as the vmcs12
filtering only requires support for Virtual Interrupt Delivery, whereas
the shadow VMCS code being removed required "full" APICv support, i.e.
required Virtual Interrupt Delivery *and* APIC Register Virtualizaton *and*
Posted Interrupt support.
Opportunistically tweak the comment to more precisely explain why the
PML and VMX preemption timer fields need to be explicitly checked.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disallow access (VMREAD/VMWRITE), both emulated and via a shadow VMCS, to
VMCS fields that the loaded incarnation of KVM doesn't support, e.g. due
to lack of hardware support, as a middle ground between allowing access to
any vmcs12 field defined by KVM (current behavior) and gating access based
on the userspace-defined vCPU model (the most functionally correct, but
very costly, implementation).
Disallowing access to unsupported fields helps a tiny bit in terms of
closing the virtualization hole (see below), but the main motivation is to
avoid having to weed out unsupported fields when synchronizing between
vmcs12 and a shadow VMCS. Because shadow VMCS accesses are done via
VMREAD and VMWRITE, KVM _must_ filter out unsupported fields (or eat
VMREAD/VMWRITE failures), and filtering out just shadow VMCS fields is
about the same amount of effort, and arguably much more confusing.
As a bonus, this also fixes a KVM-Unit-Test failure bug when running on
_hardware_ without support for TSC Scaling, which fails with the same
signature as the bug fixed by commit ba1f82456ba8 ("KVM: nVMX: Dynamically
compute max VMCS index for vmcs12"):
FAIL: VMX_VMCS_ENUM.MAX_INDEX expected: 19, actual: 17
Dynamically computing the max VMCS index only resolved the issue where KVM
was hardcoding max index, but for CPUs with TSC Scaling, that was "good
enough".
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li <xin@zytor.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/all/20251026201911.505204-22-xin@zytor.com
Link: https://lore.kernel.org/all/YR2Tf9WPNEzrE7Xg@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add SRCU read-side protection when reading PDPTR registers in
__get_sregs2().
Reading PDPTRs may trigger access to guest memory:
kvm_pdptr_read() -> svm_cache_reg() -> load_pdptrs() ->
kvm_vcpu_read_guest_page() -> kvm_vcpu_gfn_to_memslot()
kvm_vcpu_gfn_to_memslot() dereferences memslots via __kvm_memslots(),
which uses srcu_dereference_check() and requires either kvm->srcu or
kvm->slots_lock to be held. Currently only vcpu->mutex is held,
triggering lockdep warning:
=============================
WARNING: suspicious RCU usage in kvm_vcpu_gfn_to_memslot
6.12.59+ #3 Not tainted
include/linux/kvm_host.h:1062 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.5.1717/15100:
#0: ff1100002f4b00b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x1d5/0x1590
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0xf0/0x120 lib/dump_stack.c:120
lockdep_rcu_suspicious+0x1e3/0x270 kernel/locking/lockdep.c:6824
__kvm_memslots include/linux/kvm_host.h:1062 [inline]
__kvm_memslots include/linux/kvm_host.h:1059 [inline]
kvm_vcpu_memslots include/linux/kvm_host.h:1076 [inline]
kvm_vcpu_gfn_to_memslot+0x518/0x5e0 virt/kvm/kvm_main.c:2617
kvm_vcpu_read_guest_page+0x27/0x50 virt/kvm/kvm_main.c:3302
load_pdptrs+0xff/0x4b0 arch/x86/kvm/x86.c:1065
svm_cache_reg+0x1c9/0x230 arch/x86/kvm/svm/svm.c:1688
kvm_pdptr_read arch/x86/kvm/kvm_cache_regs.h:141 [inline]
__get_sregs2 arch/x86/kvm/x86.c:11784 [inline]
kvm_arch_vcpu_ioctl+0x3e20/0x4aa0 arch/x86/kvm/x86.c:6279
kvm_vcpu_ioctl+0x856/0x1590 virt/kvm/kvm_main.c:4663
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl fs/ioctl.c:893 [inline]
__x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xbd/0x1d0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Fixes: 6dba94035203 ("KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2")
Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org>
Link: https://patch.msgid.link/20260123222801.646123-1-kovalev@altlinux.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Define and advertise AVX10_VNNI_INT CPUID to userspace when it's supported
by the host.
AVX10_VNNI_INT (0x24.0x1.ECX[bit 2]) is a discrete feature bit
introduced on Intel Diamond Rapids, which enumerates the support for
EVEX VPDP* instructions for INT8/INT16 [*].
Since this feature has no actual kernel usages, define it as a KVM-only
feature in reverse_cpuid.h.
Advertise new CPUID subleaf 0x24.0x1 with AVX10_VNNI_INT bit to
userspace for guest use. It's safe since no additional enabling work
is needed in the host kernel.
[*]: Intel Advanced Vector Extensions 10.2 Architecture Specification
(rev 5.0).
Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-5-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Bump up the maximum supported AVX10 version and enumerate AVX10.2 to
userspace when it's supported by the host.
Intel AVX10 Version 2 (Intel AVX10.2) includes a suite of new
instructions delivering new AI features and performance, accelerated
media processing, expanded Web Assembly, and Cryptography support, along
with enhancements to existing legacy instructions for completeness and
efficiency, and it is enumerated as version 2 in CPUID 0x24.0x0.EBX[bits
0-7] [1].
AVX10.2 has no current kernel usage and requires no additional host
kernel enabling work (based on AVX10.1 support) and provides no new
VMX controls [2]. Moreover, since AVX10.2 is the superset of AVX10.1,
there's no need to worry about AVX10.1 and AVX10.2 compatibility issues
in KVM.
Therefore, it's safe to advertise AVX10.2 version to userspace directly
if host supports AVX10.2.
[1]: Intel Advanced Vector Extensions 10.2 Architecture Specification
(rev 5.0).
[2]: Note: Since AVX10.2 spec (rev 4.0), it has been declared "AVX10/512
will be used in all Intel products, supporting vector lengths of
128, 256, and 512 in all product lines", and the VMX support (in
earlier revisions) for AVX10/256 guest on AVX10/512 host has been
dropped.
Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-4-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Define and advertise AMX CPUIDs (0x1E.0x1) to userspace when the leaf is
supported by the host.
Intel Diamond Rapids adds new AMX instructions to support new formats
and memory operations [*], and introduces the CPUID subleaf 0x1E.0x1
to centralize the discrete AMX feature bits within EAX.
Since these AMX features have no actual kernel usages, define them as
KVM-only features in reverse_cpuid.h.
In addition to the new features, CPUID 0x1E.0x1.EAX[bits 0-3] are
aliaseed positions of existing AMX feature bits distributed across the
0x7 leaves. To avoid duplicate feature names, name these aliases with an
*_ALIAS suffix, and define them in reverse_cpuid.h as KVM-only features
as well.
Advertise new CPUID subleaf 0x1E.0x1 with its AMX CPUID feature bits to
userspace for guest use. It's safe since no additional enabling work
is needed in the host kernel.
[*]: Intel Architecture Instruction Set Extensions and Future Features
(rev.059).
Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-3-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Define the feature flag for MOVRS and advertise support to userspace when
the feature is supported by the host.
MOVRS is a new set of instructions introduced in the Intel platform
Diamond Rapids, to provide load instructions that carry a read-shared
hint.
Functionally, MOVRS family is equivalent to existing load instructions,
but its read-shared hint indicates that the source memory location is
likely to become read-shared by multiple processors, i.e., read in the
future by at least one other processor before it is written (assuming it
is ever written in the future). This hint could optimize the behavior of
the caches, especially shared caches, for this data for future reads by
multiple processors. Additionally, MOVRS family also includes a software
prefetch instruction, PREFETCHRST2, that carries the same read-shared
hint. [*]
MOVRS family is enumerated by CPUID single-bit (0x7.0x1.EAX[bit 31]).
Since it's on a densely-populated CPUID leaf and some other bits on
this leaf have kernel usages, define this new feature in cpufeatures.h,
but hide it in /proc/cpuinfo due to lack of current kernel usage.
Advertise MOVRS bit to userspace directly. It's safe, since there's no
new VMX controls or additional host enabling required for guests to use
this feature.
[*]: Intel Architecture Instruction Set Extensions and Future Features
(rev.059).
Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-2-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Introduce a new command for KVM_MEMORY_ENCRYPT_OP ioctl that can be used
to enable fetching of endorsement key certificates from userspace via
the new KVM_EXIT_SNP_REQ_CERTS exit type. Also introduce a new
KVM_X86_SEV_SNP_REQ_CERTS KVM device attribute so that userspace can
query whether the kernel supports the new command/exit.
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Link: https://patch.msgid.link/20260109231732.1160759-3-michael.roth@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
For SEV-SNP, the host can optionally provide a certificate table to the
guest when it issues an attestation request to firmware (see GHCB 2.0
specification regarding "SNP Extended Guest Requests"). This certificate
table can then be used to verify the endorsement key used by firmware to
sign the attestation report.
While it is possible for guests to obtain the certificates through other
means, handling it via the host provides more flexibility in being able
to keep the certificate data in sync with the endorsement key throughout
host-side operations that might resulting in the endorsement key
changing.
In the case of KVM, userspace will be responsible for fetching the
certificate table and keeping it in sync with any modifications to the
endorsement key by other userspace management tools. Define a new
KVM_EXIT_SNP_REQ_CERTS event where userspace is provided with the GPA of
the buffer the guest has provided as part of the attestation request so
that userspace can write the certificate data into it while relying on
filesystem-based locking to keep the certificates up-to-date relative to
the endorsement keys installed/utilized by firmware at the time the
certificates are fetched.
[Melody: Update the documentation scheme about how file locking is
expected to happen.]
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Dionna Glaze <dionnaglaze@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Melody Wang <huibo.wang@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Link: https://patch.msgid.link/20260109231732.1160759-2-michael.roth@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the sanity check in kvm_apic_accept_events() that attempts to detect
KVM bugs by asserting that a vCPU isn't in Wait-For-SIPI if INIT/SIPI are
blocked, because if INIT is blocked, then it should be impossible for a
vCPU to get into WFS in the first place. Unfortunately, syzbot is smarter
than KVM (and its maintainers), and circumvented the guards put in place
by commit 0fe3e8d804fd ("KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked
check to KVM_RUN") by swapping the order and stuffing VMXON after INIT, and
then triggering kvm_apic_accept_events() by way of KVM_GET_MP_STATE.
Simply drop the WARN as it hasn't detected any meaningful KVM bugs in
years (if ever?), and preventing userspace from clobbering guest state is
generally a non-goal. More importantly, fully closing the hole would
likely require enforcing some amount of ordering in KVM's ioctls, which is
a much bigger risk than simply deleting the WARN.
Reported-by: syzbot+59f2c3a3fc4f6c09b8cd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6925da1b.a70a0220.d98e3.00b0.GAE@google.com
Link: https://patch.msgid.link/20260123022816.2283567-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a wrapper macro, ENC_TO_VMCS12_IDX(), to get a vmcs12 index given a
field encoding in anticipation of adding a macro to get from a vmcs12 index
back to the field encoding. And because open coding ROL16(n, 6) everywhere
is gross.
No functional change intended.
Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the call to nested_vmx_setup_ctls_msrs() from vmx_hardware_setup() to
nested_vmx_hardware_setup() so that the nested code can deal with ordering
dependencies without having to straddle vmx_hardware_setup() and
nested_vmx_hardware_setup(). Specifically, an upcoming change will
sanitize the vmcs12 fields based on hardware support, and that code needs
to run _before_ the MSRs are configured, because the lovely vmcs_enum MSR
depends on the max support vmcs12 field.
No functional change intended.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260115173427.716021-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Currently the post-populate callbacks handle copying source pages into
private GPA ranges backed by guest_memfd, where kvm_gmem_populate()
acquires the filemap invalidate lock, then calls a post-populate
callback which may issue a get_user_pages() on the source pages prior to
copying them into the private GPA (e.g. TDX).
This will not be compatible with in-place conversion, where the
userspace page fault path will attempt to acquire the filemap invalidate
lock while holding the mm->mmap_lock, leading to a potential ABBA
deadlock.
Address this by hoisting the GUP above the filemap invalidate lock so
that these page faults path can be taken early, prior to acquiring the
filemap invalidate lock.
It's not currently clear whether this issue is reachable with the
current implementation of guest_memfd, which doesn't support in-place
conversion, however it does provide a consistent mechanism to provide
stable source/target PFNs to callbacks rather than punting to
vendor-specific code, which allows for more commonality across
architectures, which may be worthwhile even without in-place conversion.
As part of this change, also begin enforcing that the 'src' argument to
kvm_gmem_populate() must be page-aligned, as this greatly reduces the
complexity around how the post-populate callbacks are implemented, and
since no current in-tree users support using a non-page-aligned 'src'
argument.
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260108214622.1084057-7-michael.roth@amd.com
[sean: avoid local "p" variable]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In the past, KVM_SEV_SNP_LAUNCH_UPDATE accepted a non-page-aligned
'uaddr' parameter to copy data from, but continuing to support this with
new functionality like in-place conversion and hugepages in the pipeline
has proven to be more trouble than it is worth, since there are no known
users that have been identified who use a non-page-aligned 'uaddr'
parameter.
Rather than locking guest_memfd into continuing to support this, go
ahead and document page-alignment as a requirement and begin enforcing
this in the handling function.
Reviewed-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Link: https://patch.msgid.link/20260108214622.1084057-5-michael.roth@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
kvm_gmem_populate(), and the associated post-populate callbacks, have
some limited support for dealing with guests backed by hugepages by
passing the order information along to each post-populate callback and
iterating through the pages passed to kvm_gmem_populate() in
hugepage-chunks.
However, guest_memfd doesn't yet support hugepages, and in most cases
additional changes in the kvm_gmem_populate() path would also be needed
to actually allow for this functionality.
This makes the existing code unnecessarily complex, and makes changes
difficult to work through upstream due to theoretical impacts on
hugepage support that can't be considered properly without an actual
hugepage implementation to reference. So for now, remove what's there
so changes for things like in-place conversion can be
implemented/reviewed more efficiently.
Suggested-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260108214622.1084057-3-michael.roth@amd.com
[sean: check for !IS_ERR() before checking folio_order()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Virtual VMSAVE/VMLOAD enablement (i.e.
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK) is set/cleared by
svm_recalc_instruction_intercepts() when the intercepts are cleared/set.
This is unnecessary because the bit is meaningless when intercepts are
set and KVM emulates the instructions. Initialize the bit in vmcb01 base
on vls, and keep it unchanged.
This is similar-ish to how vGIF is handled. It is enabled in init_vmcb()
if vgif=1 and remains unchanged when the STGI intercept is enabled (e.g.
for NMI windows).
This fixes a bug in svm_recalc_instruction_intercepts(). The intercepts
for VMSAVE/VMLOAD are always toggled in vmcb01, but
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK is toggled in the current VMCB, which
could be vmcb02 instead of vmcb01 if L2 is active.
Virtual VMSAVE/VMLOAD enablement in vmcb02 is separately controlled by
nested_vmcb02_prepare_control() based on the vCPU features and VMCB12,
and if intercepts are needed they are set by recalc_intercepts().
The bug is benign though. Not toggling the bit for vmcb01 is harmless
because it's useless anyway. For vmcb02:
- The bit could be incorrectly cleared when intercepts are set in
vmcb01. This is harmless because VMSAVE/VMLOAD will be emulated by KVM
anyway.
- The bit could be incorrectly set when the intercepts are cleared in
vmcb01. However, if the bit was originally clear in vmcb02, then
recalc_intercepts() will enable in the intercepts in vmcb02 anyway and
VMSAVE/VMLOAD will be emulated by KVM.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260110004821.3411245-3-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Commit cc3ed80ae69f ("KVM: nSVM: always use vmcb01 to for vmsave/vmload
of guest state") made KVM always use vmcb01 for the fields controlled by
VMSAVE/VMLOAD, but it missed updating the VMLOAD/VMSAVE emulation code
to always use vmcb01.
As a result, if VMSAVE/VMLOAD is executed by an L2 guest and is not
intercepted by L1, KVM will mistakenly use vmcb02. Always use vmcb01
instead of the current VMCB.
Fixes: cc3ed80ae69f ("KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260110004821.3411245-2-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
fails
Unregistering the posted interrupt wakeup handler only happens during
hardware unsetup. Therefore, if alloc_kvm_area() fails and continue to
register the posted interrupt wakeup handler, this will leave the global
posted interrupt wakeup handler pointer in an incorrect state. Although
it should not be an issue, it's still better to change it.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Fixes: ec5a4919fa7b ("KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup")
Link: https://patch.msgid.link/0ac6908b608cf80eab7437004334fedd0f5f5317.1768304590.git.houwenlong.hwl@antgroup.com
[sean: use a goto]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When disconnecting a non-MSI irqfd from an IRQ bypass producer, WARN if
the irqfd is configured for IRQ bypass and set its IRTE back to remapped
mode to harden against kernel/KVM bugs (keeping the irqfd in bypass mode
is often fatal to the host).
Deactivating an irqfd (removing it from the list of irqfds), updating
irqfd routes, and the code in question are all mutually exclusive (all
run under irqfds.lock). If an irqfd is configured for bypass, and the
irqfd is deassigned at the same time IRQ routing is updated (to change the
routing to non-MSI), then either kvm_arch_update_irqfd_routing() should
process the irqfd routing change and put the IRTE into remapped mode
(routing update "wins"), or kvm_arch_irq_bypass_del_producer() should see
the MSI routing info (deactivation "wins").
Link: https://patch.msgid.link/20260113174606.104978-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When allocating the AVIC backing page, only check one of the max AVIC vs.
x2AVIC ID based on whether or not x2AVIC is enabled. Doing so fixes a bug
where KVM incorrectly inhibits AVIC if x2AVIC is _disabled_ and any vCPU
with a non-zero APIC ID is created, as x2avic_max_physical_id is left '0'
when x2AVIC is disabled.
Fixes: 940fc47cfb0d ("KVM: SVM: Add AVIC support for 4k vCPUs in x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://patch.msgid.link/20260112232805.1512361-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The comment above DR6 and DR7 initializations is redundant, because the
entire function follows the same pattern of only initializing the fields
in vmcb02 if the vmcb12 changed or the fields are dirty, which handles
the first execution case.
Also, the comment refers to new_vmcb12 as new_vmcs12. Just drop the
comment.
No functional change intended.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260113172807.2178526-1-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove the declaration of nested_mark_vmcs12_pages_dirty() from the
header file since it has been moved and renamed to
nested_vmx_mark_all_vmcs12_pages_dirty(), which is a static function.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260113084748.1714633-1-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix a goof in the comment that documents KVM's logic for enabling AVIC by
default to reference Zen5+ as family 0x1A (Zen5), not family 0x19 (Zen4).
The code is correct (checks for _greater_ than 0x19), only the comment is
flawed.
Opportunistically tweak the check too, even though it's already correct,
so that both the comment and the code reference 0x1A, and so that the
checks are "ascending", i.e. check Zen4 and then Zen5+.
No functional change intended.
Fixes: ca2967de5a5b ("KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is support")
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://patch.msgid.link/20260109035037.1015073-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|