From 4793f990ea1523309a58d8fb5237b3a815e6f537 Mon Sep 17 00:00:00 2001 From: Jim Mattson Date: Tue, 30 Sep 2025 17:14:07 -0700 Subject: KVM: x86: Advertise EferLmsleUnsupported to userspace CPUID.80000008H:EBX.EferLmsleUnsupported[bit 20] is a defeature bit. When this bit is clear, EFER.LMSLE is supported. When this bit is set, EFER.LMLSE is unsupported. KVM has never _emulated_ EFER.LMSLE, so KVM cannot truly support a 0-setting of this bit. However, KVM has allowed the guest to enable EFER.LMSLE in hardware since commit eec4b140c924 ("KVM: SVM: Allow EFER.LMSLE to be set with nested svm"), i.e. KVM partially virtualizes long-mode segment limits _if_ they are supported by the underlying hardware. Pass through the bit in KVM_GET_SUPPORTED_CPUID to advertise the unavailability of EFER.LMSLE to userspace based on the raw underlying hardware. Attempting to enable EFER.LSMLE on such CPUs simply doesn't work, e.g. immediately crashes on VMRUN. Signed-off-by: Jim Mattson Reviewed-by: Nikunj A Dadhania Reviewed-by: Yosry Ahmed Link: https://lore.kernel.org/r/20251001001529.1119031-2-jmattson@google.com [sean: add context about partial virtualization, use PASSTHROUGH_F] Signed-off-by: Sean Christopherson --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 4091a776e37a..6bdf868c8f8e 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -338,6 +338,7 @@ #define X86_FEATURE_AMD_STIBP (13*32+15) /* Single Thread Indirect Branch Predictors */ #define X86_FEATURE_AMD_STIBP_ALWAYS_ON (13*32+17) /* Single Thread Indirect Branch Predictors always-on preferred */ #define X86_FEATURE_AMD_IBRS_SAME_MODE (13*32+19) /* Indirect Branch Restricted Speculation same mode protection*/ +#define X86_FEATURE_EFER_LMSLE_MBZ (13*32+20) /* EFER.LMSLE must be zero */ #define X86_FEATURE_AMD_PPIN (13*32+23) /* "amd_ppin" Protected Processor Inventory Number */ #define X86_FEATURE_AMD_SSBD (13*32+24) /* Speculative Store Bypass Disable */ #define X86_FEATURE_VIRT_SSBD (13*32+25) /* "virt_ssbd" Virtualized Speculative Store Bypass Disable */ -- cgit v1.2.3 From ca11d9d35e958d2b4020d1a360ddc277be3ee86c Mon Sep 17 00:00:00 2001 From: Naveen N Rao Date: Fri, 5 Sep 2025 00:03:04 +0530 Subject: KVM: SVM: Expand AVIC_PHYSICAL_MAX_INDEX_MASK to be a 12-bit field In the latest APM describing AVIC support for 4k vCPUs, VMCB AVIC_PHYSICAL_MAX_INDEX (Offset 0xF8) and EXITINFO2.Index are both updated from 9-bit wide to 12-bit wide fields unconditionally (i.e., regardless of AVIC support for 4k vCPUs). Expand AVIC_PHYSICAL_MAX_INDEX_MASK accordingly. While AVIC_PHYSICAL_MAX_INDEX_MASK is updated to a 12-bit field, KVM will limit the max vCPU/APIC ID based on the maximum supported on a specific processor and enforce that limit during vCPU creation. I.e., KVM doesn't need to rely on the mask to ensure that the max APIC ID being programmed in the VMCB is in range. The additional bits (11:9) were previously marked reserved and were never set/read by older processors. Signed-off-by: Naveen N Rao (AMD) Link: https://lore.kernel.org/r/a24ae953cea716bf9c56c136f7ca4bf5e97b1080.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 17f6c3fedeee..d227e710c6b4 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -279,7 +279,7 @@ enum avic_ipi_failure_cause { AVIC_IPI_FAILURE_INVALID_IPI_VECTOR, }; -#define AVIC_PHYSICAL_MAX_INDEX_MASK GENMASK_ULL(8, 0) +#define AVIC_PHYSICAL_MAX_INDEX_MASK GENMASK_ULL(11, 0) /* * For AVIC, the max index allowed for physical APIC ID table is 0xfe (254), as -- cgit v1.2.3 From 5d0316e25defee47c8c7b1b6324244f630b6621f Mon Sep 17 00:00:00 2001 From: Naveen N Rao Date: Fri, 5 Sep 2025 00:03:06 +0530 Subject: x86/cpufeatures: Add X86_FEATURE_X2AVIC_EXT Add CPUID feature bit for x2AVIC extension that enables AMD SVM to support up to 4096 vCPUs in x2AVIC mode. The primary change is in the size of the AVIC Physical ID table, which can now go up to 8 contiguous 4k pages. The number of pages allocated is controlled by the maximum APIC ID for a guest, and that controls the number of pages to allocate for the AVIC Physical ID table. AVIC hardware is enhanced to look up Physical ID table entries for vCPUs > 512 for locating the target APIC backing page and the host APIC ID of the physical core on which the guest vCPU is running. Signed-off-by: Naveen N Rao (AMD) Acked-by: Borislav Petkov (AMD) Link: https://lore.kernel.org/r/e5c9c471ab99a130bf9b728b77050ab308cf8624.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 6bdf868c8f8e..7129eb44adad 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -500,6 +500,7 @@ #define X86_FEATURE_IBPB_EXIT_TO_USER (21*32+14) /* Use IBPB on exit-to-userspace, see VMSCAPE bug */ #define X86_FEATURE_ABMC (21*32+15) /* Assignable Bandwidth Monitoring Counters */ #define X86_FEATURE_MSR_IMM (21*32+16) /* MSR immediate form instructions */ +#define X86_FEATURE_X2AVIC_EXT (21*32+17) /* AMD SVM x2AVIC support for 4k vCPUs */ /* * BUG word(s) -- cgit v1.2.3 From 940fc47cfb0d78b0f5db1f71dfce07c4711b9457 Mon Sep 17 00:00:00 2001 From: Naveen N Rao Date: Fri, 5 Sep 2025 00:03:07 +0530 Subject: KVM: SVM: Add AVIC support for 4k vCPUs in x2AVIC mode With AVIC support for 4k vCPUs, the maximum supported physical ID in x2AVIC mode is 4095. Since this is no longer fixed, introduce a variable (x2avic_max_physical_id) to capture the maximum supported physical ID on the current platform and use that in place of the existing macro (X2AVIC_MAX_PHYSICAL_ID). With AVIC support for 4k vCPUs, the AVIC Physical ID table is no longer a single page and can occupy up to 8 contiguous 4k pages. Since AVIC hardware accesses of the physical ID table are limited by the physical max index programmed in the VMCB, it is sufficient to allocate only as many pages as are required to have a physical table entry for the max guest APIC ID. Since the guest APIC mode is not available at this point, provision for the maximum possible x2AVIC ID. For this purpose, add a variant of avic_get_max_physical_id() that works with a NULL vCPU pointer and returns the max x2AVIC ID. Wrap this in a new helper for obtaining the allocation order. To make it easy to identify support for 4k vCPUs in x2AVIC mode, update the message printed to the kernel log to print the maximum number of vCPUs supported. Do this on all platforms supporting x2AVIC since it is useful to know what is supported on a specific platform. Co-developed-by: Suravee Suthikulpanit Signed-off-by: Suravee Suthikulpanit Signed-off-by: Naveen N Rao (AMD) Link: https://lore.kernel.org/r/7fc5962f6da028f7dd3c79dbbd5c574fa02c99dd.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson --- arch/x86/include/asm/svm.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index d227e710c6b4..e69b6d0dedcf 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -289,11 +289,14 @@ enum avic_ipi_failure_cause { /* * For x2AVIC, the max index allowed for physical APIC ID table is 0x1ff (511). + * With X86_FEATURE_X2AVIC_EXT, the max index is increased to 0xfff (4095). */ #define X2AVIC_MAX_PHYSICAL_ID 0x1FFUL +#define X2AVIC_4K_MAX_PHYSICAL_ID 0xFFFUL static_assert((AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == AVIC_MAX_PHYSICAL_ID); static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_MAX_PHYSICAL_ID); +static_assert((X2AVIC_4K_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_4K_MAX_PHYSICAL_ID); #define SVM_SEV_FEAT_SNP_ACTIVE BIT(0) #define SVM_SEV_FEAT_RESTRICTED_INJECTION BIT(3) -- cgit v1.2.3 From 65a70164aba7c5d3b37a2d1e04a8d19c9d980994 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 30 Oct 2025 11:50:03 -0700 Subject: KVM: x86: Add a helper to dedup reporting of unhandled VM-Exits Add and use a helper, kvm_prepare_unexpected_reason_exit(), to dedup the code that fills the exit reason and CPU when KVM encounters a VM-Exit that KVM doesn't know how to handle. Reviewed-by: yaoyuan@linux.alibaba.com Reviewed-by: Yao Yuan Reviewed-by: Yosry Ahmed Reviewed-by: Pankaj Gupta Reviewed-by: Xiaoyao Li Reviewed-by: Binbin Wu Acked-by: Kai Huang Link: https://patch.msgid.link/20251030185004.3372256-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 48598d017d6f..4fbe4b7ce1da 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2167,6 +2167,7 @@ void __kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu, void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu); void kvm_prepare_event_vectoring_exit(struct kvm_vcpu *vcpu, gpa_t gpa); +void kvm_prepare_unexpected_reason_exit(struct kvm_vcpu *vcpu, u64 exit_reason); void kvm_enable_efer_bits(u64); bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer); -- cgit v1.2.3 From 7139c860650535cf517a9344212ce56a21ba37f5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 30 Oct 2025 13:09:34 -0700 Subject: KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Drop the return code from kvm_x86_ops.remove_external_spte(), a.k.a. tdx_sept_remove_private_spte(), as KVM simply does a KVM_BUG_ON() failure, and that KVM_BUG_ON() is redundant since all error paths in TDX also do a KVM_BUG_ON(). Opportunistically pass the spte instead of the pfn, as the API is clearly about removing an spte. Suggested-by: Rick Edgecombe Reviewed-by: Binbin Wu Reviewed-by: Kai Huang Reviewed-by: Yan Zhao Tested-by: Yan Zhao Tested-by: Kai Huang Link: https://patch.msgid.link/20251030200951.3402865-12-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 48598d017d6f..b5867f8fe6ce 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1855,8 +1855,8 @@ struct kvm_x86_ops { void *external_spt); /* Update external page table from spte getting removed, and flush TLB. */ - int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, - kvm_pfn_t pfn_for_gfn); + void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, + u64 mirror_spte); bool (*has_wbinvd_exit)(void); -- cgit v1.2.3 From b9d5cf6de0b6bd3bdd8957b5fea5a243e43c928f Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 30 Oct 2025 13:09:35 -0700 Subject: KVM: TDX: WARN if mirror SPTE doesn't have full RWX when creating S-EPT mapping Pass in the mirror_spte to kvm_x86_ops.set_external_spte() to provide symmetry with .remove_external_spte(), and assert in TDX that the mirror SPTE is shadow-present with full RWX permissions (the TDX-Module doesn't allow the hypervisor to control protections). Reviewed-by: Binbin Wu Reviewed-by: Kai Huang Reviewed-by: Yan Zhao Tested-by: Yan Zhao Tested-by: Kai Huang Link: https://patch.msgid.link/20251030200951.3402865-13-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b5867f8fe6ce..87a5f5100b1d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1848,7 +1848,7 @@ struct kvm_x86_ops { void *external_spt); /* Update the external page table from spte getting set. */ int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, - kvm_pfn_t pfn_for_gfn); + u64 mirror_spte); /* Update external page tables for page table about to be freed. */ int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, -- cgit v1.2.3 From 94428e3ba3258fc2862db3f9999e548d5a2d2a2a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 30 Oct 2025 13:09:46 -0700 Subject: KVM: TDX: Convert INIT_MEM_REGION and INIT_VCPU to "unlocked" vCPU ioctl Handle the KVM_TDX_INIT_MEM_REGION and KVM_TDX_INIT_VCPU vCPU sub-ioctls in the unlocked variant, i.e. outside of vcpu->mutex, in anticipation of taking kvm->lock along with all other vCPU mutexes, at which point the sub-ioctls _must_ start without vcpu->mutex held. No functional change intended. Reviewed-by: Kai Huang Co-developed-by: Yan Zhao Signed-off-by: Yan Zhao Reviewed-by: Binbin Wu Reviewed-by: Yan Zhao Tested-by: Yan Zhao Tested-by: Kai Huang Link: https://patch.msgid.link/20251030200951.3402865-24-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + 2 files changed, 2 insertions(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index fdf178443f85..de709fb5bd76 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -128,6 +128,7 @@ KVM_X86_OP(enable_smi_window) KVM_X86_OP_OPTIONAL(dev_get_attr) KVM_X86_OP_OPTIONAL(mem_enc_ioctl) KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl) +KVM_X86_OP_OPTIONAL(vcpu_mem_enc_unlocked_ioctl) KVM_X86_OP_OPTIONAL(mem_enc_register_region) KVM_X86_OP_OPTIONAL(mem_enc_unregister_region) KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 87a5f5100b1d..2bfae1cfa514 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1914,6 +1914,7 @@ struct kvm_x86_ops { int (*dev_get_attr)(u32 group, u64 attr, u64 *val); int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp); int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp); + int (*vcpu_mem_enc_unlocked_ioctl)(struct kvm_vcpu *vcpu, void __user *argp); int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp); int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp); int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd); -- cgit v1.2.3 From c0711f8c610e1634ed54fb04da1e82252730306a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 30 Oct 2025 12:15:25 -0700 Subject: KVM: TDX: Explicitly set user-return MSRs that *may* be clobbered by the TDX-Module Set all user-return MSRs to their post-TD-exit value when preparing to run a TDX vCPU to ensure the value that KVM expects to be loaded after running the vCPU is indeed the value that's loaded in hardware. If the TDX-Module doesn't actually enter the guest, i.e. doesn't do VM-Enter, then it won't "restore" VMM state, i.e. won't clobber user-return MSRs to their expected post-run values, in which case simply updating KVM's "cached" value will effectively corrupt the cache due to hardware still holding the original value. In theory, KVM could conditionally update the current user-return value if and only if tdh_vp_enter() succeeds, but in practice "success" doesn't guarantee the TDX-Module actually entered the guest, e.g. if the TDX-Module synthesizes an EPT Violation because it suspects a zero-step attack. Force-load the expected values instead of trying to decipher whether or not the TDX-Module restored/clobbered MSRs, as the risk doesn't justify the benefits. Effectively avoiding four WRMSRs once per run loop (even if the vCPU is scheduled out, user-return MSRs only need to be reloaded if the CPU exits to userspace or runs a non-TDX vCPU) is likely in the noise when amortized over all entries, given the cost of running a TDX vCPU. E.g. the cost of the WRMSRs is somewhere between ~300 and ~500 cycles, whereas the cost of a _single_ roundtrip to/from a TDX guest is thousands of cycles. Fixes: e0b4f31a3c65 ("KVM: TDX: restore user ret MSRs") Cc: stable@vger.kernel.org Cc: Yan Zhao Cc: Xiaoyao Li Cc: Rick Edgecombe Reviewed-by: Xiaoyao Li Link: https://patch.msgid.link/20251030191528.3380553-2-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4fbe4b7ce1da..a557c504c1a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2379,7 +2379,6 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, int kvm_add_user_return_msr(u32 msr); int kvm_find_user_return_msr(u32 msr); int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask); -void kvm_user_return_msr_update_cache(unsigned int index, u64 val); u64 kvm_get_user_return_msr(unsigned int slot); static inline bool kvm_is_supported_user_return_msr(u32 msr) -- cgit v1.2.3 From 4da3768e1820cf15cced390242d8789aed34f54d Mon Sep 17 00:00:00 2001 From: Omar Sandoval Date: Tue, 4 Nov 2025 09:55:26 -0800 Subject: KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replaced When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn instruction, discard the exception and retry the instruction if the code stream is changed (e.g. by a different vCPU) between when the CPU executes the instruction and when KVM decodes the instruction to get the next RIP. As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction"), failure to verify that the correct INTn instruction was decoded can effectively clobber guest state due to decoding the wrong instruction and thus specifying the wrong next RIP. The bug most often manifests as "Oops: int3" panics on static branch checks in Linux guests. Enabling or disabling a static branch in Linux uses the kernel's "text poke" code patching mechanism. To modify code while other CPUs may be executing that code, Linux (temporarily) replaces the first byte of the original instruction with an int3 (opcode 0xcc), then patches in the new code stream except for the first byte, and finally replaces the int3 with the first byte of the new code stream. If a CPU hits the int3, i.e. executes the code while it's being modified, then the guest kernel must look up the RIP to determine how to handle the #BP, e.g. by emulating the new instruction. If the RIP is incorrect, then this lookup fails and the guest kernel panics. The bug reproduces almost instantly by hacking the guest kernel to repeatedly check a static branch[1] while running a drgn script[2] on the host to constantly swap out the memory containing the guest's TSS. [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Omar Sandoval Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 48598d017d6f..974d64bf0a4d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2143,6 +2143,11 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu); * the gfn, i.e. retrying the instruction will hit a * !PRESENT fault, which results in a new shadow page * and sends KVM back to square one. + * + * EMULTYPE_SKIP_SOFT_INT - Set in combination with EMULTYPE_SKIP to only skip + * an instruction if it could generate a given software + * interrupt, which must be encoded via + * EMULTYPE_SET_SOFT_INT_VECTOR(). */ #define EMULTYPE_NO_DECODE (1 << 0) #define EMULTYPE_TRAP_UD (1 << 1) @@ -2153,6 +2158,10 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu); #define EMULTYPE_PF (1 << 6) #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7) #define EMULTYPE_WRITE_PF_TO_SP (1 << 8) +#define EMULTYPE_SKIP_SOFT_INT (1 << 9) + +#define EMULTYPE_SET_SOFT_INT_VECTOR(v) ((u32)((v) & 0xff) << 16) +#define EMULTYPE_GET_SOFT_INT_VECTOR(e) (((e) >> 16) & 0xff) static inline bool kvm_can_emulate_event_vectoring(int emul_type) { -- cgit v1.2.3 From 7a61d61396b97fd6bb9b9bde321c68513034ad11 Mon Sep 17 00:00:00 2001 From: Tom Lendacky Date: Mon, 27 Oct 2025 14:33:51 -0500 Subject: KVM: SEV: Publish supported SEV-SNP policy bits Define the set of policy bits that KVM currently knows as not requiring any implementation support within KVM. Provide this value to userspace via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Tom Lendacky Link: https://patch.msgid.link/c596f7529518f3f826a57970029451d9385949e5.1761593632.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson --- arch/x86/include/uapi/asm/kvm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/x86/include') diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d420c9c066d4..7ceff6583652 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -502,6 +502,7 @@ struct kvm_sync_regs { /* vendor-specific groups and attributes for system fd */ #define KVM_X86_GRP_SEV 1 # define KVM_X86_SEV_VMSA_FEATURES 0 +# define KVM_X86_SNP_POLICY_BITS 1 struct kvm_vmx_nested_state_data { __u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE]; -- cgit v1.2.3 From 43ddbf16edf5c1790684b32d5eb920a1b0eea285 Mon Sep 17 00:00:00 2001 From: Lei Chen Date: Tue, 19 Aug 2025 23:20:25 +0800 Subject: Revert "x86: kvm: introduce periodic global clock updates" This reverts commit 332967a3eac06f6379283cf155c84fe7cd0537c2. Commit 332967a3eac0 ("x86: kvm: introduce periodic global clock updates") introduced a 300s interval work to sync ntp corrections across all vcpus. Since commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw clock"), kvmclock switched to mono raw clock, we can no longer take ntp into consideration. Signed-off-by: Lei Chen Link: https://patch.msgid.link/20250819152027.1687487-2-lei.chen@smartx.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a557c504c1a4..adc8ea9d391b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1457,7 +1457,6 @@ struct kvm_arch { u64 master_kernel_ns; u64 master_cycle_now; struct delayed_work kvmclock_update_work; - struct delayed_work kvmclock_sync_work; #ifdef CONFIG_KVM_HYPERV struct kvm_hv hyperv; -- cgit v1.2.3 From 446fcce2a52b533c543dabba26777813c347577c Mon Sep 17 00:00:00 2001 From: Lei Chen Date: Tue, 19 Aug 2025 23:20:26 +0800 Subject: Revert "x86: kvm: rate-limit global clock updates" This reverts commit 7e44e4495a398eb553ce561f29f9148f40a3448f. Commit 7e44e4495a39 ("x86: kvm: rate-limit global clock updates") intends to use a kvmclock_update_work to sync ntp corretion across all vcpus kvmclock, which is based on commit 0061d53daf26f ("KVM: x86: limit difference between kvmclock updates") Since kvmclock has been switched to mono raw, this commit can be reverted. Signed-off-by: Lei Chen Link: https://patch.msgid.link/20250819152027.1687487-3-lei.chen@smartx.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index adc8ea9d391b..692ec922945f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1456,7 +1456,6 @@ struct kvm_arch { bool use_master_clock; u64 master_kernel_ns; u64 master_cycle_now; - struct delayed_work kvmclock_update_work; #ifdef CONFIG_KVM_HYPERV struct kvm_hv hyperv; -- cgit v1.2.3 From afb99ffbd5825bfbc88e8052bc5514e6ebdfec5e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:37:40 -0800 Subject: x86/bugs: Decouple ALTERNATIVE usage from VERW macro definition Decouple the use of ALTERNATIVE from the encoding of VERW to clear CPU buffers so that KVM can use ALTERNATIVE_2 to handle "always clear buffers" and "clear if guest can access host MMIO" in a single statement. No functional change intended. Reviewed-by: Brendan Jackman Reviewed-by: Pawan Gupta Link: https://patch.msgid.link/20251113233746.1703361-4-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/nospec-branch.h | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 08ed5a2e46a5..cb36a8ea00d3 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -308,24 +308,29 @@ * CFLAGS.ZF. * Note: Only the memory operand variant of VERW clears the CPU buffers. */ -.macro __CLEAR_CPU_BUFFERS feature #ifdef CONFIG_X86_64 - ALTERNATIVE "", "verw x86_verw_sel(%rip)", \feature +#define VERW verw x86_verw_sel(%rip) #else - /* - * In 32bit mode, the memory operand must be a %cs reference. The data - * segments may not be usable (vm86 mode), and the stack segment may not - * be flat (ESPFIX32). - */ - ALTERNATIVE "", "verw %cs:x86_verw_sel", \feature +/* + * In 32bit mode, the memory operand must be a %cs reference. The data segments + * may not be usable (vm86 mode), and the stack segment may not be flat (ESPFIX32). + */ +#define VERW verw %cs:x86_verw_sel #endif -.endm +/* + * Provide a stringified VERW macro for simple usage, and a non-stringified + * VERW macro for use in more elaborate sequences, e.g. to encode a conditional + * VERW within an ALTERNATIVE. + */ +#define __CLEAR_CPU_BUFFERS __stringify(VERW) + +/* If necessary, emit VERW on exit-to-userspace to clear CPU buffers. */ #define CLEAR_CPU_BUFFERS \ - __CLEAR_CPU_BUFFERS X86_FEATURE_CLEAR_CPU_BUF + ALTERNATIVE "", __CLEAR_CPU_BUFFERS, X86_FEATURE_CLEAR_CPU_BUF #define VM_CLEAR_CPU_BUFFERS \ - __CLEAR_CPU_BUFFERS X86_FEATURE_CLEAR_CPU_BUF_VM + ALTERNATIVE "", __CLEAR_CPU_BUFFERS, X86_FEATURE_CLEAR_CPU_BUF_VM #ifdef CONFIG_X86_64 .macro CLEAR_BRANCH_HISTORY -- cgit v1.2.3 From f6106d41ec84e552a5e8adda1f8741cab96a5425 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:37:41 -0800 Subject: x86/bugs: Use an x86 feature to track the MMIO Stale Data mitigation Convert the MMIO Stale Data mitigation tracking from a static branch into an x86 feature flag so that it can be used via ALTERNATIVE_2 in KVM. No functional change intended. Reviewed-by: Pawan Gupta Reviewed-by: Brendan Jackman Link: https://patch.msgid.link/20251113233746.1703361-5-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/cpufeatures.h | 5 +++++ arch/x86/include/asm/nospec-branch.h | 2 -- 2 files changed, 5 insertions(+), 2 deletions(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 4091a776e37a..fc5698844a0b 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -499,6 +499,11 @@ #define X86_FEATURE_IBPB_EXIT_TO_USER (21*32+14) /* Use IBPB on exit-to-userspace, see VMSCAPE bug */ #define X86_FEATURE_ABMC (21*32+15) /* Assignable Bandwidth Monitoring Counters */ #define X86_FEATURE_MSR_IMM (21*32+16) /* MSR immediate form instructions */ +#define X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO (21*32+17) /* + * Clear CPU buffers before VM-Enter if the vCPU + * can access host MMIO (ignored for all intents + * and purposes if CLEAR_CPU_BUF_VM is set). + */ /* * BUG word(s) diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index cb36a8ea00d3..afdcdf40e414 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -585,8 +585,6 @@ DECLARE_STATIC_KEY_FALSE(cpu_buf_idle_clear); DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); -DECLARE_STATIC_KEY_FALSE(cpu_buf_vm_clear); - extern u16 x86_verw_sel; #include -- cgit v1.2.3 From fc704b578976ed6a937f419e611d5ae92c939826 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:37:43 -0800 Subject: x86/bugs: KVM: Move VM_CLEAR_CPU_BUFFERS into SVM as SVM_CLEAR_CPU_BUFFERS Now that VMX encodes its own sequence for clearing CPU buffers, move VM_CLEAR_CPU_BUFFERS into SVM to minimize the chances of KVM botching a mitigation in the future, e.g. using VM_CLEAR_CPU_BUFFERS instead of checking multiple mitigation flags. No functional change intended. Reviewed-by: Brendan Jackman Acked-by: Borislav Petkov (AMD) Link: https://patch.msgid.link/20251113233746.1703361-7-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/nospec-branch.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index afdcdf40e414..a6526c5be5ca 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -329,9 +329,6 @@ #define CLEAR_CPU_BUFFERS \ ALTERNATIVE "", __CLEAR_CPU_BUFFERS, X86_FEATURE_CLEAR_CPU_BUF -#define VM_CLEAR_CPU_BUFFERS \ - ALTERNATIVE "", __CLEAR_CPU_BUFFERS, X86_FEATURE_CLEAR_CPU_BUF_VM - #ifdef CONFIG_X86_64 .macro CLEAR_BRANCH_HISTORY ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_LOOP -- cgit v1.2.3 From 05bd63959a9d682e5e765f950bb6b356d3d6d27a Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 13 Nov 2025 15:37:45 -0800 Subject: KVM: VMX: Disable L1TF L1 data cache flush if CONFIG_CPU_MITIGATIONS=n Disable support for flushing the L1 data cache to mitigate L1TF if CPU mitigations are disabled for the entire kernel. KVM's mitigation of L1TF is in no way special enough to justify ignoring CONFIG_CPU_MITIGATIONS=n. Deliberately use CPU_MITIGATIONS instead of the more precise MITIGATION_L1TF, as MITIGATION_L1TF only controls the default behavior, i.e. CONFIG_MITIGATION_L1TF=n doesn't completely disable L1TF mitigations in the kernel. Keep the vmentry_l1d_flush module param to avoid breaking existing setups, and leverage the .set path to alert the user to the fact that vmentry_l1d_flush will be ignored. Don't bother validating the incoming value; if an admin misconfigures vmentry_l1d_flush, the fact that the bad configuration won't be detected when running with CONFIG_CPU_MITIGATIONS=n is likely the least of their worries. Reviewed-by: Brendan Jackman Link: https://patch.msgid.link/20251113233746.1703361-9-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/hardirq.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index f00c09ffe6a9..6b6d472baa0b 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -5,7 +5,7 @@ #include typedef struct { -#if IS_ENABLED(CONFIG_KVM_INTEL) +#if IS_ENABLED(CONFIG_CPU_MITIGATIONS) && IS_ENABLED(CONFIG_KVM_INTEL) u8 kvm_cpu_l1tf_flush_l1d; #endif unsigned int __nmi_count; /* arch dependent */ @@ -68,7 +68,7 @@ extern u64 arch_irq_stat(void); DECLARE_PER_CPU_CACHE_HOT(u16, __softirq_pending); #define local_softirq_pending_ref __softirq_pending -#if IS_ENABLED(CONFIG_KVM_INTEL) +#if IS_ENABLED(CONFIG_CPU_MITIGATIONS) && IS_ENABLED(CONFIG_KVM_INTEL) /* * This function is called from noinstr interrupt contexts * and must be inlined to not get instrumentation. -- cgit v1.2.3 From 38ee66cb1845dbf1e97c5e5d3db01ae4513f66a9 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 13 Nov 2025 15:37:46 -0800 Subject: KVM: x86: Unify L1TF flushing under per-CPU variable Currently the tracking of the need to flush L1D for L1TF is tracked by two bits: one per-CPU and one per-vCPU. The per-vCPU bit is always set when the vCPU shows up on a core, so there is no interesting state that's truly per-vCPU. Indeed, this is a requirement, since L1D is a part of the physical CPU. So simplify this by combining the two bits. The vCPU bit was being written from preemption-enabled regions. To play nice with those cases, wrap all calls from KVM and use a raw write so that request a flush with preemption enabled doesn't trigger what would effectively be DEBUG_PREEMPT false positives. Preemption doesn't need to be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a flush if the vCPU task is migrated, or if userspace runs the vCPU on a different task. Signed-off-by: Brendan Jackman [sean: put raw write in KVM instead of in a hardirq.h variant] Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/include/asm/kvm_host.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'arch/x86/include') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 692ec922945f..103af57e1060 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1055,9 +1055,6 @@ struct kvm_vcpu_arch { /* be preempted when it's in kernel-mode(cpl=0) */ bool preempted_in_kernel; - /* Flush the L1 Data cache for L1TF mitigation on VMENTER */ - bool l1tf_flush_l1d; - /* Host CPU on which VM-entry was most recently attempted */ int last_vmentry_cpu; -- cgit v1.2.3