| Age | Commit message (Collapse) | Author |
|
When running as an SEV+ guest, treat SVM as unsupported even if CPUID (and
other reporting, e.g. MSRs) enumerate support for SVM, as KVM doesn't
support nested virtualization within an SEV VM (KVM would need to
explicitly share all VMCBs and other assets with the untrusted host), let
alone running nested VMs within SEV-ES+ guests (e.g. emulating VMLOAD,
VMSAVE, and VMRUN all require access to guest register state). And outside
of KVM, there is no in-tree user of SVM enabling.
Arguably, the hypervisor/VMM (e.g. QEMU) should clear SVM from guest CPUID
for SEV VMs, especially for SEV-ES+, but super duper technically, it's
feasible to run nested VMs in SEV+ guests (with many caveats). More
importantly, Linux-as-a-guest has played nice with SVM being advertised to
SEV+ guests for a long time.
Treating SVM as unsupported fixes a regression where a clean shutdown of
an SEV-ES+ guest degrades into an abrupt termination. Due to a gnarly
virtualization hole in SEV-ES (the architecture), where EFER must NOT be
intercepted by the hypervisor (because the untrusted hypervisor can't set
e.g. EFER.LME on behalf o the guest), the _host's_ EFER.SVME is visible to
the guest. Because EFER.SVME must be always '1' while in guest mode,
Linux-the-guest sees EFER.SVME=1 even when _its_ EFER.SVME is '0', thinks
it has enabled virtualization, and ultimately can cause
x86_svm_emergency_disable_virtualization_cpu() to execute STGI to ensure
GIF is enabled. Executing STGI _should_ be fine, except Linux is a also
wee bit paranoid when running as an SEV-ES guest.
Because L0 sees EFER.SVME=0 for the guest, a well-behaved L0 hypervisor
will intercept STGI (to inject #UD), and thus generate a #VC on the STGI.
Which, again, should be fine. Unfortunately, vc_check_opcode_bytes() fails
to account for STGI and other SVM instructions, throws a fatal error, and
triggers a termination request. In a perfect world, the #VC handler would
be more forgiving of unknown intercepts, especially when the #VC happened
on an instruction with exception fixup. For now, just fix the immediate
regression.
Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem")
Reported-by: Srikanth Aithal <sraithal@amd.com>
Closes: https://lore.kernel.org/all/c820e242-9f3a-4210-b414-19d11b022404@amd.com
Link: https://patch.msgid.link/20260409191341.1932853-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Implement a per-CPU refcounting scheme so that "users" of hardware
virtualization, e.g. KVM and the future TDX code, can co-exist without
pulling the rug out from under each other. E.g. if KVM were to disable
VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from
the TDX subsystem would #UD and panic the kernel.
Disable preemption in the get/put APIs to ensure virtualization is fully
enabled/disabled before returning to the caller. E.g. if the task were
preempted after a 0=>1 transition, the new task would see a 1=>2 and thus
return without enabling virtualization. Explicitly disable preemption
instead of requiring the caller to do so, because the need to disable
preemption is an artifact of the implementation. E.g. from KVM's
perspective there is no _need_ to disable preemption as KVM guarantees the
pCPU on which it is running is stable (but preemption is enabled).
Opportunistically abstract away SVM vs. VMX in the public APIs by using
X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to
enable and use.
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the majority of the code related to disabling hardware virtualization
in emergency from KVM into the virt subsystem so that virt can take full
ownership of the state of SVM/VMX. This will allow refcounting usage of
SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping
on each other.
To route the emergency callback to the "right" vendor code, add to avoid
mixing vendor and generic code, implement a x86_virt_ops structure to
track the emergency callback, along with the SVM vs. VMX (vs. "none")
feature that is active.
To avoid having to choose between SVM and VMX, simply refuse to enable
either if both are somehow supported. No known CPU supports both SVM and
VMX, and it's comically unlikely such a CPU will ever exist.
Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a
callback explicitly scoped to KVM. Loading VMCSes and saving/restoring
host state are firmly tied to running VMs, and thus are (a) KVM's
responsibility and (b) operations that are still exclusively reserved for
KVM (as far as in-tree code is concerned). I.e. the contract being
established is that non-KVM subsystems can utilize virtualization, but for
all intents and purposes cannot act as full-blown hypervisors.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the innermost EFER.SVME logic out of KVM and into to core x86 to land
the SVM support alongside VMX support. This will allow providing a more
unified API from the kernel to KVM, and will allow moving the bulk of the
emergency disabling insanity out of KVM without having a weird split
between kernel and KVM for SVM vs. VMX.
No functional change intended.
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so
that TDX can (eventually) force VMXON without having to rely on KVM being
loaded, e.g. to do SEAMCALLs during initialization.
Opportunistically update the comment regarding emergency disabling via NMI
to clarify that virt_rebooting will be set by _another_ emergency callback,
i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only
before _this_ invocation does VMXOFF.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in
all CPUs so that KVM doesn't need to manually check root_vmcs. As added
bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo,
and will avoid a futile auto-probe of kvm-intel.ko.
WARN if allocating a root VMCS page fails, e.g. to help users figure out
why VMX is broken in the unlikely scenario something goes sideways during
boot (and because the allocation should succeed unless there's a kernel
bug). Tweak KVM's error message to suggest checking kernel logs if VMX is
unsupported (in addition to checking BIOS).
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM)
for each possible CPU during early boot CPU bringup, before early TDX
initialization, so that TDX can eventually do VMXON on-demand (to make
SEAMCALLs) without needing to load kvm-intel.ko. Allocate the pages early
on, e.g. instead of trying to do so on-demand, to avoid having to juggle
allocation failures at runtime.
Opportunistically rename the per-CPU pointers to better reflect the role
of the VMCS. Use Intel's "root VMCS" terminology, e.g. from various VMCS
patents[1][2] and older SDMs, not the more opaque "VMXON region" used in
recent versions of the SDM. While it's possible the VMCS passed to VMXON
no longer serves as _the_ root VMCS on modern CPUs, it is still in effect
a "root mode VMCS", as described in the patents.
Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1]
Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2]
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps
towards extracting the innermost VMXON and EFER.SVME management logic out
of KVM and into to core x86.
For lack of a better name, call the new file "hw.c", to yield "virt
hardware" when combined with its parent directory.
No functional change intended.
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|