| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.0, take #4
- Clear the pending exception state from a vcpu coming out of
reset, as it could otherwise affect the first instruction
executed in the guest.
- Fix the address translation emulation icode to set the Hardware
Access bit on the correct PTE instead of some other location.
|
|
Using "(u64 __user *)hva + offset" to get the virtual addresses of S1/S2
descriptors looks really wrong, if offset is not zero. What we want to get
for swapping is hva + offset, not hva + offset*8. ;-)
Fix it.
Fixes: f6927b41d573 ("KVM: arm64: Add helper for swapping guest descriptor")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260317115748.47332-1-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
|
|
Our vcpu reset suffers from a particularly interesting flaw, as it
does not correctly deal with state that will have an effect on the
execution flow out of reset.
Take the following completely random example, never seen in the wild
and that never resulted in a couple of sleepless nights: /s
- vcpu-A issues a PSCI_CPU_OFF using the SMC conduit
- SMC being a trapped instruction (as opposed to HVC which is always
normally executed), we annotate the vcpu as needing to skip the
next instruction, which is the SMC itself
- vcpu-A is now safely off
- vcpu-B issues a PSCI_CPU_ON for vcpu-A, providing a starting PC
- vcpu-A gets reset, get the new PC, and is sent on its merry way
- right at the point of entering the guest, we notice that a PC
increment is pending (remember the earlier SMC?)
- vcpu-A skips its first instruction...
What could possibly go wrong?
Well, I'm glad you asked. For pKVM as a NV guest, that first instruction
is extremely significant, as it indicates whether the CPU is booting
or resuming. Having skipped that instruction, nothing makes any sense
anymore, and CPU hotplugging fails.
This is all caused by the decoupling of PC update from the handling
of an exception that triggers such update, making it non-obvious
what affects what when.
Fix this train wreck by discarding all the PC-affecting state on
vcpu reset.
Fixes: f5e30680616ab ("KVM: arm64: Move __adjust_pc out of line")
Cc: stable@vger.kernel.org
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260312140850.822968-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.0, take #3
- Correctly handle deeactivation of out-of-LRs interrupts by
starting the EOIcount deactivation walk *after* the last irq
that made it into an LR. This avoids deactivating irqs that
are in the LRs and that the vcpu hasn't deactivated yet.
- Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when
pKVM is already enabled -- not only thhis isn't possible (pKVM
will reject the call), but it is also useless: this can only
happen for a CPU that has already booted once, and the capability
will not change.
|
|
HEAD
KVM generic changes for 7.0
- Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being
unnecessary and confusing, triggered compiler warnings due to
-Wflex-array-member-not-at-end.
- Document that vcpu->mutex is take outside of kvm->slots_lock and
kvm->slots_arch_lock, which is intentional and desirable despite being
rather unintuitive.
|
|
Valentine reports that their guests fail to boot correctly, losing
interrupts, and indicates that the wrong interrupt gets deactivated.
What happens here is that if the maintenance interrupt is slow enough
to kick us out of the guest, extra interrupts can be activated from
the LRs. We then exit and proceed to handle EOIcount deactivations,
picking active interrupts from the AP list. But we start from the
top of the list, potentially deactivating interrupts that were in
the LRs, while EOIcount only denotes deactivation of interrupts that
are not present in an LR.
Solve this by tracking the last interrupt that made it in the LRs,
and start the EOIcount deactivation walk *after* that interrupt.
Since this only makes sense while the vcpu is loaded, stash this
in the per-CPU host state.
Huge thanks to Valentine for doing all the detective work and
providing an initial patch.
Fixes: 3cfd59f81e0f3 ("KVM: arm64: GICv3: Handle LR overflow when EOImode==0")
Fixes: 281c6c06e2a7b ("KVM: arm64: GICv2: Handle LR overflow when EOImode==0")
Reported-by: Valentine Burley <valentine.burley@collabora.com>
Tested-by: Valentine Burley <valentine.burley@collabora.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20260307115955.369455-1-valentine.burley@collabora.com
Link: https://patch.msgid.link/20260307191151.3781182-1-maz@kernel.org
Cc: stable@vger.kernel.org
|
|
We already have an ISB in __kvm_at() to make the address translation result
visible to subsequent reads of PAR_EL1. Remove the redundant one right
after it.
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260306074422.47694-1-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When user_mem_abort() handles a nested stage-2 fault, it truncates
vma_pagesize to respect the guest's mapping size. However, the local
variable vma_shift is never updated to match this new size.
If the underlying host page turns out to be hardware poisoned,
kvm_send_hwpoison_signal() is called with the original, larger
vma_shift instead of the actual mapping size. This signals incorrect
poison boundaries to userspace and breaks hugepage memory poison
containment for nested VMs.
Update vma_shift to match the truncated vma_pagesize when operating
on behalf of a nested hypervisor.
Fixes: fd276e71d1e7 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260304162222.836152-3-tabba@google.com
[maz: simplified vma_shift assignment from the original patch]
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When a guest performs an atomic/exclusive operation on memory lacking
the required attributes, user_mem_abort() injects a data abort and
returns early. However, it fails to release the reference to the
host page acquired via __kvm_faultin_pfn().
A malicious guest could repeatedly trigger this fault, leaking host
page references and eventually causing host memory exhaustion (OOM).
Fix this by consolidating the early error returns to a new out_put_page
label that correctly calls kvm_release_page_unused().
Fixes: 2937aeec9dc5 ("KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory")
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260304162222.836152-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Failure to read the descriptor (because it is outside of a memslot) should
result in a SEA being injected in the guest.
Suggested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/86ms1m9lp3.wl-maz@kernel.org
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-4-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As per R_BFHQH,
" When an Address size fault is generated, the reported fault code
indicates one of the following:
If the fault was generated due to the TTBR_ELx used in the translation
having nonzero address bits above the OA size, then a fault at level 0. "
Fix the reported Address size fault level as being 0 if the base address is
wrongly programmed by L1.
Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-3-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
check_base_s2_limits() checks the validity of SL0 and inputsize against
ia_size (inputsize again!) but the pseudocode from DDI0487 G.a
AArch64.TranslationTableWalk() says that we should check against the
implemented PA size.
We would otherwise fail to walk S2 with a valid configuration. E.g.,
granule size = 4KB, inputsize = 40 bits, initial lookup level = 0 (no
concatenation) on a system with 48 bits PA range supported is allowed by
architecture.
Fix it by obtaining PA size by kvm_get_pa_bits(). Note that
kvm_get_pa_bits() returns the fixed limit now and should eventually reflect
the per VM PARange (one day!). Given that the configured PARange should not
be greater that kvm_ipa_limit, it at least fixes the problem described
above.
While at it, inject a level 0 translation fault to guest if
check_base_s2_limits() fails, as per the pseudocode.
Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-2-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
If, for any odd reason, we cannot converge to mapping size that is
completely contained in a memblock region, we fail to install a S2
mapping and go back to the faulting instruction. Rince, repeat.
This happens when faulting in regions that are smaller than a page
or that do not have PAGE_SIZE-aligned boundaries (as witnessed on
an O6 board that refuses to boot in protected mode).
In this situation, fallback to using a PAGE_SIZE mapping anyway --
it isn't like we can go any lower.
Fixes: e728e705802fe ("KVM: arm64: Adjust range correctly during host stage-2 faults")
Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org
Cc: stable@vger.kernel.org
Cc: Quentin Perret <qperret@google.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
If vgic_allocate_private_irqs_locked() fails for any odd reason,
we exit kvm_vgic_create() early, leaving dist->rd_regions uninitialised.
kvm_vgic_dist_destroy() then comes along and walks into the weeds
trying to free the RDs. Got to love this stuff.
Solve it by moving all the static initialisation early, and make
sure that if we fail halfway, we're in a reasonable shape to
perform the rest of the teardown. While at it, reset the vgic model
on failure, just in case...
Reported-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com
Tested-by: syzbot+f6a46b038fc243ac0175@syzkaller.appspotmail.com
Fixes: b3aa9283c0c50 ("KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()")
Link: https://lore.kernel.org/r/69a2d58c.050a0220.3a55be.003b.GAE@google.com
Link: https://patch.msgid.link/20260228164559.936268-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
|
|
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.0, take #1
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered...
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
|
|
KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always
available. Move the definition from individual architectures to common
code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
All architectures now use MMU notifier for KVM page table management.
Remove the Kconfig symbol and the code that is used when it is
disabled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several
errata where broadcast TLBI;DSB sequences don't provide all the
architecturally required synchronization. The workaround performs more
work than necessary, and can have significant overhead. This patch
optimizes the workaround, as explained below.
The workaround was originally added for Qualcomm Falkor erratum 1009 in
commit:
d9ff80f83ecb ("arm64: Work around Falkor erratum 1009")
As noted in the message for that commit, the workaround is applied even
in cases where it is not strictly necessary.
The workaround was later reused without changes for:
* Arm Cortex-A76 erratum #1286807
SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/
* Arm Cortex-A55 erratum #2441007
SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/
* Arm Cortex-A510 erratum #2441009
SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/
The important details to note are as follows:
1. All relevant errata only affect the ordering and/or completion of
memory accesses which have been translated by an invalidated TLB
entry. The actual invalidation of TLB entries is unaffected.
2. The existing workaround is applied to both broadcast and local TLB
invalidation, whereas for all relevant errata it is only necessary to
apply a workaround for broadcast invalidation.
3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI
sequence, whereas for all relevant errata it is only necessary to
execute a single additional TLBI;DSB sequence after any number of
TLBIs are completed by a DSB.
For example, for a sequence of batched TLBIs:
TLBI <op1>[, <arg1>]
TLBI <op2>[, <arg2>]
TLBI <op3>[, <arg3>]
DSB ISH
... the existing workaround will expand this to:
TLBI <op1>[, <arg1>]
DSB ISH // additional
TLBI <op1>[, <arg1>] // additional
TLBI <op2>[, <arg2>]
DSB ISH // additional
TLBI <op2>[, <arg2>] // additional
TLBI <op3>[, <arg3>]
DSB ISH // additional
TLBI <op3>[, <arg3>] // additional
DSB ISH
... whereas it is sufficient to have:
TLBI <op1>[, <arg1>]
TLBI <op2>[, <arg2>]
TLBI <op3>[, <arg3>]
DSB ISH
TLBI <opX>[, <argX>] // additional
DSB ISH // additional
Using a single additional TBLI and DSB at the end of the sequence can
have significantly lower overhead as each DSB which completes a TLBI
must synchronize with other PEs in the system, with potential
performance effects both locally and system-wide.
4. The existing workaround repeats each specific TLBI operation, whereas
for all relevant errata it is sufficient for the additional TLBI to
use *any* operation which will be broadcast, regardless of which
translation regime or stage of translation the operation applies to.
For example, for a single TLBI:
TLBI ALLE2IS
DSB ISH
... the existing workaround will expand this to:
TLBI ALLE2IS
DSB ISH
TLBI ALLE2IS // additional
DSB ISH // additional
... whereas it is sufficient to have:
TLBI ALLE2IS
DSB ISH
TLBI VALE1IS, XZR // additional
DSB ISH // additional
As the additional TLBI doesn't have to match a specific earlier TLBI,
the additional TLBI can be implemented in separate code, with no
memory of the earlier TLBIs. The additional TLBI can also use a
cheaper TLBI operation.
5. The existing workaround is applied to both Stage-1 and Stage-2 TLB
invalidation, whereas for all relevant errata it is only necessary to
apply a workaround for Stage-1 invalidation.
Architecturally, TLBI operations which invalidate only Stage-2
information (e.g. IPAS2E1IS) are not required to invalidate TLB
entries which combine information from Stage-1 and Stage-2
translation table entries, and consequently may not complete memory
accesses translated by those combined entries. In these cases,
completion of memory accesses is only guaranteed after subsequent
invalidation of Stage-1 information (e.g. VMALLE1IS).
Taking the above points into account, this patch reworks the workaround
logic to reduce overhead:
* New __tlbi_sync_s1ish() and __tlbi_sync_s1ish_hyp() functions are
added and used in place of any dsb(ish) which is used to complete
broadcast Stage-1 TLB maintenance. When the
ARM64_WORKAROUND_REPEAT_TLBI workaround is enabled, these helpers will
execute an additional TLBI;DSB sequence.
For consistency, it might make sense to add __tlbi_sync_*() helpers
for local and stage 2 maintenance. For now I've left those with
open-coded dsb() to keep the diff small.
* The duplication of TLBIs in __TLBI_0() and __TLBI_1() is removed. This
is no longer needed as the necessary synchronization will happen in
__tlbi_sync_s1ish() or __tlbi_sync_s1ish_hyp().
* The additional TLBI operation is chosen to have minimal impact:
- __tlbi_sync_s1ish() uses "TLBI VALE1IS, XZR". This is only used at
EL1 or at EL2 with {E2H,TGE}=={1,1}, where it will target an unused
entry for the reserved ASID in the kernel's own translation regime,
and have no adverse affect.
- __tlbi_sync_s1ish_hyp() uses "TLBI VALE2IS, XZR". This is only used
in hyp code, where it will target an unused entry in the hyp code's
TTBR0 mapping, and should have no adverse effect.
* As __TLBI_0() and __TLBI_1() no longer replace each TLBI with a
TLBI;DSB;TLBI sequence, batching TLBIs is worthwhile, and there's no
need for arch_tlbbatch_should_defer() to consider
ARM64_WORKAROUND_REPEAT_TLBI.
When building defconfig with GCC 15.1.0, compared to v6.19-rc1, this
patch saves ~1KiB of text, makes the vmlinux ~42KiB smaller, and makes
the resulting Image 64KiB smaller:
| [mark@lakrids:~/src/linux]% size vmlinux-*
| text data bss dec hex filename
| 21179831 19660919 708216 41548966 279fca6 vmlinux-after
| 21181075 19660903 708216 41550194 27a0172 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l vmlinux-*
| -rwxr-xr-x 1 mark mark 157771472 Feb 4 12:05 vmlinux-after
| -rwxr-xr-x 1 mark mark 157815432 Feb 4 12:05 vmlinux-before
| [mark@lakrids:~/src/linux]% ls -l Image-*
| -rw-r--r-- 1 mark mark 41007616 Feb 4 12:05 Image-after
| -rw-r--r-- 1 mark mark 41073152 Feb 4 12:05 Image-before
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
We currently have three versions of the ASID retrieval code, one
in the S1 walker, and two in the VNCR handling (although the last
two are limited to the EL2&0 translation regime).
Make this code common, and take this opportunity to also simplify
the code a bit while switching over to the TTBRx_EL1_ASID macro.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260225104718.14209-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Commit 0c4762e26879 ("KVM: arm64: nv: Avoid NV stage-2 code when NV is
not supported") added an early return to several functions in
arch/arm64/kvm/nested.c to prevent a UBSAN shift-out-of-bounds error
when accessing the pgt union for non-nested VMs.
However, this early return was inadvertently applied to
kvm_arch_flush_shadow_all() as well, causing it to skip the call to
kvm_uninit_stage2_mmu(kvm) for all non-nested VMs.
For pKVM, skipping this teardown means the host never unshares the
guest's memory with the EL2 hypervisor. When the host kernel later
recycles these leaked pages for a new VM, it attempts to re-share them.
The hypervisor correctly rejects this with -EPERM, triggering a host
WARN_ON and hanging the guest.
Fix this by dropping the early return from kvm_arch_flush_shadow_all().
The for-loop guarding the nested MMU cleanup already bounds itself when
nested_mmus_size == 0, allowing execution to proceed to
kvm_uninit_stage2_mmu() as intended.
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/60916cb6-f460-4751-b910-f63c58700ad0@sirena.org.uk/
Fixes: 0c4762e26879 ("KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported")
Signed-off-by: Fuad Tabba <tabba@google.com>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20260222083352.89503-1-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Since 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings"),
pKVM tracks the memory that has been mapped into a guest in a
side data structure. Crucially, it uses it to find out whether
a page has already been mapped, and therefore refuses to map it
twice. So far, so good.
However, this very patch completely breaks non-4kB page support,
with guests being unable to boot. The most obvious symptom is that
we take the same fault repeatedly, and not making forward progress.
A quick investigation shows that this is because of the above
rejection code.
As it turns out, there are multiple issues at play:
- while the HPFAR_EL2 register gives you the faulting IPA minus
the bottom 12 bits, it will still give you the extra bits that
are part of the page offset for anything larger than 4kB,
even for a level-3 mapping
- pkvm_pgtable_stage2_map() assumes that the address passed as
a parameter is aligned to the size of the intended mapping
- the faulting address is only aligned for a non-page mapping
When the planets are suitably aligned (pun intended), the guest
faults on a page by accessing it past the bottom 4kB, and extra bits
get set in the HPFAR_EL2 register. If this results in a page mapping
(which is likely with large granule sizes), nothing aligns it further
down, and pkvm_mapping_iter_first() finds an intersection that
doesn't really exist. We assume this is a spurious fault and return
-EAGAIN. And again...
This doesn't hit outside of the protected code, as the page table
code always aligns the IPA down to a page boundary, hiding the issue
for everyone else.
Fix it by always forcing the alignment on vma_pagesize, irrespective
of the value of vma_pagesize.
Fixes: 3669ddd8fa8b5 ("KVM: arm64: Add a range to pkvm_mappings")
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://https://patch.msgid.link/20260222141000.3084258-1-maz@kernel.org
Cc: stable@vger.kernel.org
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct gic_kvm_info", but the returned type,
while matching, is const qualified. To get them exactly matching, just
use the dereferenced pointer for the sizeof().
Link: https://patch.msgid.link/20260206223022.it.052-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct gic_kvm_info", but the returned type,
while matching, is const qualified. To get them exactly matching, just
use the dereferenced pointer for the sizeof().
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20260206223022.it.052-kees@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The `sve_state` pointer in `hyp_vcpu->vcpu.arch` is initialized as a
hypervisor virtual address during vCPU initialization in
`pkvm_vcpu_init_sve()`.
`unpin_host_sve_state()` calls `kern_hyp_va()` on this address. Since
`kern_hyp_va()` is idempotent, it's not a bug. However, it is
unnecessary and potentially confusing. Remove the redundant conversion.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260213143815.1732675-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In protected mode, the hypervisor maintains a separate instance of
the `kvm` structure for each VM. For non-protected VMs, this structure is
initialized from the host's `kvm` state.
Currently, `pkvm_init_features_from_host()` copies the
`KVM_ARCH_FLAG_ID_REGS_INITIALIZED` flag from the host without the
underlying `id_regs` data being initialized. This results in the
hypervisor seeing the flag as set while the ID registers remain zeroed.
Consequently, `kvm_has_feat()` checks at EL2 fail (return 0) for
non-protected VMs. This breaks logic that relies on feature detection,
such as `ctxt_has_tcrx()` for TCR2_EL1 support. As a result, certain
system registers (e.g., TCR2_EL1, PIR_EL1, POR_EL1) are not
saved/restored during the world switch, which could lead to state
corruption.
Fix this by explicitly copying the ID registers from the host `kvm` to
the hypervisor `kvm` for non-protected VMs during initialization, since
we trust the host with its non-protected guests' features. Also ensure
`KVM_ARCH_FLAG_ID_REGS_INITIALIZED` is cleared initially in
`pkvm_init_features_from_host` so that `vm_copy_id_regs` can properly
initialize them and set the flag once done.
Fixes: 41d6028e28bd ("KVM: arm64: Convert the SVE guest vcpu flag to a vm flag")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260213143815.1732675-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When CONFIG_ARM64_POE is disabled, KVM does not save/restore POR_EL1.
However, ID_AA64MMFR3_EL1 sanitisation currently exposes the feature to
guests whenever the hardware supports it, ignoring the host kernel
configuration.
If a guest detects this feature and attempts to use it, the host will
fail to context-switch POR_EL1, potentially leading to state corruption.
Fix this by masking ID_AA64MMFR3_EL1.S1POE in the sanitised system
registers, preventing KVM from advertising the feature when the host
does not support it (i.e. system_supports_poe() is false).
Fixes: 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260213143815.1732675-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
KVM mediated PMU support for 6.20
Add support for mediated PMUs, where KVM gives the guest full ownership of PMU
hardware (contexted switched around the fastpath run loop) and allows direct
access to data MSRs and PMCs (restricted by the vPMU model), but intercepts
access to control registers, e.g. to enforce event filtering and to prevent the
guest from profiling sensitive host state.
To keep overall complexity reasonable, mediated PMU usage is all or nothing
for a given instance of KVM (controlled via module param). The Mediated PMU
is disabled default, partly to maintain backwards compatilibity for existing
setup, partly because there are tradeoffs when running with a mediated PMU that
may be non-starters for some use cases, e.g. the host loses the ability to
profile guests with mediated PMUs, the fastpath run loop is also a blind spot,
entry/exit transitions are more expensive, etc.
Versus the emulated PMU, where KVM is "just another perf user", the mediated
PMU delivers more accurate profiling and monitoring (no risk of contention and
thus dropped events), with significantly less overhead (fewer exits and faster
emulation/programming of event selectors) E.g. when running Specint-2017 on
a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from
within the guest:
Perf command:
a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite
b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite
Guest performance overhead:
---------------------------------------------------------------------------
| Test case | emulated vPMU | all passthrough | passthrough with |
| | | | event filters |
---------------------------------------------------------------------------
| basic-sampling | 33.62% | 4.24% | 6.21% |
---------------------------------------------------------------------------
| multiplex-sampling | 79.32% | 7.34% | 10.45% |
---------------------------------------------------------------------------
|
|
* kvm-arm64/misc-6.20:
: .
: Misc KVM/arm64 changes for 6.20
:
: - Trivial FPSIMD cleanups
:
: - Calculate hyp VA size only once, avoiding potential mapping issues when
: VA bits is smaller than expected
:
: - Silence sparse warning for the HYP stack base
:
: - Fix error checking when handling FFA_VERSION
:
: - Add missing trap configuration for DBGWCR15_EL1
:
: - Don't try to deal with nested S2 when NV isn't enabled for a guest
:
: - Various spelling fixes
: .
KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported
KVM: arm64: Fix various comments
KVM: arm64: nv: Add trap config for DBGWCR<15>_EL1
KVM: arm64: Fix error checking for FFA_VERSION
KVM: arm64: Fix missing <asm/stackpage/nvhe.h> include
KVM: arm64: Calculate hyp VA size only once
KVM: arm64: Remove ISB after writing FPEXC32_EL2
KVM: arm64: Shuffle KVM_HOST_DATA_FLAG_* indices
KVM: arm64: Fix comment in fpsimd_lazy_switch_to_host()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/resx:
: .
: Add infrastructure to deal with the full gamut of RESx bits
: for NV. As a result, it is now possible to have the expected
: semantics for some bits such as SCTLR_EL2.SPAN.
: .
KVM: arm64: Add debugfs file dumping computed RESx values
KVM: arm64: Add sanitisation to SCTLR_EL2
KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE
KVM: arm64: Remove all traces of FEAT_TME
KVM: arm64: Simplify handling of full register invalid constraint
KVM: arm64: Get rid of FIXED_VALUE altogether
KVM: arm64: Simplify handling of HCR_EL2.E2H RESx
KVM: arm64: Move RESx into individual register descriptors
KVM: arm64: Add RES1_WHEN_E2Hx constraints as configuration flags
KVM: arm64: Add REQUIRES_E2H1 constraint as configuration flags
KVM: arm64: Simplify FIXED_VALUE handling
KVM: arm64: Convert HCR_EL2.RW to AS_RES1
KVM: arm64: Correctly handle SCTLR_EL1 RES1 bits for unsupported features
KVM: arm64: Allow RES1 bits to be inferred from configuration
KVM: arm64: Inherit RESx bits from FGT register descriptors
KVM: arm64: Extend unified RESx handling to runtime sanitisation
KVM: arm64: Introduce data structure tracking both RES0 and RES1 bits
KVM: arm64: Introduce standalone FGU computing primitive
KVM: arm64: Remove duplicate configuration for SCTLR_EL1.{EE,E0E}
arm64: Convert SCTLR_EL2 to sysreg infrastructure
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/debugfs-fixes:
: .
: Cleanup of the debugfs iterator, which are way more complicated
: than they ought to be, courtesy of Fuad Tabba. From the cover letter:
:
: "This series refactors the debugfs implementations for `idregs` and
: `vgic-state` to use standard `seq_file` iterator patterns.
:
: The existing implementations relied on storing iterator state within
: global VM structures (`kvm_arch` and `vgic_dist`). This approach
: prevented concurrent reads of the debugfs files (returning -EBUSY) and
: created improper dependencies between transient file operations and
: long-lived VM state."
: .
KVM: arm64: Use standard seq_file iterator for vgic-debug debugfs
KVM: arm64: Reimplement vgic-debug XArray iteration
KVM: arm64: Use standard seq_file iterator for idregs debugfs
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/gicv5-prologue:
: .
: Prologue to GICv5 support, courtesy of Sascha Bischoff.
:
: This is preliminary work that sets the scene for the full-blow
: support.
: .
irqchip/gic-v5: Check if impl is virt capable
KVM: arm64: gic: Set vgic_model before initing private IRQs
arm64/sysreg: Drop ICH_HFGRTR_EL2.ICC_HAPR_EL1 and make RES1
KVM: arm64: gic-v3: Switch vGIC-v3 to use generated ICH_VMCR_EL2
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/gicv3-tdir-fixes:
: .
: Address two trapping-related issues when running legacy (i.e. GICv3)
: guests on GICv5 hosts, courtesy of Sascha Bischoff.
: .
KVM: arm64: Correct test for ICH_HCR_EL2_TDIR cap for GICv5 hosts
KVM: arm64: gic: Enable GICv3 CPUIF trapping on GICv5 hosts if required
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/fwb-for-all:
: .
: Allow pKVM's host stage-2 mappings to use the Force Write Back version
: of the memory attributes by using the "pass-through' encoding.
:
: This avoids having two separate encodings for S2 on a given platform.
: .
KVM: arm64: Simplify PAGE_S2_MEMATTR
KVM: arm64: Kill KVM_PGTABLE_S2_NOFWB
KVM: arm64: Switch pKVM host S2 over to KVM_PGTABLE_S2_AS_S1
KVM: arm64: Add KVM_PGTABLE_S2_AS_S1 flag
arm64: Add MT_S2{,_FWB}_AS_S1 encodings
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-no-mte:
: .
: pKVM updates preventing the host from using MTE-related system
: sysrem registers when the feature is disabled from the kernel
: command-line (arm64.nomte), courtesy of Fuad Taba.
:
: From the cover letter:
:
: "If MTE is supported by the hardware (and is enabled at EL3), it remains
: available to lower exception levels by default. Disabling it in the host
: kernel (e.g., via 'arm64.nomte') only stops the kernel from advertising
: the feature; it does not physically disable MTE in the hardware.
:
: The ability to disable MTE in the host kernel is used by some systems,
: such as Android, so that the physical memory otherwise used as tag
: storage can be used for other things (i.e. treated just like the rest of
: memory). In this scenario, a malicious host could still access tags in
: pages donated to a guest using MTE instructions (e.g., STG and LDG),
: bypassing the kernel's configuration."
: .
KVM: arm64: Use kvm_has_mte() in pKVM trap initialization
KVM: arm64: Inject UNDEF when accessing MTE sysregs with MTE disabled
KVM: arm64: Trap MTE access and discovery when MTE is disabled
KVM: arm64: Remove dead code resetting HCR_EL2 for pKVM
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Computing RESx values is hard. Verifying that they are correct is
harder. Add a debugfs file called "resx" that will dump all the RESx
values for a given VM.
I found it useful, maybe you will too.
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-21-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Sanitise SCTLR_EL2 the usual way. The most important aspect of
this is that we benefit from SCTLR_EL2.SPAN being RES1 when
HCR_EL2.E2H==0.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-20-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
MIOCNCE had the potential to eat your data, and also was never
implemented by anyone. It's been retrospectively removed from
the architecture, and we're happy to follow that lead.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-19-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
FEAT_TME has been dropped from the architecture. Retrospectively.
I'm sure someone is crying somewhere, but most of us won't.
Clean-up time.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-18-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Now that we embed the RESx bits in the register description, it becomes
easier to deal with registers that are simply not valid, as their
existence is not satisfied by the configuration (SCTLR2_ELx without
FEAT_SCTLR2, for example). Such registers essentially become RES0 for
any bit that wasn't already advertised as RESx.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-17-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We have now killed every occurrences of FIXED_VALUE, and we can therefore
drop the whole infrastructure. Good riddance.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-16-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Now that we can link the RESx behaviour with the value of HCR_EL2.E2H,
we can trivially express the tautological constraint that makes E2H
a reserved value at all times.
Fun, isn't it?
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-15-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Instead of hacking the RES1 bits at runtime, move them into the
register descriptors. This makes it significantly nicer.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-14-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
"Thanks" to VHE, SCTLR_EL2 radically changes shape depending on the
value of HCR_EL2.E2H, as a lot of the bits that didn't have much
meaning with E2H=0 start impacting EL0 with E2H=1.
This has a direct impact on the RESx behaviour of these bits, and
we need a way to express them.
For this purpose, introduce two new constaints that, when the
controlling feature is not present, force the field to RES1 depending
on the value of E2H. Note that RES0 is still implicit,
This allows diverging RESx values depending on the value of E2H,
something that is required by a bunch of SCTLR_EL2 bits.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-13-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
A bunch of EL2 configuration are very similar to their EL1 counterpart,
with the added constraint that HCR_EL2.E2H being 1.
For us, this means HCR_EL2.E2H being RES1, which is something we can
statically evaluate.
Add a REQUIRES_E2H1 constraint, which allows us to express conditions
in a much simpler way (without extra code). Existing occurrences are
converted, before we add a lot more.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The FIXED_VALUE qualifier (mostly used for HCR_EL2) is pointlessly
complicated, as it tries to piggy-back on the previous RES0 handling
while being done in a different phase, on different data.
Instead, make it an integral part of the RESx computation, and allow
it to directly set RESx bits. This is much easier to understand.
It also paves the way for some additional changes to that will allow
the full removal of the FIXED_VALUE handling.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Now that we have the AS_RES1 constraint, it becomes trivial to express
the HCR_EL2.RW behaviour.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
A bunch of SCTLR_EL1 bits must be set to RES1 when the controlling
feature is not present. Add the AS_RES1 qualifier where needed.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|