linux-toradex.git/arch/x86, branch v6.19

x86/vmware: Fix hypercall clobbers

2026-02-06T22:51:03+00:00

Fedora QA reported the following panic:

  BUG: unable to handle page fault for address: 0000000040003e54
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20251119-3.fc43 11/19/2025
  RIP: 0010:vmware_hypercall4.constprop.0+0x52/0x90
  ..
  Call Trace:
   vmmouse_report_events+0x13e/0x1b0
   psmouse_handle_byte+0x15/0x60
   ps2_interrupt+0x8a/0xd0
   ...

because the QEMU VMware mouse emulation is buggy, and clears the top 32
bits of %rdi that the kernel kept a pointer in.

The QEMU vmmouse driver saves and restores the register state in a
"uint32_t data[6];" and as a result restores the state with the high
bits all cleared.

RDI originally contained the value of a valid kernel stack address
(0xff5eeb3240003e54).  After the vmware hypercall it now contains
0x40003e54, and we get a page fault as a result when it is dereferenced.

The proper fix would be in QEMU, but this works around the issue in the
kernel to keep old setups working, when old kernels had not happened to
keep any state in %rdi over the hypercall.

In theory this same issue exists for all the hypercalls in the vmmouse
driver; in practice it has only been seen with vmware_hypercall3() and
vmware_hypercall4().  For now, just mark RDI/RSI as clobbered for those
two calls.  This should have a minimal effect on code generation overall
as it should be rare for the compiler to want to make RDI/RSI live
across hypercalls.

Reported-by: Justin Forbes 
Link: https://lore.kernel.org/all/99a9c69a-fc1a-43b7-8d1e-c42d6493b41f@broadcom.com/
Signed-off-by: Josh Poimboeuf 
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2026-02-05T00:04:00+00:00

Pull misc fixes from Andrew Morton:
 "Five hotfixes.  Two are cc:stable, two are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Documentation: document liveupdate cmdline parameter
  mm, shmem: prevent infinite loop on truncate race
  mailmap: update Alexander Mikhalitsyn's emails
  liveupdate: luo_file: do not clear serialized_data on unfreeze
  x86/kfence: fix booting on 32bit non-PAE systems

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2026-02-04T18:38:56+00:00

Pull KVM fixes from Paolo Bonzini:

 - Fix a bug where AVIC is incorrectly inhibited when running with
   x2AVIC disabled via module param (or on a system without x2AVIC)

 - Fix a dangling device posted IRQs bug by explicitly checking if the
   irqfd is still active (on the list) when handling an eventfd signal,
   instead of zeroing the irqfd's routing information when the irqfd is
   deassigned.

   Zeroing the irqfd's routing info causes arm64 and x86's to not
   disable posting for the IRQ (kvm_arch_irq_bypass_del_producer() looks
   for an MSI), incorrectly leaving the IRQ in posted mode (and leading
   to use-after-free and memory leaks on AMD in particular).

   This is both the most pressing and scariest, but it's been in -next
   for a while.

 - Disable FORTIFY_SOURCE for KVM selftests to prevent the compiler from
   generating calls to the checked versions of memset() and friends,
   which leads to unexpected page faults in guest code due e.g.
   __memset_chk@plt not being resolved.

 - Explicitly configure the supported XSS capabilities from within
   {svm,vmx}_set_cpu_caps() to fix a bug where VMX will compute the
   reference VMCS configuration with SHSTK and IBT enabled, but then
   compute each CPUs local config with SHSTK and IBT disabled if not all
   CET xfeatures are enabled, e.g. if the kernel is built with
   X86_KERNEL_IBT=n.

   The mismatch in features results in differing nVMX setting, and
   ultimately causes kvm-intel.ko to refuse to load with nested=1.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()
  KVM: selftests: Add -U_FORTIFY_SOURCE to avoid some unpredictable test failures
  KVM: x86: Assert that non-MSI doesn't have bypass vCPU when deleting producer
  KVM: Don't clobber irqfd routing type when deassigning irqfd
  KVM: SVM: Check vCPU ID against max x2AVIC ID if and only if x2AVIC is enabled

Merge tag 'kvm-x86-fixes-6.19-rc8' of https://github.com/kvm-x86/linux into HEAD

2026-02-04T17:30:32+00:00

Final KVM fixes for 6.19:

 - Fix a bug where AVIC is incorrectly inhibited when running with x2AVIC
   disabled via module param (or on a system without x2AVIC).

 - Fix a dangling device posted IRQs bug by explicitly checking if the irqfd is
   still active (on the list) when handling an eventfd signal, instead of
   zeroing the irqfd's routing information when the irqfd is deassigned.
   Zeroing the irqfd's routing info causes arm64 and x86's to not disable
   posting for the IRQ (kvm_arch_irq_bypass_del_producer() looks for an MSI),
   incorrectly leaving the IRQ in posted mode (and leading to use-after-free
   and memory leaks on AMD in particular).

   This is both the most pressing and scariest, but it's been in -next for
   a while.

 - Disable FORTIFY_SOURCE for KVM selftests to prevent the compiler from
   generating calls to the checked versions of memset() and friends, which
   leads to unexpected page faults in guest code due e.g. __memset_chk@plt
   not being resolved.

 - Explicitly configure the support XSS from within {svm,vmx}_set_cpu_caps() to
   fix a bug where VMX will compute the reference VMCS configuration with SHSTK
   and IBT enabled, but then compute each CPUs local config with SHSTK and IBT
   disabled if not all CET xfeatures are enabled, e.g. if the kernel is built
   with X86_KERNEL_IBT=n.  The mismatch in features results in differing nVMX
   setting, and ultimately causes kvm-intel.ko to refuse to load with nested=1.

x86/kfence: fix booting on 32bit non-PAE systems

2026-02-03T02:43:55+00:00

The original patch inverted the PTE unconditionally to avoid
L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level
paging.

Adjust the logic to use the flip_protnone_guard() helper, which is a nop
on 2-level paging but inverts the address bits in all other paging modes.

This doesn't matter for the Xen aspect of the original change.  Linux no
longer supports running 32bit PV under Xen, and Xen doesn't support
running any 32bit PV guests without using PAE paging.

Link: https://lkml.kernel.org/r/20260126211046.2096622-1-andrew.cooper3@citrix.com
Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Reported-by: Ryusuke Konishi 
Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/
Signed-off-by: Andrew Cooper 
Tested-by: Ryusuke Konishi 
Tested-by: Borislav Petkov (AMD) 
Cc: Alexander Potapenko 
Cc: Marco Elver 
Cc: Dmitry Vyukov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Jann Horn 
Cc: 
Signed-off-by: Andrew Morton

KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()

2026-01-30T21:27:33+00:00

Explicitly configure KVM's supported XSS as part of each vendor's setup
flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due
to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested
VMX is enabled, i.e. when nested=1.  The late clearing results in
nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE
when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks,
ultimately leading to a mismatched VMCS config due to the reference config
having the CET bits set, but every CPU's "local" config having the bits
cleared.

Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by
kvm_x86_vendor_init(), before calling into vendor code, and not referenced
between ops->hardware_setup() and their current/old location.

Fixes: 69cc3e886582 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER")
Cc: stable@vger.kernel.org
Cc: Mathias Krause 
Cc: John Allen 
Cc: Rick Edgecombe 
Cc: Chao Gao 
Cc: Binbin Wu 
Cc: Xiaoyao Li 
Reviewed-by: Xiaoyao Li 
Reviewed-by: Binbin Wu 
Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com
Signed-off-by: Sean Christopherson

Merge tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2026-01-24T17:24:17+00:00

Pull perf events fixes from Ingo Molnar:

 - Fix mmap_count warning & bug when creating a group member event
   with the PERF_FLAG_FD_OUTPUT flag

 - Disable the sample period == 1 branch events BTS optimization
   on guests, because BTS is not virtualized

* tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Do not enable BTS for guests
  perf: Fix refcount warning on event->mmap_count increment

x86: make page fault handling disable interrupts properly

2026-01-23T00:49:17+00:00

There's a big comment in the x86 do_page_fault() about our interrupt
disabling code:

    * User address page fault handling might have reenabled
    * interrupts. Fixing up all potential exit points of
    * do_user_addr_fault() and its leaf functions is just not
    * doable w/o creating an unholy mess or turning the code
    * upside down.

but it turns out that comment is subtly wrong, and the code as a result
is also wrong.

Because it's certainly true that we may have re-enabled interrupts when
handling user page faults.  And it's most certainly true that we don't
want to bother fixing up all the cases.

But what isn't true is that it's limited to user address page faults.

The confusion stems from the fact that we have logic here that depends
on the address range of the access, but other code then depends on the
_context_ the access was done in.  The two are not related, even though
both of them are about user-vs-kernel.

In other words, both user and kernel addresses can cause interrupts to
have been enabled (eg when __bad_area_nosemaphore() gets called for user
accesses to kernel addresses).  As a result we should make sure to
disable interrupts again regardless of the address range before
returning to the low-level fault handling code.

The __bad_area_nosemaphore() code actually did disable interrupts again
after enabling them, just not consistently.  Ironically, as noted in the
original comment, fixing up all the cases is just not worth it, when the
simple solution is to just do it unconditionally in one single place.

So remove the incomplete case that unsuccessfully tried to do what the
comment said was "not doable" in commit ca4c6a9858c2 ("x86/traps: Make
interrupt enable/disable symmetric in C code"), and just make it do the
simple and straightforward thing.

Signed-off-by: Cedric Xing 
Reviewed-by: Dave Hansen 
Fixes: ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code")
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Signed-off-by: Linus Torvalds

perf/x86/intel: Do not enable BTS for guests

2026-01-21T15:28:59+00:00

By default when users program perf to sample branch instructions
(PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf
interprets this as a special case and enables BTS (Branch Trace Store)
as an optimization to avoid taking an interrupt on every branch.

Since BTS doesn't virtualize, this optimization doesn't make sense when
the request originates from a guest. Add an additional check that
prevents this optimization for virtualized events (exclude_host).

Reported-by: Jan H. Schönherr 
Suggested-by: Peter Zijlstra 
Signed-off-by: Fernand Sieber 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Link: https://patch.msgid.link/20251211183604.868641-1-sieberf@amazon.com

x86/kfence: avoid writing L1TF-vulnerable PTEs

2026-01-19T20:30:02+00:00

For native, the choice of PTE is fine.  There's real memory backing the
non-present PTE.  However, for XenPV, Xen complains:

  (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing

To explain, some background on XenPV pagetables:

  Xen PV guests are control their own pagetables; they choose the new
  PTE value, and use hypercalls to make changes so Xen can audit for
  safety.

  In addition to a regular reference count, Xen also maintains a type
  reference count.  e.g.  SegDesc (referenced by vGDT/vLDT), Writable
  (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower
  pagetable level).  This is in order to prevent e.g.  a page being
  inserted into the pagetables for which the guest has a writable mapping.

  For non-present mappings, all other bits become software accessible,
  and typically contain metadata rather a real frame address.  There is
  nothing that a reference count could sensibly be tied to.  As such, even
  if Xen could recognise the address as currently safe, nothing would
  prevent that frame from changing owner to another VM in the future.

  When Xen detects a PV guest writing a L1TF-PTE, it responds by
  activating shadow paging.  This is normally only used for the live phase
  of migration, and comes with a reasonable overhead.

KFENCE only cares about getting #PF to catch wild accesses; it doesn't
care about the value for non-present mappings.  Use a fully inverted PTE,
to avoid hitting the slow path when running under Xen.

While adjusting the logic, take the opportunity to skip all actions if the
PTE is already in the right state, half the number PVOps callouts, and
skip TLB maintenance on a !P -> P transition which benefits non-Xen cases
too.

Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com
Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Signed-off-by: Andrew Cooper 
Tested-by: Marco Elver 
Cc: Alexander Potapenko 
Cc: Marco Elver 
Cc: Dmitry Vyukov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Jann Horn 
Cc: 
Signed-off-by: Andrew Morton