summaryrefslogtreecommitdiff
path: root/arch/x86/include/asm
AgeCommit message (Collapse)Author
6 hoursMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "Loongarch: - Add more CPUCFG mask bits - Improve feature detection - Add lazy load support for FPU and binary translation (LBT) register state - Fix return value for memory reads from and writes to in-kernel devices - Add support for detecting preemption from within a guest - Add KVM steal time test case to tools/selftests ARM: - Add support for FEAT_IDST, allowing ID registers that are not implemented to be reported as a normal trap rather than as an UNDEF exception - Add sanitisation of the VTCR_EL2 register, fixing a number of UXN/PXN/XN bugs in the process - Full handling of RESx bits, instead of only RES0, and resulting in SCTLR_EL2 being added to the list of sanitised registers - More pKVM fixes for features that are not supposed to be exposed to guests - Make sure that MTE being disabled on the pKVM host doesn't give it the ability to attack the hypervisor - Allow pKVM's host stage-2 mappings to use the Force Write Back version of the memory attributes by using the "pass-through' encoding - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the guest - Preliminary work for guest GICv5 support - A bunch of debugfs fixes, removing pointless custom iterators stored in guest data structures - A small set of FPSIMD cleanups - Selftest fixes addressing the incorrect alignment of page allocation - Other assorted low-impact fixes and spelling fixes RISC-V: - Fixes for issues discoverd by KVM API fuzzing in kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and kvm_riscv_vcpu_aia_imsic_update() - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM - Transparent huge page support for hypervisor page tables - Adjust the number of available guest irq files based on MMIO register sizes found in the device tree or the ACPI tables - Add RISC-V specific paging modes to KVM selftests - Detect paging mode at runtime for selftests s390: - Performance improvement for vSIE (aka nested virtualization) - Completely new memory management. s390 was a special snowflake that enlisted help from the architecture's page table management to build hypervisor page tables, in particular enabling sharing the last level of page tables. This however was a lot of code (~3K lines) in order to support KVM, and also blocked several features. The biggest advantages is that the page size of userspace is completely independent of the page size used by the guest: userspace can mix normal pages, THPs and hugetlbfs as it sees fit, and in fact transparent hugepages were not possible before. It's also now possible to have nested guests and guests with huge pages running on the same host - Maintainership change for s390 vfio-pci - Small quality of life improvement for protected guests x86: - Add support for giving the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allowing direct access to data MSRs and PMCs (restricted by the vPMU model). KVM still intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. This is more accurate, since it has no risk of contention and thus dropped events, and also has significantly less overhead. For more information, see the commit message for merge commit bf2c3138ae36 ("Merge tag 'kvm-x86-pmu-6.20' ...") - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those were advertised as supported to userspace, - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM would attempt to read CR3 configuring an async #PF entry - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Only a few exports that are intended for external usage, and those are allowed explicitly - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead of WARNing. Userspace can sometimes put the vCPU into what should be an impossible state, and spurious exit to userspace on -EBUSY does not really do anything to solve the issue - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller stuffing architecturally impossible states into KVM - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs) - Clean up KVM's handling of marking mapped vCPU pages dirty - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n - Bury all of ioapic.h, i8254.h and related ioctls (including KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y - Add a regression test for recent APICv update fixes - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information - Drop a dead function declaration - Minor cleanups x86 (Intel): - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults, and instead always reflect them into the guest. Since KVM doesn't shadow EPCM entries, EPCM violations cannot be due to KVM interference and can't be resolved by KVM - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch x86 (AMD): - Drop a user-triggerable WARN on nested_svm_load_cr3() failure - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) *must* flush the RAP - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0 - Treat exit_code as an unsigned 64-bit value through all of KVM - Add support for fetching SNP certificates from userspace - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2 - Misc fixes and cleanups x86 selftests: - Add a regression test for TPR<=>CR8 synchronization and IRQ masking - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support, and extend x86's infrastructure to support EPT and NPT (for L2 guests) - Extend several nested VMX tests to also cover nested SVM - Add a selftest for nested VMLOAD/VMSAVE - Rework the nested dirty log test, originally added as a regression test for PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage and to hopefully make the test easier to understand and maintain guest_memfd: - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage handling. SEV/SNP was the only user of the tracking and it can do it via the RMP - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to avoid non-trivial complexity for something that no known VMM seems to be doing and to avoid an API special case for in-place conversion, which simply can't support unaligned sources - When populating guest_memfd memory, GUP the source page in common code and pass the refcounted page to the vendor callback, instead of letting vendor code do the heavy lifting. Doing so avoids a looming deadlock bug with in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap invalidate lock Generic: - Fix a bug where KVM would ignore the vCPU's selected address space when creating a vCPU-specific mapping of guest memory. Actually this bug could not be hit even on x86, the only architecture with multiple address spaces, but it's a bug nevertheless" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits) KVM: s390: Increase permitted SE header size to 1 MiB MAINTAINERS: Replace backup for s390 vfio-pci KVM: s390: vsie: Fix race in acquire_gmap_shadow() KVM: s390: vsie: Fix race in walk_guest_tables() KVM: s390: Use guest address to mark guest page dirty irqchip/riscv-imsic: Adjust the number of available guest irq files RISC-V: KVM: Transparent huge page support RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test RISC-V: KVM: Allow Zalasr extensions for Guest/VM KVM: riscv: selftests: Add riscv vm satp modes KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr() RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr() RISC-V: KVM: Remove unnecessary 'ret' assignment KVM: s390: Add explicit padding to struct kvm_s390_keyop KVM: LoongArch: selftests: Add steal time test case LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side LoongArch: KVM: Add paravirt preempt feature in hypervisor side ...
30 hoursMerge tag 'mm-stable-2026-02-11-19-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...
2 daysMerge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- | Test case | emulated vPMU | all passthrough | passthrough with | | | | | event filters | --------------------------------------------------------------------------- | basic-sampling | 33.62% | 4.24% | 6.21% | --------------------------------------------------------------------------- | multiplex-sampling | 79.32% | 7.34% | 10.45% | ---------------------------------------------------------------------------
2 daysMerge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 APIC-ish changes for 6.20 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when creating a vCPU-specific mapping of guest memory. - Clean up KVM's handling of marking mapped vCPU pages dirty. - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless. - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow burying the RTC shenanigans behind CONFIG_KVM_IOAPIC=y. - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y. - Add a regression test for recent APICv update fixes. - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans. - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information. - Drop a dead function declaration.
3 daysMerge tag 'x86_misc_for_7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Dave Hansen: "The usual smattering of x86/misc changes. The IPv6 patch in here surprised me in a couple of ways. First, the function it inlines is able to eat a lot more CPU time than I would have expected. Second, the inlining does not seem to bloat the kernel, at least in the configs folks have tested. - Inline x86-specific IPv6 checksum helper - Update IOMMU docs to use stable identifiers - Print unhashed pointers on fatal stack overflows" * tag 'x86_misc_for_7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/traps: Print unhashed pointers on stack overflow Documentation/x86: Update IOMMU spec references to use stable identifiers x86/lib: Inline csum_ipv6_magic()
3 daysMerge tag 'x86_entry_for_7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry code updates from Dave Hansen: "This is entirely composed of a set of long overdue VDSO cleanups. They makes the VDSO build much more logical and zap quite a bit of old cruft. It also results in a coveted net-code-removal diffstat" * tag 'x86_entry_for_7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/vdso: Add vdso2c to .gitignore x86/entry/vdso32: Omit '.cfi_offset eflags' for LLVM < 16 MAINTAINERS: Adjust vdso file entry in INTEL SGX x86/entry/vdso/selftest: Update location of vgetrandom-chacha.S x86/entry/vdso: Fix filtering of vdso compiler flags x86/entry/vdso: Update the object paths for "make vdso_install" x86/entry/vdso32: When using int $0x80, use it directly x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32 x86/vdso: Abstract out vdso system call internals x86/entry/vdso: Include GNU_PROPERTY and GNU_STACK PHDRs x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S x86/entry/vdso32: Remove SYSCALL_ENTER_KERNEL macro in sigreturn.S x86/entry/vdso32: Don't rely on int80_landing_pad for adjusting ip x86/entry/vdso: Refactor the vdso build x86/entry/vdso: Move vdso2c to arch/x86/tools x86/entry/vdso: Rename vdso_image_* to vdso*_image
3 daysMerge tag 'x86_sev_for_v7.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Make the SEV internal header really internal and carve out the SVSM-specific code into a separate compilation unit, along with other cleanups and fixups [ TLA translation service: 'SEV' is AMD's 'Secure Encrypted Virtualization' and SVSM is an ETLA ('Enhanced TLA') for 'Secure VM Service Module'. Some of us have trouble keeping track of this all and need all the help we can get ] * tag 'x86_sev_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Don't emit BSS_DECRYPTED section unless it is in use x86/sev: Use kfree_sensitive() when freeing a SNP message descriptor x86/sev: Rename sev_es_ghcb_handle_msr() to __vc_handle_msr() x86/sev: Carve out the SVSM code into a separate compilation unit x86/sev: Add internal header guards x86/sev: Move the internal header
3 daysMerge tag 'x86_paravirt_for_v7.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt updates from Borislav Petkov: - A nice cleanup to the paravirt code containing a unification of the paravirt clock interface, taming the include hell by splitting the pv_ops structure and removing of a bunch of obsolete code (Juergen Gross) * tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted() x86/paravirt: Remove trailing semicolons from alternative asm templates x86/pvlocks: Move paravirt spinlock functions into own header x86/paravirt: Specify pv_ops array in paravirt macros x86/paravirt: Allow pv-calls outside paravirt.h objtool: Allow multiple pv_ops arrays x86/xen: Drop xen_mmu_ops x86/xen: Drop xen_cpu_ops x86/xen: Drop xen_irq_ops x86/paravirt: Move pv_native_*() prototypes to paravirt.c x86/paravirt: Introduce new paravirt-base.h header x86/paravirt: Move paravirt_sched_clock() related code into tsc.c x86/paravirt: Use common code for paravirt_steal_clock() riscv/paravirt: Use common code for paravirt_steal_clock() loongarch/paravirt: Use common code for paravirt_steal_clock() arm64/paravirt: Use common code for paravirt_steal_clock() arm/paravirt: Use common code for paravirt_steal_clock() sched: Move clock related paravirt code to kernel/sched paravirt: Remove asm/paravirt_api_clock.h x86/paravirt: Move thunk macros to paravirt_types.h ...
3 daysMerge tag 'x86_cleanups_for_v7.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - The usual set of cleanups and simplifications all over the tree * tag 'x86_cleanups_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/segment: Use MOVL when reading segment registers selftests/x86: Clean up sysret_rip coding style x86/mm: Hide mm_free_global_asid() definition under CONFIG_BROADCAST_TLB_FLUSH x86/crash: Use set_memory_p() instead of __set_memory_prot() x86/CPU/AMD: Simplify the spectral chicken fix x86/platform/olpc: Replace strcpy() with strscpy() in xo15_sci_add() x86/split_lock: Remove dead string when split_lock_detect=fatal
3 daysMerge tag 'x86-irq-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq updates from Thomas Gleixner: "Trivial cleanups for the posted MSI interrupt handling" * tag 'x86-irq-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq_remapping: Sanitize posted_msi_supported() x86/irq: Cleanup posted MSI code
3 daysMerge tag 'timers-vdso-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull VDSO updates from Thomas Gleixner: - Provide the missing 64-bit variant of clock_getres() This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and finally the removal of 32-bit time types from the kernel and UAPI. - Remove the useless and broken getcpu_cache from the VDSO The intention was to provide a trivial way to retrieve the CPU number from the VDSO, but as the VDSO data is per process there is no way to make it work. - Switch get/put_unaligned() from packed struct to memcpy() The packed struct violates strict aliasing rules which requires to pass -fno-strict-aliasing to the compiler. As this are scalar values __builtin_memcpy() turns them into simple loads and stores - Use __typeof_unqual__() for __unqual_scalar_typeof() The get/put_unaligned() changes triggered a new sparse warning when __beNN types are used with get/put_unaligned() as sparse builds add a special 'bitwise' attribute to them which prevents sparse to evaluate the Generic in __unqual_scalar_typeof(). Newer sparse versions support __typeof_unqual__() which avoids the problem, but requires a recent sparse install. So this adds a sanity check to sparse builds, which validates that sparse is available and capable of handling it. - Force inline __cvdso_clock_getres_common() Compilers sometimes un-inline agressively, which results in function call overhead and problems with automatic stack variable initialization. Interestingly enough the force inlining results in smaller code than the un-inlined variant produced by GCC when optimizing for size. * tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: vdso/gettimeofday: Force inlining of __cvdso_clock_getres_common() x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparse compiler: Use __typeof_unqual__() for __unqual_scalar_typeof() powerpc/vdso: Provide clock_getres_time64() tools headers: Remove unneeded ignoring of warnings in unaligned.h tools headers: Update the linux/unaligned.h copy with the kernel sources vdso: Switch get/put_unaligned() from packed struct to memcpy() parisc: Inline a type punning version of get_unaligned_le32() vdso: Remove struct getcpu_cache MIPS: vdso: Provide getres_time64() for 32-bit ABIs arm64: vdso32: Provide clock_getres_time64() ARM: VDSO: Provide clock_getres_time64() ARM: VDSO: Patch out __vdso_clock_getres() if unavailable x86/vdso: Provide clock_getres_time64() for x86-32 selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64() selftests: vDSO: vdso_test_abi: Use UAPI system call numbers selftests: vDSO: vdso_config: Add configurations for clock_getres_time64() vdso: Add prototype for __vdso_clock_getres_time64()
3 daysMerge tag 'x86-boot-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/boot updates from Ingo Molnar: - x86/acpi: Add acpi=spcr to use SPCR-provided default console (Shenghao Yang) - x86/acpi/boot: Correct the acpi_is_processor_usable() check again (Yazen Ghannam) - Refresh the x86 memory map (e820 table) handling code, and make the printouts a bit more informative (Ingo Molnar) * tag 'x86-boot-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/acpi: Add acpi=spcr to use SPCR-provided default console x86/boot/e820: Use <linux/sizes.h> symbols for literals x86/boot/e820: Make sure e820_search_gap() finds all gaps x86/boot/e820: Simplify the e820__range_remove() API x86/boot/e820: Remove e820__range_remove()'s unused return parameter x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables x86/boot/e820: Standardize __init/__initdata tag placement x86/boot/e820: Simplify & clarify __e820__range_add() a bit x86/boot/e820: Rename gap_start/gap_size to max_gap_start/max_gap_start in e820_search_gap() et al x86/boot/e820: Change e820_search_gap() to search for the highest-address PCI gap x86/boot/e820: Clean up e820__setup_pci_gap()/e820_search_gap() a bit x86/boot/e820: Change struct e820_table::nr_entries type from __u32 to u32 x86/boot/e820: Standardize e820 table index variable types under 'u32' x86/boot/e820: Standardize e820 table index variable names under 'idx' x86/boot/e820: Remove unnecessary header inclusions x86/boot/e820: Clean up __refdata use a bit x86/boot/e820: Clean up __e820__range_add() a bit x86/boot/e820: Improve e820_print_type() messages x86/boot/e820: Clean up confusing and self-contradictory verbiage around E820 related resource allocations x86/boot/e820: Remove pointless early_panic() indirection ...
3 daysMerge tag 'perf-core-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance event updates from Ingo Molnar: "x86 PMU driver updates: - Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs (Dapeng Mi) Compared to previous iterations of the Intel PMU code, there's been a lot of changes, which center around three main areas: - Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the Off-Core Response (OCR) facility - New PEBS data source encoding layout - Support the new "RDPMC user disable" feature - Likewise, a large series adds uncore PMU support for Intel Diamond Rapids (DMR) CPUs (Zide Chen) This centers around these four main areas: - DMR may have two Integrated I/O and Memory Hub (IMH) dies, separate from the compute tile (CBB) dies. Each CBB and each IMH die has its own discovery domain. - Unlike prior CPUs that retrieve the global discovery table portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON discovery and MSR for CBB PMON discovery. - DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6. - IIO free-running counters in DMR are MMIO-based, unlike SPR. - Also add support for Add missing PMON units for Intel Panther Lake, and support Nova Lake (NVL), which largely maps to Panther Lake. (Zide Chen) - KVM integration: Add support for mediated vPMUs (by Kan Liang and Sean Christopherson, with fixes and cleanups by Peter Zijlstra, Sandipan Das and Mingwei Zhang) - Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs, which are a low-power variant of Panther Lake (Zide Chen) - Add core, cstate and MSR PMU support for the Airmont NP Intel CPU (aka MaxLinear Lightning Mountain), which maps to the existing Airmont code (Martin Schiller) Performance enhancements: - Speed up kexec shutdown by avoiding unnecessary cross CPU calls (Jan H. Schönherr) - Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim) User-space stack unwinding support: - Various cleanups and refactorings in preparation to generalize the unwinding code for other architectures (Jens Remus) Uprobes updates: - Transition from kmap_atomic to kmap_local_page (Keke Ming) - Fix incorrect lockdep condition in filter_chain() (Breno Leitao) - Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov) Misc fixes and cleanups: - s390: Remove kvm_types.h from Kbuild (Randy Dunlap) - x86/intel/uncore: Convert comma to semicolon (Chen Ni) - x86/uncore: Clean up const mismatch (Greg Kroah-Hartman) - x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)" * tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) s390: remove kvm_types.h from Kbuild uprobes: Fix incorrect lockdep condition in filter_chain() x86/ibs: Fix typo in dc_l2tlb_miss comment x86/uprobes: Fix XOL allocation failure for 32-bit tasks perf/x86/intel/uncore: Convert comma to semicolon perf/x86/intel: Add support for rdpmc user disable feature perf/x86: Use macros to replace magic numbers in attr_rdpmc perf/x86/intel: Add core PMU support for Novalake perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL perf/x86/intel: Add core PMU support for DMR perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL perf/core: Fix slow perf_event_task_exit() with LBR callstacks perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls uprobes: use kmap_local_page() for temporary page mappings arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() perf/x86/intel/uncore: Add Nova Lake support ...
3 daysMerge tag 'bpf-next-7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...
4 daysMerge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.20 - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN. - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that were advertised as supported to userspace when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled. - Fix a bug where KVM would attempt to read protect guest state (CR3) when configuring an async #PF entry. - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow the few exports that are intended for external usage. - Ignore -EBUSY when checking nested events after a vCPU exits blocking as the WARN is user-triggerable, and because exiting to userspace on -EBUSY does more harm than good in pretty much every situation. - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an unwinnable game. - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace. - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2. - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration. - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs). - Minor cleanups.
4 daysMerge tag 'kvm-x86-svm-6.20' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.20 - Drop a user-triggerable WARN on nested_svm_load_cr3() failure. - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) *must* flush the RAP. - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model. - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig. - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0. - Treat exit_code as an unsigned 64-bit value through all of KVM. - Add support for fetching SNP certificates from userspace. - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2. - Misc fixes and cleanups.
7 daysx86/vmware: Fix hypercall clobbersJosh Poimboeuf
Fedora QA reported the following panic: BUG: unable to handle page fault for address: 0000000040003e54 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20251119-3.fc43 11/19/2025 RIP: 0010:vmware_hypercall4.constprop.0+0x52/0x90 .. Call Trace: vmmouse_report_events+0x13e/0x1b0 psmouse_handle_byte+0x15/0x60 ps2_interrupt+0x8a/0xd0 ... because the QEMU VMware mouse emulation is buggy, and clears the top 32 bits of %rdi that the kernel kept a pointer in. The QEMU vmmouse driver saves and restores the register state in a "uint32_t data[6];" and as a result restores the state with the high bits all cleared. RDI originally contained the value of a valid kernel stack address (0xff5eeb3240003e54). After the vmware hypercall it now contains 0x40003e54, and we get a page fault as a result when it is dereferenced. The proper fix would be in QEMU, but this works around the issue in the kernel to keep old setups working, when old kernels had not happened to keep any state in %rdi over the hypercall. In theory this same issue exists for all the hypercalls in the vmmouse driver; in practice it has only been seen with vmware_hypercall3() and vmware_hypercall4(). For now, just mark RDI/RSI as clobbered for those two calls. This should have a minimal effect on code generation overall as it should be rare for the compiler to want to make RDI/RSI live across hypercalls. Reported-by: Justin Forbes <jforbes@fedoraproject.org> Link: https://lore.kernel.org/all/99a9c69a-fc1a-43b7-8d1e-c42d6493b41f@broadcom.com/ Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 daysx86/kfence: fix booting on 32bit non-PAE systemsAndrew Cooper
The original patch inverted the PTE unconditionally to avoid L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level paging. Adjust the logic to use the flip_protnone_guard() helper, which is a nop on 2-level paging but inverts the address bits in all other paging modes. This doesn't matter for the Xen aspect of the original change. Linux no longer supports running 32bit PV under Xen, and Xen doesn't support running any 32bit PV guests without using PAE paging. Link: https://lkml.kernel.org/r/20260126211046.2096622-1-andrew.cooper3@citrix.com Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs") Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/ Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysx86/ibs: Fix typo in dc_l2tlb_miss commentXiang-Bin Shi
The comment for dc_l2tlb_miss incorrectly describes it as a "hit". This contradicts the field name and the actual bit definition. Fix the comment to correctly describe it as a "miss". Signed-off-by: Xiang-Bin Shi <eric91102091@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260123075719.160734-1-eric91102091@gmail.com
2026-01-30x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs pathJiri Olsa
Mahe reported missing function from stack trace on top of kprobe multi program. The missing function is the very first one in the stacktrace, the one that the bpf program is attached to. # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}' Attaching 1 probe... do_syscall_64+134 entry_SYSCALL_64_after_hwframe+118 ('*' is used for kprobe_multi attachment) The reason is that the previous change (the Fixes commit) fixed stack unwind for tracepoint, but removed attached function address from the stack trace on top of kprobe multi programs, which I also overlooked in the related test (check following patch). The tracepoint and kprobe_multi have different stack setup, but use same unwind path. I think it's better to keep the previous change, which fixed tracepoint unwind and instead change the kprobe multi unwind as explained below. The bpf program stack unwind calls perf_callchain_kernel for kernel portion and it follows two unwind paths based on X86_EFLAGS_FIXED bit in pt_regs.flags. When the bit set we unwind from stack represented by pt_regs argument, otherwise we unwind currently executed stack up to 'first_frame' boundary. The 'first_frame' value is taken from regs.rsp value, but ftrace_caller and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp to the previous stack frame, so we skip the attached function entry. If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit, we set the start of the unwind to the attached function address. As another benefit we also cut extra unwind cycles needed to reach the 'first_frame' boundary. The speedup can be measured with trigger bench for kprobe_multi program and stacktrace support. - trigger bench with stacktrace on current code: kprobe-multi : 0.810 ± 0.001M/s kretprobe-multi: 0.808 ± 0.001M/s - and with the fix: kprobe-multi : 1.264 ± 0.001M/s kretprobe-multi: 1.401 ± 0.002M/s With the fix, the entry probe stacktrace: # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}' Attaching 1 probe... __x64_sys_newuname+9 do_syscall_64+134 entry_SYSCALL_64_after_hwframe+118 The return probe skips the attached function, because it's no longer on the stack at the point of the unwind and this way is the same how standard kretprobe works. # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}' Attaching 1 probe... do_syscall_64+134 entry_SYSCALL_64_after_hwframe+118 Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"") Reported-by: Mahe Tardy <mahe.tardy@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20260126211837.472802-3-jolsa@kernel.org
2026-01-30KVM: x86: Add x2APIC "features" to control EOI broadcast suppressionKhushit Shah
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated by userspace), which KVM completely mishandles. When x2APIC support was first added, KVM incorrectly advertised and "enabled" Suppress EOI Broadcast, without fully supporting the I/O APIC side of the equation, i.e. without adding directed EOI to KVM's in-kernel I/O APIC. That flaw was carried over to split IRQCHIP support, i.e. KVM advertised support for Suppress EOI Broadcasts irrespective of whether or not the userspace I/O APIC implementation supported directed EOIs. Even worse, KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without support for directed EOI came to rely on the "spurious" broadcasts. KVM "fixed" the in-kernel I/O APIC implementation by completely disabling support for Suppress EOI Broadcasts in commit 0bcc3fb95b97 ("KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but didn't do anything to remedy userspace I/O APIC implementations. KVM's bogus handling of Suppress EOI Broadcast is problematic when the guest relies on interrupts being masked in the I/O APIC until well after the initial local APIC EOI. E.g. Windows with Credential Guard enabled handles interrupts in the following order: 1. Interrupt for L2 arrives. 2. L1 APIC EOIs the interrupt. 3. L1 resumes L2 and injects the interrupt. 4. L2 EOIs after servicing. 5. L1 performs the I/O APIC EOI. Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt storm, e.g. if the IRQ line is still asserted and userspace reacts to the EOI by re-injecting the IRQ, because the guest doesn't de-assert the line until step #4, and doesn't expect the interrupt to be re-enabled until step #5. Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way of knowing if the userspace I/O APIC supports directed EOIs, i.e. suppressing EOI broadcasts would result in interrupts being stuck masked in the userspace I/O APIC due to step #5 being ignored by userspace. And fully disabling support for Suppress EOI Broadcast is also undesirable, as picking up the fix would require a guest reboot, *and* more importantly would change the virtual CPU model exposed to the guest without any buy-in from userspace. Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to explicitly enable or disable support for Suppress EOI Broadcasts. This gives userspace control over the virtual CPU model exposed to the guest, as KVM should never have enabled support for Suppress EOI Broadcast without userspace opt-in. Not setting either flag will result in legacy quirky behavior for backward compatibility. Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear that commit c806a6ad35bf ("KVM: x86: call irq notifiers with directed EOI") was entirely correct, i.e. it may have simply papered over the lack of Directed EOI emulation in the I/O APIC. Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI is to support Directed EOI (KVM's name) irrespective of guest CPU vendor. Fixes: 7543a635aa09 ("KVM: x86: Add KVM exit for IOAPIC EOIs") Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com Cc: stable@vger.kernel.org Suggested-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Khushit Shah <khushit.shah@nutanix.com> Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com [sean: clean up minor formatting goofs and fix a comment typo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-26mm: provide address parameter to p{te,md,ud}_user_accessible_page()Rohan McLure
On several powerpc platforms, a page table entry may not imply whether the relevant mapping is for userspace or kernelspace. Instead, such platforms infer this by the address which is being accessed. Add an additional address argument to each of these routines in order to provide support for page table check on powerpc. [ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-9-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26mm/page_table_check: reinstate address parameter in ↵Rohan McLure
[__]page_table_check_pte_clear() This reverts commit aa232204c468 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pte_clear"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. [ajd@linux.ibm.com: rebase, fix additional occurrence and loop handling] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-8-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26mm/page_table_check: reinstate address parameter in ↵Rohan McLure
[__]page_table_check_pmd_clear() This reverts commit 1831414cd729 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_clear"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. [ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-7-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26mm/page_table_check: reinstate address parameter in ↵Rohan McLure
[__]page_table_check_pud_clear() This reverts commit 931c38e16499 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pud_clear"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. [ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-6-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26mm/page_table_check: reinstate address parameter in ↵Rohan McLure
[__]page_table_check_pmd[s]_set() This reverts commit a3b837130b58 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_set"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. Apply this to __page_table_check_pmds_set(), page_table_check_pmd_set(), and the page_table_check_pmd_set() wrapper macro. [ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-4-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26mm/page_table_check: reinstate address parameter in ↵Rohan McLure
[__]page_table_check_pud[s]_set() This reverts commit 6d144436d954 ("mm/page_table_check: remove unused parameter in [__]page_table_check_pud_set"). Reinstate previously unused parameters for the purpose of supporting powerpc platforms, as many do not encode user/kernel ownership of the page in the pte, but instead in the address of the access. Apply this to __page_table_check_puds_set(), page_table_check_puds_set() and the page_table_check_pud_set() wrapper macro. [ajd@linux.ibm.com: rebase on riscv + arm64 changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-3-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-23KVM: x86: Advertise AVX10_VNNI_INT CPUID to userspaceZhao Liu
Define and advertise AVX10_VNNI_INT CPUID to userspace when it's supported by the host. AVX10_VNNI_INT (0x24.0x1.ECX[bit 2]) is a discrete feature bit introduced on Intel Diamond Rapids, which enumerates the support for EVEX VPDP* instructions for INT8/INT16 [*]. Since this feature has no actual kernel usages, define it as a KVM-only feature in reverse_cpuid.h. Advertise new CPUID subleaf 0x24.0x1 with AVX10_VNNI_INT bit to userspace for guest use. It's safe since no additional enabling work is needed in the host kernel. [*]: Intel Advanced Vector Extensions 10.2 Architecture Specification (rev 5.0). Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-5-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Advertise AMX CPUIDs in subleaf 0x1E.0x1 to userspaceZhao Liu
Define and advertise AMX CPUIDs (0x1E.0x1) to userspace when the leaf is supported by the host. Intel Diamond Rapids adds new AMX instructions to support new formats and memory operations [*], and introduces the CPUID subleaf 0x1E.0x1 to centralize the discrete AMX feature bits within EAX. Since these AMX features have no actual kernel usages, define them as KVM-only features in reverse_cpuid.h. In addition to the new features, CPUID 0x1E.0x1.EAX[bits 0-3] are aliaseed positions of existing AMX feature bits distributed across the 0x7 leaves. To avoid duplicate feature names, name these aliases with an *_ALIAS suffix, and define them in reverse_cpuid.h as KVM-only features as well. Advertise new CPUID subleaf 0x1E.0x1 with its AMX CPUID feature bits to userspace for guest use. It's safe since no additional enabling work is needed in the host kernel. [*]: Intel Architecture Instruction Set Extensions and Future Features (rev.059). Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-3-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23KVM: x86: Advertise MOVRS CPUID to userspaceZhao Liu
Define the feature flag for MOVRS and advertise support to userspace when the feature is supported by the host. MOVRS is a new set of instructions introduced in the Intel platform Diamond Rapids, to provide load instructions that carry a read-shared hint. Functionally, MOVRS family is equivalent to existing load instructions, but its read-shared hint indicates that the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written (assuming it is ever written in the future). This hint could optimize the behavior of the caches, especially shared caches, for this data for future reads by multiple processors. Additionally, MOVRS family also includes a software prefetch instruction, PREFETCHRST2, that carries the same read-shared hint. [*] MOVRS family is enumerated by CPUID single-bit (0x7.0x1.EAX[bit 31]). Since it's on a densely-populated CPUID leaf and some other bits on this leaf have kernel usages, define this new feature in cpufeatures.h, but hide it in /proc/cpuinfo due to lack of current kernel usage. Advertise MOVRS bit to userspace directly. It's safe, since there's no new VMX controls or additional host enabling required for guests to use this feature. [*]: Intel Architecture Instruction Set Extensions and Future Features (rev.059). Tested-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20251120050720.931449-2-zhao1.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-20x86/clear_page: introduce clear_pages()Ankur Arora
Performance when clearing with string instructions (x86-64-stosq and similar) can vary significantly based on the chunk-size used. $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 13.748208 GB/sec $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in # arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 15.067900 GB/sec $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 38.104311 GB/sec (Both on AMD Milan.) With a change in chunk-size from 4KB to 1GB, we see the performance go from 13.7 GB/sec to 38.1 GB/sec. For the chunk-size of 2MB the change isn't quite as drastic but it is worth adding a clear_page() variant that can handle contiguous page-extents. Link: https://lkml.kernel.org/r/20260107072009.1615991-6-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20x86/mm: simplify clear_page_*Ankur Arora
clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" variations. Inlining gets rid of an unnecessary CALL/RET (which isn't free when using RETHUNK speculative execution mitigations.) Fixup and rename clear_page_orig() to adapt to the changed calling convention. Also add a comment from Dave Hansen detailing various clearing mechanisms used in clear_page(). Link: https://lkml.kernel.org/r/20260107072009.1615991-5-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20treewide: provide a generic clear_user_page() variantDavid Hildenbrand
Patch series "mm: folio_zero_user: clear page ranges", v11. This series adds clearing of contiguous page ranges for hugepages. The series improves on the current discontiguous clearing approach in two ways: - clear pages in a contiguous fashion. - use batched clearing via clear_pages() wherever exposed. The first is useful because it allows us to make much better use of hardware prefetchers. The second, enables advertising the real extent to the processor. Where specific instructions support it (ex. string instructions on x86; "mops" on arm64 etc), a processor can optimize based on this because, instead of seeing a sequence of 8-byte stores, or a sequence of 4KB pages, it sees a larger unit being operated on. For instance, AMD Zen uarchs (for extents larger than LLC-size) switch to a mode where they start eliding cacheline allocation. This is helpful not just because it results in higher bandwidth, but also because now the cache is not evicting useful cachelines and replacing them with zeroes. Demand faulting a 64GB region shows performance improvement: $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 baseline +series (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 25.34 +- 1.18% [*] +115.47% preempt=* pg-sz=1GB 24.85 +- 2.41% 39.22 +- 2.32% + 57.82% preempt=none|voluntary pg-sz=1GB (similar) 52.73 +- 0.20% [#] +112.19% preempt=full|lazy [*] This improvement is because switching to sequential clearing allows the hardware prefetchers to do a much better job. [#] For pg-sz=1GB a large part of the improvement is because of the cacheline elision mentioned above. preempt=full|lazy improves upon that because, not needing explicit invocations of cond_resched() to ensure reasonable preemption latency, it can clear the full extent as a single unit. In comparison the maximum extent used for preempt=none|voluntary is PROCESS_PAGES_NON_PREEMPT_BATCH (32MB). When provided the full extent the processor forgoes allocating cachelines on this path almost entirely. (The hope is that eventually, in the fullness of time, the lazy preemption model will be able to do the same job that none or voluntary models are used for, allowing us to do away with cond_resched().) Raghavendra also tested previous version of the series on AMD Genoa and sees similar improvement [1] with preempt=lazy. $ perf bench mem map -p $page-size -f populate -s 64GB -l 10 base patched change pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6% pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2% This patch (of 8): Let's drop all variants that effectively map to clear_page() and provide it in a generic variant instead. We'll use the macro clear_user_page to indicate whether an architecture provides it's own variant. Also, clear_user_page() is only called from the generic variant of clear_user_highpage(), so define it only if the architecture does not provide a clear_user_highpage(). And, for simplicity define it in linux/highmem.h. Note that for parisc, clear_page() and clear_user_page() map to clear_page_asm(), so we can just get rid of the custom clear_user_page() implementation. There is a clear_user_page_asm() function on parisc, that seems to be unused. Not sure what's up with that. Link: https://lkml.kernel.org/r/20260107072009.1615991-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-2-ankur.a.arora@oracle.com Signed-off-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20x86/xen: use lazy_mmu_state when context-switchingKevin Brodsky
We currently set a TIF flag when scheduling out a task that is in lazy MMU mode, in order to restore it when the task is scheduled again. The generic lazy_mmu layer now tracks whether a task is in lazy MMU mode in task_struct::lazy_mmu_state. We can therefore check that state when switching to the new task, instead of using a separate TIF flag. Link: https://lkml.kernel.org/r/20251215150323.2218608-14-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: introduce generic lazy_mmu helpersKevin Brodsky
The implementation of the lazy MMU mode is currently entirely arch-specific; core code directly calls arch helpers: arch_{enter,leave}_lazy_mmu_mode(). We are about to introduce support for nested lazy MMU sections. As things stand we'd have to duplicate that logic in every arch implementing lazy_mmu - adding to a fair amount of logic already duplicated across lazy_mmu implementations. This patch therefore introduces a new generic layer that calls the existing arch_* helpers. Two pair of calls are introduced: * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable() This is the standard case where the mode is enabled for a given block of code by surrounding it with enable() and disable() calls. * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume() This is for situations where the mode is temporarily disabled by first calling pause() and then resume() (e.g. to prevent any batching from occurring in a critical section). The documentation in <linux/pgtable.h> will be updated in a subsequent patch. No functional change should be introduced at this stage. The implementation of enable()/resume() and disable()/pause() is currently identical, but nesting support will change that. Most of the call sites have been updated using the following Coccinelle script: @@ @@ { ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_mode_enable(); ... - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_disable(); ... } @@ @@ { ... - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_pause(); ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_mode_resume(); ... } A couple of notes regarding x86: * Xen is currently the only case where explicit handling is required for lazy MMU when context-switching. This is purely an implementation detail and using the generic lazy_mmu_mode_* functions would cause trouble when nesting support is introduced, because the generic functions must be called from the current task. For that reason we still use arch_leave() and arch_enter() there. * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few places, but only defines it if PARAVIRT_XXL is selected, and we are removing the fallback in <linux/pgtable.h>. Add a new fallback definition to <asm/pgtable.h> to keep things building. Link: https://lkml.kernel.org/r/20251215150323.2218608-8-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODEKevin Brodsky
Architectures currently opt in for implementing lazy_mmu helpers by defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE. In preparation for introducing a generic lazy_mmu layer that will require storage in task_struct, let's switch to a cleaner approach: instead of defining a macro, select a CONFIG option. This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each arch select it when it implements lazy_mmu helpers. __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h> relies on the new CONFIG instead. On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is selected. This creates some complications in arch/x86/boot/, because a few files manually undefine PARAVIRT* options. As a result <asm/paravirt.h> does not define the lazy_mmu helpers, but this breaks the build as <linux/pgtable.h> only defines them if !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean way out of this - let's just undefine that new CONFIG too. Link: https://lkml.kernel.org/r/20251215150323.2218608-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20x86/segment: Use MOVL when reading segment registersUros Bizjak
Use MOVL when reading segment registers to avoid 0x66 operand-size override insn prefix. The segment value is always 16-bit and gets zero-extended to the full 32-bit size. Example: 4e4: 66 8c c0 mov %es,%ax 4e7: 66 89 83 80 0b 00 00 mov %ax,0xb80(%rbx) 4e4: 8c c0 mov %es,%eax 4e6: 66 89 83 80 0b 00 00 mov %ax,0xb80(%rbx) Also, use the %k0 modifier which generates the SImode (signed integer) register name for the target register. [ bp: Extend and clarify commit message. ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Link: https://patch.msgid.link/20260105090422.6243-1-ubizjak@gmail.com
2026-01-19x86/kfence: avoid writing L1TF-vulnerable PTEsAndrew Cooper
For native, the choice of PTE is fine. There's real memory backing the non-present PTE. However, for XenPV, Xen complains: (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing To explain, some background on XenPV pagetables: Xen PV guests are control their own pagetables; they choose the new PTE value, and use hypercalls to make changes so Xen can audit for safety. In addition to a regular reference count, Xen also maintains a type reference count. e.g. SegDesc (referenced by vGDT/vLDT), Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower pagetable level). This is in order to prevent e.g. a page being inserted into the pagetables for which the guest has a writable mapping. For non-present mappings, all other bits become software accessible, and typically contain metadata rather a real frame address. There is nothing that a reference count could sensibly be tied to. As such, even if Xen could recognise the address as currently safe, nothing would prevent that frame from changing owner to another VM in the future. When Xen detects a PV guest writing a L1TF-PTE, it responds by activating shadow paging. This is normally only used for the live phase of migration, and comes with a reasonable overhead. KFENCE only cares about getting #PF to catch wild accesses; it doesn't care about the value for non-present mappings. Use a fully inverted PTE, to avoid hitting the slow path when running under Xen. While adjusting the logic, take the opportunity to skip all actions if the PTE is already in the right state, half the number PVOps callouts, and skip TLB maintenance on a !P -> P transition which benefits non-Xen cases too. Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-18x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparseThomas Gleixner
Now that sparse builds enforce the usage of typeof_unqual() the casts in __raw_cpu_read/write() cause sparse to emit tons of false postive warnings: warning: cast removes address space '__percpu' of expression Address this by annotating the casts with __force. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/87v7gz0yjv.ffs@tglx Closes: https://lore.kernel.org/oe-kbuild-all/202601181733.YZOf9XU3-lkp@intel.com/
2026-01-16x86/mm: Hide mm_free_global_asid() definition under CONFIG_BROADCAST_TLB_FLUSHHou Wenlong
When CONFIG_BROADCAST_TLB_FLUSH is not enabled, mm_free_global_asid() remains a globally visible symbol and generates a useless function call to it in destroy_context(). Therefore, hide the mm_free_global_asid() definition under CONFIG_BROADCAST_TLB_FLUSH and provide a static inline empty version when it is not enabled to remove the function call. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Rik van Riel <riel@surriel.com> Link: https://patch.msgid.link/b262a8ec8076fb26bb692aaf113848b1e6f40e40.1768448079.git.houwenlong.hwl@antgroup.com
2026-01-15x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()Uros Bizjak
x86_64 zero extends 32bit operations, so for 64bit operands, XOR r32,r32 is functionally equal to XOR r64,r64, but avoids a REX prefix byte when legacy registers are used. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Acked-by: Alexey Makhalov <alexey.makhalov@broadcom.com> Link: https://patch.msgid.link/20260114211948.74774-2-ubizjak@gmail.com
2026-01-15x86/paravirt: Remove trailing semicolons from alternative asm templatesUros Bizjak
GCC inline asm treats semicolons as instruction separators, so a semicolon after the last instruction is not required. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Alexey Makhalov <alexey.makhalov@broadcom.com> Link: https://patch.msgid.link/20260114211948.74774-1-ubizjak@gmail.com
2026-01-15perf/x86/intel: Add support for rdpmc user disable featureDapeng Mi
Starting with Panther Cove, the rdpmc user disable feature is supported. This feature allows the perf system to disable user space rdpmc reads at the counter level. Currently, when a global counter is active, any user with rdpmc rights can read it, even if perf access permissions forbid it (e.g., disallow reading ring 0 counters). The rdpmc user disable feature mitigates this security concern. Details: - A new RDPMC_USR_DISABLE bit (bit 37) in each EVNTSELx MSR indicates that the GP counter cannot be read by RDPMC in ring 3. - New RDPMC_USR_DISABLE bits in IA32_FIXED_CTR_CTRL MSR (bits 33, 37, 41, 45, etc.) for fixed counters 0, 1, 2, 3, etc. - When calling rdpmc instruction for counter x, the following pseudo code demonstrates how the counter value is obtained: If (!CPL0 && RDPMC_USR_DISABLE[x] == 1) ? 0 : counter_value; - RDPMC_USR_DISABLE is enumerated by CPUID.0x23.0.EBX[2]. This patch extends the current global user space rdpmc control logic via the sysfs interface (/sys/devices/cpu/rdpmc) as follows: - rdpmc = 0: Global user space rdpmc and counter-level user space rdpmc for all counters are both disabled. - rdpmc = 1: Global user space rdpmc is enabled during the mmap-enabled time window, and counter-level user space rdpmc is enabled only for non-system-wide events. This prevents counter data leaks as count data is cleared during context switches. - rdpmc = 2: Global user space rdpmc and counter-level user space rdpmc for all counters are enabled unconditionally. The new rdpmc settings only affect newly activated perf events; currently active perf events remain unaffected. This simplifies and cleans up the code. The default value of rdpmc remains unchanged at 1. For more details about rdpmc user disable, please refer to chapter 15 "RDPMC USER DISABLE" in ISE documentation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-8-dapeng1.mi@linux.intel.com
2026-01-15perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVLDapeng Mi
Diamond Rapids (DMR) and Nova Lake (NVL) introduce an enhanced Off-Module Response (OMR) facility, replacing the Off-Core Response (OCR) Performance Monitoring of previous processors. Legacy microarchitectures used the OCR facility to evaluate off-core and multi-core off-module transactions. The newly named OMR facility improves OCR capabilities for scalable coverage of new memory systems in multi-core module systems. Similar to OCR, 4 additional off-module configuration MSRs (OFFMODULE_RSP_0 to OFFMODULE_RSP_3) are introduced to specify attributes of off-module transactions. When multiple identical OMR events are created, they need to occupy the same OFFMODULE_RSP_x MSR. To ensure these multiple identical OMR events can work simultaneously, the intel_alt_er() and intel_fixup_er() helpers are enhanced to rotate these OMR events across different OFFMODULE_RSP_* MSRs, similar to previous OCR events. For more details about OMR, please refer to section 16.1 "OFF-MODULE RESPONSE (OMR) FACILITY" in ISE documentation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-2-dapeng1.mi@linux.intel.com
2026-01-14vdso: Remove struct getcpu_cacheThomas Weißschuh
The cache parameter of getcpu() is useless nowadays for various reasons. * It is never passed by userspace for either the vDSO or syscalls. * It is never used by the kernel. * It could not be made to work on the current vDSO architecture. * The structure definition is not part of the UAPI headers. * vdso_getcpu() is superseded by restartable sequences in any case. Remove the struct and its header. As a side-effect this gets rid of an unwanted inclusion of the linux/ header namespace from vDSO code. [ tglx: Adapt to s390 upstream changes */ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de
2026-01-13KVM: SVM: Treat exit_code as an unsigned 64-bit value through all of KVMSean Christopherson
Fix KVM's long-standing buggy handling of SVM's exit_code as a 32-bit value. Per the APM and Xen commit d1bd157fbc ("Big merge the HVM full-virtualisation abstractions.") (which is arguably more trustworthy than KVM), offset 0x70 is a single 64-bit value: 070h 63:0 EXITCODE Track exit_code as a single u64 to prevent reintroducing bugs where KVM neglects to correctly set bits 63:32. Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface") Cc: Jim Mattson <jmattson@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230211347.4099600-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-13x86/entry/vdso32: When using int $0x80, use it directlyH. Peter Anvin
When neither sysenter32 nor syscall32 is available (on either FRED-capable 64-bit hardware or old 32-bit hardware), there is no reason to do a bunch of stack shuffling in __kernel_vsyscall. Unfortunately, just overwriting the initial "push" instructions will mess up the CFI annotations, so suffer the 3-byte NOP if not applicable. Similarly, inline the int $0x80 when doing inline system calls in the vdso instead of calling __kernel_vsyscall. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-11-hpa@zytor.com
2026-01-13x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32H. Peter Anvin
In most cases, the use of "fast 32-bit system call" depends either on X86_FEATURE_SEP or X86_FEATURE_SYSENTER32 || X86_FEATURE_SYSCALL32. However, nearly all the logic for both is identical. Define X86_FEATURE_SYSFAST32 which indicates that *either* SYSENTER32 or SYSCALL32 should be used, for either 32- or 64-bit kernels. This defaults to SYSENTER; use SYSCALL if the SYSCALL32 bit is also set. As this removes ALL existing uses of X86_FEATURE_SYSENTER32, which is a kernel-only synthetic feature bit, simply remove it and replace it with X86_FEATURE_SYSFAST32. This leaves an unused alternative for a true 32-bit kernel, but that should really not matter in any way. The clearing of X86_FEATURE_SYSCALL32 can be removed once the patches for automatically clearing disabled features has been merged. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-10-hpa@zytor.com
2026-01-13x86/vdso: Abstract out vdso system call internalsH. Peter Anvin
Abstract out the calling of true system calls from the vdso into macros. It has been a very long time since gcc did not allow %ebx or %ebp in inline asm in 32-bit PIC mode; remove the corresponding hacks. Remove the use of memory output constraints in gettimeofday.h in favor of "memory" clobbers. The resulting code is identical for the current use cases, as the system call is usually a terminal fallback anyway, and it merely complicates the macroization. This patch adds only a handful of more lines of code than it removes, and in fact could be made substantially smaller by removing the macros for the argument counts that aren't currently used, however, it seems better to be general from the start. [ v3: remove stray comment from prototyping; remove VDSO_SYSCALL6() since it would require special handling on 32 bits and is currently unused. (Uros Biszjak) Indent nested preprocessor directives. ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Uros Bizjak <ubizjak@gmail.com> Link: https://patch.msgid.link/20251216212606.1325678-9-hpa@zytor.com
2026-01-13x86/entry/vdso32: Remove open-coded DWARF in sigreturn.SH. Peter Anvin
The vdso32 sigreturn.S contains open-coded DWARF bytecode, which includes a hack for gdb to not try to step back to a previous call instruction when backtracing from a signal handler. Neither of those are necessary anymore: the backtracing issue is handled by ".cfi_entry simple" and ".cfi_signal_frame", both of which have been supported for a very long time now, which allows the remaining frame to be built using regular .cfi annotations. Add a few more register offsets to the signal frame just for good measure. Replace the nop on fallthrough of the system call (which should never, ever happen) with a ud2a trap. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20251216212606.1325678-7-hpa@zytor.com