| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Dave Hansen:
"The usual smattering of x86/misc changes.
The IPv6 patch in here surprised me in a couple of ways. First, the
function it inlines is able to eat a lot more CPU time than I would
have expected. Second, the inlining does not seem to bloat the kernel,
at least in the configs folks have tested.
- Inline x86-specific IPv6 checksum helper
- Update IOMMU docs to use stable identifiers
- Print unhashed pointers on fatal stack overflows"
* tag 'x86_misc_for_7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/traps: Print unhashed pointers on stack overflow
Documentation/x86: Update IOMMU spec references to use stable identifiers
x86/lib: Inline csum_ipv6_magic()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry code updates from Dave Hansen:
"This is entirely composed of a set of long overdue VDSO cleanups. They
makes the VDSO build much more logical and zap quite a bit of old
cruft.
It also results in a coveted net-code-removal diffstat"
* tag 'x86_entry_for_7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/vdso: Add vdso2c to .gitignore
x86/entry/vdso32: Omit '.cfi_offset eflags' for LLVM < 16
MAINTAINERS: Adjust vdso file entry in INTEL SGX
x86/entry/vdso/selftest: Update location of vgetrandom-chacha.S
x86/entry/vdso: Fix filtering of vdso compiler flags
x86/entry/vdso: Update the object paths for "make vdso_install"
x86/entry/vdso32: When using int $0x80, use it directly
x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32
x86/vdso: Abstract out vdso system call internals
x86/entry/vdso: Include GNU_PROPERTY and GNU_STACK PHDRs
x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S
x86/entry/vdso32: Remove SYSCALL_ENTER_KERNEL macro in sigreturn.S
x86/entry/vdso32: Don't rely on int80_landing_pad for adjusting ip
x86/entry/vdso: Refactor the vdso build
x86/entry/vdso: Move vdso2c to arch/x86/tools
x86/entry/vdso: Rename vdso_image_* to vdso*_image
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Make the SEV internal header really internal and carve out the
SVSM-specific code into a separate compilation unit, along with other
cleanups and fixups
[ TLA translation service: 'SEV' is AMD's 'Secure Encrypted
Virtualization' and SVSM is an ETLA ('Enhanced TLA') for 'Secure
VM Service Module'.
Some of us have trouble keeping track of this all and need all the
help we can get ]
* tag 'x86_sev_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Don't emit BSS_DECRYPTED section unless it is in use
x86/sev: Use kfree_sensitive() when freeing a SNP message descriptor
x86/sev: Rename sev_es_ghcb_handle_msr() to __vc_handle_msr()
x86/sev: Carve out the SVSM code into a separate compilation unit
x86/sev: Add internal header guards
x86/sev: Move the internal header
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt updates from Borislav Petkov:
- A nice cleanup to the paravirt code containing a unification of the
paravirt clock interface, taming the include hell by splitting the
pv_ops structure and removing of a bunch of obsolete code (Juergen
Gross)
* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
x86/paravirt: Remove trailing semicolons from alternative asm templates
x86/pvlocks: Move paravirt spinlock functions into own header
x86/paravirt: Specify pv_ops array in paravirt macros
x86/paravirt: Allow pv-calls outside paravirt.h
objtool: Allow multiple pv_ops arrays
x86/xen: Drop xen_mmu_ops
x86/xen: Drop xen_cpu_ops
x86/xen: Drop xen_irq_ops
x86/paravirt: Move pv_native_*() prototypes to paravirt.c
x86/paravirt: Introduce new paravirt-base.h header
x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
x86/paravirt: Use common code for paravirt_steal_clock()
riscv/paravirt: Use common code for paravirt_steal_clock()
loongarch/paravirt: Use common code for paravirt_steal_clock()
arm64/paravirt: Use common code for paravirt_steal_clock()
arm/paravirt: Use common code for paravirt_steal_clock()
sched: Move clock related paravirt code to kernel/sched
paravirt: Remove asm/paravirt_api_clock.h
x86/paravirt: Move thunk macros to paravirt_types.h
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader update from Borislav Petkov:
- Since debugging the microcode loader makes sense on baremetal too (it
was used in a guest only until now), extend it to be able to do that
too
* tag 'x86_microcode_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Allow loader debugging to be enabled on baremetal too
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- The usual set of cleanups and simplifications all over the tree
* tag 'x86_cleanups_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/segment: Use MOVL when reading segment registers
selftests/x86: Clean up sysret_rip coding style
x86/mm: Hide mm_free_global_asid() definition under CONFIG_BROADCAST_TLB_FLUSH
x86/crash: Use set_memory_p() instead of __set_memory_prot()
x86/CPU/AMD: Simplify the spectral chicken fix
x86/platform/olpc: Replace strcpy() with strscpy() in xo15_sci_add()
x86/split_lock: Remove dead string when split_lock_detect=fatal
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Extend the resctrl machinery to support telemetry monitoring on
Intel (Tony Luck)
The practical usage of this is being able to tell how much energy or
how much work can be attributed to a group of tasks tracked under a
single idenitifier. Prepend this work with proper refactoring of
resctrl domains handling code.
* tag 'x86_cache_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86,fs/resctrl: Update documentation for telemetry events
x86/resctrl: Enable RDT_RESOURCE_PERF_PKG
fs/resctrl: Move RMID initialization to first mount
x86,fs/resctrl: Compute number of RMIDs as minimum across resources
fs/resctrl: Move allocation/free of closid_num_dirty_rmid[]
x86/resctrl: Handle number of RMIDs supported by RDT_RESOURCE_PERF_PKG
x86/resctrl: Add energy/perf choices to rdt boot option
x86,fs/resctrl: Handle domain creation/deletion for RDT_RESOURCE_PERF_PKG
fs/resctrl: Refactor rmdir_mondata_subdir_allrdtgrp()
fs/resctrl: Refactor mkdir_mondata_subdir()
x86/resctrl: Read telemetry events
x86/resctrl: Find and enable usable telemetry events
x86,fs/resctrl: Add architectural event pointer
x86,fs/resctrl: Fill in details of events for performance and energy GUIDs
x86/resctrl: Discover hardware telemetry events
fs/resctrl: Emphasize that L3 monitoring resource is required for summing domains
x86,fs/resctrl: Add and initialize a resource for package scope monitoring
x86,fs/resctrl: Add an architectural hook called for first mount
x86,fs/resctrl: Support binary fixed point event counters
x86,fs/resctrl: Handle events that can be read from any CPU
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 alternatives update from Borislav Petkov:
- Reorganize the alternatives patching mechanism to patch a single
location only once instead of multiple times as it was the case with
the two or three alternative options macros
* tag 'x86_alternatives_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternative: Patch a single alternative location only once
x86/alternative: Use helper functions for patching alternatives
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq updates from Thomas Gleixner:
"Trivial cleanups for the posted MSI interrupt handling"
* tag 'x86-irq-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq_remapping: Sanitize posted_msi_supported()
x86/irq: Cleanup posted MSI code
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Ingo Molnar:
- CPU model updates (Andrew Cooper):
- amd: Correct the microcode table for Zenbleed
- amd: Use ZEN_MODEL_STEP_UCODE() for erratum_1386_microcode[]
- Drop vestigial PBE logic in AMD/Hygon/Centaur/Cyrix
- tsx: Set default TSX mode to auto (Nikolay Borisov)
- Drop unused Kconfig symbol X86_P6_NOP (Randy Dunlap)
* tag 'x86-cpu-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Set default TSX mode to auto
x86/cpu: Drop unused Kconfig symbol X86_P6_NOP
x86/cpu: Drop vestigial PBE logic in AMD/Hygon/Centaur/Cyrix
x86/cpu/amd: Use ZEN_MODEL_STEP_UCODE() for erratum_1386_microcode[]
x86/cpu/amd: Correct the microcode table for Zenbleed
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC update from Ingo Molnar:
- Inline __x2apic_send_IPI_dest() (Eric Dumazet)
* tag 'x86-apic-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Inline __x2apic_send_IPI_dest()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/boot updates from Ingo Molnar:
- x86/acpi: Add acpi=spcr to use SPCR-provided default console
(Shenghao Yang)
- x86/acpi/boot: Correct the acpi_is_processor_usable() check again
(Yazen Ghannam)
- Refresh the x86 memory map (e820 table) handling code, and make the
printouts a bit more informative (Ingo Molnar)
* tag 'x86-boot-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/acpi: Add acpi=spcr to use SPCR-provided default console
x86/boot/e820: Use <linux/sizes.h> symbols for literals
x86/boot/e820: Make sure e820_search_gap() finds all gaps
x86/boot/e820: Simplify the e820__range_remove() API
x86/boot/e820: Remove e820__range_remove()'s unused return parameter
x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables
x86/boot/e820: Standardize __init/__initdata tag placement
x86/boot/e820: Simplify & clarify __e820__range_add() a bit
x86/boot/e820: Rename gap_start/gap_size to max_gap_start/max_gap_start in e820_search_gap() et al
x86/boot/e820: Change e820_search_gap() to search for the highest-address PCI gap
x86/boot/e820: Clean up e820__setup_pci_gap()/e820_search_gap() a bit
x86/boot/e820: Change struct e820_table::nr_entries type from __u32 to u32
x86/boot/e820: Standardize e820 table index variable types under 'u32'
x86/boot/e820: Standardize e820 table index variable names under 'idx'
x86/boot/e820: Remove unnecessary header inclusions
x86/boot/e820: Clean up __refdata use a bit
x86/boot/e820: Clean up __e820__range_add() a bit
x86/boot/e820: Improve e820_print_type() messages
x86/boot/e820: Clean up confusing and self-contradictory verbiage around E820 related resource allocations
x86/boot/e820: Remove pointless early_panic() indirection
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scheduler Kconfig space updates:
- Further consolidate configurable preemption modes (Peter Zijlstra)
Reduce the number of architectures that are allowed to offer
PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
preemption models from four to just two: 'full' and 'lazy' on
up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86).
None and voluntary are only available as legacy features on
platforms that don't implement lazy preemption yet, or which don't
even support preemption.
The goal is to eventually remove cond_resched() and voluntary
preemption altogether.
RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
and Peter Zijlstra):
This allows a thread to request a time slice extension when it enters
a critical section to avoid contention on a resource when the thread
is scheduled out inside of the critical section.
- Add fields and constants for time slice extension
- Provide static branch for time slice extensions
- Add statistics for time slice extensions
- Add prctl() to enable time slice extensions
- Implement sys_rseq_slice_yield()
- Implement syscall entry work for time slice extensions
- Implement time slice extension enforcement timer
- Reset slice extension when scheduled
- Implement rseq_grant_slice_extension()
- entry: Hook up rseq time slice extension
- selftests: Implement time slice extension test
- Allow registering RSEQ with slice extension
- Move slice_ext_nsec to debugfs
- Lower default slice extension
- selftests/rseq: Add rseq slice histogram script
Scheduler performance/scalability improvements:
- Update rq->avg_idle when a task is moved to an idle CPU, which
improves the scalability of various workloads (Shubhang Kaushik)
- Reorder fields in 'struct rq' for better caching (Blake Jones)
- Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
- Move checking for nohz cpus after time check
- Change likelyhood of nohz.nr_cpus
- Remove nohz.nr_cpus and use weight of cpumask instead
- Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
- Cleanups (Yury Norov):
- Drop useless cpumask_empty() in find_energy_efficient_cpu()
- Simplify task_numa_find_cpu()
- Use cpumask_weight_and() in sched_balance_find_dst_group()
DL scheduler updates:
- Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
Fernandes, with fixes by Peter Zijlstra)
RT scheduler updates:
- Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
Entry code updates and performance improvements (Jinjie Ruan)
This is part of the scheduler tree in this cycle due to inter-
dependencies with the RSEQ based time slice extension work:
- Remove unused syscall argument from syscall_trace_enter()
- Rework syscall_exit_to_user_mode_work() for architecture reuse
- Add arch_ptrace_report_syscall_entry/exit()
- Inline syscall_exit_work() and syscall_trace_enter()
Scheduler core updates (Peter Zijlstra):
- Rework sched_class::wakeup_preempt() and rq_modified_*()
- Avoid rq->lock bouncing in sched_balance_newidle()
- Rename rcu_dereference_check_sched_domain() =>
rcu_dereference_sched_domain()
- <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):
- Fold the sched_avg update
- Change rcu_dereference_check_sched_domain() to rcu-sched
- Switch to rcu_dereference_all()
- Remove superfluous rcu_read_lock()
- Limit hrtick work
- Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
- Clean up comments in 'struct cfs_rq'
- Separate se->vlag from se->vprot
- Rename cfs_rq::avg_load to cfs_rq::sum_weight
- Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
- Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
wrapped-signed aritmetics
- Sort out 'blocked_load*' namespace noise
Scheduler debugging code updates:
- Export hidden tracepoints to modules (Gabriele Monaco)
- Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
(Fushuai Wang)
- Add assertions to QUEUE_CLASS (Peter Zijlstra)
- hrtimer: Fix tracing oddity (Thomas Gleixner)
Misc fixes and cleanups:
- Re-evaluate scheduling when migrating queued tasks out of throttled
cgroups (Zicheng Qu)
- Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
- Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
- sched/cpufreq: Use %pe format for PTR_ERR() printing
(zenghongling)"
* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
sched/cpufreq: Use %pe format for PTR_ERR() printing
sched/rt: Skip currently executing CPU in rto_next_cpu()
sched/clock: Avoid false sharing for sched_clock_irqtime
selftests/sched_ext: Add test for DL server total_bw consistency
selftests/sched_ext: Add test for sched_ext dl_server
sched/debug: Fix dl_server (re)start conditions
sched/debug: Add support to change sched_ext server params
sched_ext: Add a DL server for sched_ext tasks
sched/debug: Stop and start server based on if it was active
sched/debug: Fix updating of ppos on server write ops
sched/deadline: Clear the defer params
entry: Inline syscall_exit_work() and syscall_trace_enter()
entry: Add arch_ptrace_report_syscall_entry/exit()
entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
entry: Remove unused syscall argument from syscall_trace_enter()
sched: remove task_struct->faults_disabled_mapping
sched: Update rq->avg_idle when a task is moved to an idle CPU
selftests/rseq: Add rseq slice histogram script
hrtimer: Fix trace oddity
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
"x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs
(Dapeng Mi)
Compared to previous iterations of the Intel PMU code, there's been
a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the
Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
- Likewise, a large series adds uncore PMU support for Intel Diamond
Rapids (DMR) CPUs (Zide Chen)
This centers around these four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and each IMH
die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR,
PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang and
Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs,
which are a low-power variant of Panther Lake (Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code (Martin Schiller)
Performance enhancements:
- Speed up kexec shutdown by avoiding unnecessary cross CPU calls
(Jan H. Schönherr)
- Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize the
unwinding code for other architectures (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)"
* tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
s390: remove kvm_types.h from Kbuild
uprobes: Fix incorrect lockdep condition in filter_chain()
x86/ibs: Fix typo in dc_l2tlb_miss comment
x86/uprobes: Fix XOL allocation failure for 32-bit tasks
perf/x86/intel/uncore: Convert comma to semicolon
perf/x86/intel: Add support for rdpmc user disable feature
perf/x86: Use macros to replace magic numbers in attr_rdpmc
perf/x86/intel: Add core PMU support for Novalake
perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL
perf/x86/intel: Add core PMU support for DMR
perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR
perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL
perf/core: Fix slow perf_event_task_exit() with LBR callstacks
perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
uprobes: use kmap_local_page() for temporary page mappings
arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
perf/x86/intel/uncore: Add Nova Lake support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support associating BPF program with struct_ops (Amery Hung)
- Switch BPF local storage to rqspinlock and remove recursion detection
counters which were causing false positives (Amery Hung)
- Fix live registers marking for indirect jumps (Anton Protopopov)
- Introduce execution context detection BPF helpers (Changwoo Min)
- Improve verifier precision for 32bit sign extension pattern
(Cupertino Miranda)
- Optimize BTF type lookup by sorting vmlinux BTF and doing binary
search (Donglin Peng)
- Allow states pruning for misc/invalid slots in iterator loops (Eduard
Zingerman)
- In preparation for ASAN support in BPF arenas teach libbpf to move
global BPF variables to the end of the region and enable arena kfuncs
while holding locks (Emil Tsalapatis)
- Introduce support for implicit arguments in kfuncs and migrate a
number of them to new API. This is a prerequisite for cgroup
sub-schedulers in sched-ext (Ihor Solodrai)
- Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)
- Fix ORC stack unwind from kprobe_multi (Jiri Olsa)
- Speed up fentry attach by using single ftrace direct ops in BPF
trampolines (Jiri Olsa)
- Require frozen map for calculating map hash (KP Singh)
- Fix lock entry creation in TAS fallback in rqspinlock (Kumar
Kartikeya Dwivedi)
- Allow user space to select cpu in lookup/update operations on per-cpu
array and hash maps (Leon Hwang)
- Make kfuncs return trusted pointers by default (Matt Bobrowski)
- Introduce "fsession" support where single BPF program is executed
upon entry and exit from traced kernel function (Menglong Dong)
- Allow bpf_timer and bpf_wq use in all programs types (Mykyta
Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
Starovoitov)
- Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
definition across the tree (Puranjay Mohan)
- Allow BPF arena calls from non-sleepable context (Puranjay Mohan)
- Improve register id comparison logic in the verifier and extend
linked registers with negative offsets (Puranjay Mohan)
- In preparation for BPF-OOM introduce kfuncs to access memcg events
(Roman Gushchin)
- Use CFI compatible destructor kfunc type (Sami Tolvanen)
- Add bitwise tracking for BPF_END in the verifier (Tianci Cao)
- Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
Tang)
- Make BPF selftests work with 64k page size (Yonghong Song)
* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
selftests/bpf: Fix outdated test on storage->smap
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
selftests/bpf: Remove test_task_storage_map_stress_lookup
selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
selftests/bpf: Update task_local_storage/recursion test
selftests/bpf: Update sk_storage_omem_uncharge test
bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
bpf: Support lockless unlink when freeing map or local storage
bpf: Prepare for bpf_selem_unlink_nofail()
bpf: Remove unused percpu counter from bpf_local_storage_map_free
bpf: Remove cgroup local storage percpu counter
bpf: Remove task local storage percpu counter
bpf: Change local_storage->lock and b->lock to rqspinlock
bpf: Convert bpf_selem_unlink to failable
bpf: Convert bpf_selem_link_map to failable
bpf: Convert bpf_selem_unlink_map to failable
bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
selftests/xsk: fix number of Tx frags in invalid packet
selftests/xsk: properly handle batch ending in the middle of a packet
bpf: Prevent reentrance into call_rcu_tasks_trace()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
- Quirk the broken EFI framebuffer geometry on the Valve Steam Deck
- Capture the EDID information of the primary display also on non-x86
EFI systems when booting via the EFI stub.
* tag 'efi-next-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: Support EDID information
sysfb: Move edid_info into sysfb_primary_display
sysfb: Pass sysfb_primary_display to devices
sysfb: Replace screen_info with sysfb_primary_display
sysfb: Add struct sysfb_display_info
efi: sysfb_efi: Reduce number of references to global screen_info
efi: earlycon: Reduce number of references to global screen_info
efi: sysfb_efi: Fix efidrmfb and simpledrmfb on Valve Steam Deck
efi: sysfb_efi: Convert swap width and height quirk to a callback
efi: sysfb_efi: Fix lfb_linelength calculation when applying quirks
efi: sysfb_efi: Replace open coded swap with the macro
|
|
Read-mostly sched_clock_irqtime may share the same cacheline with
frequently updated nohz struct. Make it as static_key to avoid
false sharing issue.
The only user of disable_sched_clock_irqtime()
is tsc_.*mark_unstable() which may be invoked under atomic context
and require a workqueue to disable static_key. But both of them
calls clear_sched_clock_stable() just before doing
disable_sched_clock_irqtime(). We can reuse
"sched_clock_work" to also disable sched_clock_irqtime().
One additional case need to handle is if the tsc is marked unstable
before late_initcall() phase, sched_clock_work will not be invoked
and sched_clock_irqtime will stay enabled although clock is unstable:
tsc_init()
enable_sched_clock_irqtime() # irqtime accounting is enabled here
...
if (unsynchronized_tsc()) # true
mark_tsc_unstable()
clear_sched_clock_stable()
__sched_clock_stable_early = 0;
...
if (static_key_count(&sched_clock_running.key) == 2)
# Only happens at sched_clock_init_late()
__clear_sched_clock_stable(); # Never executed
...
# late_initcall() phase
sched_clock_init_late()
if (__sched_clock_stable_early) # Already false
__set_sched_clock_stable(); # sched_clock is never marked stable
# TSC unstable, but sched_clock_work won't run to disable irqtime
So we need to disable_sched_clock_irqtime() in sched_clock_init_late()
if clock is unstable.
Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260127072509.2627346-1-wangyang.guo@intel.com
|
|
The BSS_DECRYPTED section that gets emitted into .bss will be empty if
CONFIG_AMD_MEM_ENCRYPT is not defined. However, due to the fact that it
is injected into .bss rather than emitted as a separate section, the
2 MiB alignment that it specifies is still taken into account
unconditionally, pushing .bss out to the next 2 MiB boundary, leaving
a gap that is never freed.
So only emit a non-empty BSS_DECRYPTED section if it is going to be
used. In that case, it would still be nice to free the padding, but
that is left for later.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260108092526.28586-23-ardb@kernel.org
|
|
The previous change (Fixes commit) messed up the rsp register value,
which is wrong because it's already adjusted with FRAME_SIZE, we need
the original rsp value.
This change does not affect fprobe current kernel unwind, the !perf_hw_regs
path perf_callchain_kernel:
if (perf_hw_regs(regs)) {
if (perf_callchain_store(entry, regs->ip))
return;
unwind_start(&state, current, regs, NULL);
} else {
unwind_start(&state, current, NULL, (void *)regs->sp);
}
which uses pt_regs.sp as first_frame boundary (FRAME_SIZE shift makes
no difference, unwind stil stops at the right frame).
This change fixes the other path when we want to unwind directly from
pt_regs sp/fp/ip state, which is coming in following change.
Fixes: 20a0bc10272f ("x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20260126211837.472802-2-jolsa@kernel.org
|
|
Avoid one call/ret in networking RPS/RFS fast path, with no space cost.
scripts/bloat-o-meter -t vmlinux.before vmlinux.after
add/remove: 0/2 grow/shrink: 2/0 up/down: 96/-97 (-1)
Function old new delta
x2apic_send_IPI 146 198 +52
__x2apic_send_IPI_mask 326 370 +44
__pfx___x2apic_send_IPI_dest 16 - -16
__x2apic_send_IPI_dest 81 - -81
Total: Before=20861234, After=20861233, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20251223052026.4141498-1-edumazet@google.com
|
|
Pick up upstream work and
d9b40d7262a2 ("selftests/x86: Add selftests include path for kselftest.h after centralization")
especially which is a build fix needed for a selftests cleanup coming
ontop of this.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix resctrl initialization on Hygon CPUs
- Fix resctrl memory bandwidth counters on Hygon CPUs
- Fix x86 self-tests build bug
* tag 'x86-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86: Add selftests include path for kselftest.h after centralization
x86/resctrl: Fix memory bandwidth counter width for Hygon
x86/resctrl: Add missing resctrl initialization for Hygon
|
|
The SPCR provided console on x86 is only available as a boot console when
earlycon is provided on the kernel command line, and will not be present in
/proc/consoles.
While it's possible to retain the boot console with the keep_bootcon
parameter, that leaves the console using the less efficient 8250_early driver.
Users wanting to use the firmware suggested console (to avoid maintaining
unique serial console parameters for different server models in large fleets)
with the conventional driver have to parse the kernel log for the console
parameters and reinsert them.
[ 0.005091] ACPI: SPCR 0x000000007FFB5000 000059 (v04 ALASKA A M I 01072009 INTL 20250404)
[ 0.073387] ACPI: SPCR: console: uart,io,0x3f8,115200
In commit
0231d00082f6 ("ACPI: SPCR: Make SPCR available to x86")¹
the SPCR console was only added as an option for earlycon but not as an
ordinary console so users don't see console output changes.
So users can opt in to an automatic SPCR console, make ACPI init add it if
acpi=spcr is set.
¹https://lore.kernel.org/lkml/20180118150951.28964-1-prarit@redhat.com/
[ bp: Touchups. ]
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260117072827.355360-1-me@shenghaoyang.info
|
|
This script
#!/usr/bin/bash
echo 0 > /proc/sys/kernel/randomize_va_space
echo 'void main(void) {}' > TEST.c
# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
gcc -m32 -fcf-protection=branch TEST.c -o test
bpftrace -e 'uprobe:./test:main {}' -c ./test
"hangs", the probed ./test task enters an endless loop.
The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.
arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.
handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.
I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.
But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.
Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
|
|
When a stack overflow occurs, the kernel prints hashed fault address and the
stack range using %p. The actual addresses are required for debugging and
hashed pointers provide no useful information in this context.
Use %px to print the unhashed, raw addresses.
Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251224070735.454816-1-ryasuoka@redhat.com
|
|
Debugging the loader on baremetal does make sense, so enable it there
too.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260108165028.27417-1-bp@kernel.org
|
|
In most cases, the use of "fast 32-bit system call" depends either on
X86_FEATURE_SEP or X86_FEATURE_SYSENTER32 || X86_FEATURE_SYSCALL32.
However, nearly all the logic for both is identical.
Define X86_FEATURE_SYSFAST32 which indicates that *either* SYSENTER32 or
SYSCALL32 should be used, for either 32- or 64-bit kernels. This
defaults to SYSENTER; use SYSCALL if the SYSCALL32 bit is also set.
As this removes ALL existing uses of X86_FEATURE_SYSENTER32, which is
a kernel-only synthetic feature bit, simply remove it and replace it
with X86_FEATURE_SYSFAST32.
This leaves an unused alternative for a true 32-bit kernel, but that
should really not matter in any way.
The clearing of X86_FEATURE_SYSCALL32 can be removed once the patches
for automatically clearing disabled features has been merged.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251216212606.1325678-10-hpa@zytor.com
|
|
The vdso32 sigreturn.S contains open-coded DWARF bytecode, which
includes a hack for gdb to not try to step back to a previous call
instruction when backtracing from a signal handler.
Neither of those are necessary anymore: the backtracing issue is
handled by ".cfi_entry simple" and ".cfi_signal_frame", both of which
have been supported for a very long time now, which allows the
remaining frame to be built using regular .cfi annotations.
Add a few more register offsets to the signal frame just for good
measure.
Replace the nop on fallthrough of the system call (which should never,
ever happen) with a ud2a trap.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251216212606.1325678-7-hpa@zytor.com
|
|
The vdso .so files are named vdso*.so. These structures are binary
images and descriptions of these files, so it is more consistent for
them to have a naming that more directly mirrors the filenames.
It is also very slightly more compact (by one character...) and
simplifies the Makefile just a little bit.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251216212606.1325678-2-hpa@zytor.com
|
|
Pull x86 kvm fixes from Paolo Bonzini:
- Avoid freeing stack-allocated node in kvm_async_pf_queue_task
- Clear XSTATE_BV[i] in guest XSAVE state whenever XFD[i]=1
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
selftests: kvm: Verify TILELOADD actually #NM faults when XFD[18]=1
selftests: kvm: try getting XFD and XSAVE state out of sync
selftests: kvm: replace numbered sync points with actions
x86/fpu: Clear XSTATE_BV[i] in guest XSAVE state whenever XFD[i]=1
x86/kvm: Avoid freeing stack-allocated node in kvm_async_pf_queue_task
|
|
The memory bandwidth calculation relies on reading the hardware counter
and measuring the delta between samples. To ensure accurate measurement,
the software reads the counter frequently enough to prevent it from
rolling over twice between reads.
The default Memory Bandwidth Monitoring (MBM) counter width is 24 bits.
Hygon CPUs provide a 32-bit width counter, but they do not support the
MBM capability CPUID leaf (0xF.[ECX=1]:EAX) to report the width offset
(from 24 bits).
Consequently, the kernel falls back to the 24-bit default counter width,
which causes incorrect overflow handling on Hygon CPUs.
Fix this by explicitly setting the counter width offset to 8 bits (resulting
in a 32-bit total counter width) for Hygon CPUs.
Fixes: d8df126349da ("x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper")
Signed-off-by: Xiaochen Shen <shenxiaochen@open-hieco.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251209062650.1536952-3-shenxiaochen@open-hieco.net
|
|
Hygon CPUs supporting Platform QoS features currently undergo partial resctrl
initialization through resctrl_cpu_detect() in the Hygon BSP init helper and
AMD/Hygon common initialization code. However, several critical data
structures remain uninitialized for Hygon CPUs in the following paths:
- get_mem_config()-> __rdt_get_mem_config_amd():
rdt_resource::membw,alloc_capable
hw_res::num_closid
- rdt_init_res_defs()->rdt_init_res_defs_amd():
rdt_resource::cache
hw_res::msr_base,msr_update
Add the missing AMD/Hygon common initialization to ensure proper Platform QoS
functionality on Hygon CPUs.
Fixes: d8df126349da ("x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper")
Signed-off-by: Xiaochen Shen <shenxiaochen@open-hieco.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251209062650.1536952-2-shenxiaochen@open-hieco.net
|
|
set_memory_p() is available to use outside of its compilation unit since:
030ad7af9437 ("x86/mm: Regularize set_memory_p() parameters and make non-static").
There is no use for __set_memory_prot() anymore so drop it too.
[ bp: Massage commit message. ]
Signed-off-by: Coiby Xu <coxu@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260106095100.656292-1-coxu@redhat.com
|
|
Instead of having the pv spinlock function definitions in paravirt.h,
move them into the new header paravirt-spinlock.h.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260105110520.21356-22-jgross@suse.com
|
|
The only reason the pv_native_*() prototypes are needed is the complete
definition of those functions via an asm() statement, which makes it
impossible to have those functions as static ones.
Move the prototypes from paravirt_types.h into paravirt.c, which is the
only source referencing the functions.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20260105110520.21356-15-jgross@suse.com
|
|
The only user of paravirt_sched_clock() is in tsc.c, so move the code
from paravirt.c and paravirt.h to tsc.c.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-13-jgross@suse.com
|
|
Remove the arch-specific variant of paravirt_steal_clock() and use
the common one instead.
With all archs supporting Xen now having been switched to the common
variant, including paravirt.h can be dropped from drivers/xen/time.c.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-12-jgross@suse.com
|
|
Paravirt clock related functions are available in multiple archs.
In order to share the common parts, move the common static keys
to kernel/sched/ and remove them from the arch specific files.
Make a common paravirt_steal_clock() implementation available in
kernel/sched/cputime.c, guarding it with a new config option
CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch
in case it wants to use that common variant.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
|
|
In some places asm/paravirt.h is included without really being needed.
Remove the related #include statements.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-2-jgross@suse.com
|
|
In a vain attempt to consolidate the email zoo switch everything to the
kernel.org account.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG
resource is not considered "monitoring capable" during early resctrl initialization.
This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU
hotplug notifiers are registered and run for the first time right after resctrl
initialization.
Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry
event enumeration to ensure future CPU hotplug events include this resource and
initialize its domain list for CPUs that are already online.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
resctrl assumes that only the L3 resource supports monitor events, so it
simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as the system's
number of RMIDs.
The addition of telemetry events in a different resource breaks that
assumption.
Compute the number of available RMIDs as the minimum value across all
mon_capable resources (analogous to how the number of CLOSIDs is computed
across alloc_capable resources).
Note that mount time enumeration of the telemetry resource means that
this number can be reduced. If this happens, then some memory will
be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and
rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization will
be larger than needed.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
There are now three meanings for "number of RMIDs":
1) The number for legacy features enumerated by CPUID leaf 0xF. This is the
maximum number of distinct values that can be loaded into MSR_IA32_PQR_ASSOC.
Note that systems with Sub-NUMA Cluster mode enabled will force scaling down
the CPUID enumerated value by the number of SNC nodes per L3-cache.
2) The number of registers in MMIO space for each event. This is enumerated in
the XML files and is the value initialized into event_group::num_rmid.
3) The number of "hardware counters" (this isn't a strictly accurate
description of how things work, but serves as a useful analogy that does
describe the limitations) feeding to those MMIO registers. This is enumerated
in telemetry_region::num_rmids returned by intel_pmt_get_regions_by_feature().
Event groups with insufficient "hardware counters" to track all RMIDs are
difficult for users to use, since the system may reassign "hardware counters"
at any time. This means that users cannot reliably collect two consecutive
event counts to compute the rate at which events are occurring.
Disable such event groups by default. The user may override this with
a command line "rdt=" option. In this case limit an under-resourced event
group's number of possible monitor resource groups to the lowest number of
"hardware counters".
Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG resource
"num_rmid" value to the smallest of these values as this value will be used
later to compare against the number of RMIDs supported by other resources to
determine how many monitoring resource groups are supported.
N.B. Change type of resctrl_mon::num_rmid to u32 to match its usage and the
type of event_group::num_rmid so that min(r->num_rmid, e->num_rmid) won't
complain about mixing signed and unsigned types.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
When loading guest XSAVE state via KVM_SET_XSAVE, and when updating XFD in
response to a guest WRMSR, clear XFD-disabled features in the saved (or to
be restored) XSTATE_BV to ensure KVM doesn't attempt to load state for
features that are disabled via the guest's XFD. Because the kernel
executes XRSTOR with the guest's XFD, saving XSTATE_BV[i]=1 with XFD[i]=1
will cause XRSTOR to #NM and panic the kernel.
E.g. if fpu_update_guest_xfd() sets XFD without clearing XSTATE_BV:
------------[ cut here ]------------
WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#29: amx_test/848
Modules linked in: kvm_intel kvm irqbypass
CPU: 29 UID: 1000 PID: 848 Comm: amx_test Not tainted 6.19.0-rc2-ffa07f7fd437-x86_amx_nm_xfd_non_init-vm #171 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:exc_device_not_available+0x101/0x110
Call Trace:
<TASK>
asm_exc_device_not_available+0x1a/0x20
RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
switch_fpu_return+0x4a/0xb0
kvm_arch_vcpu_ioctl_run+0x1245/0x1e40 [kvm]
kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
__x64_sys_ioctl+0x8f/0xd0
do_syscall_64+0x62/0x940
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
This can happen if the guest executes WRMSR(MSR_IA32_XFD) to set XFD[18] = 1,
and a host IRQ triggers kernel_fpu_begin() prior to the vmexit handler's
call to fpu_update_guest_xfd().
and if userspace stuffs XSTATE_BV[i]=1 via KVM_SET_XSAVE:
------------[ cut here ]------------
WARNING: arch/x86/kernel/traps.c:1524 at exc_device_not_available+0x101/0x110, CPU#14: amx_test/867
Modules linked in: kvm_intel kvm irqbypass
CPU: 14 UID: 1000 PID: 867 Comm: amx_test Not tainted 6.19.0-rc2-2dace9faccd6-x86_amx_nm_xfd_non_init-vm #168 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:exc_device_not_available+0x101/0x110
Call Trace:
<TASK>
asm_exc_device_not_available+0x1a/0x20
RIP: 0010:restore_fpregs_from_fpstate+0x36/0x90
fpu_swap_kvm_fpstate+0x6b/0x120
kvm_load_guest_fpu+0x30/0x80 [kvm]
kvm_arch_vcpu_ioctl_run+0x85/0x1e40 [kvm]
kvm_vcpu_ioctl+0x2c3/0x8f0 [kvm]
__x64_sys_ioctl+0x8f/0xd0
do_syscall_64+0x62/0x940
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
The new behavior is consistent with the AMX architecture. Per Intel's SDM,
XSAVE saves XSTATE_BV as '0' for components that are disabled via XFD
(and non-compacted XSAVE saves the initial configuration of the state
component):
If XSAVE, XSAVEC, XSAVEOPT, or XSAVES is saving the state component i,
the instruction does not generate #NM when XCR0[i] = IA32_XFD[i] = 1;
instead, it operates as if XINUSE[i] = 0 (and the state component was
in its initial state): it saves bit i of XSTATE_BV field of the XSAVE
header as 0; in addition, XSAVE saves the initial configuration of the
state component (the other instructions do not save state component i).
Alternatively, KVM could always do XRSTOR with XFD=0, e.g. by using
a constant XFD based on the set of enabled features when XSAVEing for
a struct fpu_guest. However, having XSTATE_BV[i]=1 for XFD-disabled
features can only happen in the above interrupt case, or in similar
scenarios involving preemption on preemptible kernels, because
fpu_swap_kvm_fpstate()'s call to save_fpregs_to_fpstate() saves the
outgoing FPU state with the current XFD; and that is (on all but the
first WRMSR to XFD) the guest XFD.
Therefore, XFD can only go out of sync with XSTATE_BV in the above
interrupt case, or in similar scenarios involving preemption on
preemptible kernels, and it we can consider it (de facto) part of KVM
ABI that KVM_GET_XSAVE returns XSTATE_BV[i]=0 for XFD-disabled features.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 820a6ee944e7 ("kvm: x86: Add emulation for IA32_XFD", 2022-01-14)
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Move clearing of XSTATE_BV from fpu_copy_uabi_to_guest_fpstate
to kvm_vcpu_ioctl_x86_set_xsave. - Paolo]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Legacy resctrl features are enumerated by X86_FEATURE_* flags. These may be
overridden by quirks to disable features in the case of errata. Users can use
kernel command line options to either disable a feature, or to force enable
a feature that was disabled by a quirk.
A different approach is needed for hardware features that do not have an
X86_FEATURE_* flag.
Update parsing of the "rdt=" boot parameter to call the telemetry driver
directly to handle new "perf" and "energy" options that controls activation of
telemetry monitoring of the named type. By itself a "perf" or "energy" option
controls the forced enabling or disabling (with ! prefix) of all event groups
of the named type. A ":guid" suffix allows for fine grained control per event
group.
[ bp: s/intel_aet_option/intel_handle_aet_option/g ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The L3 resource has several requirements for domains. There are per-domain
structures that hold the 64-bit values of counters, and elements to keep
track of the overflow and limbo threads.
None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.
Define a new rdt_perf_pkg_mon_domain structure which just consists of the
standard rdt_domain_hdr to keep track of domain id and CPU mask.
Update resctrl_online_mon_domain() for RDT_RESOURCE_PERF_PKG. The only action
needed for this resource is to create and populate domain directories if a
domain is added while resctrl is mounted.
Similarly resctrl_offline_mon_domain() only needs to remove domain directories.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Introduce intel_aet_read_event() to read telemetry events for resource
RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each
package, so scan all of them and add up all counters. Aggregators may return
an invalid data indication if they have received no records for a given RMID.
The user will see "Unavailable" if none of the aggregators on a package
provide valid counts.
Resctrl now uses readq() so depends on X86_64. Update Kconfig.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
Every event group has a private copy of the data of all telemetry event
aggregators (aka "telemetry regions") tracking its feature type. Included
may be regions that have the same feature type but tracking different GUID
from the event group's.
Traverse the event group's telemetry region data and mark all regions that
are not usable by the event group as unusable by clearing those regions'
MMIO addresses. A region is considered unusable if:
1) GUID does not match the GUID of the event group.
2) Package ID is invalid.
3) The enumerated size of the MMIO region does not match the expected
value from the XML description file.
Hereafter any telemetry region with an MMIO address is considered valid for
the event group it is associated with.
Enable all the event group's events as long as there is at least one usable
region from where data for its events can be read. Enabling of an event can
fail if the same event has already been enabled as part of another event
group. It should never happen that the same event is described by different
GUID supported by the same system so just WARN (via resctrl_enable_mon_event())
and skip the event.
Note that it is architecturally possible that some telemetry events are only
supported by a subset of the packages in the system. It is not expected that
systems will ever do this. If they do the user will see event files in resctrl
that always return "Unavailable".
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The resctrl file system layer passes the domain, RMID, and event id to the
architecture to fetch an event counter.
Fetching a telemetry event counter requires additional information that is
private to the architecture, for example, the offset into MMIO space from
where the counter should be read.
Add mon_evt::arch_priv that architecture can use for any private data related
to the event. The resctrl filesystem initializes mon_evt::arch_priv when the
architecture enables the event and passes it back to architecture when needing
to fetch an event counter.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|
|
The telemetry event aggregators of the Intel Clearwater Forest CPU support two
RMID-based feature types: "energy" with GUID 0x26696143¹, and "perf" with
GUID 0x26557651².
The event counter offsets in an aggregator's MMIO space are arranged in groups
for each RMID.
E.g., the "energy" counters for GUID 0x26696143 are arranged like this:
MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY
MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY
MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY
MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY
...
MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY
MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY
After all counters there are three status registers that provide indications
of how many times an aggregator was unable to process event counts, the time
stamp for the most recent loss of data, and the time stamp of the most recent
successful update.
MMIO offset:0x2400 AGG_DATA_LOSS_COUNT
MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP
MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP
Define event_group structures for both of these aggregator types and define
the events tracked by the aggregators in the file system code.
PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point format.
File system code must output as floating point values.
¹https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
²https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
|