summaryrefslogtreecommitdiff
path: root/tools/include
AgeCommit message (Collapse)Author
2025-12-03Merge tag 'nolibc-20251130-for-6.19-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc Pull nolibc updates from Thomas Weißschuh: - Preparations to the use of nolibc in UML: - Cleanup of sparse warnings - Library mode without _start() - More consistency when disabling errno - Unconditional installation of all architecture support files - Always 64-bit wide ino_t and off_t - Various cleanups and bug fixes * tag 'nolibc-20251130-for-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc: (25 commits) selftests/nolibc: error out on linker warnings selftests/nolibc: use lld to link loongarch binaries tools/nolibc: remove more __nolibc_enosys() fallbacks tools/nolibc: remove now superfluous overflow check in llseek tools/nolibc: use 64-bit off_t tools/nolibc: prefer the llseek syscall tools/nolibc: handle 64-bit off_t for llseek tools/nolibc: use 64-bit ino_t tools/nolibc: avoid using plain integer as NULL pointer tools/nolibc: add support for fchdir() tools/nolibc: clean up outdated comments in generic arch.h tools/nolibc: make the "headers" target install all supported archs tools/nolibc: add the more portable inttypes.h tools/nolibc: provide the portable sys/select.h tools/nolibc: add missing memchr() to string.h tools/nolibc: fix misleading help message regarding installation path tools/nolibc: add uio.h with readv and writev tools/nolibc: add option to disable runtime tools/nolibc: use __fallthrough__ rather than fallthrough tools/nolibc: implement %m if errno is not defined ...
2025-12-02Merge tag 's390-6.19-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Heiko Carstens: - Provide a new interface for dynamic configuration and deconfiguration of hotplug memory, allowing with and without memmap_on_memory support. This makes the way memory hotplug is handled on s390 much more similar to other architectures - Remove compat support. There shouldn't be any compat user space around anymore, therefore get rid of a lot of code which also doesn't need to be tested anymore - Add stackprotector support. GCC 16 will get new compiler options, which allow to generate code required for kernel stackprotector support - Merge pai_crypto and pai_ext PMU drivers into a new driver. This removes a lot of duplicated code. The new driver is also extendable and allows to support new PMUs - Add driver override support for AP queues - Rework and extend zcrypt and AP trace events to allow for tracing of crypto requests - Support block sizes larger than 65535 bytes for CCW tape devices - Since the rework of the virtual kernel address space the module area and the kernel image are within the same 4GB area. This eliminates the need of weak per cpu variables. Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU - Various other small improvements and fixes * tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (92 commits) watchdog: diag288_wdt: Remove KMSG_COMPONENT macro s390/entry: Use lay instead of aghik s390/vdso: Get rid of -m64 flag handling s390/vdso: Rename vdso64 to vdso s390: Rename head64.S to head.S s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros s390: Add stackprotector support s390/modules: Simplify module_finalize() slightly s390: Remove KMSG_COMPONENT macro s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU s390/ap: Restrict driver_override versus apmask and aqmask use s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex s390/ap: Support driver_override for AP queue devices s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks s390/debug: Update description of resize operation s390/syscalls: Switch to generic system call table generation s390/syscalls: Remove system call table pointer from thread_struct s390/uapi: Remove 31 bit support from uapi header files s390: Remove compat support tools: Remove s390 compat support ...
2025-12-02tools headers UAPI: Sync linux/perf_event.h for deferred callchainsNamhyung Kim
It needs to sync with the kernel to support user space changes for the deferred callchains. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-12-01Merge tag 'perf-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Callchain support: - Add support for deferred user-space stack unwinding for perf, enabled on x86. (Peter Zijlstra, Steven Rostedt) - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh Poimboeuf) x86 PMU support and infrastructure: - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra) - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra) Intel PMU driver: - Large series to prepare for and implement architectural PEBS support for Intel platforms such as Clearwater Forest (CWF) and Panther Lake (PTL). (Dapeng Mi, Kan Liang) - Check dynamic constraints (Kan Liang) - Optimize PEBS extended config (Peter Zijlstra) - cstates: - Remove PC3 support from LunarLake (Zhang Rui) - Add Pantherlake support (Zhang Rui) - Clearwater Forest support (Zide Chen) AMD PMU driver: - x86/amd: Check event before enable to avoid GPF (George Kennedy) Fixes and cleanups: - task_work: Fix NMI race condition (Peter Zijlstra) - perf/x86: Fix NULL event access and potential PEBS record loss (Dapeng Mi) - Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter Zijlstra)" * tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use perf/x86/intel: Optimize PEBS extended config perf/x86/intel: Check PEBS dyn_constraints perf/x86/intel: Add a check for dynamic constraints perf/x86/intel: Add counter group support for arch-PEBS perf/x86/intel: Setup PEBS data configuration and enable legacy groups perf/x86/intel: Update dyn_constraint base on PEBS event precise level perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR perf/x86/intel: Process arch-PEBS records or record fragments perf/x86/intel/ds: Factor out PEBS group processing code to functions perf/x86/intel/ds: Factor out PEBS record processing code to functions perf/x86/intel: Initialize architectural PEBS perf/x86/intel: Correct large PEBS flag check perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call perf/x86: Fix NULL event access and potential PEBS record loss perf/x86: Remove redundant is_x86_event() prototype entry,unwind/deferred: Fix unwind_reset_info() placement unwind_user/x86: Fix arch=um build perf: Support deferred user unwind unwind_user/x86: Teach FP unwind about start of function ...
2025-12-01Merge tag 'objtool-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - klp-build livepatch module generation (Josh Poimboeuf) Introduce new objtool features and a klp-build script to generate livepatch modules using a source .patch as input. This builds on concepts from the longstanding out-of-tree kpatch project which began in 2012 and has been used for many years to generate livepatch modules for production kernels. However, this is a complete rewrite which incorporates hard-earned lessons from 12+ years of maintaining kpatch. Key improvements compared to kpatch-build: - Integrated with objtool: Leverages objtool's existing control-flow graph analysis to help detect changed functions. - Works on vmlinux.o: Supports late-linked objects, making it compatible with LTO, IBT, and similar. - Simplified code base: ~3k fewer lines of code. - Upstream: No more out-of-tree #ifdef hacks, far less cruft. - Cleaner internals: Vastly simplified logic for symbol/section/reloc inclusion and special section extraction. - Robust __LINE__ macro handling: Avoids false positive binary diffs caused by the __LINE__ macro by introducing a fix-patch-lines script which injects #line directives into the source .patch to preserve the original line numbers at compile time. - Disassemble code with libopcodes instead of running objdump (Alexandre Chartre) - Disassemble support (-d option to objtool) by Alexandre Chartre, which supports the decoding of various Linux kernel code generation specials such as alternatives: 17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx 17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT 17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax 17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx ... jump table alternatives: 1895: sched_use_asym_prio+0x5 test $0x8,%ch 1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19> 189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP 189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2 189c: sched_use_asym_prio+0xc mov $0x1,%eax 18a1: sched_use_asym_prio+0x11 and $0x80,%ecx ... exception table alternatives: native_read_msr: 5b80: native_read_msr+0x0 mov %edi,%ecx 5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION 5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4> 5b84: native_read_msr+0x4 shl $0x20,%rdx .... x86 feature flag decoding (also see the X86_FEATURE_POPCNT example in sched_balance_find_dst_group() above): 2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114> 2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG 2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5 2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax ... NOP sequence shortening: 1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7> 1048e4: snapshot_write_finalize+0xc4 nop6 1048ea: snapshot_write_finalize+0xca nop11 1048f5: snapshot_write_finalize+0xd5 nop11 104900: snapshot_write_finalize+0xe0 mov %rax,%rcx 104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax ... and much more. - Function validation tracing support (Alexandre Chartre) - Various -ffunction-sections fixes (Josh Poimboeuf) - Clang AutoFDO (Automated Feedback-Directed Optimizations) support (Josh Poimboeuf) - Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra, Thorsten Blum) * tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) objtool: Fix segfault on unknown alternatives objtool: Build with disassembly can fail when including bdf.h objtool: Trim trailing NOPs in alternative objtool: Add wide output for disassembly objtool: Compact output for alternatives with one instruction objtool: Improve naming of group alternatives objtool: Add Function to get the name of a CPU feature objtool: Provide access to feature and flags of group alternatives objtool: Fix address references in alternatives objtool: Disassemble jump table alternatives objtool: Disassemble exception table alternatives objtool: Print addresses with alternative instructions objtool: Disassemble group alternatives objtool: Print headers for alternatives objtool: Preserve alternatives order objtool: Add the --disas=<function-pattern> action objtool: Do not validate IBT for .return_sites and .call_sites objtool: Improve tracing of alternative instructions objtool: Add functions to better name alternatives objtool: Identify the different types of alternatives ...
2025-12-01Merge tag 'namespace-6.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains substantial namespace infrastructure changes including a new system call, active reference counting, and extensive header cleanups. The branch depends on the shared kbuild branch for -fms-extensions support. Features: - listns() system call Add a new listns() system call that allows userspace to iterate through namespaces in the system. This provides a programmatic interface to discover and inspect namespaces, addressing longstanding limitations: Currently, there is no direct way for userspace to enumerate namespaces. Applications must resort to scanning /proc/*/ns/ across all processes, which is: - Inefficient - requires iterating over all processes - Incomplete - misses namespaces not attached to any running process but kept alive by file descriptors, bind mounts, or parent references - Permission-heavy - requires access to /proc for many processes - No ordering or ownership information - No filtering per namespace type The listns() system call solves these problems: ssize_t listns(const struct ns_id_req *req, u64 *ns_ids, size_t nr_ns_ids, unsigned int flags); struct ns_id_req { __u32 size; __u32 spare; __u64 ns_id; struct /* listns */ { __u32 ns_type; __u32 spare2; __u64 user_ns_id; }; }; Features include: - Pagination support for large namespace sets - Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.) - Filtering by owning user namespace - Permission checks respecting namespace isolation - Active Reference Counting Introduce an active reference count that tracks namespace visibility to userspace. A namespace is visible in the following cases: - The namespace is in use by a task - The namespace is persisted through a VFS object (namespace file descriptor or bind-mount) - The namespace is a hierarchical type and is the parent of child namespaces The active reference count does not regulate lifetime (that's still done by the normal reference count) - it only regulates visibility to namespace file handles and listns(). This prevents resurrection of namespaces that are pinned only for internal kernel reasons (e.g., user namespaces held by file->f_cred, lazy TLB references on idle CPUs, etc.) which should not be accessible via (1)-(3). - Unified Namespace Tree Introduce a unified tree structure for all namespaces with: - Fixed IDs assigned to initial namespaces - Lookup based solely on inode number - Maintained list of owned namespaces per user namespace - Simplified rbtree comparison helpers Cleanups - Header Reorganization: - Move namespace types into separate header (ns_common_types.h) - Decouple nstree from ns_common header - Move nstree types into separate header - Switch to new ns_tree_{node,root} structures with helper functions - Use guards for ns_tree_lock - Initial Namespace Reference Count Optimization - Make all reference counts on initial namespaces a nop to avoid pointless cacheline ping-pong for namespaces that can never go away - Drop custom reference count initialization for initial namespaces - Add NS_COMMON_INIT() macro and use it for all namespaces - pid: rely on common reference count behavior - Miscellaneous Cleanups - Rename exit_task_namespaces() to exit_nsproxy_namespaces() - Rename is_initial_namespace() and make argument const - Use boolean to indicate anonymous mount namespace - Simplify owner list iteration in nstree - nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly - nsfs: use inode_just_drop() - pidfs: raise DCACHE_DONTCACHE explicitly - pidfs: simplify PIDFD_GET__NAMESPACE ioctls - libfs: allow to specify s_d_flags - cgroup: add cgroup namespace to tree after owner is set - nsproxy: fix free_nsproxy() and simplify create_new_namespaces() Fixes: - setns(pidfd, ...) race condition Fix a subtle race when using pidfds with setns(). When the target task exits after prepare_nsset() but before commit_nsset(), the namespace's active reference count might have been dropped. If setns() then installs the namespaces, it would bump the active reference count from zero without taking the required reference on the owner namespace, leading to underflow when later decremented. The fix resurrects the ownership chain if necessary - if the caller succeeded in grabbing passive references, the setns() should succeed even if the target task exits or gets reaped. - Return EFAULT on put_user() error instead of success - Make sure references are dropped outside of RCU lock (some namespaces like mount namespace sleep when putting the last reference) - Don't skip active reference count initialization for network namespace - Add asserts for active refcount underflow - Add asserts for initial namespace reference counts (both passive and active) - ipc: enable is_ns_init_id() assertions - Fix kernel-doc comments for internal nstree functions - Selftests - 15 active reference count tests - 9 listns() functionality tests - 7 listns() permission tests - 12 inactive namespace resurrection tests - 3 threaded active reference count tests - commit_creds() active reference tests - Pagination and stress tests - EFAULT handling test - nsid tests fixes" * tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits) pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls nstree: fix kernel-doc comments for internal functions nsproxy: fix free_nsproxy() and simplify create_new_namespaces() selftests/namespaces: fix nsid tests ns: drop custom reference count initialization for initial namespaces pid: rely on common reference count behavior ns: add asserts for initial namespace active reference counts ns: add asserts for initial namespace reference counts ns: make all reference counts on initial namespace a nop ipc: enable is_ns_init_id() assertions fs: use boolean to indicate anonymous mount namespace ns: rename is_initial_namespace() ns: make is_initial_namespace() argument const nstree: use guards for ns_tree_lock nstree: simplify owner list iteration nstree: switch to new structures nstree: add helper to operate on struct ns_tree_{node,root} nstree: move nstree types into separate header nstree: decouple from ns_common header ns: move namespace types into separate header ...
2025-11-25tools: ynl-gen: add regeneration commentAsbjørn Sloth Tønnesen
Add a comment on regeneration to the generated files. The comment is placed after the YNL-GEN line[1], as to not interfere with ynl-regen.sh's detection logic. [1] and after the optional YNL-ARG line. Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24tools headers UAPI: Sync linux/perf_event.h with the kernel sourcesJames Clark
To pickup config4 changes. Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-11-20tools/nolibc: remove more __nolibc_enosys() fallbacksThomas Weißschuh
Commit e6366101ce1f ("tools/nolibc: remove __nolibc_enosys() fallback from time64-related functions") removed many of these fallbacks but forgot a few. Finish the job. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu>
2025-11-20tools/nolibc: remove now superfluous overflow check in llseekThomas Weißschuh
As off_t is now always 64-bit wide this overflow can not happen anymore, remove the check. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2025-11-20tools/nolibc: use 64-bit off_tThomas Weißschuh
The kernel uses 64-bit values for file offsets. Currently these might be truncated to 32-bit when assigned to nolibc's off_t values. Switch to 64-bit off_t consistently. Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/lkml/cec27d94-c99d-4c57-9a12-275ea663dda8@app.fastmail.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu>
2025-11-20tools/nolibc: prefer the llseek syscallThomas Weißschuh
Make sure to always use the 64-bit safe system call in preparation for 64-bit off_t on 32 bit architectures. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu>
2025-11-20tools/nolibc: handle 64-bit off_t for llseekThomas Weißschuh
Correctly handle 64-bit off_t values in preparation for 64-bit off_t on 32-bit architectures. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Willy Tarreau <w@1wt.eu>
2025-11-20tools/nolibc: use 64-bit ino_tThomas Weißschuh
The kernel uses 64-bit values for inode numbers. Currently these might be truncated to 32-bit when assigned to nolibc's ino_t values. Switch to 64-bit ino_t consistently. As ino_t is never used directly in kernel ABIs, no systemcall wrappers need to be adapted. Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/lkml/cec27d94-c99d-4c57-9a12-275ea663dda8@app.fastmail.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu>
2025-11-17tools: Remove s390 compat supportHeiko Carstens
Remove s390 compat support from everything within tools, since s390 compat support will be removed from the kernel. Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Weißschuh <linux@weissschuh.net> # tools/nolibc selftests/nolibc Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> # selftests/vDSO Acked-by: Alexei Starovoitov <ast@kernel.org> # bpf bits Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-15static_call: allow using STATIC_CALL_TRAMP_STR() from assemblyNaman Jain
STATIC_CALL_TRAMP_STR() could not be used from .S files because static_call_types.h was not safe to include in assembly as it pulled in C types/constructs that are unavailable under __ASSEMBLY__. Make the header assembly-friendly by adding __ASSEMBLY__ checks and providing only the minimal definitions needed for assembly, so that it can be safely included by .S code. This enables emitting the static call trampoline symbol name via STATIC_CALL_TRAMP_STR() directly in assembly sources, to be used with 'call' instruction. Also, move a certain definitions out of __ASSEMBLY__ checks in compiler_types.h to meet the dependencies. No functional change for C compilation. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc6). No conflicts, adjacent changes in: drivers/net/phy/micrel.c 96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface") 61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814") and a trivial one in tools/testing/selftests/drivers/net/Makefile. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13Merge tag 'v6.18-rc5' into objtool/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-11-09tools/nolibc: avoid using plain integer as NULL pointerThomas Weißschuh
While an integer zero is a valid NULL pointer as per the C standard, sparse will complain about it. Use explicit NULL pointers instead. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202509261452.g5peaXCc-lkp@intel.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu>
2025-11-08tools/nolibc: add support for fchdir()Thomas Weißschuh
Add support for the file descriptor based variant of chdir(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-11-05bpf, x86: add new map type: instructions arrayAnton Protopopov
On bpf(BPF_PROG_LOAD) syscall user-supplied BPF programs are translated by the verifier into "xlated" BPF programs. During this process the original instructions offsets might be adjusted and/or individual instructions might be replaced by new sets of instructions, or deleted. Add a new BPF map type which is aimed to keep track of how, for a given program, the original instructions were relocated during the verification. Also, besides keeping track of the original -> xlated mapping, make x86 JIT to build the xlated -> jitted mapping for every instruction listed in an instruction array. This is required for every future application of instruction arrays: static keys, indirect jumps and indirect calls. A map of the BPF_MAP_TYPE_INSN_ARRAY type must be created with a u32 keys and value of size 8. The values have different semantics for userspace and for BPF space. For userspace a value consists of two u32 values – xlated and jitted offsets. For BPF side the value is a real pointer to a jitted instruction. On map creation/initialization, before loading the program, each element of the map should be initialized to point to an instruction offset within the program. Before the program load such maps should be made frozen. After the program verification xlated and jitted offsets can be read via the bpf(2) syscall. If a tracked instruction is removed by the verifier, then the xlated offset is set to (u32)-1 which is considered to be too big for a valid BPF program offset. One such a map can, obviously, be used to track one and only one BPF program. If the verification process was unsuccessful, then the same map can be re-used to verify the program with a different log level. However, if the program was loaded fine, then such a map, being frozen in any case, can't be reused by other programs even after the program release. Example. Consider the following original and xlated programs: Original prog: Xlated prog: 0: r1 = 0x0 0: r1 = 0 1: *(u32 *)(r10 - 0x4) = r1 1: *(u32 *)(r10 -4) = r1 2: r2 = r10 2: r2 = r10 3: r2 += -0x4 3: r2 += -4 4: r1 = 0x0 ll 4: r1 = map[id:88] 6: call 0x1 6: r1 += 272 7: r0 = *(u32 *)(r2 +0) 8: if r0 >= 0x1 goto pc+3 9: r0 <<= 3 10: r0 += r1 11: goto pc+1 12: r0 = 0 7: r6 = r0 13: r6 = r0 8: if r6 == 0x0 goto +0x2 14: if r6 == 0x0 goto pc+4 9: call 0x76 15: r0 = 0xffffffff8d2079c0 17: r0 = *(u64 *)(r0 +0) 10: *(u64 *)(r6 + 0x0) = r0 18: *(u64 *)(r6 +0) = r0 11: r0 = 0x0 19: r0 = 0x0 12: exit 20: exit An instruction array map, containing, e.g., instructions [0,4,7,12] will be translated by the verifier to [0,4,13,20]. A map with index 5 (the middle of 16-byte instruction) or indexes greater than 12 (outside the program boundaries) would be rejected. The functionality provided by this patch will be extended in consequent patches to implement BPF Static Keys, indirect jumps, and indirect calls. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251105090410.1250500-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-03net: Extend NAPI threaded polling to allow kthread based busy pollingSamiullah Khawaja
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03nsfs: update tools headerChristian Brauner
Ensure all the new uapi bits are visible for the selftests. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-21-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03tools headers asm: Sync fls headers header with the kernel sourcesArnaldo Carvalho de Melo
To pick the changes in: 6606c8c7e8188656 ("bitops: Add __attribute_const__ to generic ffs()-family implementations") This addresses these tools build warnings: Warning: Kernel ABI header differences: diff -u tools/include/asm-generic/bitops/__fls.h include/asm-generic/bitops/__fls.h diff -u tools/include/asm-generic/bitops/fls.h include/asm-generic/bitops/fls.h diff -u tools/include/asm-generic/bitops/fls64.h include/asm-generic/bitops/fls64.h Please see tools/include/uapi/README for further details. Cc: Kees Cook <kees@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-11-03tools headers UAPI: Sync linux/kvm.h with the kernel sourcesArnaldo Carvalho de Melo
To pick the changes in: fe2bf6234e947bf5 ("KVM: guest_memfd: Add INIT_SHARED flag, reject user page faults if not set") d2042d8f96ddefde ("KVM: Rework KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS") 3d3a04fad25a6621 ("KVM: Allow and advertise support for host mmap() on guest_memfd files") That just rebuilds perf, as these patches don't add any new KVM ioctl to be harvested for the 'perf trace' ioctl syscall argument beautifiers. This addresses this perf build warning: Warning: Kernel ABI header differences: diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h Please see tools/include/uapi/README for further details. Cc: Sean Christopherson <seanjc@google.com> Cc: Fuad Tabba <tabba@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-11-03tools headers UAPI: Update tools's copy of drm.h to pick ↵Arnaldo Carvalho de Melo
DRM_IOCTL_GEM_CHANGE_HANDLE Picking the changes from: 0864197382fa7c8c ("drm: Move drm_gem ioctl kerneldoc to uapi file") 53096728b8910c69 ("drm: Add DRM prime interface to reassign GEM handle") Addressing these perf build warnings: Warning: Kernel ABI header differences: Now 'perf trace' and other code that might use the tools/perf/trace/beauty autogenerated tables will be able to translate this new ioctl command into a string: $ tools/perf/trace/beauty/drm_ioctl.sh > before $ cp include/uapi/drm/drm.h tools/include/uapi/drm/drm.h $ tools/perf/trace/beauty/drm_ioctl.sh > after $ diff -u before after --- before 2025-11-03 09:57:34.832553174 -0300 +++ after 2025-11-03 09:57:47.969409428 -0300 @@ -111,6 +111,7 @@ [0xCF] = "SYNCOBJ_EVENTFD", [0xD0] = "MODE_CLOSEFB", [0xD1] = "SET_CLIENT_NAME", + [0xD2] = "GEM_CHANGE_HANDLE", [DRM_COMMAND_BASE + 0x00] = "I915_INIT", [DRM_COMMAND_BASE + 0x01] = "I915_FLUSH", [DRM_COMMAND_BASE + 0x02] = "I915_FLIP", $ Please see tools/include/uapi/README for further details. Cc: Christian König <christian.koenig@amd.com> Cc: David Francis <David.Francis@amd.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-11-02tools/nolibc: clean up outdated comments in generic arch.hWilly Tarreau
Along the code reorganizations, the file has been keeping the original comments about argv and envp which are no longer relevant to this file anymore. Let's just drop them. Acked-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Willy Tarreau <w@1wt.eu>
2025-11-02tools/nolibc: make the "headers" target install all supported archsWilly Tarreau
The efforts we go through by installing a single arch are counter productive when the base directory already supports them all, and the arch-specific files are really small. Let's make the "headers" target simply install headers for all supported archs and stop trying to build a hybrid "arch.h" file on the fly, to instead keep the generic one. Now the same nolibc headers installation will be usable with any arch-specific uapi installation. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-11-02tools/nolibc: add the more portable inttypes.hWilly Tarreau
It's often recommended to only use inttypes.h instead of stdint.h for portability reasons since the former is always present when the latter is present, but not conversely, and the former includes the latter. Due to this some simple programs fail to build when including inttypes.h. Let's add one that simply includes nolibc.h to better support these programs. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-11-02tools/nolibc: provide the portable sys/select.hWilly Tarreau
Modern programs tend to include sys/select.h to get FD_SET() and FD_CLR() definitions as well as struct fd_set, but in our case it didn't exist. The definitions were moved from types.h to sys/select.h, which is now included from nolibc.h, and the sys_select() definition moved there as well from sys.h. Signed-off-by: Willy Tarreau <w@1wt.eu> [Thomas: adapt to current -next branch] Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-11-02tools/nolibc: add missing memchr() to string.hWilly Tarreau
Surprisingly we forgot to add this common one. It was added with a per-arch guard allowing to later implement it in arch-specific asm code like was done for a few other ones. The test verifies that we don't search past the indicated length. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-11-02tools/nolibc: fix misleading help message regarding installation pathWilly Tarreau
The help message says the headers are going to be installed into tools/include/nolibc but this is only the default if $OUTPUT is not set, so better clarify this (the current value of $OUTPUT is already shown). Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc: add uio.h with readv and writevBenjamin Berg
This is generally useful and struct iovec is also needed for other purposes such as ptrace. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc: add option to disable runtimeBenjamin Berg
In principle, it is possible to use nolibc for only some object files in a program. In that case, the startup code in _start and _start_c is not going to be used. Add the NOLIBC_NO_RUNTIME compile time option to disable it entirely and also remove anything that depends on it. Doing this avoids warnings from modpost for UML as the _start_c code references the main function from the .init.text section while it is not inside .init itself. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc: use __fallthrough__ rather than fallthroughBenjamin Berg
Use the version of the attribute with underscores to avoid issues if fallthrough has been defined by another header file already. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc: implement %m if errno is not definedBenjamin Berg
For improved compatibility, print %m as "unknown error" when nolibc is compiled using NOLIBC_IGNORE_ERRNO. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc/dirent: avoid errno in readdir_rBenjamin Berg
Using errno is not possible when NOLIBC_IGNORE_ERRNO is set. Use sys_lseek instead of lseek as that avoids using errno. Fixes: 665fa8dea90d ("tools/nolibc: add support for directory access") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc/stdio: let perror work when NOLIBC_IGNORE_ERRNO is setBenjamin Berg
There is no errno variable when NOLIBC_IGNORE_ERRNO is defined. As such, simply print the message with "unknown error" rather than the integer value of errno. Fixes: acab7bcdb1bc ("tools/nolibc/stdio: add perror() to report the errno value") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29tools/nolibc: remove outdated comment about __sysret() in mmap()Thomas Weißschuh
Since commit fb01ff635efd ("tools/nolibc: keep brk(), sbrk(), mmap() away from __sysret()") the implementation of mmap() does not use the __sysret() macro anymore. Remove the outdated comment. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-29perf: Support deferred user unwindPeter Zijlstra
Add support for deferred userspace unwind to perf. Where perf currently relies on in-place stack unwinding; from NMI context and all that. This moves the userspace part of the unwind to right before the return-to-userspace. This has two distinct benefits, the biggest is that it moves the unwind to a faultable context. It becomes possible to fault in debug info (.eh_frame, SFrame etc.) that might not otherwise be readily available. And secondly, it de-duplicates the user callchain where multiple samples happen during the same kernel entry. To facilitate this the perf interface is extended with a new record type: PERF_RECORD_CALLCHAIN_DEFERRED and two new attribute flags: perf_event_attr::defer_callchain - to request the user unwind be deferred perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records The existing PERF_RECORD_SAMPLE callchain section gets a new context type: PERF_CONTEXT_USER_DEFERRED After which will come a single entry, denoting the 'cookie' of the deferred callchain that should be attached here, matching the 'cookie' field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED. The 'defer_callchain' flag is expected on all events with PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event responsible for collecting side-band events (like mmap, comm etc.). Setting 'defer_output' on multiple events will get you duplicated PERF_RECORD_CALLCHAIN_DEFERRED records. Based on earlier patches by Josh and Steven. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net
2025-10-27bpf: Add overwrite mode for BPF ring bufferXu Kuohai
When the BPF ring buffer is full, a new event cannot be recorded until one or more old events are consumed to make enough space for it. In cases such as fault diagnostics, where recent events are more useful than older ones, this mechanism may lead to critical events being lost. So add overwrite mode for BPF ring buffer to address it. In this mode, the new event overwrites the oldest event when the buffer is full. The basic idea is as follows: 1. producer_pos tracks the next position to record new event. When there is enough free space, producer_pos is simply advanced by producer to make space for the new event. 2. To avoid waiting for consumer when the buffer is full, a new variable, overwrite_pos, is introduced for producer. It points to the oldest event committed in the buffer. It is advanced by producer to discard one or more oldest events to make space for the new event when the buffer is full. 3. pending_pos tracks the oldest event to be committed. pending_pos is never passed by producer_pos, so multiple producers never write to the same position at the same time. The following example diagrams show how it works in a 4096-byte ring buffer. 1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | +-----------------------------------------------------------------------+ ^ | | producer_pos = 0 overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 2. Now reserve a 512-byte event A. There is enough free space, so A is allocated at offset 0. And producer_pos is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is set. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | A | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 512 | overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 3. Reserve event B, size 1024. B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced to the end of B. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | A | B | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 1536 | overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 4. Reserve event C, size 2048. C is allocated at offset 1536, and producer_pos is advanced to 3584. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | [BUSY] | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ | | | | | producer_pos = 3584 | overwrite_pos = 0 pending_pos = 0 consumer_pos = 0 5. Submit event A. The BUSY bit of A is cleared. B becomes the oldest event to be committed, so pending_pos is advanced to 512, the start of B. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | | [BUSY] | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | | | pending_pos = 512 producer_pos = 3584 | overwrite_pos = 0 consumer_pos = 0 6. Submit event B. The BUSY bit of B is cleared, and pending_pos is advanced to the start of C, which is now the oldest event to be committed. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | A | B | C | | | | | [BUSY] | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | | | pending_pos = 1536 producer_pos = 3584 | overwrite_pos = 0 consumer_pos = 0 7. Reserve event D, size 1536 (3 * 512). There are 2048 bytes not being written between producer_pos (currently 3584) and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced by 1536 (from 3584 to 5120). Since event D will overwrite all bytes of event A and the first 512 bytes of event B, overwrite_pos is advanced to the start of event C, the oldest event that is not overwritten. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | D End | | C | D Begin| | [BUSY] | | [BUSY] | [BUSY] | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | pending_pos = 1536 | | overwrite_pos = 1536 | | | producer_pos=5120 | consumer_pos = 0 8. Reserve event E, size 1024. Although there are 512 bytes not being written between producer_pos and pending_pos, E cannot be reserved, as it would overwrite the first 512 bytes of event C, which is still being written. 9. Submit event C and D. pending_pos is advanced to the end of D. 0 512 1024 1536 2048 2560 3072 3584 4096 +-----------------------------------------------------------------------+ | | | | | | D End | | C | D Begin| | | | | | +-----------------------------------------------------------------------+ ^ ^ ^ | | | | | overwrite_pos = 1536 | | | producer_pos=5120 | pending_pos=5120 | consumer_pos = 0 The performance data for overwrite mode will be provided in a follow-up patch that adds overwrite-mode benchmarks. A sample of performance data for non-overwrite mode, collected on an x86_64 CPU and an arm64 CPU, before and after this patch, is shown below. As we can see, no obvious performance regression occurs. - x86_64 (AMD EPYC 9654) Before: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s) After: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s) - arm64 (HiSilicon Kunpeng 920) Before: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s) After: Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s) Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251018035738.4039621-2-xukuohai@huaweicloud.com
2025-10-27bpf: widen dynptr size/offset to 64 bitMykyta Yatsenko
Dynptr currently caps size and offset at 24 bits, which isn’t sufficient for file-backed use cases; even 32 bits can be limiting. Refactor dynptr helpers/kfuncs to use 64-bit size and offset, ensuring consistency across the APIs. This change does not affect internals of xdp, skb or other dynptrs, which continue to behave as before. Also it does not break binary compatibility. The widening enables large-file access support via dynptr, implemented in the next patches. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-27tools/nolibc: x86: fix section mismatch caused by asm "mem*" functionsWilly Tarreau
I recently got occasional build failures at -Os or -Oz that would always involve waitpid(), where the assembler would complain about this: init.s: Error: .size expression for waitpid.constprop.0 does not evaluate to a constant And without -fno-asynchronous-unwind-tables it could also spit such errors: init.s:836: Error: CFI instruction used without previous .cfi_startproc init.s:838: Error: .cfi_endproc without corresponding .cfi_startproc init.s: Error: open CFI at the end of file; missing .cfi_endproc directive A trimmed down reproducer is as simple as this: int main(int argc, char **argv) { int ret, status; if (argc == 0) ret = waitpid(-1, &status, 0); else ret = waitpid(-1, &status, 0); return status; } It produces the following asm code on x86_64: .text .section .text.nolibc_memmove_memcpy .weak memmove .weak memcpy memmove: memcpy: movq %rdx, %rcx (...) retq .section .text.nolibc_memset .weak memset memset: xchgl %eax, %esi movq %rdx, %rcx pushq %rdi rep stosb popq %rax retq .type waitpid.constprop.0.isra.0, @function waitpid.constprop.0.isra.0: subq $8, %rsp (...) jmp *.L5(,%rax,8) .section .rodata .align 8 .align 4 .L5: .quad .L10 (...) .quad .L4 .text .L10: (...) .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE273: .size waitpid.constprop.0.isra.0, .-waitpid.constprop.0.isra.0 It's a bit dense, but here's the explanation: the compiler has emitted a ".text" statement because it knows it's working in the .text section. Then, our hand-written asm code for the mem* functions forced the section to .text.something without the compiler knowing about it, so it thinks the code is still being emitted for .text. As such, without any .section statement, the waitpid.constprop.0.isra.0 label is in fact placed in the previously created section, here .text.nolibc_memset. The waitpid() function involves a switch/case statement that can be turned to a jump table, which is what the compiler does with the .rodata section, and after that it restores .text, which is no longer the previous .text.nolibc_memset section. Then the CFI statements cross a section, so does the .size calculation, which explains the error. While a first approach consisting in placing an explicit ".text" at the end of these functions was verified to work, it's still unreliable as it depends on what the compiler remembers having emitted previously. A better approach is to replace the ".section" with ".pushsection", and place a ".popsection" at the end, so that these code blocks are agnostic to where they're placed relative to other blocks. Fixes: 553845eebd60 ("tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()`") Fixes: 12108aa8c1a1 ("tools/nolibc: x86-64: Use `rep stosb` for `memset()`") Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-10-16bpf: Introduce SK_BPF_BYPASS_PROT_MEM.Kuniyuki Iwashima
If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out of the global protocol memory accounting. This is easily controlled by net.core.bypass_prot_mem sysctl, but it lacks flexibility. Let's support flagging (and clearing) sk->sk_bypass_prot_mem via bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook. int val = 1; bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_BYPASS_PROT_MEM, &val, sizeof(val)); As with net.core.bypass_prot_mem, this is inherited to child sockets, and BPF always takes precedence over sysctl at socket(2) and accept(2). SK_BPF_BYPASS_PROT_MEM is only supported at BPF_CGROUP_INET_SOCK_CREATE and not supported on other hooks for some reasons: 1. UDP charges memory under sk->sk_receive_queue.lock instead of lock_sock() 2. Modifying the flag after skb is charged to sk requires such adjustment during bpf_setsockopt() and complicates the logic unnecessarily We can support other hooks later if a real use case justifies that. Most changes are inline and hard to trace, but a microbenchmark on __sk_mem_raise_allocated() during neper/tcp_stream showed that more samples completed faster with sk->sk_bypass_prot_mem == 1. This will be more visible under tcp_mem pressure (but it's not a fair comparison). # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; } kretprobe:__sk_mem_raise_allocated /@start[tid]/ { @end[tid] = nsecs - @start[tid]; @times = hist(@end[tid]); delete(@start[tid]); }' # tcp_stream -6 -F 1000 -N -T 256 Without bpf prog: [128, 256) 3846 | | [256, 512) 1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 198207 |@@@@@@ | [2K, 4K) 31199 |@ | With bpf prog in the next patch: (must be attached before tcp_stream) # bpftool prog load sk_bypass_prot_mem.bpf.o /sys/fs/bpf/test type cgroup/sock_create # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/test [128, 256) 6413 | | [256, 512) 1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 117031 |@@@@ | [2K, 4K) 11773 | | Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://patch.msgid.link/20251014235604.3057003-6-kuniyu@google.com
2025-10-14objtool/klp: Introduce klp diff subcommand for diffing object filesJosh Poimboeuf
Add a new klp diff subcommand which performs a binary diff between two object files and extracts changed functions into a new object which can then be linked into a livepatch module. This builds on concepts from the longstanding out-of-tree kpatch [1] project which began in 2012 and has been used for many years to generate livepatch modules for production kernels. However, this is a complete rewrite which incorporates hard-earned lessons from 12+ years of maintaining kpatch. Key improvements compared to kpatch-build: - Integrated with objtool: Leverages objtool's existing control-flow graph analysis to help detect changed functions. - Works on vmlinux.o: Supports late-linked objects, making it compatible with LTO, IBT, and similar. - Simplified code base: ~3k fewer lines of code. - Upstream: No more out-of-tree #ifdef hacks, far less cruft. - Cleaner internals: Vastly simplified logic for symbol/section/reloc inclusion and special section extraction. - Robust __LINE__ macro handling: Avoids false positive binary diffs caused by the __LINE__ macro by introducing a fix-patch-lines script (coming in a later patch) which injects #line directives into the source .patch to preserve the original line numbers at compile time. Note the end result of this subcommand is not yet functionally complete. Livepatch needs some ELF magic which linkers don't like: - Two relocation sections (.rela*, .klp.rela*) for the same text section. - Use of SHN_LIVEPATCH to mark livepatch symbols. Unfortunately linkers tend to mangle such things. To work around that, klp diff generates a linker-compliant intermediate binary which encodes the relevant KLP section/reloc/symbol metadata. After module linking, a klp post-link step (coming soon) will clean up the mess and convert the linked .ko into a fully compliant livepatch module. Note this subcommand requires the diffed binaries to have been compiled with -ffunction-sections and -fdata-sections, and processed with 'objtool --checksum'. Those constraints will be handled by a klp-build script introduced in a later patch. Without '-ffunction-sections -fdata-sections', reliable object diffing would be infeasible due to toolchain limitations: - For intra-file+intra-section references, the compiler might occasionally generated hard-coded instruction offsets instead of relocations. - Section-symbol-based references can be ambiguous: - Overlapping or zero-length symbols create ambiguity as to which symbol is being referenced. - A reference to the end of a symbol (e.g., checking array bounds) can be misinterpreted as a reference to the next symbol, or vice versa. A potential future alternative to '-ffunction-sections -fdata-sections' would be to introduce a toolchain option that forces symbol-based (non-section) relocations. Acked-by: Petr Mladek <pmladek@suse.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14objtool: Add ANNOTATE_DATA_SPECIALJosh Poimboeuf
In preparation for the objtool klp diff subcommand, add an ANNOTATE_DATA_SPECIAL macro which annotates special section entries so that objtool can determine their size and location and extract them when needed. Acked-by: Petr Mladek <pmladek@suse.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14interval_tree: Fix ITSTATIC usage for *_subtree_search()Josh Poimboeuf
For consistency with the other function templates, change _subtree_search_*() to use the user-supplied ITSTATIC rather than the hard-coded 'static'. Acked-by: Petr Mladek <pmladek@suse.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14interval_tree: Sync interval_tree_generic.h with toolsJosh Poimboeuf
The following commit made an improvement to interval_tree_generic.h, but didn't sync it to the tools copy: 19811285784f ("lib/interval_tree: skip the check before go to the right subtree") Sync it, and add it to objtool's sync-check.sh so they are more likely to stay in sync going forward. Acked-by: Petr Mladek <pmladek@suse.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-12tools/nolibc: handle NULL wstatus argument to waitpid()Thomas Weißschuh
wstatus is allowed to be NULL. Avoid a segmentation fault in this case. Fixes: 0c89abf5ab3f ("tools/nolibc: implement waitpid() in terms of waitid()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>