summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
33 hoursConvert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
46 hoursConvert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 daysConvert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 daysConvert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 daystreewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
4 daysMerge tag 'net-7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200" * tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: nfc: nci: Fix parameter validation for packet data net/mlx5e: Use unsigned for mlx5e_get_max_num_channels net/mlx5e: Fix deadlocks between devlink and netdev instance locks net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event net/mlx5: Fix misidentification of write combining CQE during poll loop net/mlx5e: Fix misidentification of ASO CQE during poll loop net/mlx5: Fix multiport device check over light SFs bonding: alb: fix UAF in rlb_arp_recv during bond up/down bnge: fix reserving resources from FW eth: fbnic: Advertise supported XDP features. rds: tcp: fix uninit-value in __inet_bind net/rds: Fix NULL pointer dereference in rds_tcp_accept_one octeontx2-af: Fix default entries mcam entry action net/mlx5e: XSK, Fix unintended ICOSQ change ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() inet: move icmp_global_{credit,stamp} to a separate cache line icmp: prevent possible overflow in icmp_global_allow() selftests/net: packetdrill: add ipv4-mapped-ipv6 tests ...
5 daysMerge tag 'nf-26-02-17' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net The following patchset contains Netfilter fixes for *net*: 1) Add missing __rcu annotations to NAT helper hook pointers in Amanda, FTP, IRC, SNMP and TFTP helpers. From Sun Jian. 2-4): - Add global spinlock to serialize nft_counter fetch+reset operations. - Use atomic64_xchg() for nft_quota reset instead of read+subtract pattern. Note AI review detects a race in this change but it isn't new. The 'racing' bit only exists to prevent constant stream of 'quota expired' notifications. - Revert commit_mutex usage in nf_tables reset path, it caused circular lock dependency. All from Brian Witte. 5) Fix uninitialized l3num value in nf_conntrack_h323 helper. 6) Fix musl libc compatibility in netfilter_bridge.h UAPI header. This change isn't nice (UAPI headers should not include libc headers), but as-is musl builds may fail due to redefinition of struct ethhdr. 7) Fix protocol checksum validation in IPVS for IPv6 with extension headers, from Julian Anastasov. 8) Fix device reference leak in IPVS when netdev goes down. Also from Julian. 9) Remove WARN_ON_ONCE when accessing forward path array, this can trigger with sufficiently long forward paths. From Pablo Neira Ayuso. 10) Fix use-after-free in nf_tables_addchain() error path, from Inseo An. * tag 'nf-26-02-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: fix use-after-free in nf_tables_addchain() net: remove WARN_ON_ONCE when accessing forward path array ipvs: do not keep dest_dst if dev is going down ipvs: skip ipv6 extension headers for csum checks include: uapi: netfilter_bridge.h: Cover for musl libc netfilter: nf_conntrack_h323: don't pass uninitialised l3num value netfilter: nf_tables: revert commit_mutex usage in reset path netfilter: nft_quota: use atomic64_xchg for reset netfilter: nft_counter: serialize reset with spinlock netfilter: annotate NAT helper hook pointers with __rcu ==================== Link: https://patch.msgid.link/20260217163233.31455-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: do not delay zero-copy skbs in skb_attempt_defer_free()Eric Dumazet
After the blamed commit, TCP tx zero copy notifications could be arbitrarily delayed and cause regressions in applications waiting for them. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: remove WARN_ON_ONCE when accessing forward path arrayPablo Neira Ayuso
Although unlikely, recent support for IPIP tunnels increases chances of reaching this WARN_ON_ONCE if userspace manages to build a sufficiently long forward path. Remove it. Fixes: ddb94eafab8b ("net: resolve forwarding path from virtual netdevice and HW destination address") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
7 daysnet: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RTEric Dumazet
CONFIG_PREEMPT_RT is special, make this clear in backlog_lock_irq_save() and backlog_unlock_irq_restore(). The issue shows up with CONFIG_DEBUG_IRQFLAGS=y raw_local_irq_restore() called with IRQs enabled WARNING: kernel/locking/irqflag-debug.c:10 at warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10, CPU#1: aoe_tx0/1321 Modules linked in: CPU: 1 UID: 0 PID: 1321 Comm: aoe_tx0 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10 Call Trace: <TASK> backlog_unlock_irq_restore net/core/dev.c:253 [inline] enqueue_to_backlog+0x525/0xcf0 net/core/dev.c:5347 netif_rx_internal+0x120/0x550 net/core/dev.c:5659 __netif_rx+0xa9/0x110 net/core/dev.c:5679 loopback_xmit+0x43a/0x660 drivers/net/loopback.c:90 __netdev_start_xmit include/linux/netdevice.h:5275 [inline] netdev_start_xmit include/linux/netdevice.h:5284 [inline] xmit_one net/core/dev.c:3864 [inline] dev_hard_start_xmit+0x2df/0x830 net/core/dev.c:3880 __dev_queue_xmit+0x16f4/0x3990 net/core/dev.c:4829 dev_queue_xmit include/linux/netdevice.h:3384 [inline] Fixes: 27a01c1969a5 ("net: fully inline backlog_unlock_irq_restore()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260213120427.2914544-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
11 daysMerge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
12 daysMerge tag 'net-next-7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - A significant effort all around the stack to guide the compiler to make the right choice when inlining code, to avoid unneeded calls for small helper and stack canary overhead in the fast-path. This generates better and faster code with very small or no text size increases, as in many cases the call generated more code than the actual inlined helper. - Extend AccECN implementation so that is now functionally complete, also allow the user-space enabling it on a per network namespace basis. - Add support for memory providers with large (above 4K) rx buffer. Paired with hw-gro, larger rx buffer sizes reduce the number of buffers traversing the stack, dincreasing single stream CPU usage by up to ~30%. - Do not add HBH header to Big TCP GSO packets. This simplifies the RX path, the TX path and the NIC drivers, and is possible because user-space taps can now interpret correctly such packets without the HBH hint. - Allow IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified, aligning IPv6 to IPv4 behavior. - Multi-queue aware sch_cake. This makes it possible to scale the rate shaper of sch_cake across multiple CPUs, while still enforcing a single global rate on the interface. - Add support for the nbcon (new buffer console) infrastructure to netconsole, enabling lock-free, priority-based console operations that are safer in crash scenarios. - Improve the TCP ipv6 output path to cache the flow information, saving cpu cycles, reducing cache line misses and stack use. - Improve netfilter packet tracker to resolve clashes for most protocols, avoiding unneeded drops on rare occasions. - Add IP6IP6 tunneling acceleration to the flowtable infrastructure. - Reduce tcp socket size by one cache line. - Notify neighbour changes atomically, avoiding inconsistencies between the notification sequence and the actual states sequence. - Add vsock namespace support, allowing complete isolation of vsocks across different network namespaces. - Improve xsk generic performances with cache-alignment-oriented optimizations. - Support netconsole automatic target recovery, allowing netconsole to reestablish targets when underlying low-level interface comes back online. Driver API: - Support for switching the working mode (automatic vs manual) of a DPLL device via netlink. - Introduce PHY ports representation to expose multiple front-facing media ports over a single MAC. - Introduce "rx-polarity" and "tx-polarity" device tree properties, to generalize polarity inversion requirements for differential signaling. - Add helper to create, prepare and enable managed clocks. Device drivers: - Add Huawei hinic3 PF etherner driver. - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller. - Add ethernet driver for MaxLinear MxL862xx switches - Remove parallel-port Ethernet driver. - Convert existing driver timestamp configuration reporting to hwtstamp_get and remove legacy ioctl(). - Convert existing drivers to .get_rx_ring_count(), simplifing the RX ring count retrieval. Also remove the legacy fallback path. - Ethernet high-speed NICs: - Broadcom (bnxt, bng): - bnxt: add FW interface update to support FEC stats histogram and NVRAM defragmentation - bng: add TSO and H/W GRO support - nVidia/Mellanox (mlx5): - improve latency of channel restart operations, reducing the used H/W resources - add TSO support for UDP over GRE over VLAN - add flow counters support for hardware steering (HWS) rules - use a static memory area to store headers for H/W GRO, leading to 12% RX tput improvement - Intel (100G, ice, idpf): - ice: reorganizes layout of Tx and Rx rings for cacheline locality and utilizes __cacheline_group* macros on the new layouts - ice: introduces Synchronous Ethernet (SyncE) support - Meta (fbnic): - adds debugfs for firmware mailbox and tx/rx rings vectors - Ethernet virtual: - geneve: introduce GRO/GSO support for double UDP encapsulation - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - some code refactoring and cleanups - RealTek (r8169): - add support for RTL8127ATF (10G Fiber SFP) - add dash and LTR support - Airoha: - AN8811HB 2.5 Gbps phy support - Freescale (fec): - add XDP zero-copy support - Thunderbolt: - add get link setting support to allow bonding - Renesas: - add support for RZ/G3L GBETH SoC - Ethernet switches: - Maxlinear: - support R(G)MII slow rate configuration - add support for Intel GSW150 - Motorcomm (yt921x): - add DCB/QoS support - TI: - icssm-prueth: support bridging (STP/RSTP) via the switchdev framework - Ethernet PHYs: - Realtek: - enable SGMII and 2500Base-X in-band auto-negotiation - simplify and reunify C22/C45 drivers - Micrel: convert bindings to DT schema - CAN: - move skb headroom content into skb extensions, making CAN metadata access more robust - CAN drivers: - rcar_canfd: - add support for FD-only mode - add support for the RZ/T2H SoC - sja1000: cleanup the CAN state handling - WiFi: - implement EPPKE/802.1X over auth frames support - split up drop reasons better, removing generic RX_DROP - additional FTM capabilities: 6 GHz support, supported number of spatial streams and supported number of LTF repetitions - better mac80211 iterators to enumerate resources - initial UHR (Wi-Fi 8) support for cfg80211/mac80211 - WiFi drivers: - Qualcomm/Atheros: - ath11k: support for Channel Frequency Response measurement - ath12k: a significant driver refactor to support multi-wiphy devices and and pave the way for future device support in the same driver (rather than splitting to ath13k) - ath12k: support for the QCC2072 chipset - Intel: - iwlwifi: partial Neighbor Awareness Networking (NAN) support - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn - RealTek (rtw89): - preparations for RTL8922DE support - Bluetooth: - implement setsockopt(BT_PHY) to set the connection packet type/PHY - set link_policy on incoming ACL connections - Bluetooth drivers: - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE - btqca: add WCN6855 firmware priority selection feature" * tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits) bnge/bng_re: Add a new HSI net: macb: Fix tx/rx malfunction after phy link down and up af_unix: Fix memleak of newsk in unix_stream_connect(). net: ti: icssg-prueth: Add optional dependency on HSR net: dsa: add basic initial driver for MxL862xx switches net: mdio: add unlocked mdiodev C45 bus accessors net: dsa: add tag format for MxL862xx switches dt-bindings: net: dsa: add MaxLinear MxL862xx selftests: drivers: net: hw: Modify toeplitz.c to poll for packets octeontx2-pf: Unregister devlink on probe failure net: renesas: rswitch: fix forwarding offload statemachine ionic: Rate limit unknown xcvr type messages tcp: inet6_csk_xmit() optimization tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect() ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6 ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update() ipv6: use np->final in inet6_sk_rebuild_header() ipv6: add daddr/final storage in struct ipv6_pinfo net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup() ...
13 daysMerge tag 'bpf-next-7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...
14 daysMerge tag 'kthread-for-7.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future" * tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits) doc: Add housekeeping documentation kthread: Document kthread_affine_preferred() kthread: Comment on the purpose and placement of kthread_affine_node() call kthread: Honour kthreads preferred affinity after cpuset changes sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management kthread: Include kthreadd to the managed affinity list kthread: Include unbound kthreads in the managed affinity list kthread: Refine naming of affinity related fields PCI: Remove superfluous HK_TYPE_WQ check sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() cpuset: Remove cpuset_cpu_is_isolated() timers/migration: Remove superfluous cpuset isolation test cpuset: Propagate cpuset isolation update to timers through housekeeping cpuset: Propagate cpuset isolation update to workqueue through housekeeping PCI: Flush PCI probe workqueue on cpuset isolated partition change sched/isolation: Flush vmstat workqueues on cpuset isolated partition change sched/isolation: Flush memcg workqueues on cpuset isolated partition change cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset ...
2026-02-06net/ipv6: Remove jumbo_remove step from TX pathAlice Mikityanska
Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove unnecessary steps from the GSO TX path, that used to check and remove HBH. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-5-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06net/ipv6: Drop HBH for BIG TCP on RX sideAlice Mikityanska
Complementary to the previous commit, stop inserting HBH when building BIG TCP GRO SKBs. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-4-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06net: skb: allow up to 8 skb extension idsOliver Hartkopp
The skb extension ids range from 0 .. 7 to fit their bits as flags into a single byte. The ids are automatically enumnerated in enum skb_ext_id in skbuff.h, where SKB_EXT_NUM is defined as the last value. When having 8 skb extension ids (0 .. 7), SKB_EXT_NUM becomes 8 which is a valid value for SKB_EXT_NUM. Fixes: 96ea3a1e2d31 ("can: add CAN skb extension infrastructure") Link: https://lore.kernel.org/netdev/aXoMqaA7b2CqJZNA@strlen.de/ Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260205-skb_ext-v1-1-9ba992ccee8b@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06netns: optimize netns cleaning by batching unhash_nsid callsQiliang Yuan
Currently, unhash_nsid() scans the entire system for each netns being killed, leading to O(L_dying_net * M_alive_net * N_id) complexity, as __peernet2id() also performs a linear search in the IDR. Optimize this to O(M_alive_net * N_id) by batching unhash operations. Move unhash_nsid() out of the per-netns loop in cleanup_net() to perform a single-pass traversal over survivor namespaces. Identify dying peers by an 'is_dying' flag, which is set under net_rwsem write lock after the netns is removed from the global list. This batches the unhashing work and eliminates the O(L_dying_net) multiplier. To minimize the impact on struct net size, 'is_dying' is placed in an existing hole after 'hash_mix' in struct net. Use a restartable idr_get_next() loop for iteration. This avoids the unsafe modification issue inherent to idr_for_each() callbacks and allows dropping the nsid_lock to safely call sleepy rtnl_net_notifyid(). Clean up redundant nsid_lock and simplify the destruction loop now that unhashing is centralized. Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204074854.3506916-1-realwujing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}Amery Hung
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}() properly by switching to bpf_selem_unlink_nofail(). Both functions iterate their own RCU-protected list of selems and call bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when both map_free() and destroy() fail to remove a selem from b->list (extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(), also switch to hlist_for_each_entry_rcu() since we no longer iterate local_storage->list under local_storage->lock. bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths so reuse_now should always be false. Remove it from the argument and hardcode it. Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
2026-02-06bpf: Remove unused percpu counter from bpf_local_storage_map_freeAmery Hung
Percpu locks have been removed from cgroup and task local storage. Now that all local storage no longer use percpu variables as locks preventing recursion, there is no need to pass them to bpf_local_storage_map_free(). Remove the argument from the function. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
2026-02-06bpf: Convert bpf_selem_unlink to failableAmery Hung
To prepare changing both bpf_local_storage_map_bucket::lock and bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Open code bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since unlink_map and unlink_storage must be done together after all the necessary locks are acquired. For bpf_local_storage_map_free(), ignore the return from bpf_selem_unlink() for now. A later patch will allow it to unlink selems even when failing to acquire locks. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
2026-02-06bpf: Convert bpf_selem_link_map to failableAmery Hung
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_link_map() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
2026-02-06bpf: Select bpf_local_storage_map_bucket based on bpf_local_storageAmery Hung
A later bpf_local_storage refactor will acquire all locks before performing any update. To simplified the number of locks needed to take in bpf_local_storage_map_update(), determine the bucket based on the local_storage an selem belongs to instead of the selem pointer. Currently, when a new selem needs to be created to replace the old selem in bpf_local_storage_map_update(), locks of both buckets need to be acquired to prevent racing. This can be simplified if the two selem belongs to the same bucket so that only one bucket needs to be locked. Therefore, instead of hashing selem, hashing the local_storage pointer the selem belongs. Performance wise, this is slightly better as update now requires locking one bucket. It should not change the level of contention on one bucket as the pointers to local storages of selems in a map are just as unique as pointers to selems. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com
2026-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.19-rc9). No adjacent changes, conflicts: drivers/net/ethernet/spacemit/k1_emac.c 3125fc1701694 ("net: spacemit: k1-emac: fix jumbo frame support") f66086798f91f ("net: spacemit: Remove broken flow control support") https://lore.kernel.org/aYIysFIE9ooavWia@sirena.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05net: get rid of net/core/request_sock.cEric Dumazet
After DCCP removal, this file was not needed any more. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05tcp: move reqsk_fastopen_remove to net/ipv4/tcp_fastopen.cEric Dumazet
This function belongs to TCP stack, not to net/core/request_sock.c We get rid of the now empty request_sock.c n the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.cEric Dumazet
Only called once from inet_csk_listen_start(), it can be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05net: add vlan_get_protocol_offset_inline() helperEric Dumazet
skb_protocol() is bloated, and forces slow stack canaries in many fast paths. Add vlan_get_protocol_offset_inline() which deals with the non-vlan common cases. __vlan_get_protocol_offset() is now out of line. It returns a vlan_type_depth struct to avoid stack canaries in callers. struct vlan_type_depth { __be16 type; u16 depth; }; $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 0/22 up/down: 0/-6320 (-6320) Function old new delta vlan_get_protocol_dgram 61 59 -2 __pfx_skb_protocol 16 - -16 __vlan_get_protocol_offset 307 273 -34 tap_get_user 1374 1207 -167 ip_md_tunnel_xmit 1625 1452 -173 tap_sendmsg 940 753 -187 netif_skb_features 1079 866 -213 netem_enqueue 3017 2800 -217 vlan_parse_protocol 271 50 -221 tso_start 567 344 -223 fq_dequeue 1908 1685 -223 skb_network_protocol 434 205 -229 ip6_tnl_xmit 2639 2409 -230 br_dev_queue_push_xmit 474 236 -238 skb_protocol 258 - -258 packet_parse_headers 621 357 -264 __ip6_tnl_rcv 1306 1039 -267 skb_csum_hwoffload_help 515 224 -291 ip_tunnel_xmit 2635 2339 -296 sch_frag_xmit_hook 1582 1233 -349 bpf_skb_ecn_set_ce 868 457 -411 IP6_ECN_decapsulate 1297 768 -529 ip_tunnel_rcv 2121 1489 -632 ipip6_rcv 2572 1922 -650 Total: Before=24892803, After=24886483, chg -0.03% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260204053023.1622775-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-05can: add CAN skb extension infrastructureOliver Hartkopp
To remove the private CAN bus skb headroom infrastructure 8 bytes need to be stored in the skb. The skb extensions are a common pattern and an easy and efficient way to hold private data travelling along with the skb. We only need the skb_ext_add() and skb_ext_find() functions to allocate and access CAN specific content as the skb helpers to copy/clone/free skbs automatically take care of skb extensions and their final removal. This patch introduces the complete CAN skb extensions infrastructure: - add struct can_skb_ext in new file include/net/can.h - add include/net/can.h in MAINTAINERS - add SKB_EXT_CAN to skbuff.c and skbuff.h - select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled - check for existing CAN skb extensions in can_rcv() in af_can.c - add CAN skb extensions allocation at every skb_alloc() location - duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops) - introduce can_skb_ext_add() and can_skb_ext_find() helpers The patch also corrects an indention issue in the original code from 2018: Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-04bpf: Don't check sk_fullsock() in bpf_skc_to_unix_sock().Kuniyuki Iwashima
AF_UNIX does not use TCP_NEW_SYN_RECV nor TCP_TIME_WAIT and checking sk->sk_family is sufficient. Let's remove sk_fullsock() and use sk_is_unix() in bpf_skc_to_unix_sock(). Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260203213442.682838-3-kuniyu@google.com
2026-02-03net: gro: fix outer network offsetPaolo Abeni
The udp GRO complete stage assumes that all the packets inserted the RX have the `encapsulation` flag zeroed. Such assumption is not true, as a few H/W NICs can set such flag when H/W offloading the checksum for an UDP encapsulated traffic, the tun driver can inject GSO packets with UDP encapsulation and the problematic layout can also be created via a veth based setup. Due to the above, in the problematic scenarios, udp4_gro_complete() uses the wrong network offset (inner instead of outer) to compute the outer UDP header pseudo checksum, leading to csum validation errors later on in packet processing. Address the issue always clearing the encapsulation flag at GRO completion time. Such flag will be set again as needed for encapsulated packets by udp_gro_complete(). Fixes: 5ef31ea5d053 ("net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/562638dbebb3b15424220e26a180274b387e2a88.1770032084.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-03net: add proper RCU protection to /proc/net/ptypeEric Dumazet
Yin Fengwei reported an RCU stall in ptype_seq_show() and provided a patch. Real issue is that ptype_seq_next() and ptype_seq_show() violate RCU rules. ptype_seq_show() runs under rcu_read_lock(), and reads pt->dev to get device name without any barrier. At the same time, concurrent writers can remove a packet_type structure (which is correctly freed after an RCU grace period) and clear pt->dev without an RCU grace period. Define ptype_iter_state to carry a dev pointer along seq_net_private: struct ptype_iter_state { struct seq_net_private p; struct net_device *dev; // added in this patch }; We need to record the device pointer in ptype_get_idx() and ptype_seq_next() so that ptype_seq_show() is safe against concurrent pt->dev changes. We also need to add full RCU protection in ptype_seq_next(). (Missing READ_ONCE() when reading list.next values) Many thanks to Dong Chenchen for providing a repro. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: 1d10f8a1f40b ("net-procfs: show net devices bound packet types") Fixes: c353e8983e0d ("net: introduce per netns packet chains") Reported-by: Yin Fengwei <fengwei_yin@linux.alibaba.com> Reported-by: Dong Chenchen <dongchenchen2@huawei.com> Closes: https://lore.kernel.org/netdev/CANn89iKRRKPnWjJmb-_3a=sq+9h6DvTQM4DBZHT5ZRGPMzQaiA@mail.gmail.com/T/#m7b80b9fc9b9267f90e0b7aad557595f686f9c50d Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Tested-by: Yin Fengwei <fengwei_yin@linux.alibaba.com> Link: https://patch.msgid.link/20260202205217.2881198-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-03netclassid: use thread_group_leader(p) in update_classid_task()Oleg Nesterov
Cleanup and preparation to simplify planned future changes. Link: https://lkml.kernel.org/r/aXY_4NSP094-Cf-2@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Boris Brezillon <boris.brezillon@collabora.com> Cc: Christan König <christian.koenig@amd.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-03net: Keep ignoring isolated cpuset changeFrederic Weisbecker
RPS cpumask can be overriden through sysfs/syctl. The boot defined isolated CPUs are then excluded from that cpumask. However HK_TYPE_DOMAIN will soon integrate cpuset isolated CPUs updates and the RPS infrastructure needs more thoughts to be able to propagate such changes and synchronize against them. Keep handling only what was passed through "isolcpus=" for now. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: netdev@vger.kernel.org
2026-02-02linkwatch: use __dev_put() in callers to prevent UAFJiayuan Chen
After linkwatch_do_dev() calls __dev_put() to release the linkwatch reference, the device refcount may drop to 1. At this point, netdev_run_todo() can proceed (since linkwatch_sync_dev() sees an empty list and returns without blocking), wait for the refcount to become 1 via netdev_wait_allrefs_any(), and then free the device via kobject_put(). This creates a use-after-free when __linkwatch_run_queue() tries to call netdev_unlock_ops() on the already-freed device. Note that adding netdev_lock_ops()/netdev_unlock_ops() pair in netdev_run_todo() before kobject_put() would not work, because netdev_lock_ops() is conditional - it only locks when netdev_need_ops_lock() returns true. If the device doesn't require ops_lock, linkwatch won't hold any lock, and netdev_run_todo() acquiring the lock won't provide synchronization. Fix this by moving __dev_put() from linkwatch_do_dev() to its callers. The device reference logically pairs with de-listing the device, so it's reasonable for the caller that did the de-listing to release it. This allows placing __dev_put() after all device accesses are complete, preventing UAF. The bug can be reproduced by adding mdelay(2000) after linkwatch_do_dev() in __linkwatch_run_queue(), then running: ip tuntap add mode tun name tun_test ip link set tun_test up ip link set tun_test carrier off ip link set tun_test carrier on sleep 0.5 ip tuntap del mode tun name tun_test KASAN report: ================================================================== BUG: KASAN: use-after-free in netdev_need_ops_lock include/net/netdev_lock.h:33 [inline] BUG: KASAN: use-after-free in netdev_unlock_ops include/net/netdev_lock.h:47 [inline] BUG: KASAN: use-after-free in __linkwatch_run_queue+0x865/0x8a0 net/core/link_watch.c:245 Read of size 8 at addr ffff88804de5c008 by task kworker/u32:10/8123 CPU: 0 UID: 0 PID: 8123 Comm: kworker/u32:10 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: events_unbound linkwatch_event Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x156/0x4c9 mm/kasan/report.c:482 kasan_report+0xdf/0x1a0 mm/kasan/report.c:595 netdev_need_ops_lock include/net/netdev_lock.h:33 [inline] netdev_unlock_ops include/net/netdev_lock.h:47 [inline] __linkwatch_run_queue+0x865/0x8a0 net/core/link_watch.c:245 linkwatch_event+0x8f/0xc0 net/core/link_watch.c:304 process_one_work+0x9c2/0x1840 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x5da/0xe40 kernel/workqueue.c:3421 kthread+0x3b3/0x730 kernel/kthread.c:463 ret_from_fork+0x754/0xaf0 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> ================================================================== Fixes: 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") Reported-by: syzbot+1ec2f6a450f0b54af8c8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6824d064.a70a0220.3e9d8.001a.GAE@google.com/T/ Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260201135915.393451-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-31bpf: Fix bpf_xdp_store_bytes proto for read-only argPaul Chaignon
While making some maps in Cilium read-only from the BPF side, we noticed that the bpf_xdp_store_bytes proto is incorrect. In particular, the verifier was throwing the following error: ; ret = ctx_store_bytes(ctx, l3_off + offsetof(struct iphdr, saddr), &nat->address, 4, 0); 635: (79) r1 = *(u64 *)(r10 -144) ; R1=ctx() R10=fp0 fp-144=ctx() 636: (b4) w2 = 26 ; R2=26 637: (b4) w4 = 4 ; R4=4 638: (b4) w5 = 0 ; R5=0 639: (85) call bpf_xdp_store_bytes#190 write into map forbidden, value_size=6 off=0 size=4 nat comes from a BPF_F_RDONLY_PROG map, so R3 is a PTR_TO_MAP_VALUE. The verifier checks the helper's memory access to R3 in check_mem_size_reg, as it reaches ARG_CONST_SIZE argument. The third argument has expected type ARG_PTR_TO_UNINIT_MEM, which includes the MEM_WRITE flag. The verifier thus checks for a BPF_WRITE access on R3. Given R3 points to a read-only map, the check fails. Conversely, ARG_PTR_TO_UNINIT_MEM can also lead to the helper reading from uninitialized memory. This patch simply fixes the expected argument type to match that of bpf_skb_store_bytes. Fixes: 3f364222d032 ("net: xdp: introduce bpf_xdp_pointer utility routine") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/9fa3c9f72d806e82541071c4df88b8cba28ad6a9.1769875479.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-31net: don't touch dev->stats in BPF redirect pathsJakub Kicinski
Gal reports that BPF redirect increments dev->stats.tx_errors on failure. This is not correct, most modern drivers completely ignore dev->stats so these drops will be invisible to the user. Core code should use the dedicated core stats which are folded into device stats in dev_get_stats(). Note that we're switching from tx_errors to tx_dropped. Core only has tx_dropped, hence presumably users already expect that counter to increment for "stack" Tx issues. Reported-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/c5df3b60-246a-4030-9c9a-0a35cd1ca924@nvidia.com Fixes: b4ab31414970 ("bpf: Add redirect_neigh helper as redirect drop-in") Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260130033827.698841-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-30tcp: reduce tcp sockets size by one cache lineEric Dumazet
By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU, slub has to use extra storage for the freelist pointer after each object, because slub assumes that any bit in the object can be used by RCU readers. Because proto_register() is also using SLAB_HWCACHE_ALIGN, this forces slub to use one extra cache line per object. We can instead put the slub freelist anywhere in the object, granted the concurrent RCU readers are not supposed to use the pointer value. Add a new (struct sock)sk_freeptr field, in an union with sk_rcu: No RCU readers would need to look at sk_rcu, which is only used at free phase. Tested: grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab} grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab} Before: /sys/kernel/slab/TCP/object_size:2368 /sys/kernel/slab/TCP/slab_size:2432 /sys/kernel/slab/TCP/objs_per_slab:13 /sys/kernel/slab/TCPv6/object_size:2496 /sys/kernel/slab/TCPv6/slab_size:2560 /sys/kernel/slab/TCPv6/objs_per_slab:12 After this patch, we can pack one more TCPv6 object per slab, and object_size == slab_size. /sys/kernel/slab/TCP/object_size:2368 /sys/kernel/slab/TCP/slab_size:2368 /sys/kernel/slab/TCP/objs_per_slab:13 /sys/kernel/slab/TCPv6/object_size:2496 /sys/kernel/slab/TCPv6/slab_size:2496 /sys/kernel/slab/TCPv6/objs_per_slab:13 Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.19-rc8). No adjacent changes, conflicts: drivers/net/ethernet/spacemit/k1_emac.c 2c84959167d64 ("net: spacemit: Check for netif_carrier_ok() in emac_stats_update()") f66086798f91f ("net: spacemit: Remove broken flow control support") https://lore.kernel.org/aXjAqZA3iEWD_DGM@sirena.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29net: fix segmentation of forwarding fraglist GROJibin Zhang
This patch enhances GSO segment handling by properly checking the SKB_GSO_DODGY flag for frag_list GSO packets, addressing low throughput issues observed when a station accesses IPv4 servers via hotspots with an IPv6-only upstream interface. Specifically, it fixes a bug in GSO segmentation when forwarding GRO packets containing a frag_list. The function skb_segment_list cannot correctly process GRO skbs that have been converted by XLAT, since XLAT only translates the header of the head skb. Consequently, skbs in the frag_list may remain untranslated, resulting in protocol inconsistencies and reduced throughput. To address this, the patch explicitly sets the SKB_GSO_DODGY flag for GSO packets in XLAT's IPv4/IPv6 protocol translation helpers (bpf_skb_proto_4_to_6 and bpf_skb_proto_6_to_4). This marks GSO packets as potentially modified after protocol translation. As a result, GSO segmentation will avoid using skb_segment_list and instead falls back to skb_segment for packets with the SKB_GSO_DODGY flag. This ensures that only safe and fully translated frag_list packets are processed by skb_segment_list, resolving protocol inconsistencies and improving throughput when forwarding GRO packets converted by XLAT. Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com> Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260126152114.1211-1-jibin.zhang@mediatek.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-27bpf, sockmap: Fix FIONREAD for sockmapJiayuan Chen
A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to calculate FIONREAD is not enough. This patch adds a new msg_tot_len field in the psock structure to record the data length in ingress_msg. Additionally, we implement new ioctl interfaces for TCP and UDP to intercept FIONREAD operations. Note that we intentionally do not include sk_receive_queue data in the FIONREAD result. Data in sk_receive_queue has not yet been processed by the BPF verdict program, and may be redirected to other sockets or dropped. Including it would create semantic ambiguity since this data may never be readable by the user. Unix and VSOCK sockets have similar issues, but fixing them is outside the scope of this patch as it would require more intrusive changes. Previous work by John Fastabend made some efforts towards FIONREAD support: commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq") Although the current patch is based on the previous work by John Fastabend, it is acceptable for our Fixes tag to point to the same commit. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27bpf, sockmap: Fix incorrect copied_seq calculationJiayuan Chen
A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. The issue is that when reading from ingress_msg, we update tp->copied_seq by default. However, if the data is not from its own protocol stack, tcp->rcv_nxt is not increased. Later, if we convert this socket to a native socket, reading from this socket may fail because copied_seq might be significantly larger than rcv_nxt. This fix also addresses the syzkaller-reported bug referenced in the Closes tag. This patch marks the skmsg objects in ingress_msg. When reading, we update copied_seq only if the data is from its own protocol stack. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-26net: include <linux/hex.h> from sysctl_net_core.cEric Dumazet
Needed for hex_byte_pack(). x86_64 was already including it, but some arches were not. Fixes: 37b0ea8fef56 ("net: expand NETDEV_RSS_KEY_LEN to 256 bytes") Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/netdev/aXeka0KYBnrkwUcF@sirena.org.uk/ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260126174731.2767372-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Make another netlink notification atomicallyPetr Machata
Similarly to the issue from the previous patch, neigh_timer_handler() also updates the neighbor separately from formatting and sending the netlink notification message. We have not seen reports to the effect of this causing trouble, but in theory, the same sort of issues could have come up: neigh_timer_handler() would make changes as necessary, but before formatting and sending a notification, is interrupted before sending by another thread, which makes a parallel change and sends its own message. The message send that is prompted by an earlier change thus contains information that does not reflect the change having been made. To solve this, the netlink notification needs to be in the same critical section that updates the neighbor. The critical section is ended by the neigh_probe() call which drops the lock before calling solicit. Stretching the critical section over the solicit call is problematic, because that can then involved all sorts of forwarding callbacks. Therefore, like in the previous patch, split the netlink notification away from the internal one and move it ahead of the probe call. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e440118511cbdbe1d88eb0d71c9047116feb96e0.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Make one netlink notification atomicallyPetr Machata
As noted in a previous patch, one race remains in the current code. A kernel thread might interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent <-------- neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent again The solution is to send the netlink message inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. To that end, in __neigh_update(), move the current neigh_notify() call up to said critical section, and convert it to __neigh_notify(), because the lock is held. This motion crosses calls to neigh_update_managed_list(), neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which potentially unlock and give an opportunity for the above race. This also crosses a call to neigh_update_process_arp_queue() which calls neigh->output(), which might be neigh_resolve_output() calls neigh_event_send() calls neigh_event_send_probe() calls __neigh_event_send() calls neigh_probe(), which touches neigh->probes, an update which will now not be visible in the notification. However, there is indication that there is no promise that these changes will be accurately projected to notifications: fib6_table_lookup() indirectly calls route.c's find_match() calls rt6_probe(), which looks up a neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0, but neither this nor the caller seems to send a notification. Additionally, the neighbor object that the neigh_probe() mentioned above is called on, might be the alternative neighbor looked up for the ARP queue packet destination. If that is the case, the changed value of n1->probes is not notified anywhere. So at least in some circumstances, the reported number of probes needs to be assumed to change without notification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ceb44995498eb52375cb2d46c3245bdb9e74b355.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Reorder netlink & internal notificationPetr Machata
The netlink message needs to be send inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. On the other hand, there is no such need in case of notifier chain: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. This requires that the netlink notification be done before the internal notifier chain message is sent. That is safe to do, because the current listeners, as well as __neigh_notify(), only read the updated neighbor fields, and never modify them. (Apart from locking.) Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/f3ef74d5460f14c4d102b8a5857d4a6624da9a5a.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Inline neigh_update_notify() callsPetr Machata
The obvious idea behind the helper is to keep together the two bits that should be done either both or neither: the internal notifier chain message, and the netlink notification. To make sure that the notification sent reflects the change being made, the netlink message needs to be send inside the critical section where the neighbor is changed. But for the notifier chain, there is no such need: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. Therefore these two items have each different locking needs. Now we could unlock inside the helper, but I find that error prone, and the fact that the notification is conditional in the first place does not help to make the call site obvious. So in this patch, the helper is instead removed and the body, which is just these two calls, inlined. That way we can use each notifier independently. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e65dce5882bc6f4aa2530b8a4877d0e003071a1a.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Process ARP queue laterPetr Machata
ARP queue processing unlocks the neighbor lock, which can allow another thread to asynchronously perform a neighbor update and send an out of order notification. Therefore this needs to be done after the notification is sent. Move it just before the end of the critical section. Since neigh_update_process_arp_queue() unlocks, it does not form a part of the critical section anymore but it can benefit from the lock being taken. The intention is to eventually do the RTNL notification before this call. This motion crosses a call to neigh_update_is_router(), which should not influence processing of the ARP queue. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/9ea7159e71430ebdc837ebcc880a76b7e82e52a4.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Extract ARP queue processing to a helper functionPetr Machata
In order to make manipulation with this bit of code clearer, extract it to a helper function, neigh_update_process_arp_queue(). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8b0fa0abe2cf0e24484903f5436fe0ac64163057.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Call __neigh_notify() under a lockPetr Machata
Andy Roulin has described an issue with the current neighbor notification scheme as follows. This was also presented publicly at the link below. neigh_update sends a rtnl notification if an update, e.g., nud_state change, was done but there is no guarantee of ordering of the rtnl notifications. Consider the following scenario: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = STALE read_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = REACHABLE read_unlock_bh(n->lock) rtnl_nofify RTNL REACHABLE sent <-------- rtnl_notify RTNL STALE sent In this scenario, the kernel neigh is updated first to STALE and then REACHABLE but the netlink notifications are sent out of order, first REACHABLE and then STALE. The solution is to send the netlink message inside the same critical section that formats the message. That way both the contents and ordering of the message reflect the same state, and we cannot see the abovementioned out-of-order delivery. Even with this patch, a remaining issue that the contents of the message may not reflect the changes made to the neighbor. A kernel thread might still interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents. The following patches will attempt to address that issue. To support those future patches, convert __neigh_notify() to a helper that assumes that the neighbor lock is already taken by having it call __neigh_fill_info() instead of neigh_fill_info(). Add a new helper, neigh_notify(), which takes the lock before calling __neigh_notify(). Migrate all callers to use the latter. Link: https://lore.kernel.org/netdev/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com/ Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/4b4368dcc5f5a7e407009cb6c36b69cfb5282864.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>