summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)Author
6 daysnet/sched: sch_sfq: annotate data-races from sfq_dump_class_stats()Eric Dumazet
sfq_dump_class_stats() runs locklessly, add needed READ_ONCE() and WRITE_ONCE() annotations. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260505091133.2452510-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet/sched: sch_fq_codel: annotate data-races from fq_codel_dump_class_stats()Eric Dumazet
fq_codel_dump_class_stats() acquires qdisc spinlock only when requested to follow flow->head chain. As we did in sch_cake recently, add the missing READ_ONCE()/WRITE_ONCE() annotations. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260504163842.1162001-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet/sched: sch_cake: annotate data-races in cake_dump_class_stats (II)Eric Dumazet
cake_dump_class_stats() runs without qdisc spinlock being held. In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->deficit - flow->cvars.dropping - flow->cvars.count - flow->cvars.p_drop - flow->cvars.blue_timer - flow->cvars.drop_next Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061610.3503483-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet/sched: sch_cake: annotate data-races in cake_dump_class_stats (I)Eric Dumazet
cake_dump_class_stats() runs without qdisc spinlock being held. In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for: - flow->head - flow->dropped - b->backlogs[] Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061610.3503483-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet/sched: sch_sfb: Replace direct dequeue call with peek and ↵Victor Nogueria
qdisc_dequeue_peeked When sfb has children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (sfb in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(sfb) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (sfb). And herein lies the problem. - sfb will call the child's dequeue() which will essentially just try to grab something of qfq's queue. [ 127.594489][ T453] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 127.594741][ T453] CPU: 2 UID: 0 PID: 453 Comm: ping Not tainted 7.1.0-rc1-00035-gac961974495b-dirty #793 PREEMPT(full) [ 127.595059][ T453] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 127.595254][ T453] RIP: 0010:qfq_dequeue+0x35c/0x1650 [sch_qfq] [ 127.595461][ T453] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b [ 127.596081][ T453] RSP: 0018:ffff88810e5af440 EFLAGS: 00010216 [ 127.596337][ T453] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000 [ 127.596623][ T453] RDX: 0000000000000009 RSI: 0000001880000000 RDI: ffff888104fd82b0 [ 127.596917][ T453] RBP: ffff888104fd8000 R08: ffff888104fd8280 R09: 1ffff110211893a3 [ 127.597165][ T453] R10: 1ffff110211893a6 R11: 1ffff110211893a7 R12: 0000001880000000 [ 127.597404][ T453] R13: ffff888104fd82b8 R14: 0000000000000048 R15: 0000000040000000 [ 127.597644][ T453] FS: 00007fc380cbfc40(0000) GS:ffff88816f2a8000(0000) knlGS:0000000000000000 [ 127.597956][ T453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 127.598160][ T453] CR2: 00005610aa9890a8 CR3: 000000010369e000 CR4: 0000000000750ef0 [ 127.598390][ T453] PKRU: 55555554 [ 127.598509][ T453] Call Trace: [ 127.598629][ T453] <TASK> [ 127.598718][ T453] ? mark_held_locks+0x40/0x70 [ 127.598890][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599053][ T453] sfb_dequeue+0x88/0x4d0 [ 127.599174][ T453] ? ktime_get+0x137/0x230 [ 127.599328][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599480][ T453] ? qdisc_peek_dequeued+0x7b/0x350 [sch_qfq] [ 127.599670][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599831][ T453] tbf_dequeue+0x6b1/0x1098 [sch_tbf] [ 127.599988][ T453] __qdisc_run+0x169/0x1900 The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Fixes: e13e02a3c68d ("net_sched: SFB flow scheduler") Signed-off-by: Victor Nogueria <victor@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430152957.194015-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnet/sched: sch_red: Replace direct dequeue call with peek and ↵Jamal Hadi Salim
qdisc_dequeue_peeked When red qdisc has children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (red in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(red) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (red). And herein lies the problem. - red will call the child's dequeue() which will essentially just try to grab something of qfq's queue. [ 78.667668][ T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 78.667927][ T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full) [ 78.668263][ T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 78.668486][ T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq] [ 78.668718][ T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d [ 78.669312][ T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216 [ 78.669533][ T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 78.669790][ T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048 [ 78.670044][ T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078 [ 78.670297][ T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000 [ 78.670560][ T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200 [ 78.670814][ T363] FS: 00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000 [ 78.671110][ T363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 78.671324][ T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0 [ 78.671585][ T363] PKRU: 55555554 [ 78.671713][ T363] Call Trace: [ 78.671843][ T363] <TASK> [ 78.671936][ T363] ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq] [ 78.672148][ T363] ? __pfx__printk+0x10/0x10 [ 78.672322][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.672496][ T363] ? lockdep_hardirqs_on_prepare+0xa8/0x1a0 [ 78.672706][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.672875][ T363] ? trace_hardirqs_on+0x19/0x1a0 [ 78.673047][ T363] red_dequeue+0x65/0x270 [sch_red] [ 78.673217][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.673385][ T363] tbf_dequeue.cold+0xb0/0x70c [sch_tbf] [ 78.673566][ T363] __qdisc_run+0x169/0x1900 The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Fixes: 77be155cba4e ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.") Reported-by: Manas <ghandatmanas@gmail.com> Reported-by: Rakshit Awasthi <rakshitawasthi17@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430152957.194015-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysnet/sched: sch_pie: annotate more data-races in pie_dump_stats()Eric Dumazet
My prior patch missed few READ_ONCE()/WRITE_ONCE() annotations. Fixes: 5154561d9b11 ("net/sched: sch_pie: annotate data-races in pie_dump_stats()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430080056.35104-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysnet/sched: cls_flower: revert unintended changesPaolo Abeni
While applying the blamed commit 4ca07b9239bd ("net: mctp i2c: check length before marking flow active"), I unintentionally included unrelated and unacceptable changes. Revert them. Fixes: 4ca07b9239bd ("net: mctp i2c: check length before marking flow active") Reported-by: Jeremy Kerr <jk@codeconstruct.com.au> Closes: https://lore.kernel.org/netdev/bd8704fe0bd53e278add5cde4873256656623e2e.camel@codeconstruct.com.au/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/043026a53ff84da88b17648c4b0d17f0331749cb.1777447863.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
14 daysnet/sched: sch_cake: annotate data-races in cake_dump_stats() (V)Eric Dumazet
cake_dump_stats() runs without qdisc spinlock being held. In this final patch, I add READ_ONCE()/WRITE_ONCE() annotations for cparams.target and cparams.interval. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 daysnet/sched: sch_cake: annotate data-races in cake_dump_stats() (IV)Eric Dumazet
cake_dump_stats() runs without qdisc spinlock being held. In this fourth patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - avg_peak_bandwidth - buffer_limit - buffer_max_used - avg_netoff - max_netlen - max_adjlen - min_netlen - min_adjlen - active_queues - tin_rate_bps - bytes - tin_backlog Other annotations are added in following patch, to ease code review. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 daysnet/sched: sch_cake: annotate data-races in cake_dump_stats() (III)Eric Dumazet
cake_dump_stats() runs without qdisc spinlock being held. In this third patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - packets - tin_dropped - tin_ecn_mark - ack_drops - peak_delay - avge_delay - base_delay Other annotations are added in following patches, to ease code review. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 daysnet/sched: sch_cake: annotate data-races in cake_dump_stats() (II)Eric Dumazet
cake_dump_stats() runs without qdisc spinlock being held. In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - bulk_flow_count - unresponsive_flow_count - max_skblen - flow_quantum Other annotations are added in following patches, to ease code review. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 daysnet/sched: sch_cake: annotate data-races in cake_dump_stats() (I)Eric Dumazet
cake_dump_stats() runs without qdisc spinlock being held. In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations for the following fields: - way_hits - way_misses - way_collisions - sparse_flow_count - decaying_flow_count Other annotations are added in following patches, to ease code review. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-28net: mctp i2c: check length before marking flow activeWilliam A. Kennington III
Currently, mctp_i2c_get_tx_flow_state() is called before the packet length sanity check. This function marks a new flow as active in the MCTP core. If the sanity check fails, mctp_i2c_xmit() returns early without calling mctp_i2c_lock_nest(). This results in a mismatched locking state: the flow is active, but the I2C bus lock was never acquired for it. When the flow is later released, mctp_i2c_release_flow() will see the active state and queue an unlock marker. The TX thread will then decrement midev->i2c_lock_count from 0, causing it to underflow to -1. This underflow permanently breaks the driver's locking logic, allowing future transmissions to occur without holding the I2C bus lock, leading to bus collisions and potential hardware hangs. Move the mctp_i2c_get_tx_flow_state() call to after the length sanity check to ensure we only transition the flow state if we are actually going to proceed with the transmission and locking. Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver") Signed-off-by: William A. Kennington III <william@wkennington.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-27net/sched: taprio: fix NULL pointer dereference in class dumpWeiming Shi
When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft() is called with new == NULL and stores NULL into q->qdiscs[cl - 1]. Subsequent RTM_GETTCLASS dump operations walk all classes via taprio_walk() and call taprio_dump_class(), which calls taprio_leaf() returning the NULL pointer, then dereferences it to read child->handle, causing a kernel NULL pointer dereference. The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user namespaces enabled, an unprivileged local user can trigger a kernel panic by creating a taprio qdisc inside a new network namespace, grafting an explicit child qdisc, deleting it, and requesting a class dump. The RTM_GETTCLASS dump itself requires no capability. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000007: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] RIP: 0010:taprio_dump_class (net/sched/sch_taprio.c:2478) Call Trace: <TASK> tc_fill_tclass (net/sched/sch_api.c:1966) qdisc_class_dump (net/sched/sch_api.c:2326) taprio_walk (net/sched/sch_taprio.c:2514) tc_dump_tclass_qdisc (net/sched/sch_api.c:2352) tc_dump_tclass_root (net/sched/sch_api.c:2370) tc_dump_tclass (net/sched/sch_api.c:2431) rtnl_dumpit (net/core/rtnetlink.c:6864) netlink_dump (net/netlink/af_netlink.c:2325) rtnetlink_rcv_msg (net/core/rtnetlink.c:6959) netlink_rcv_skb (net/netlink/af_netlink.c:2550) </TASK> Fix this by substituting &noop_qdisc when new is NULL in taprio_graft(), a common pattern used by other qdiscs (e.g., multiq_graft()) to ensure the q->qdiscs[] slots are never NULL. This makes control-plane dump paths safe without requiring individual NULL checks. Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq) previously had explicit NULL guards that would drop/skip the packet cleanly, update those checks to test for &noop_qdisc instead. Without this, packets would reach taprio_enqueue_one() which increments the root qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc drops the packet but those counters are never rolled back, permanently inflating the root qdisc's statistics. After this change *old can be a valid qdisc, NULL, or &noop_qdisc. Only call qdisc_put(*old) in the first case to avoid decreasing noop_qdisc's refcount, which was never increased. Fixes: 665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260422161958.2517539-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: sch_fq_pie: annotate data-races in fq_pie_dump_stats()Eric Dumazet
fq_codel_dump_stats() acquires the qdisc spinlock a bit too late. Move this acquisition before we fill tc_fq_pie_xstats with live data. Alternative would be to add READ_ONCE() and WRITE_ONCE() annotations, but the spinlock is needed anyway to scan q->new_flows and q->old_flows. Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423063527.2568262-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: sch_choke: annotate data-races in choke_dump_stats()Eric Dumazet
choke_dump_stats() only runs with RTNL held. It reads fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423062839.2524324-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: netem: check for negative latency and jitterStephen Hemminger
Reject requests with negative latency or jitter. A negative value added to current timestamp (u64) wraps to an enormous time_to_send, disabling dequeue. The original UAPI used u32 for these values; the conversion to 64-bit time values via TCA_NETEM_LATENCY64 and TCA_NETEM_JITTER64 allowed signed values to reach the kernel without validation. Jitter is already silently clamped by an abs() in netem_change(); that abs() can be removed in a follow-up once this rejection is in place. Fixes: 99803171ef04 ("netem: add uapi to express delay and jitter in nanoseconds") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-7-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: netem: fix slot delay calculation overflowStephen Hemminger
get_slot_next() computes a random delay between min_delay and max_delay using: get_random_u32() * (max_delay - min_delay) >> 32 This overflows signed 64-bit arithmetic when the delay range exceeds approximately 2.1 seconds (2^31 nanoseconds), producing a negative result that effectively disables slot-based pacing. This is a realistic configuration for WAN emulation (e.g., slot 1s 5s). Use mul_u64_u32_shr() which handles the widening multiply without overflow. Fixes: 0a9fe5c375b5 ("netem: slotting with non-uniform distribution") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-6-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: netem: validate slot configurationStephen Hemminger
Reject slot configurations that have no defensible meaning: - negative min_delay or max_delay - min_delay greater than max_delay - negative dist_delay or dist_jitter - negative max_packets or max_bytes Negative or out-of-order delays underflow in get_slot_next(), producing garbage intervals. Negative limits trip the per-slot accounting (packets_left/bytes_left <= 0) on the first packet of every slot, defeating the rate-limiting half of the slot feature. Note that dist_jitter has been silently coerced to its absolute value by get_slot() since the feature was introduced; rejecting negatives here converts that silent coercion into -EINVAL. The abs() can be removed in a follow-up. Fixes: 836af83b54e3 ("netem: support delivering packets in delayed time slots") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-5-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: netem: only reseed PRNG when seed is explicitly providedStephen Hemminger
netem_change() unconditionally reseeds the PRNG on every tc change command. If TCA_NETEM_PRNG_SEED is not specified, a new random seed is generated, destroying reproducibility for users who set a deterministic seed on a previous change. Move the initial random seed generation to netem_init() and only reseed in netem_change() when TCA_NETEM_PRNG_SEED is explicitly provided by the user. Fixes: 4072d97ddc44 ("netem: add prng attribute to netem_sched_data") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-4-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: netem: fix queue limit check to include reordered packetsStephen Hemminger
The queue limit check in netem_enqueue() uses q->t_len which only counts packets in the internal tfifo. Packets placed in sch->q by the reorder path (__qdisc_enqueue_head) are not counted, allowing the total queue occupancy to exceed sch->limit under reordering. Include sch->q.qlen in the limit check. Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-3-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-27net/sched: netem: fix probability gaps in 4-state loss modelStephen Hemminger
The 4-state Markov chain in loss_4state() has gaps at the boundaries between transition probability ranges. The comparisons use: if (rnd < a4) else if (a4 < rnd && rnd < a1 + a4) When rnd equals a boundary value exactly, neither branch matches and no state transition occurs. The redundant lower-bound check (a4 < rnd) is already implied by being in the else branch. Remove the unnecessary lower-bound comparisons so the ranges are contiguous and every random value produces a transition, matching the GI (General and Intuitive) loss model specification. This bug goes back to original implementation of this model. Fixes: 661b79725fea ("netem: revised correlated loss generator") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-2-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-22net/sched: sch_sfb: annotate data-races in sfb_dump_stats()Eric Dumazet
sfb_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_sfb_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260421141655.3953721-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-22net/sched: sch_red: annotate data-races in red_dump_stats()Eric Dumazet
red_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_red_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421142309.3964322-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-22net/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats()Eric Dumazet
fq_codel_dump_stats() acquires the qdisc spinlock a bit too late. Move this acquisition before we fill st.qdisc_stats with live data. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421142509.3967231-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-22net/sched: sch_pie: annotate data-races in pie_dump_stats()Eric Dumazet
pie_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_pie_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421142944.4009941-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-22net_sched: sch_hhf: annotate data-races in hhf_dump_stats()Eric Dumazet
hhf_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421143349.4052215-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-21net/sched: sch_dualpi2: drain both C-queue and L-queue in dualpi2_change()Chia-Yu Chang
Fix dualpi2_change() to correctly enforce updated limit and memlimit values after a configuration change of the dualpi2 qdisc. Before this patch, dualpi2_change() always attempted to dequeue packets via the root qdisc (C-queue) when reducing backlog or memory usage, and unconditionally assumed that a valid skb will be returned. When traffic classification results in packets being queued in the L-queue while the C-queue is empty, this leads to a NULL skb dereference during limit or memlimit enforcement. This is fixed by first dequeuing from the C-queue path if it is non-empty. Once the C-queue is empty, packets are dequeued directly from the L-queue. Return values from qdisc_dequeue_internal() are checked for both queues. When dequeuing from the L-queue, the parent qdisc qlen and backlog counters are updated explicitly to keep overall qdisc statistics consistent. Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc") Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com> Closes: https://lore.kernel.org/netdev/20260413075740.2234828-1-hxzene@gmail.com/ Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20260417152551.71648-1-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-16net/sched: taprio: fix use-after-free in advance_sched() on schedule switchVinicius Costa Gomes
In advance_sched(), when should_change_schedules() returns true, switch_schedules() is called to promote the admin schedule to oper. switch_schedules() queues the old oper schedule for RCU freeing via call_rcu(), but 'next' still points into an entry of the old oper schedule. The subsequent 'next->end_time = end_time' and rcu_assign_pointer(q->current_entry, next) are use-after-free. Fix this by selecting 'next' from the new oper schedule immediately after switch_schedules(), and using its pre-calculated end_time. setup_first_end_time() sets the first entry's end_time to base_time + interval when the schedule is installed, so the value is already correct. The deleted 'end_time = sched_base_time(admin)' assignment was also harmful independently: it would overwrite the new first entry's pre-calculated end_time with just base_time. Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: Junxi Qian <qjx1298677004@gmail.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-16net/sched: sch_cake: fix NAT destination port not being updated in ↵Dudu Lu
cake_update_flowkeys cake_update_flowkeys() is supposed to update the flow dissector keys with the NAT-translated addresses and ports from conntrack, so that CAKE's per-flow fairness correctly identifies post-NAT flows as belonging to the same connection. For the source port, this works correctly: keys->ports.src = port; But for the destination port, the assignment is reversed: port = keys->ports.dst; This means the NAT destination port is never updated in the flow keys. As a result, when multiple connections are NATed to the same destination, CAKE treats them as separate flows because the original (pre-NAT) destination ports differ. This breaks CAKE's NAT-aware flow isolation when using the "nat" mode. The bug was introduced in commit b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate") which refactored the original direct assignment into a compare-and-conditionally-update pattern, but wrote the destination port update backwards. Fix by reversing the assignment direction to match the source port pattern. Fixes: b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate") Signed-off-by: Dudu Lu <phx0fer@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260413110041.44704-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-16net/sched: act_mirred: fix wrong device for mac_header_xmit check in ↵Dudu Lu
tcf_blockcast_redir In tcf_blockcast_redir(), when iterating block ports to redirect packets to multiple devices, the mac_header_xmit flag is queried from the wrong device. The loop sends to dev_prev but queries dev_is_mac_header_xmit(dev) — which is the NEXT device in the iteration, not the one being sent to. This causes tcf_mirred_to_dev() to make incorrect decisions about whether to push or pull the MAC header. When the block contains mixed device types (e.g., an ethernet veth and a tunnel device), intermediate devices get the wrong mac_header_xmit flag, leading to skb header corruption. In the worst case, skb_push_rcsum with an incorrect mac_len can exhaust headroom and panic. The last device in the loop is handled correctly (line 365-366 uses dev_is_mac_header_xmit(dev_prev)), confirming this is a copy-paste oversight for the intermediate devices. Fix by using dev_prev instead of dev for the mac_header_xmit query, consistent with the device actually being sent to. Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block") Signed-off-by: Dudu Lu <phx0fer@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260413084927.71353-1-phx0fer@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes in preparation for the net-next PR. Conflicts: include/net/sch_generic.h a6bd339dbb351 ("net_sched: fix skb memory leak in deferred qdisc drops") ff2998f29f390 ("net: sched: introduce qdisc-specific drop reason tracing") https://lore.kernel.org/adz0iX85FHMz0HdO@sirena.org.uk drivers/net/ethernet/airoha/airoha_eth.c 1acdfbdb516b ("net: airoha: Fix VIP configuration for AN7583 SoC") bf3471e6e6c0 ("net: airoha: Make flow control source port mapping dependent on nbq parameter") Adjacent changes: drivers/net/ethernet/airoha/airoha_ppe.c f44218cd5e6a ("net: airoha: Reset PPE cpu port configuration in airoha_ppe_hw_init()") 7da62262ec96 ("inet: add ip_local_port_step_width sysctl to improve port usage distribution") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net/sched: act_ct: Only release RCU read lock after ct_ftJamal Hadi Salim
When looking up a flow table in act_ct in tcf_ct_flow_table_get(), rhashtable_lookup_fast() internally opens and closes an RCU read critical section before returning ct_ft. The tcf_ct_flow_table_cleanup_work() can complete before refcount_inc_not_zero() is invoked on the returned ct_ft resulting in a UAF on the already freed ct_ft object. This vulnerability can lead to privilege escalation. Analysis from zdi-disclosures@trendmicro.com: When initializing act_ct, tcf_ct_init() is called, which internally triggers tcf_ct_flow_table_get(). static int tcf_ct_flow_table_get(struct net *net, struct tcf_ct_params *params) { struct zones_ht_key key = { .net = net, .zone = params->zone }; struct tcf_ct_flow_table *ct_ft; int err = -ENOMEM; mutex_lock(&zones_mutex); ct_ft = rhashtable_lookup_fast(&zones_ht, &key, zones_params); // [1] if (ct_ft && refcount_inc_not_zero(&ct_ft->ref)) // [2] goto out_unlock; ... } static __always_inline void *rhashtable_lookup_fast( struct rhashtable *ht, const void *key, const struct rhashtable_params params) { void *obj; rcu_read_lock(); obj = rhashtable_lookup(ht, key, params); rcu_read_unlock(); return obj; } At [1], rhashtable_lookup_fast() looks up and returns the corresponding ct_ft from zones_ht . The lookup is performed within an RCU read critical section through rcu_read_lock() / rcu_read_unlock(), which prevents the object from being freed. However, at the point of function return, rcu_read_unlock() has already been called, and there is nothing preventing ct_ft from being freed before reaching refcount_inc_not_zero(&ct_ft->ref) at [2]. This interval becomes the race window, during which ct_ft can be freed. Free Process: tcf_ct_flow_table_put() is executed through the path tcf_ct_cleanup() call_rcu() tcf_ct_params_free_rcu() tcf_ct_params_free() tcf_ct_flow_table_put(). static void tcf_ct_flow_table_put(struct tcf_ct_flow_table *ct_ft) { if (refcount_dec_and_test(&ct_ft->ref)) { rhashtable_remove_fast(&zones_ht, &ct_ft->node, zones_params); INIT_RCU_WORK(&ct_ft->rwork, tcf_ct_flow_table_cleanup_work); // [3] queue_rcu_work(act_ct_wq, &ct_ft->rwork); } } At [3], tcf_ct_flow_table_cleanup_work() is scheduled as RCU work static void tcf_ct_flow_table_cleanup_work(struct work_struct *work) { struct tcf_ct_flow_table *ct_ft; struct flow_block *block; ct_ft = container_of(to_rcu_work(work), struct tcf_ct_flow_table, rwork); nf_flow_table_free(&ct_ft->nf_ft); block = &ct_ft->nf_ft.flow_block; down_write(&ct_ft->nf_ft.flow_block_lock); WARN_ON(!list_empty(&block->cb_list)); up_write(&ct_ft->nf_ft.flow_block_lock); kfree(ct_ft); // [4] module_put(THIS_MODULE); } tcf_ct_flow_table_cleanup_work() frees ct_ft at [4]. When this function executes between [1] and [2], UAF occurs. This race condition has a very short race window, making it generally difficult to trigger. Therefore, to trigger the vulnerability an msleep(100) was inserted after[1] Fixes: 138470a9b2cc2 ("net/sched: act_ct: fix lockdep splat in tcf_ct_flow_table_get") Reported-by: zdi-disclosures@trendmicro.com Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260410111627.46611-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net/sched: cls_fw: fix NULL dereference of "old" filters before change()Davide Caratti
Like pointed out by Sashiko [1], since commit ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") TC filters are added to a shared block and published to datapath before their ->change() function is called. This is a problem for cls_fw: an invalid filter created with the "old" method can still classify some packets before it is destroyed by the validation logic added by Xiang. Therefore, insisting with repeated runs of the following script: # ip link add dev crash0 type dummy # ip link set dev crash0 up # mausezahn crash0 -c 100000 -P 10 \ > -A 4.3.2.1 -B 1.2.3.4 -t udp "dp=1234" -q & # sleep 1 # tc qdisc add dev crash0 egress_block 1 clsact # tc filter add block 1 protocol ip prio 1 matchall \ > action skbedit mark 65536 continue # tc filter add block 1 protocol ip prio 2 fw # ip link del dev crash0 can still make fw_classify() hit the WARN_ON() in [2]: WARNING: ./include/net/pkt_cls.h:88 at fw_classify+0x244/0x250 [cls_fw], CPU#18: mausezahn/1399 Modules linked in: cls_fw(E) act_skbedit(E) CPU: 18 UID: 0 PID: 1399 Comm: mausezahn Tainted: G E 7.0.0-rc6-virtme #17 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Red Hat KVM, BIOS 1.16.3-2.el9 04/01/2014 RIP: 0010:fw_classify+0x244/0x250 [cls_fw] Code: 5c 49 c7 45 00 00 00 00 00 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 5b b8 ff ff ff ff 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 90 <0f> 0b 90 eb a0 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffd1b7026bf8a8 EFLAGS: 00010202 RAX: ffff8c5ac9c60800 RBX: ffff8c5ac99322c0 RCX: 0000000000000004 RDX: 0000000000000001 RSI: ffff8c5b74d7a000 RDI: ffff8c5ac8284f40 RBP: ffffd1b7026bf8d0 R08: 0000000000000000 R09: ffffd1b7026bf9b0 R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000010000 R13: ffffd1b7026bf930 R14: ffff8c5ac8284f40 R15: 0000000000000000 FS: 00007fca40c37740(0000) GS:ffff8c5b74d7a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fca40e822a0 CR3: 0000000005ca0001 CR4: 0000000000172ef0 Call Trace: <TASK> tcf_classify+0x17d/0x5c0 tc_run+0x9d/0x150 __dev_queue_xmit+0x2ab/0x14d0 ip_finish_output2+0x340/0x8f0 ip_output+0xa4/0x250 raw_sendmsg+0x147d/0x14b0 __sys_sendto+0x1cc/0x1f0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x126/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fca40e822ba Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 RSP: 002b:00007ffc248a42c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055ef233289d0 RCX: 00007fca40e822ba RDX: 000000000000001e RSI: 000055ef23328c30 RDI: 0000000000000003 RBP: 000055ef233289d0 R08: 00007ffc248a42d0 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000001e R13: 00000000000186a0 R14: 0000000000000000 R15: 00007fca41043000 </TASK> irq event stamp: 1045778 hardirqs last enabled at (1045784): [<ffffffff864ec042>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1045789): [<ffffffff864ec027>] __up_console_sem+0x37/0x60 softirqs last enabled at (1045426): [<ffffffff874d48c7>] __alloc_skb+0x207/0x260 softirqs last disabled at (1045434): [<ffffffff874fe8f8>] __dev_queue_xmit+0x78/0x14d0 Then, because of the value in the packet's mark, dereference on 'q->handle' with NULL 'q' occurs: BUG: kernel NULL pointer dereference, address: 0000000000000038 [...] RIP: 0010:fw_classify+0x1fe/0x250 [cls_fw] [...] Skip "old-style" classification on shared blocks, so that the NULL dereference is fixed and WARN_ON() is not hit anymore in the short lifetime of invalid cls_fw "old-style" filters. [1] https://sashiko.dev/#/patchset/20260331050217.504278-1-xmei5%40asu.edu [2] https://elixir.bootlin.com/linux/v7.0-rc6/source/include/net/pkt_cls.h#L86 Fixes: faeea8bbf6e9 ("net/sched: cls_fw: fix NULL pointer dereference on shared blocks") Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/e39cbd3103a337f1e515d186fe697b4459d24757.1775661704.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c c3812651b522f ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") 78723a62b969a ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c fde29fd934932 ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") d98adfbdd5c01 ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c 51f4e090b9f8 ("net: stmmac: fix integer underflow in chain mode") 6b4286e05508 ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08codel: annotate data-races in codel_dump_stats()Eric Dumazet
codel_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_codel_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. No change in kernel size: $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 0/0 grow/shrink: 1/1 up/down: 3/-1 (2) Function old new delta codel_qdisc_dequeue 2462 2465 +3 codel_dump_stats 250 249 -1 Total: Before=29739919, After=29739921, chg +0.00% Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260407143053.1570620-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-03net: sched: act_csum: validate nested VLAN headersRuide Cao
tcf_csum_act() walks nested VLAN headers directly from skb->data when an skb still carries in-payload VLAN tags. The current code reads vlan->h_vlan_encapsulated_proto and then pulls VLAN_HLEN bytes without first ensuring that the full VLAN header is present in the linear area. If only part of an inner VLAN header is linearized, accessing h_vlan_encapsulated_proto reads past the linear area, and the following skb_pull(VLAN_HLEN) may violate skb invariants. Fix this by requiring pskb_may_pull(skb, VLAN_HLEN) before accessing and pulling each nested VLAN header. If the header still is not fully available, drop the packet through the existing error path. Fixes: 2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Ruide Cao <caoruide123@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/22df2fcb49f410203eafa5d97963dd36089f4ecf.1774892775.git.caoruide123@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.0-rc7). Conflicts: net/vmw_vsock/af_vsock.c b18c83388874 ("vsock: initialize child_ns_mode_locked in vsock_net_init()") 0de607dc4fd8 ("vsock: add G2H fallback for CIDs not owned by H2G transport") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c ceee35e5674a ("bnxt_en: Refactor some basic ring setup and adjustment logic") 57cdfe0dc70b ("bnxt_en: Resize RSS contexts on channel count change") drivers/net/wireless/intel/iwlwifi/mld/mac80211.c 4d56037a02bd ("wifi: iwlwifi: mld: block EMLSR during TDLS connections") 687a95d204e7 ("wifi: iwlwifi: mld: correctly set wifi generation data") drivers/net/wireless/intel/iwlwifi/mld/scan.h b6045c899e37 ("wifi: iwlwifi: mld: Refactor scan command handling") ec66ec6a5a8f ("wifi: iwlwifi: mld: Fix MLO scan timing") drivers/net/wireless/intel/iwlwifi/mvm/fw.c 078df640ef05 ("wifi: iwlwifi: mld: add support for iwl_mcc_allowed_ap_type_cmd v 2") 323156c3541e ("wifi: iwlwifi: mvm: don't send a 6E related command when not supported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-02net/sched: cls_flow: fix NULL pointer dereference on shared blocksXiang Mei
flow_change() calls tcf_block_q() and dereferences q->handle to derive a default baseclass. Shared blocks leave block->q NULL, causing a NULL deref when a flow filter without a fully qualified baseclass is created on a shared block. Check tcf_block_shared() before accessing block->q and return -EINVAL for shared blocks. This avoids the null-deref shown below: ======================================================================= KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] RIP: 0010:flow_change (net/sched/cls_flow.c:508) Call Trace: tc_new_tfilter (net/sched/cls_api.c:2432) rtnetlink_rcv_msg (net/core/rtnetlink.c:6980) [...] ======================================================================= Fixes: 1abf272022cf ("net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260331050217.504278-2-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-02net/sched: cls_fw: fix NULL pointer dereference on shared blocksXiang Mei
The old-method path in fw_classify() calls tcf_block_q() and dereferences q->handle. Shared blocks leave block->q NULL, causing a NULL deref when an empty cls_fw filter is attached to a shared block and a packet with a nonzero major skb mark is classified. Reject the configuration in fw_change() when the old method (no TCA_OPTIONS) is used on a shared block, since fw_classify()'s old-method path needs block->q which is NULL for shared blocks. The fixed null-ptr-deref calling stack: KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] RIP: 0010:fw_classify (net/sched/cls_fw.c:81) Call Trace: tcf_classify (./include/net/tc_wrapper.h:197 net/sched/cls_api.c:1764 net/sched/cls_api.c:1860) tc_run (net/core/dev.c:4401) __dev_queue_xmit (net/core/dev.c:4535 net/core/dev.c:4790) Fixes: 1abf272022cf ("net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260331050217.504278-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-01net/sched: sch_netem: fix out-of-bounds access in packet corruptionYucheng Lu
In netem_enqueue(), the packet corruption logic uses get_random_u32_below(skb_headlen(skb)) to select an index for modifying skb->data. When an AF_PACKET TX_RING sends fully non-linear packets over an IPIP tunnel, skb_headlen(skb) evaluates to 0. Passing 0 to get_random_u32_below() takes the variable-ceil slow path which returns an unconstrained 32-bit random integer. Using this unconstrained value as an offset into skb->data results in an out-of-bounds memory access. Fix this by verifying skb_headlen(skb) is non-zero before attempting to corrupt the linear data area. Fully non-linear packets will silently bypass the corruption logic. Fixes: c865e5d99e25 ("[PKT_SCHED] netem: packet corruption option") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Signed-off-by: Yuan Tan <tanyuan98@outlook.com> Signed-off-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Yuhang Zheng <z1652074432@gmail.com> Signed-off-by: Yucheng Lu <kanolyc@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://patch.msgid.link/45435c0935df877853a81e6d06205ac738ec65fa.1774941614.git.kanolyc@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-30net: sched: cls_api: fix tc_chain_fill_node to initialize tcm_info to zero ↵Yochai Eisenrich
to prevent an info-leak When building netlink messages, tc_chain_fill_node() never initializes the tcm_info field of struct tcmsg. Since the allocation is not zeroed, kernel heap memory is leaked to userspace through this 4-byte field. The fix simply zeroes tcm_info alongside the other fields that are already initialized. Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi") Signed-off-by: Yochai Eisenrich <echelonh@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260328211436.1010152-1-echelonh@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-29net: convert remaining ipv6_stub users to direct function callsFernando Fernandez Mancera
As IPv6 is built-in only, the ipv6_stub infrastructure is no longer necessary. Convert remaining ipv6_stub users to make direct function calls. The fallback functions introduced previously will prevent linkage errors when CONFIG_IPV6 is disabled. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Tested-by: Ricardo B. Marlière <rbm@suse.com> Link: https://patch.msgid.link/20260325120928.15848-9-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-27net/sched: sch_hfsc: fix divide-by-zero in rtsc_min()Xiang Mei
m2sm() converts a u32 slope to a u64 scaled value. For large inputs (e.g. m1=4000000000), the result can reach 2^32. rtsc_min() stores the difference of two such u64 values in a u32 variable `dsm` and uses it as a divisor. When the difference is exactly 2^32 the truncation yields zero, causing a divide-by-zero oops in the concave-curve intersection path: Oops: divide error: 0000 RIP: 0010:rtsc_min (net/sched/sch_hfsc.c:601) Call Trace: init_ed (net/sched/sch_hfsc.c:629) hfsc_enqueue (net/sched/sch_hfsc.c:1569) [...] Widen `dsm` to u64 and replace do_div() with div64_u64() so the full difference is preserved. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260326204310.1549327-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.0-rc5). net/netfilter/nft_set_rbtree.c 598adea720b97 ("netfilter: revert nft_set_rbtree: validate open interval overlap") 3aea466a43998 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock") https://lore.kernel.org/abgaQBpeGstdN4oq@sirena.org.uk No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-17clsact: Fix use-after-free in init/destroy rollback asymmetryDaniel Borkmann
Fix a use-after-free in the clsact qdisc upon init/destroy rollback asymmetry. The latter is achieved by first fully initializing a clsact instance, and then in a second step having a replacement failure for the new clsact qdisc instance. clsact_init() initializes ingress first and then takes care of the egress part. This can fail midway, for example, via tcf_block_get_ext(). Upon failure, the kernel will trigger the clsact_destroy() callback. Commit 1cb6f0bae504 ("bpf: Fix too early release of tcx_entry") details the way how the transition is happening. If tcf_block_get_ext on the q->ingress_block ends up failing, we took the tcx_miniq_inc reference count on the ingress side, but not yet on the egress side. clsact_destroy() tests whether the {ingress,egress}_entry was non-NULL. However, even in midway failure on the replacement, both are in fact non-NULL with a valid egress_entry from the previous clsact instance. What we really need to test for is whether the qdisc instance-specific ingress or egress side previously got initialized. This adds a small helper for checking the miniq initialization called mini_qdisc_pair_inited, and utilizes that upon clsact_destroy() in order to fix the use-after-free scenario. Convert the ingress_destroy() side as well so both are consistent to each other. Fixes: 1cb6f0bae504 ("bpf: Fix too early release of tcx_entry") Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260313065531.98639-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-17net/sched: cls_flower: remove unions from fl_flow_keyCedric Jehasse
When creating a flower classifier with an ipv4 address the flow_dissector has both FLOW_DISSECTOR_KEY_IPV4_ADDRS and FLOW_DISSECTOR_KEY_IPV6_ADDRS bits set in used_keys. This happens because ipv4/ipv6 fields are a union and FL_KEY_SET_IF_MASKED() will interpret either being set as both. Removing the unions fixes this behavior without needing special handling for union fields. Example of a command that caused FLOW_DISSECTOR_KEY_IPV4_ADDRS and FLOW_DISSECTOR_KEY_IPV6_ADDRS to be set: tc filter add dev p1 ingress protocol ip flower skip_sw \ dst_ip 224.0.1.129 action trap Signed-off-by: Cedric Jehasse <cedric.jehasse@luminex.be> Link: https://patch.msgid.link/20260311-net-next-mv88e6xxx-tcam-v8-1-32dd5ba30002@luminex.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-16net/sched: teql: Fix double-free in teql_master_xmitJamal Hadi Salim
Whenever a TEQL devices has a lockless Qdisc as root, qdisc_reset should be called using the seq_lock to avoid racing with the datapath. Failure to do so may cause crashes like the following: [ 238.028993][ T318] BUG: KASAN: double-free in skb_release_data (net/core/skbuff.c:1139) [ 238.029328][ T318] Free of addr ffff88810c67ec00 by task poc_teql_uaf_ke/318 [ 238.029749][ T318] [ 238.029900][ T318] CPU: 3 UID: 0 PID: 318 Comm: poc_teql_ke Not tainted 7.0.0-rc3-00149-ge5b31d988a41 #704 PREEMPT(full) [ 238.029906][ T318] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 238.029910][ T318] Call Trace: [ 238.029913][ T318] <TASK> [ 238.029916][ T318] dump_stack_lvl (lib/dump_stack.c:122) [ 238.029928][ T318] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 238.029940][ T318] ? skb_release_data (net/core/skbuff.c:1139) [ 238.029944][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) ... [ 238.029957][ T318] ? skb_release_data (net/core/skbuff.c:1139) [ 238.029969][ T318] kasan_report_invalid_free (mm/kasan/report.c:221 mm/kasan/report.c:563) [ 238.029979][ T318] ? skb_release_data (net/core/skbuff.c:1139) [ 238.029989][ T318] check_slab_allocation (mm/kasan/common.c:231) [ 238.029995][ T318] kmem_cache_free (mm/slub.c:2637 (discriminator 1) mm/slub.c:6168 (discriminator 1) mm/slub.c:6298 (discriminator 1)) [ 238.030004][ T318] skb_release_data (net/core/skbuff.c:1139) ... [ 238.030025][ T318] sk_skb_reason_drop (net/core/skbuff.c:1256) [ 238.030032][ T318] pfifo_fast_reset (./include/linux/ptr_ring.h:171 ./include/linux/ptr_ring.h:309 ./include/linux/skb_array.h:98 net/sched/sch_generic.c:827) [ 238.030039][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) ... [ 238.030054][ T318] qdisc_reset (net/sched/sch_generic.c:1034) [ 238.030062][ T318] teql_destroy (./include/linux/spinlock.h:395 net/sched/sch_teql.c:157) [ 238.030071][ T318] __qdisc_destroy (./include/net/pkt_sched.h:328 net/sched/sch_generic.c:1077) [ 238.030077][ T318] qdisc_graft (net/sched/sch_api.c:1062 net/sched/sch_api.c:1053 net/sched/sch_api.c:1159) [ 238.030089][ T318] ? __pfx_qdisc_graft (net/sched/sch_api.c:1091) [ 238.030095][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 238.030102][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 238.030106][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 238.030114][ T318] tc_get_qdisc (net/sched/sch_api.c:1529 net/sched/sch_api.c:1556) ... [ 238.072958][ T318] Allocated by task 303 on cpu 5 at 238.026275s: [ 238.073392][ T318] kasan_save_stack (mm/kasan/common.c:58) [ 238.073884][ T318] kasan_save_track (mm/kasan/common.c:64 (discriminator 5) mm/kasan/common.c:79 (discriminator 5)) [ 238.074230][ T318] __kasan_slab_alloc (mm/kasan/common.c:369) [ 238.074578][ T318] kmem_cache_alloc_node_noprof (./include/linux/kasan.h:253 mm/slub.c:4542 mm/slub.c:4869 mm/slub.c:4921) [ 238.076091][ T318] kmalloc_reserve (net/core/skbuff.c:616 (discriminator 107)) [ 238.076450][ T318] __alloc_skb (net/core/skbuff.c:713) [ 238.076834][ T318] alloc_skb_with_frags (./include/linux/skbuff.h:1383 net/core/skbuff.c:6763) [ 238.077178][ T318] sock_alloc_send_pskb (net/core/sock.c:2997) [ 238.077520][ T318] packet_sendmsg (net/packet/af_packet.c:2926 net/packet/af_packet.c:3019 net/packet/af_packet.c:3108) [ 238.081469][ T318] [ 238.081870][ T318] Freed by task 299 on cpu 1 at 238.028496s: [ 238.082761][ T318] kasan_save_stack (mm/kasan/common.c:58) [ 238.083481][ T318] kasan_save_track (mm/kasan/common.c:64 (discriminator 5) mm/kasan/common.c:79 (discriminator 5)) [ 238.085348][ T318] kasan_save_free_info (mm/kasan/generic.c:587 (discriminator 1)) [ 238.085900][ T318] __kasan_slab_free (mm/kasan/common.c:287) [ 238.086439][ T318] kmem_cache_free (mm/slub.c:6168 (discriminator 3) mm/slub.c:6298 (discriminator 3)) [ 238.087007][ T318] skb_release_data (net/core/skbuff.c:1139) [ 238.087491][ T318] consume_skb (net/core/skbuff.c:1451) [ 238.087757][ T318] teql_master_xmit (net/sched/sch_teql.c:358) [ 238.088116][ T318] dev_hard_start_xmit (./include/linux/netdevice.h:5324 ./include/linux/netdevice.h:5333 net/core/dev.c:3871 net/core/dev.c:3887) [ 238.088468][ T318] sch_direct_xmit (net/sched/sch_generic.c:347) [ 238.088820][ T318] __qdisc_run (net/sched/sch_generic.c:420 (discriminator 1)) [ 238.089166][ T318] __dev_queue_xmit (./include/net/sch_generic.h:229 ./include/net/pkt_sched.h:121 ./include/net/pkt_sched.h:117 net/core/dev.c:4196 net/core/dev.c:4802) Workflow to reproduce: 1. Initialize a TEQL topology (dummy0 and ifb0 as slaves, teql0 up). 2. Start multiple sender workers continuously transmitting packets through teql0 to drive teql_master_xmit(). 3. In parallel, repeatedly delete and re-add the root qdisc on dummy0 and ifb0 via RTNETLINK, forcing frequent teardown and reset activity (teql_destroy() / qdisc_reset()). 4. After running both workloads concurrently for several iterations, KASAN reports slab-use-after-free or double-free in the skb free path. Fix this by moving dev_reset_queue to sch_generic.h and calling it, instead of qdisc_reset, in teql_destroy since it handles both the lock and lockless cases correctly for root qdiscs. Fixes: 96009c7d500e ("sched: replace __QDISC_STATE_RUNNING bit with a spin lock") Reported-by: Xianrui Dong <keenanat2000@gmail.com> Tested-by: Xianrui Dong <keenanat2000@gmail.com> Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260315155422.147256-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.0-rc4). drivers/net/ethernet/mellanox/mlx5/core/en_rx.c db25c42c2e1f9 ("net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ") dff1c3164a692 ("net/mlx5e: SHAMPO, Always calculate page size") https://lore.kernel.org/aa7ORohmf67EKihj@sirena.org.uk drivers/net/ethernet/ti/am65-cpsw-nuss.c 840c9d13cb1ca ("net: ethernet: ti: am65-cpsw-nuss: Fix rx_filter value for PTP support") a23c657e332f2 ("net: ethernet: ti: am65-cpsw: Use also port number to identify timestamps") https://lore.kernel.org/abK3EkIXuVgMyGI7@sirena.org.uk No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>