| Age | Commit message (Collapse) | Author |
|
While applying the blamed commit 4ca07b9239bd ("net: mctp i2c: check
length before marking flow active"), I unintentionally included
unrelated and unacceptable changes.
Revert them.
Fixes: 4ca07b9239bd ("net: mctp i2c: check length before marking flow active")
Reported-by: Jeremy Kerr <jk@codeconstruct.com.au>
Closes: https://lore.kernel.org/netdev/bd8704fe0bd53e278add5cde4873256656623e2e.camel@codeconstruct.com.au/
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/043026a53ff84da88b17648c4b0d17f0331749cb.1777447863.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When tls_set_device_offload_rx() fails at tls_dev_add(), the error path
calls tls_sw_free_resources_rx() to clean up the SW context that was
initialized by tls_set_sw_offload(). This function calls
tls_sw_release_resources_rx() (which stops the strparser via
tls_strp_stop()) and tls_sw_free_ctx_rx() (which kfrees the context),
but never frees the anchor skb that was allocated by alloc_skb(0) in
tls_strp_init().
Note that tls_sw_free_resources_rx() is exclusively used for this
"failed to start offload" code path, there's no other caller.
The leak did not exist before commit 84c61fe1a75b ("tls: rx: do not use
the standard strparser"), because the standard strparser doesn't try
to pre-allocate an skb.
The normal close path in tls_sk_proto_close() handles cleanup by calling
tls_sw_strparser_done() (which calls tls_strp_done()) after dropping
the socket lock, because tls_strp_done() does cancel_work_sync() and
the strparser work handler takes the socket lock.
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260428231559.1358502-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When page_pool_create_percpu() fails on page_pool_list(), it falls
through to its err_uninit: label, which calls page_pool_uninit().
At that point page_pool_init() has already taken two references
when the user requested PP_FLAG_ALLOW_UNREADABLE_NETMEM:
pool->mp_ops->init(pool)
static_branch_inc(&page_pool_mem_providers);
Neither is undone by page_pool_uninit(); both are only undone by
__page_pool_destroy() (success-side teardown). The error path
therefore leaks the per-provider reference taken by mp_ops->init
(io_zcrx_ifq->refs in the io_uring zcrx provider, the dmabuf
binding refcount in the devmem provider) plus one increment of
the page_pool_mem_providers static branch on every failure of
xa_alloc_cyclic() inside page_pool_list().
The leaked io_zcrx_ifq->refs in turn pins everything
io_zcrx_ifq_free() would release on cleanup: ifq->user (uid),
ifq->mm_account (mmdrop), ifq->dev (device refcount),
ifq->netdev_tracker (netdev refcount), and the rbuf region.
The leaked static branch increment forces all subsequent
page_pool_alloc_netmems() and page_pool_return_page() callers to
take the slow mp_ops branch for the lifetime of the kernel.
Reachable via the io_uring zcrx path:
io_uring_register(IORING_REGISTER_ZCRX_IFQ) /* CAP_NET_ADMIN */
-> __io_uring_register
-> io_register_zcrx
-> zcrx_register_netdev
-> netif_mp_open_rxq
-> driver ndo_queue_mem_alloc
-> page_pool_create_percpu
-> page_pool_init succeeds (mp_ops->init runs, branch++)
-> page_pool_list fails (xa_alloc_cyclic -ENOMEM)
-> goto err_uninit <-- leak
The same shape applies to the devmem dmabuf provider via
mp_dmabuf_devmem_init()/mp_dmabuf_devmem_destroy().
Restore the cleanup symmetry by moving the mp_ops->destroy() and
static_branch_dec() calls out of __page_pool_destroy() and into
page_pool_uninit(), so page_pool_uninit() is again the strict
inverse of page_pool_init(). page_pool_uninit() has only two
callers (the err_uninit: path and __page_pool_destroy()), so this
preserves the single-call invariant on the success path while
fixing the err path. The error path of page_pool_init() itself
still skips the mp_ops cleanup correctly: mp_ops->init is the
last action that takes a reference before page_pool_init() returns
0, so when it returns an error neither the refcount nor the static
branch has been touched.
Triggering the bug requires xa_alloc_cyclic() to fail with -ENOMEM,
which under normal GFP_KERNEL retry behaviour is rare. It is
deterministic under CONFIG_FAULT_INJECTION with fail_page_alloc /
xa fault injection, or under sustained memory pressure. The leak
is silent: there is no warning, and the released kernel build
continues running with a permanently-incremented static branch.
Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
Signed-off-by: Hasan Basbunar <basbunarhasan@gmail.com>
Link: https://patch.msgid.link/20260428170739.34881-1-basbunarhasan@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
VMBUS ring buffers must be page aligned. Therefore, the current value of
24K presents a challenge on ARM64 kernels (with 64K pages). So, use
VMBUS_RING_SIZE() to ensure they are always aligned and large enough to
hold all of the relevant data.
Cc: stable@vger.kernel.org
Fixes: 77ffe33363c0 ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication")
Tested-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260428125339.13963-1-hamzamahfooz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit db359fccf212 ("mm: introduce a new page type for page pool in
page type") added a page_type field to struct net_iov at the same
offset as struct page::page_type, so that page_pool_set_pp_info() can
call __SetPageNetpp() uniformly on both pages and net_iovs.
The page-type API requires the field to hold the UINT_MAX "no type"
sentinel before a type can be set; for real struct page that invariant
is established by the page allocator on free. struct net_iov is not
allocated through the page allocator, so the field is left as zero
(io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem,
which uses kvmalloc_objs() without zeroing). When the page pool then
calls page_pool_set_pp_info() on a freshly-bound niov,
__SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires
and the kernel BUGs. Triggered in selftests by io_uring zcrx setup
through the fbnic queue restart path:
kernel BUG at ./include/linux/page-flags.h:1062!
RIP: 0010:page_pool_set_pp_info (./include/linux/page-flags.h:1062
net/core/page_pool.c:716)
Call Trace:
<TASK>
net_mp_niov_set_page_pool (net/core/page_pool.c:1360)
io_pp_zc_alloc_netmems (io_uring/zcrx.c:1089 io_uring/zcrx.c:1110)
fbnic_fill_bdq (./include/net/page_pool/helpers.h:160
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:906)
__fbnic_nv_restart (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2470
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2874)
fbnic_queue_start (drivers/net/ethernet/meta/fbnic/fbnic_txrx.c:2903)
netdev_rx_queue_reconfig (net/core/netdev_rx_queue.c:137)
__netif_mp_open_rxq (net/core/netdev_rx_queue.c:234)
io_register_zcrx (io_uring/zcrx.c:818 io_uring/zcrx.c:903)
__io_uring_register (io_uring/register.c:931)
__do_sys_io_uring_register (io_uring/register.c:1029)
do_syscall_64 (arch/x86/entry/syscall_64.c:63
arch/x86/entry/syscall_64.c:94)
</TASK>
The same path is reachable through devmem dmabuf binding via
netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue().
Add a net_iov_init() helper that stamps ->owner, ->type and the
->page_type sentinel, and use it from both the devmem and io_uring
zcrx niov init loops.
Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type")
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This variable counts how many MPTCP endpoints have a 'fullmesh' flag
set. After having flushed all MPTCP endpoints, it is then needed to
reset this counter.
Without this reset, this counter exposed to the userspace is wrong, but
also non-fullmesh endpoints added after the flush will not be taken into
account to create subflows in reaction to ADD_ADDRs.
Fixes: f88191c7f361 ("mptcp: pm: in-kernel: record fullmesh endp nb")
Cc: stable@vger.kernel.org
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260422-mptcp-inc-limits-v6-0-903181771530%40kernel.org?part=15
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-4-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The SO_LINGER socket option has been supported for a while with MPTCP
sockets [1], but it didn't cause the equivalent of a TCP reset as
expected when enabled and its time was set to 0. This was causing some
behavioural differences with TCP where some connections were not
promptly stopped as expected.
To fix that, an extra condition is checked at close() time before
sending an MP_FASTCLOSE, the MPTCP equivalent of a TCP reset.
Note that backporting up to [1] will be difficult as more changes are
needed to be able to send MP_FASTCLOSE. It seems better to stop at [2],
which was supposed to already imitate TCP.
Validated with MPTCP packetdrill tests [3].
Fixes: 268b12387460 ("mptcp: setsockopt: support SO_LINGER") [1]
Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios") [2]
Cc: stable@vger.kernel.org
Reported-by: Lance Tuller <lance@lance0.com>
Closes: https://github.com/lance0/xfr/pull/67
Link: https://github.com/multipath-tcp/packetdrill/pull/196 [3]
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-3-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Using lock_sock_fast() (atomic context) around sock_set_timestamp()
and sock_set_timestamping() is unsafe, as both helpers can sleep.
Replace lock_sock_fast() with sleepable lock_sock()/release_sock()
to avoid scheduling while atomic panic.
Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows")
Cc: stable@vger.kernel.org
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260420093343.16443-1-gang.yan@linux.dev
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-2-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Both mptcp_setsockopt_sol_socket_tstamp() and
mptcp_setsockopt_sol_socket_timestamping() iterate over subflows,
acquire the subflow socket lock, but then erroneously pass the MPTCP
msk socket to sock_set_timestamp() / sock_set_timestamping() instead
of the subflow ssk. As a result, the timestamp flags are set on the
wrong socket and have no effect on the actual subflows.
Pass ssk instead of sk to both helpers.
Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows")
Cc: stable@vger.kernel.org
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-1-7432b7f279fa@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cake_dump_stats() runs without qdisc spinlock being held.
In this final patch, I add READ_ONCE()/WRITE_ONCE() annotations
for cparams.target and cparams.interval.
Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cake_dump_stats() runs without qdisc spinlock being held.
In this fourth patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- avg_peak_bandwidth
- buffer_limit
- buffer_max_used
- avg_netoff
- max_netlen
- max_adjlen
- min_netlen
- min_adjlen
- active_queues
- tin_rate_bps
- bytes
- tin_backlog
Other annotations are added in following patch, to ease code review.
Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cake_dump_stats() runs without qdisc spinlock being held.
In this third patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- packets
- tin_dropped
- tin_ecn_mark
- ack_drops
- peak_delay
- avge_delay
- base_delay
Other annotations are added in following patches, to ease code review.
Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cake_dump_stats() runs without qdisc spinlock being held.
In this second patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- bulk_flow_count
- unresponsive_flow_count
- max_skblen
- flow_quantum
Other annotations are added in following patches, to ease code review.
Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cake_dump_stats() runs without qdisc spinlock being held.
In this first patch, I add READ_ONCE()/WRITE_ONCE() annotations
for the following fields:
- way_hits
- way_misses
- way_collisions
- sparse_flow_count
- decaying_flow_count
Other annotations are added in following patches, to ease code review.
Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://patch.msgid.link/20260427083606.459355-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After an association reaches ESTABLISHED, the peer’s init_tag is already
known from the handshake. Any subsequent INIT with the same init_tag is
not a valid restart, but a delayed or duplicate INIT.
Drop such INIT chunks in sctp_sf_do_unexpected_init() instead of
processing them as new association attempts.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/5788c76c1ee122a3ed00189e88dcf9df1fba226c.1777214801.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
An INIT whose init_tag matches the peer's vtag does not provide new state
information. It indicates either:
- a stale INIT (after INIT-ACK has already been seen on the same side), or
- a retransmitted INIT (after INIT has already been recorded on the same
side).
In both cases, the INIT must not update ct->proto.sctp.init[] state, since
it does not advance the handshake tracking and may otherwise corrupt
INIT/INIT-ACK validation logic.
Allow INIT processing only when the conntrack entry is newly created
(SCTP_CONNTRACK_NONE), or when the init_tag differs from the stored peer
vtag.
Note it skips the check for the ct with old_state SCTP_CONNTRACK_NONE in
nf_conntrack_sctp_packet(), as it is just created in sctp_new() where it
set ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag.
Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/ee56c3e416452b2a40589a2a85245ac2ad5e9f4b.1777214801.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The dev-set and key-rotate netlink operations modify shared device
state (PSP version configuration and cryptographic key material,
respectively) but do not require CAP_NET_ADMIN. The only access
control is psp_dev_check_access() which merely verifies netns
membership.
Fixes: 00c94ca2b99e ("psp: base PSP device support")
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260427195856.401223-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
psp_assoc_device_get_locked() obtains a psp_dev reference via
psp_dev_get_for_sock() (which uses psp_dev_tryget() under RCU);
it then acquires psd->lock and drops the reference. Before
the lock is taken, psp_dev_unregister() can run to completion:
take psd->lock, clear out state, unlock, drop the registration
reference.
The expectation is that the lock prevents device unregistration,
but much like with netdevs special care has to be taken when
"upgrading" a reference to a locked device. Add the missing
check if device is still alive. psp_dev_is_registered() exists
already but had no callers, which makes me wonder if I either
forgot to add this or lost the check during refactoring...
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: 6b46ca260e22 ("net: psp: add socket security association code")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260427190606.366101-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) IEEE1394 ARP payload contains no target hardware address in the
ARP packet. Apparently, arp_tables was never updated to deal with
IEEE1394 ARP properly. To deal with this, return no match in case
the target hardware address selector is used, either for inverse or
normal match. Moreover, arpt_mangle disallows mangling of the target
hardware and IP address because, it is not worth to adjust the
offset calculation to fix this, we suspect no users of arp_tables
for this family.
2) Use list_del_rcu() to delete device hooks in nf_tables, this hook
list is RCU protected, concurrent netlink dump readers can be
walking on this list, fix it by adding a helper function and use it
for consistency. From Florian Westphal.
3) Add list_splice_rcu(), this is useful for joining the local list of
new device hooks to the RCU protected hook list in chain and
flowtable. Reviewed by Paul E. McKenney.
4) Use list_splice_rcu() to publish the new device hooks in chain and
flowtable to fix concurrent netlink dump traversal.
5) Add a new hook transaction object to track device hook deletions.
The current approach moves device hooks to be deleted around during
the preparation phase, this breaks concurrent RCU reader via netlink
dump. This new hook transaction is combined with NFT_HOOK_REMOVE
flag to annotate hooks for removal in the preparation phase.
6) xt_policy inbound policy check in strict mode can lead to
out-of-bound access of the secpath array due to incorrect.
The iteration over the secpath needs to be reversed in the inbound
to check for the human readable policy, expecting inner in first
position and outer in second position, the secpath from inbound
actually stores outer in first position then in second position.
From Jiexun Wang.
7) Fix possible zero shift in nft_bitwise triggering UBSAN splat,
reject zero shift from control plane, from Kai Ma.
8) Replace simple_strtoul() in the conntrack SIP helper since it relies
on nul-terminated strings. From Florian Westphal.
* tag 'nf-26-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_sip: don't use simple_strtoul
netfilter: reject zero shift in nft_bitwise
netfilter: xt_policy: fix strict mode inbound policy matching
netfilter: nf_tables: add hook transactions for device deletions
netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase
rculist: add list_splice_rcu() for private lists
netfilter: nf_tables: use list_del_rcu for netlink hooks
netfilter: arp_tables: fix IEEE1394 ARP payload parsing
====================
Link: https://patch.msgid.link/20260428095840.51961-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, mctp_i2c_get_tx_flow_state() is called before the packet length
sanity check. This function marks a new flow as active in the MCTP core.
If the sanity check fails, mctp_i2c_xmit() returns early without calling
mctp_i2c_lock_nest(). This results in a mismatched locking state: the
flow is active, but the I2C bus lock was never acquired for it.
When the flow is later released, mctp_i2c_release_flow() will see the
active state and queue an unlock marker. The TX thread will then
decrement midev->i2c_lock_count from 0, causing it to underflow to -1.
This underflow permanently breaks the driver's locking logic, allowing
future transmissions to occur without holding the I2C bus lock, leading
to bus collisions and potential hardware hangs.
Move the mctp_i2c_get_tx_flow_state() call to after the length sanity
check to ensure we only transition the flow state if we are actually
going to proceed with the transmission and locking.
Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver")
Signed-off-by: William A. Kennington III <william@wkennington.com>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
seg6_input_core() and rpl_input() call ip6_route_input() which sets a
NOREF dst on the skb, then pass it to dst_cache_set_ip6() invoking
dst_hold() unconditionally.
On PREEMPT_RT, ksoftirqd is preemptible and a higher-priority task can
release the underlying pcpu_rt between the lookup and the caching
through a concurrent FIB lookup on a shared nexthop.
Simplified race sequence:
ksoftirqd/X higher-prio task (same CPU X)
----------- --------------------------------
seg6_input_core(,skb)/rpl_input(skb)
dst_cache_get()
-> miss
ip6_route_input(skb)
-> ip6_pol_route(,skb,flags)
[RT6_LOOKUP_F_DST_NOREF in flags]
-> FIB lookup resolves fib6_nh
[nhid=N route]
-> rt6_make_pcpu_route()
[creates pcpu_rt, refcount=1]
pcpu_rt->sernum = fib6_sernum
[fib6_sernum=W]
-> cmpxchg(fib6_nh.rt6i_pcpu,
NULL, pcpu_rt)
[slot was empty, store succeeds]
-> skb_dst_set_noref(skb, dst)
[dst is pcpu_rt, refcount still 1]
rt_genid_bump_ipv6()
-> bumps fib6_sernum
[fib6_sernum from W to Z]
ip6_route_output()
-> ip6_pol_route()
-> FIB lookup resolves fib6_nh
[nhid=N]
-> rt6_get_pcpu_route()
pcpu_rt->sernum != fib6_sernum
[W <> Z, stale]
-> prev = xchg(rt6i_pcpu, NULL)
-> dst_release(prev)
[prev is pcpu_rt,
refcount 1->0, dead]
dst = skb_dst(skb)
[dst is the dead pcpu_rt]
dst_cache_set_ip6(dst)
-> dst_hold() on dead dst
-> WARN / use-after-free
For the race to occur, ksoftirqd must be preemptible (PREEMPT_RT without
PREEMPT_RT_NEEDS_BH_LOCK) and a concurrent task must be able to release
the pcpu_rt. Shared nexthop objects provide such a path, as two routes
pointing to the same nhid share the same fib6_nh and its rt6i_pcpu
entry.
Fix seg6_input_core() and rpl_input() by calling skb_dst_force() after
ip6_route_input() to force the NOREF dst into a refcounted one before
caching.
The output path is not affected as ip6_route_output() already returns a
refcounted dst.
Fixes: af4a2209b134 ("ipv6: sr: use dst_cache in seg6_input")
Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260421094735.20997-1-andrea.mayer@uniroma2.it
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
netpoll_setup() decides whether to auto-populate the local source
address by testing np->local_ip.ip, which only inspects the first 4
bytes of the union inet_addr storage.
For an IPv6 netpoll whose caller-supplied local address has a zero
high-32 bits (::1, ::<suffix>, IPv4-mapped ::ffff:a.b.c.d, etc.), this
misdetects the address as unset (which they are not, but the first
4 bytes are empty), calls netpoll_take_ipv6() and overwrites it with
whatever matching link-local/global address the device happens to expose
first.
Introduce a helper netpoll_local_ip_unset() that picks the correct
family-aware test (ipv6_addr_any() for IPv6, !.ip for IPv4) and use it
from netpoll_setup().
Reproducer is something like:
echo "::2" > local_ip
echo 1 > enabled
cat local_ip
# before this fix: 2001:db8::1 (caller-supplied ::2 was clobbered)
# after this fix: ::2
Fixes: b7394d2429c1 ("netpoll: prepare for ipv6")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260424-netpoll_fix-v1-1-3a55348c625f@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
using subtraction with an unsigned lvalue. If elapsed probing time
exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large
value.
This ends up re-arming the probe timer for a full backoff interval
instead of expiring immediately, delaying connection teardown beyond
the configured timeout.
Fix this by preventing underflow so user-set timeout expiration is
handled correctly without extending the probe timer.
Fixes: 344db93ae3ee ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
Link: https://lore.kernel.org/r/20260414013634.43997-1-ahacigu.linux@gmail.com
Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260424014639.54110-1-ahacigu.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
neigh_xmit always releases the skb, except when no neighbour table is
found. But even the first added user of neigh_xmit (mpls) relied on
neigh_xmit to release the skb (or queue it for tx).
sashiko reported:
If neigh_xmit() is called with an uninitialized neighbor table (for
example, NEIGH_ND_TABLE when IPv6 is disabled), it returns -EAFNOSUPPORT
and bypasses its internal out_kfree_skb error path. Because the return
value of neigh_xmit() is ignored here, does this leak the SKB?
Assume full ownership and remove the last code path that doesn't
xmit or free skb.
Fixes: 4fd3d7d9e868 ("neigh: Add helper function neigh_xmit")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260424145843.74055-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup()
does not check if net->ipv4.mrt is NULL.
Since default_device_exit_batch() is called after ->exit_rtnl(),
a device could receive IGMP packets and access net->ipv4.mrt
during/after ipmr_rules_exit_rtnl().
If ipmr_rules_exit_rtnl() had already cleared it and freed the
memory, the access would trigger null-ptr-deref or use-after-free.
Let's fix it by using RCU helper and free mrt after RCU grace
period.
In addition, check_net(net) is added to mroute_clean_tables()
and ipmr_cache_unresolved() to synchronise via mfc_unres_lock.
This prevents ipmr_cache_unresolved() from putting skb into
c->_c.mfc_un.unres.unresolved after mroute_clean_tables()
purges it.
For the same reason, timer_shutdown_sync() is moved after
mroute_clean_tables().
Since rhltable_destroy() holds mutex internally, rcu_work is
used, and it is placed as the first member because rcu_head
must be placed within <4K offset. mr_table is alraedy 3864
bytes without rcu_work.
Note that IP6MR is not yet converted to ->exit_rtnl(), so this
change is not needed for now but will be.
Fixes: b22b01867406 ("ipmr: Convert ipmr_net_exit_batch() to ->exit_rtnl().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260423053456.4097409-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported a kernel BUG triggered from pn_socket_sendmsg() via
pn_socket_autobind():
kernel BUG at net/phonet/socket.c:213!
RIP: 0010:pn_socket_autobind net/phonet/socket.c:213 [inline]
RIP: 0010:pn_socket_sendmsg+0x240/0x250 net/phonet/socket.c:421
Call Trace:
sock_sendmsg_nosec+0x112/0x150 net/socket.c:797
__sock_sendmsg net/socket.c:812 [inline]
__sys_sendto+0x402/0x590 net/socket.c:2280
...
pn_socket_autobind() calls pn_socket_bind() with port 0 and, on
-EINVAL, assumes the socket was already bound and asserts that the
port is non-zero:
err = pn_socket_bind(sock, ..., sizeof(struct sockaddr_pn));
if (err != -EINVAL)
return err;
BUG_ON(!pn_port(pn_sk(sock->sk)->sobject));
return 0; /* socket was already bound */
However pn_socket_bind() also returns -EINVAL when sk->sk_state is not
TCP_CLOSE, even when the socket has never been bound and pn_port() is
still 0. In that case the BUG_ON() fires and panics the kernel from a
user-triggerable path.
Treat the "bind returned -EINVAL but pn_port() is still 0" case as a
regular error and propagate -EINVAL to the caller instead of crashing.
Existing callers already translate a non-zero return from
pn_socket_autobind() into -ENOBUFS/-EAGAIN, so returning -EINVAL here
only changes behaviour from panic to a normal errno.
Fixes: ba113a94b750 ("Phonet: common socket glue")
Reported-by: syzbot+706f5eb79044e686c794@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=706f5eb79044e686c794
Suggested-by: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Morduan Zang <zhangdandan@uniontech.com>
Signed-off-by: zhanjun <zhanjun@uniontech.com>
Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Link: https://patch.msgid.link/87A8960A2045AF3C+20260423010557.138124-1-zhangdandan@uniontech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft()
is called with new == NULL and stores NULL into q->qdiscs[cl - 1].
Subsequent RTM_GETTCLASS dump operations walk all classes via
taprio_walk() and call taprio_dump_class(), which calls taprio_leaf()
returning the NULL pointer, then dereferences it to read child->handle,
causing a kernel NULL pointer dereference.
The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel
with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user
namespaces enabled, an unprivileged local user can trigger a kernel
panic by creating a taprio qdisc inside a new network namespace,
grafting an explicit child qdisc, deleting it, and requesting a class
dump. The RTM_GETTCLASS dump itself requires no capability.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000007: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f]
RIP: 0010:taprio_dump_class (net/sched/sch_taprio.c:2478)
Call Trace:
<TASK>
tc_fill_tclass (net/sched/sch_api.c:1966)
qdisc_class_dump (net/sched/sch_api.c:2326)
taprio_walk (net/sched/sch_taprio.c:2514)
tc_dump_tclass_qdisc (net/sched/sch_api.c:2352)
tc_dump_tclass_root (net/sched/sch_api.c:2370)
tc_dump_tclass (net/sched/sch_api.c:2431)
rtnl_dumpit (net/core/rtnetlink.c:6864)
netlink_dump (net/netlink/af_netlink.c:2325)
rtnetlink_rcv_msg (net/core/rtnetlink.c:6959)
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
</TASK>
Fix this by substituting &noop_qdisc when new is NULL in
taprio_graft(), a common pattern used by other qdiscs (e.g.,
multiq_graft()) to ensure the q->qdiscs[] slots are never NULL.
This makes control-plane dump paths safe without requiring individual
NULL checks.
Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq)
previously had explicit NULL guards that would drop/skip the packet
cleanly, update those checks to test for &noop_qdisc instead. Without
this, packets would reach taprio_enqueue_one() which increments the root
qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc
drops the packet but those counters are never rolled back, permanently
inflating the root qdisc's statistics.
After this change *old can be a valid qdisc, NULL, or &noop_qdisc.
Only call qdisc_put(*old) in the first case to avoid decreasing
noop_qdisc's refcount, which was never increased.
Fixes: 665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260422161958.2517539-3-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ipv6_rpl_srh_rcv() decompresses an RFC 6554 Source Routing Header, swaps
the next segment into ipv6_hdr->daddr, recompresses, then pulls the old
header and pushes the new one plus the IPv6 header back. The
recompressed header can be larger than the received one when the swap
reduces the common-prefix length the segments share with daddr (CmprI=0,
CmprE>0, seg[0][0] != daddr[0] gives the maximum +8 bytes).
pskb_expand_head() was gated on segments_left == 0, so on earlier
segments the push consumed unchecked headroom. Once skb_push() leaves
fewer than skb->mac_len bytes in front of data,
skb_mac_header_rebuild()'s call to:
skb_set_mac_header(skb, -skb->mac_len);
will store (data - head) - mac_len into the u16 mac_header field, which
wraps to ~65530, and the following memmove() writes mac_len bytes ~64KiB
past skb->head.
A single AF_INET6/SOCK_RAW/IPV6_HDRINCL packet over lo with a two
segment type-3 SRH (CmprI=0, CmprE=15) reaches headroom 8 after one
pass; KASAN reports a 14-byte OOB write in ipv6_rthdr_rcv.
Fix this by expanding the head whenever the remaining room is less than
the push size plus mac_len, and request that much extra so the rebuilt
MAC header fits afterwards.
Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr")
Cc: stable <stable@kernel.org>
Reported-by: Anthropic
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026042133-gout-unvented-1bd9@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fq_codel_dump_stats() acquires the qdisc spinlock a bit too late.
Move this acquisition before we fill tc_fq_pie_xstats with live data.
Alternative would be to add READ_ONCE() and WRITE_ONCE() annotations,
but the spinlock is needed anyway to scan q->new_flows and q->old_flows.
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260423063527.2568262-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
choke_dump_stats() only runs with RTNL held.
It reads fields that can be changed in qdisc fast path.
Add READ_ONCE()/WRITE_ONCE() annotations.
Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260423062839.2524324-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Reject requests with negative latency or jitter.
A negative value added to current timestamp (u64) wraps
to an enormous time_to_send, disabling dequeue.
The original UAPI used u32 for these values; the conversion to 64-bit
time values via TCA_NETEM_LATENCY64 and TCA_NETEM_JITTER64
allowed signed values to reach the kernel without validation.
Jitter is already silently clamped by an abs() in netem_change();
that abs() can be removed in a follow-up once this rejection is in
place.
Fixes: 99803171ef04 ("netem: add uapi to express delay and jitter in nanoseconds")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260418032027.900913-7-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
get_slot_next() computes a random delay between min_delay and
max_delay using:
get_random_u32() * (max_delay - min_delay) >> 32
This overflows signed 64-bit arithmetic when the delay range exceeds
approximately 2.1 seconds (2^31 nanoseconds), producing a negative
result that effectively disables slot-based pacing. This is a
realistic configuration for WAN emulation (e.g., slot 1s 5s).
Use mul_u64_u32_shr() which handles the widening multiply without
overflow.
Fixes: 0a9fe5c375b5 ("netem: slotting with non-uniform distribution")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260418032027.900913-6-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Reject slot configurations that have no defensible meaning:
- negative min_delay or max_delay
- min_delay greater than max_delay
- negative dist_delay or dist_jitter
- negative max_packets or max_bytes
Negative or out-of-order delays underflow in get_slot_next(),
producing garbage intervals. Negative limits trip the per-slot
accounting (packets_left/bytes_left <= 0) on the first packet of
every slot, defeating the rate-limiting half of the slot feature.
Note that dist_jitter has been silently coerced to its absolute
value by get_slot() since the feature was introduced; rejecting
negatives here converts that silent coercion into -EINVAL. The
abs() can be removed in a follow-up.
Fixes: 836af83b54e3 ("netem: support delivering packets in delayed time slots")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260418032027.900913-5-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
netem_change() unconditionally reseeds the PRNG on every tc change
command. If TCA_NETEM_PRNG_SEED is not specified, a new random seed
is generated, destroying reproducibility for users who set a
deterministic seed on a previous change.
Move the initial random seed generation to netem_init() and only
reseed in netem_change() when TCA_NETEM_PRNG_SEED is explicitly
provided by the user.
Fixes: 4072d97ddc44 ("netem: add prng attribute to netem_sched_data")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260418032027.900913-4-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The queue limit check in netem_enqueue() uses q->t_len which only
counts packets in the internal tfifo. Packets placed in sch->q by
the reorder path (__qdisc_enqueue_head) are not counted, allowing
the total queue occupancy to exceed sch->limit under reordering.
Include sch->q.qlen in the limit check.
Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260418032027.900913-3-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The 4-state Markov chain in loss_4state() has gaps at the boundaries
between transition probability ranges. The comparisons use:
if (rnd < a4)
else if (a4 < rnd && rnd < a1 + a4)
When rnd equals a boundary value exactly, neither branch matches and
no state transition occurs. The redundant lower-bound check (a4 < rnd)
is already implied by being in the else branch.
Remove the unnecessary lower-bound comparisons so the ranges are
contiguous and every random value produces a transition, matching
the GI (General and Intuitive) loss model specification.
This bug goes back to original implementation of this model.
Fixes: 661b79725fea ("netem: revised correlated loss generator")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260418032027.900913-2-stephen@networkplumber.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- Fix handling of ENOSPC so that if we have to resend writes, they
are written synchronously
- SUNRPC RDMA transport fixes from Chuck
- Several fixes for delegated timestamps in NFSv4.2
- Failure to obtain a directory delegation should not cause stat() to
fail with NFSv4
- Rename was failing to update timestamps when a directory delegation
is held on NFSv4
- Ensure we check rsize/wsize after crossing a NFSv4 filesystem
boundary
- NFSv4/pnfs:
- If the server is down, retry the layout returns on reboot
- Fallback to MDS could result in a short write being incorrectly
logged
Cleanups:
- Use memcpy_and_pad in decode_fh"
* tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (21 commits)
NFS: Fix RCU dereference of cl_xprt in nfs_compare_super_address
NFS: remove redundant __private attribute from nfs_page_class
NFSv4.2: fix CLONE/COPY attrs in presence of delegated attributes
NFS: fix writeback in presence of errors
nfs: use memcpy_and_pad in decode_fh
NFSv4.1: Apply session size limits on clone path
NFSv4: retry GETATTR if GET_DIR_DELEGATION failed
NFS: fix RENAME attr in presence of directory delegations
pnfs/flexfiles: validate ds_versions_cnt is non-zero
NFS/blocklayout: print each device used for SCSI layouts
xprtrdma: Post receive buffers after RPC completion
xprtrdma: Scale receive batch size with credit window
xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursor
xprtrdma: Decouple frwr_wp_create from frwr_map
xprtrdma: Close lost-wakeup race in xprt_rdma_alloc_slot
xprtrdma: Avoid 250 ms delay on backlog wakeup
xprtrdma: Close sendctx get/put race that can block a transport
nfs: update inode ctime after removexattr operation
nfs: fix utimensat() for atime with delegated timestamps
NFS: improve "Server wrote zero bytes" error
...
|
|
Pull ceph updates from Ilya Dryomov:
"We have a series from Alex which extends CephFS client metrics with
support for per-subvolume data I/O performance and latency tracking
(metadata operations aren't included) and a good variety of fixes and
cleanups across RBD and CephFS"
* tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client:
ceph: add subvolume metrics collection and reporting
ceph: parse subvolume_id from InodeStat v9 and store in inode
ceph: handle InodeStat v8 versioned field in reply parsing
libceph: Fix slab-out-of-bounds access in auth message processing
rbd: fix null-ptr-deref when device_add_disk() fails
crush: cleanup in crush_do_rule() method
ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails
ceph: only d_add() negative dentries when they are unhashed
libceph: update outdated comment in ceph_sock_write_space()
libceph: Remove obsolete session key alignment logic
ceph: fix num_ops off-by-one when crypto allocation fails
libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()
|
|
Pull 9p updates from Dominique Martinet:
- 9p access flag fix (cannot change access flag since new mount API implem)
- some minor cleanup
* tag '9p-for-7.1-rc1' of https://github.com/martinetd/linux:
9p/trans_xen: replace simple_strto* with kstrtouint
9p/trans_xen: make cleanup idempotent after dataring alloc errors
9p: document missing enum values in kernel-doc comments
9p: fix access mode flags being ORed instead of replaced
9p: fix memory leak in v9fs_init_fs_context error path
|
|
Replace unsafe port parsing in epaddr_len(), ct_sip_parse_header_uri(),
and ct_sip_parse_request() with a new sip_parse_port() helper that
validates each digit against the buffer limit, eliminating the use of
simple_strtoul() which assumes NUL-terminated strings.
The previous code dereferenced pointers without bounds checks after
sip_parse_addr() and relied on simple_strtoul() on non-NUL-terminated
skb data. A port that reaches the buffer limit without a trailing
character is also rejected as malformed.
Also get rid of all simple_strtoul() usage in conntrack, prefer a
stricter version instead. There are intentional changes:
- Bail out if number is > UINT_MAX and indicate a failure, same for
too long sequences.
While we do accept 05535 as port 5535, we will not accept e.g.
'sip:10.0.0.1:005060'. While its syntactically valid under RFC 3261,
we should restrict this to not waste cycles when presented with
malformed packets with 64k '0' characters.
- Force base 10 in ct_sip_parse_numerical_param(). This is used to fetch
'expire=' and 'rports='; both are expected to use base-10.
- In nf_nat_sip.c, only accept the parsed value if its within the 1k-64k
range.
- epaddr_len now returns 0 if the port is invalid, as it already does
for invalid ip addresses. This is intentional. nf_conntrack_sip
performs lots of guesswork to find the right parts of the message
to parse. Being stricter could break existing setups.
Connection tracking helpers are designed to allow traffic to
pass, not to block it.
Based on an earlier patch from Jenny Guanni Qu <qguanni@gmail.com>.
Fixes: 05e3ced297fe ("[NETFILTER]: nf_conntrack_sip: introduce SIP-URI parsing helper")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Reject zero shift operands for nft_bitwise left and right shift
expressions during initialization.
The carry propagation logic computes the carry from the adjacent 32-bit
word using BITS_PER_TYPE(u32) - shift. A zero shift operand turns this
into a 32-bit shift, which is undefined behaviour.
Reject zero shift operands in the control plane, alongside the existing
check for values greater than or equal to 32, so malformed rules never
reach the packet path.
Fixes: 567d746b55bc ("netfilter: bitwise: add support for shifts.")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Kai Ma <k4729.23098@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
match_policy_in() walks sec_path entries from the last transform to the
first one, but strict policy matching needs to consume info->pol[] in
the same forward order as the rule layout.
Derive the strict-match policy position from the number of transforms
already consumed so that multi-element inbound rules are matched
consistently.
Fixes: c4b885139203 ("[NETFILTER]: x_tables: replace IPv4/IPv6 policy match by address family independant version")
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking deletions from Jakub Kicinski:
"Delete some obsolete networking code
Old code like amateur radio and NFC have long been a burden to core
networking developers. syzbot loves to find bugs in BKL-era code, and
noobs try to fix them.
If we want to have a fighting chance of surviving the LLM-pocalypse
this code needs to find a dedicated owner or get deleted. We've talked
about these deletions multiple times in the past and every time
someone wanted the code to stay. It is never very clear to me how many
of those people actually use the code vs are just nostalgic to see it
go. Amateur radio did have occasional users (or so I think) but most
users switched to user space implementations since its all super slow
stuff. Nobody stepped up to maintain the kernel code.
We were lucky enough to find someone who wants to help with NFC so
we're giving that a chance. Let's try to put the rest of this code
behind us"
* tag 'net-deletions' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next:
drivers: net: 8390: wd80x3: Remove this driver
drivers: net: 8390: ultra: Remove this driver
drivers: net: 8390: AX88190: Remove this driver
drivers: net: fujitsu: fmvj18x: Remove this driver
drivers: net: smsc: smc91c92: Remove this driver
drivers: net: smsc: smc9194: Remove this driver
drivers: net: amd: nmclan: Remove this driver
drivers: net: amd: lance: Remove this driver
drivers: net: 3com: 3c589: Remove this driver
drivers: net: 3com: 3c574: Remove this driver
drivers: net: 3com: 3c515: Remove this driver
drivers: net: 3com: 3c509: Remove this driver
net: packetengines: remove obsolete yellowfin driver and vendor dir
net: packetengines: remove obsolete hamachi driver
net: remove unused ATM protocols and legacy ATM device drivers
net: remove ax25 and amateur radio (hamradio) subsystem
net: remove ISDN subsystem and Bluetooth CMTP
caif: remove CAIF NETWORK LAYER
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter.
Steady stream of fixes. Last two weeks feel comparable to the two
weeks before the merge window. Lots of AI-aided bug discovery. A newer
big source is Sashiko/Gemini (Roman Gushchin's system), which points
out issues in existing code during patch review (maybe 25% of fixes
here likely originating from Sashiko). Nice thing is these are often
fixed by the respective maintainers, not drive-bys.
Current release - new code bugs:
- kconfig: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
Previous releases - regressions:
- add async ndo_set_rx_mode and switch drivers which we promised to
be called under the per-netdev mutex to it
- dsa: remove duplicate netdev_lock_ops() for conduit ethtool ops
- hv_sock: report EOF instead of -EIO for FIN
- vsock/virtio: fix MSG_PEEK calculation on bytes to copy
Previous releases - always broken:
- ipv6: fix possible UAF in icmpv6_rcv()
- icmp: validate reply type before using icmp_pointers
- af_unix: drop all SCM attributes for SOCKMAP
- netfilter: fix a number of bugs in the osf (OS fingerprinting)
- eth: intel: fix timestamp interrupt configuration for E825C
Misc:
- bunch of data-race annotations"
* tag 'net-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (148 commits)
rxrpc: Fix error handling in rxgk_extract_token()
rxrpc: Fix re-decryption of RESPONSE packets
rxrpc: Fix rxrpc_input_call_event() to only unshare DATA packets
rxrpc: Fix missing validation of ticket length in non-XDR key preparsing
rxgk: Fix potential integer overflow in length check
rxrpc: Fix conn-level packet handling to unshare RESPONSE packets
rxrpc: Fix potential UAF after skb_unshare() failure
rxrpc: Fix rxkad crypto unalignment handling
rxrpc: Fix memory leaks in rxkad_verify_response()
net: rds: fix MR cleanup on copy error
m68k: mvme147: Make me the maintainer
net: txgbe: fix firmware version check
selftests/bpf: check epoll readiness during reuseport migration
tcp: call sk_data_ready() after listener migration
vhost_net: fix sleeping with preempt-disabled in vhost_net_busy_poll()
ipv6: Cap TLV scan in ip6_tnl_parse_tlv_enc_lim
tipc: fix double-free in tipc_buf_append()
llc: Return -EINPROGRESS from llc_ui_connect()
ipv4: icmp: validate reply type before using icmp_pointers
selftests/net: packetdrill: cover RFC 5961 5.2 challenge ACK on both edges
...
|
|
Fix a missing bit of error handling in rxgk_extract_token(): in the event
that rxgk_decrypt_skb() returns -ENOMEM, it should just return that rather
than continuing on (for anything else, it generates an abort).
Fixes: 64863f4ca494 ("rxrpc: Fix unhandled errors in rxgk_verify_packet_integrity()")
Closes: https://sashiko.dev/#/patchset/20260422161438.2593376-4-dhowells@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260423200909.3049438-4-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If a RESPONSE packet gets a temporary failure during processing, it may end
up in a partially decrypted state - and then get requeued for a retry.
Fix this by just discarding the packet; we will send another CHALLENGE
packet and thereby elicit a further response. Similarly, discard an
incoming CHALLENGE packet if we get an error whilst generating a RESPONSE;
the server will send another CHALLENGE.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Closes: https://sashiko.dev/#/patchset/20260422161438.2593376-4-dhowells@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260423200909.3049438-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix rxrpc_input_call_event() to only unshare DATA packets and not ACK,
ABORT, etc..
And with that, rxrpc_input_packet() doesn't need to take a pointer to the
pointer to the packet, so change that to just a pointer.
Fixes: 1f2740150f90 ("rxrpc: Fix potential UAF after skb_unshare() failure")
Closes: https://sashiko.dev/#/patchset/20260422161438.2593376-4-dhowells@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260423200909.3049438-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In rxrpc_preparse(), there are two paths for parsing key payloads: the
XDR path (for large payloads) and the non-XDR path (for payloads <= 28
bytes). While the XDR path (rxrpc_preparse_xdr_rxkad()) correctly
validates the ticket length against AFSTOKEN_RK_TIX_MAX, the non-XDR
path fails to do so.
This allows an unprivileged user to provide a very large ticket length.
When this key is later read via rxrpc_read(), the total
token size (toksize) calculation results in a value that exceeds
AFSTOKEN_LENGTH_MAX, triggering a WARN_ON().
[ 2001.302904] WARNING: CPU: 2 PID: 2108 at net/rxrpc/key.c:778 rxrpc_read+0x109/0x5c0 [rxrpc]
Fix this by adding a check in the non-XDR parsing path of rxrpc_preparse()
to ensure the ticket length does not exceed AFSTOKEN_RK_TIX_MAX,
bringing it into parity with the XDR parsing logic.
Fixes: 8a7a3eb4ddbe ("KEYS: RxRPC: Use key preparsing")
Fixes: 84924aac08a4 ("rxrpc: Fix checker warning")
Reported-by: Anderson Nascimento <anderson@allelesecurity.com>
Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260422161438.2593376-7-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix potential integer overflow in rxgk_extract_token() when checking the
length of the ticket. Rather than rounding up the value to be tested
(which might overflow), round down the size of the available data.
Fixes: 2429a1976481 ("rxrpc: Fix untrusted unsigned subtract")
Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260422161438.2593376-6-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The security operations that verify the RESPONSE packets decrypt bits of it
in place - however, the sk_buff may be shared with a packet sniffer, which
would lead to the sniffer seeing an apparently corrupt packet (actually
decrypted).
Fix this by handing a copy of the packet off to the specific security
handler if the packet was cloned.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Closes: https://sashiko.dev/#/patchset/20260408121252.2249051-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260422161438.2593376-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|