linux-toradex.git/net/xfrm, branch master

xfrm: policy: preallocate inexact bins before xfrm_hash_rebuild reinsert

2026-07-06T06:30:02+00:00

xfrm_hash_rebuild()'s first loop preallocates the bins/chains the reinsert
loop needs, so the reinsert (after hlist_del_rcu()) cannot allocate or
fail. But its guard is inverted: it skips policies with prefixlen <
threshold and preallocates for the rest.

prefixlen < threshold is exactly when policy_hash_bysel() returns NULL and
the reinsert takes the allocating xfrm_policy_inexact_insert() path. So the
loop preallocates for the exact policies (which never allocate) and skips
the inexact ones, whose bin/node is then allocated GFP_ATOMIC during
reinsert. On failure the error path only WARN_ONCE()s and continues,
leaving a poisoned bydst node; the next rebuild's hlist_del_rcu()
dereferences LIST_POISON2 and takes a GPF. Reachable under memory pressure,
deterministic via failslab.

Invert the guard so preallocation covers exactly the reinserted policies;
the reinsert then allocates nothing and cannot fail.

Crash:
  Oops: general protection fault, probably for non-canonical address
  0xfbd59c0000000024: 0000 [#1] SMP KASAN NOPTI
  KASAN: maybe wild-memory-access in range [0xdead...]
  ...
  Workqueue: events xfrm_hash_rebuild
  RIP: 0010:xfrm_hash_rebuild+0x5b3/0x1190
  RAX: dead000000000122   (LIST_POISON2 + offset)
  ...
  Call Trace:
   hlist_del_rcu (include/linux/rculist.h:599)
   xfrm_hash_rebuild (net/xfrm/xfrm_policy.c:1365)
   process_one_work (kernel/workqueue.c:3322)
   worker_thread (kernel/workqueue.c:3486)
   kthread (kernel/kthread.c:436)
   ret_from_fork (arch/x86/kernel/process.c:158)
   ret_from_fork_asm (arch/x86/entry/entry_64.S:245)
   ...
  Kernel panic - not syncing: Fatal exception in interrupt

Fixes: 24969facd704 ("xfrm: policy: store inexact policies in an rhashtable")
Reported-by: AutonomousCodeSecurity@microsoft.com
Signed-off-by: Xiang Mei (Microsoft) 
Reviewed-by: Florian Westphal 
Signed-off-by: Steffen Klassert

xfrm: iptfs: propagate SKBFL_SHARED_FRAG in iptfs_skb_add_frags()

2026-07-06T06:29:07+00:00

When iptfs_skb_add_frags() copies frag references from the source
frag walk into a new SKB, it increments the page reference count via
__skb_frag_ref() but does not propagate SKBFL_SHARED_FRAG to the
destination SKB's skb_shinfo->flags.

If the source SKB carries shared frags (e.g. from a page-pool backed
receive path), the new inner SKB will appear to ESP as having privately
owned frags.  A subsequent esp_input() call for a nested transport-mode
SA then takes the no-COW fast path and decrypts in place, writing over
pages that are still referenced by the outer IPTFS SKB.  This causes
kernel-visible memory corruption and can trigger a panic.

All other frag-transfer helpers in the kernel (skb_try_coalesce,
skb_gro_receive, __pskb_copy_fclone, skb_shift, skb_segment) correctly
propagate SKBFL_SHARED_FRAG; align iptfs_skb_add_frags() with this
convention by setting the flag inside the loop immediately after
__skb_frag_ref() and nr_frags++, so every exit path that attaches a frag
unconditionally propagates SKBFL_SHARED_FRAG.

Fixes: 5f2b6a909574 ("xfrm: iptfs: add skb-fragment sharing code")
Signed-off-by: Chen YanJun 
Signed-off-by: Steffen Klassert

xfrm: clear mode callbacks after failed mode setup

2026-07-06T06:29:06+00:00

xfrm_state_gc_task can run long after a failed IPTFS state setup. In the
reproduced case, __xfrm_init_state() cached x->mode_cbs, IPTFS setup
returned -ENOMEM before publishing mode_data, and the temporary module
reference from xfrm_get_mode_cbs() was dropped immediately. The dead state
then kept x->mode_cbs until deferred GC ran after xfrm_iptfs had been
unloaded.

Clear x->mode_cbs when mode init or clone fails before publishing
mode_data. Those states never installed mode-specific state or the
long-term IPTFS module pin, so deferred GC has nothing mode-specific to
destroy and must not retain a callback table pointer past the temporary
lookup reference.

The buggy scenario involves two paths, with each column showing the order
within that path:

failed setup path:
1. cache x->mode_cbs
2. mode setup fails before mode_data
3. drop the temporary module ref
4. dead state keeps x->mode_cbs cached

GC/unload path:
1. xfrm_state_put() queues GC work
2. xfrm_iptfs unloads later
3. xfrm_state_gc_task runs
4. GC dereferences stale x->mode_cbs

This also covers the failed clone path where clone_state() returns before
publishing mode_data.

Validation reproduced this kernel report:
Kernel panic - not syncing: Fatal exception
CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y
failslab_stacktrace_filter matched xfrm_iptfs frames
ack_error=-12
FAULT_INJECTION: forcing a failure
BUG: unable to handle page fault
Workqueue: events xfrm_state_gc_task
RIP: xfrm_state_gc_task+0x142/0x650
Modules linked in: esp4_offload xfrm_user [last unloaded: xfrm_iptfs]
Kernel panic - not syncing: Fatal exception

Fixes: 4b3faf610cc6 ("xfrm: iptfs: add new iptfs xfrm mode impl")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang 
Signed-off-by: Steffen Klassert

xfrm: reject optional IPTFS templates in outbound policies

2026-07-02T07:22:54+00:00

syzbot reported a stack-out-of-bounds read in xfrm_state_find()
which flows from xfrm_tmpl_resolve_one().

Commit 3d776e31c841 ("xfrm: Reject optional tunnel/BEET mode
templates in outbound policies") disallowed optional tunnel and
BEET in outbound policies to prevent this. Later when IPTFS
added, it was not covered by that fix and can still trigger
the out-of-bounds read;

Extend the check to disallow optional IPTFS in outbound policies
as well. IPTFS should be identical to tunnel mode.
IN and FWD policies are not affected: xfrm_tmpl_resolve_one()
is only reachable via the outbound path.

Reproducer, before:

ip link add dummy0 type dummy
ip link set dummy0 up
ip addr add 10.1.1.1/24 dev dummy0
ip xfrm policy add src 10.1.1.1/32 dst 10.1.1.2/32 dir out tmpl
  src fc00::dead:1 dst fc00::dead:2 proto esp reqid 1 mode iptfs
  level use tmpl src fc00::dead:1 dst fc00::dead:2 proto esp reqid
  2 mode transport
ping -W 1 -c 1 10.1.1.2
PING 10.1.1.2 (10.1.1.2) 56(84) bytes of data.

[   64.168420] ==================================================================
[   64.169977] BUG: KASAN: stack-out-of-bounds in __xfrm6_addr_hash+0x11e/0x170
[   64.169977] Read of size 4 at addr ffff88800e1ffd20 by task ping/2844

[   64.169977] CPU: 2 UID: 0 PID: 2844 Comm: ping Not tainted 7.1.0-rc7-00180-geb23b588430a #98 PREEMPT(full)
[   64.169977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   64.169977] Call Trace:
[   64.169977]  
[   64.169977]  dump_stack_lvl+0x47/0x70
[   64.169977]  ? __xfrm6_addr_hash+0x11e/0x170
[   64.169977]  print_report+0x152/0x4b0
[   64.169977]  ? ksys_mmap_pgoff+0x6d/0xa0
[   64.169977]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   64.169977]  ? rcu_read_unlock_sched+0xa/0x20
[   64.169977]  ? __virt_addr_valid+0x21b/0x230
[   64.169977]  ? __xfrm6_addr_hash+0x11e/0x170
[   64.169977]  kasan_report+0xa8/0xd0
[   64.169977]  ? __xfrm6_addr_hash+0x11e/0x170
[   64.169977]  __xfrm6_addr_hash+0x11e/0x170
[   64.169977]  __xfrm_dst_hash+0x24/0xc0
[   64.169977]  xfrm_state_find+0xa2d/0x2f90
[   64.169977]  ? __pfx_xfrm_state_find+0x10/0x10
[   64.169977]  ? __pfx_ftrace_graph_ret_addr+0x10/0x10
[   64.169977]  ? __pfx_ftrace_graph_ret_addr+0x10/0x10
[   64.169977]  xfrm_tmpl_resolve_one+0x210/0x570
[   64.169977]  ? __pfx_xfrm_tmpl_resolve_one+0x10/0x10
[   64.169977]  ? __pfx_stack_trace_consume_entry+0x10/0x10
[   64.169977]  ? kernel_text_address+0x5b/0x80
[   64.169977]  ? __kernel_text_address+0xe/0x30
[   64.169977]  ? unwind_get_return_address+0x5e/0x90
[   64.169977]  ? arch_stack_walk+0x8c/0xe0
[   64.169977]  xfrm_tmpl_resolve+0x130/0x200
[   64.169977]  ? __pfx_xfrm_tmpl_resolve+0x10/0x10
[   64.169977]  ? __pfx_xfrm_policy_inexact_lookup_rcu+0x10/0x10
[   64.169977]  ? __refcount_add_not_zero.constprop.0+0xb2/0x110
[   64.169977]  ? __pfx___refcount_add_not_zero.constprop.0+0x10/0x10
[   64.169977]  xfrm_resolve_and_create_bundle+0xd5/0x310
[   64.169977]  ? __pfx_xfrm_resolve_and_create_bundle+0x10/0x10
[   64.169977]  ? __pfx_xfrm_policy_lookup_bytype+0x10/0x10
[   64.169977]  ? __pfx_xfrm_policy_lookup_bytype+0x10/0x10
[   64.169977]  xfrm_lookup_with_ifid+0x3d8/0xb80
[   64.169977]  ? __pfx_xfrm_lookup_with_ifid+0x10/0x10
[   64.169977]  ? ip_route_output_key_hash+0xc6/0x110
[   64.169977]  ? kasan_save_track+0x10/0x30
[   64.169977]  xfrm_lookup_route+0x18/0xe0
[   64.169977]  ip4_datagram_release_cb+0x4c9/0x530
[   64.169977]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[   64.169977]  ? do_raw_spin_lock+0x71/0xc0
[   64.169977]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   64.169977]  release_sock+0xb0/0x170
[   64.169977]  udp_connect+0x43/0x50
[   64.169977]  __sys_connect+0xa6/0x100
[   64.169977]  ? alloc_fd+0x2e9/0x300
[   64.169977]  ? __pfx___sys_connect+0x10/0x10
[   64.169977]  ? preempt_latency_start+0x1f/0x70
[   64.169977]  ? fd_install+0x7e/0x150
[   64.169977]  ? rcu_read_unlock_sched+0xa/0x20
[   64.169977]  ? __sys_socket+0xdf/0x130
[   64.169977]  ? __pfx___sys_socket+0x10/0x10
[   64.169977]  ? vma_refcount_put+0x43/0xa0
[   64.169977]  __x64_sys_connect+0x7e/0x90
[   64.169977]  do_syscall_64+0x11b/0x2b0
[   64.169977]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   64.169977] RIP: 0033:0x7f4851ecb570
[   64.169977] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d f9 ca 0d 00 00 74 17 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 54
[   64.169977] RSP: 002b:00007ffc830e3498 EFLAGS: 00000202 ORIG_RAX: 000000000000002a
[   64.169977] RAX: ffffffffffffffda RBX: 00007ffc830e34d0 RCX: 00007f4851ecb570
[   64.169977] RDX: 0000000000000010 RSI: 00007ffc830e34d0 RDI: 0000000000000005
[   64.169977] RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
[   64.169977] R10: 0000000000000006 R11: 0000000000000202 R12: 0000000000000005
[   64.169977] R13: 0000000000000000 R14: 00005619a863f340 R15: 0000000000000000
[   64.169977]  

[   64.169977] The buggy address belongs to stack of task ping/2844
[   64.169977]  and is located at offset 88 in frame:
[   64.169977]  ip4_datagram_release_cb+0x0/0x530

[   64.169977] This frame has 1 object:
[   64.169977]  [32, 88) 'fl4'

[   64.169977] The buggy address belongs to the physical page:
[   64.169977] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xe1ff
[   64.169977] flags: 0x4000000000000000(zone=1)
[   64.169977] raw: 4000000000000000 0000000000000000 ffffea0000387fc8 0000000000000000
[   64.169977] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   64.169977] page dumped because: kasan: bad access detected

[   64.169977] Memory state around the buggy address:
[   64.169977]  ffff88800e1ffc00: f2 f2 00 00 f3 f3 00 00 00 00 00 00 00 00 00 00
[   64.169977]  ffff88800e1ffc80: 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00
[   64.169977] >ffff88800e1ffd00: 00 00 00 00 f3 f3 f3 f3 f3 00 00 00 00 00 00 00
[   64.169977]                                ^
[   64.169977]  ffff88800e1ffd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
[   64.169977]  ffff88800e1ffe00: f1 f1 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   64.169977] ==================================================================
[   64.245153] Disabling lock debugging due to kernel taint

After the fix:

ip xfrm policy add src 10.1.1.1/32 dst 10.1.1.2/32 dir out tmpl \
 src fc00::dead:1 dst fc00::dead:2 proto esp reqid 1 mode iptfs \
 level use tmpl src fc00::dead:1 dst fc00::dead:2 proto esp reqid 2 \
 mode transport

Error: Mode in optional template not allowed in outbound policy.

Fixes: d1716d5a44c3 ("xfrm: add generic iptfs defines and functionality")
Reported-by: syzbot+0ac4d84afe1066a1f3e9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a3ceb94.43b4ff68.30a095.0004.GAE@google.com/T/
Signed-off-by: Antony Antony 
Signed-off-by: Steffen Klassert

xfrm: cache the offload ifindex for netlink dumps

2026-07-02T07:12:58+00:00

copy_to_user_state_extra() only holds a reference to the outer xfrm_state.
That does not pin x->xso.dev. NETDEV_DOWN and NETDEV_UNREGISTER can race
through xfrm_dev_state_flush(), xfrm_state_delete(), and
xfrm_dev_state_free(), which clears xso->dev and drops the netdev
reference before the GETSA dump reaches xso_to_xuo() and reads
xso->dev->ifindex.

The buggy scenario involves two paths, with each column showing the order
within that path:

XFRM_MSG_GETSA dump path:           NETDEV teardown path:
1. xfrm_get_sa() gets xfrm_state    1. xfrm_dev_state_flush() finds x
2. copy_to_user_state_extra() sees  2. xfrm_state_delete() removes x
   x->xso.dev                          from the SAD
3. copy_user_offload() calls        3. xfrm_dev_state_free() clears
   xso_to_xuo()                        xso->dev
4. xso->dev->ifindex dereferences   4. netdev_put() drops the device
   a detached net_device               reference

Avoid following the live net_device from the dump paths. Cache the
attached ifindex in xfrm_dev_offload when state or policy offload is bound
to a device, and serialize that snapshot instead. This preserves the
user-visible XFRMA_OFFLOAD_DEV value without depending on the embedded
net_device lifetime.

Validation reproduced this kernel report:
Oops: general protection fault

Call Trace:
 
 copy_to_user_state_extra+0xb8d/0x1370 [xfrm_user]
 ? __pfx_copy_to_user_state_extra+0x10/0x10 [xfrm_user]
 ? __asan_memset+0x23/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __alloc_skb+0x342/0x960
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __asan_memset+0x23/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __nlmsg_put+0x147/0x1b0
 dump_one_state+0x1c7/0x3e0 [xfrm_user]
 xfrm_state_netlink+0xcb/0x130 [xfrm_user]
 ? __pfx_xfrm_state_netlink+0x10/0x10 [xfrm_user]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? xfrm_user_state_lookup.constprop.0+0x230/0x310 [xfrm_user]
 xfrm_get_sa+0x102/0x250 [xfrm_user]
 ? __pfx_xfrm_get_sa+0x10/0x10 [xfrm_user]
 xfrm_user_rcv_msg+0x504/0xaa0 [xfrm_user]
 ? __pfx_xfrm_user_rcv_msg+0x10/0x10 [xfrm_user]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? stack_trace_save+0x8e/0xc0
 ? __pfx_stack_trace_save+0x10/0x10
 netlink_rcv_skb+0x11f/0x350
 ? __pfx_xfrm_user_rcv_msg+0x10/0x10 [xfrm_user]
 ? __pfx_netlink_rcv_skb+0x10/0x10
 ? __pfx_mutex_lock+0x10/0x10
 ? srso_alias_return_thunk+0x5/0xfbef5
 xfrm_netlink_rcv+0x65/0x80 [xfrm_user]
 netlink_unicast+0x600/0x870
 ? __pfx_netlink_unicast+0x10/0x10
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __pfx_stack_trace_save+0x10/0x10
 netlink_sendmsg+0x75d/0xc10
 ? __pfx_netlink_sendmsg+0x10/0x10
 ? srso_alias_return_thunk+0x5/0xfbef5
 ____sys_sendmsg+0x77a/0x900
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __pfx_____sys_sendmsg+0x10/0x10
 ? __pfx_copy_msghdr_from_user+0x10/0x10
 ? release_sock+0x1a/0x1d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? netlink_insert+0x143/0xec0
 ___sys_sendmsg+0xff/0x180
 ? __pfx____sys_sendmsg+0x10/0x10
 ? _raw_spin_lock_irqsave+0x85/0xe0
 ? do_getsockname+0xf9/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? fdget+0x53/0x3b0
 __sys_sendmsg+0x111/0x1a0
 ? __pfx___sys_sendmsg+0x10/0x10
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __sys_getsockname+0x8c/0x100
 do_syscall_64+0x102/0x5a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang 
Signed-off-by: Steffen Klassert

xfrm: fix sk_dst_cache double-free in xfrm_user_policy()

2026-07-02T07:02:59+00:00

xfrm_user_policy() clears the socket dst cache with __sk_dst_reset(),
i.e. the non-atomic __sk_dst_set(sk, NULL): it reads sk_dst_cache with
rcu_dereference_protected(), stores NULL and dst_release()s the old dst.
That is only safe if no other thread modifies sk_dst_cache concurrently.

For a connected UDP socket that does not hold: the transmit fast path
(udp_sendmsg -> sk_dst_check -> sk_dst_reset) resets the cache locklessly
with an atomic xchg(). A per-socket policy change racing a send can make
both sides observe the same old dst and each dst_release() it, dropping
the socket's single reference twice and freeing the xfrm_dst bundle while
it is still referenced:

  BUG: KASAN: slab-use-after-free in dst_release
  Write of size 4 at addr ffff88801897b6c0 by task exploit/155
  Call Trace:
   ...
   dst_release (... ./include/linux/rcuref.h:109)
   xfrm_user_policy (./include/net/sock.h:2239 ./include/net/sock.h:2256 net/xfrm/xfrm_state.c:3053)
   do_ip_setsockopt (net/ipv4/ip_sockglue.c:1347)
   ip_setsockopt (net/ipv4/ip_sockglue.c:1417)
   do_sock_setsockopt (net/socket.c:2368)
   __sys_setsockopt (net/socket.c:2393)
   __x64_sys_setsockopt (net/socket.c:2396)
   do_syscall_64 (arch/x86/entry/syscall_64.c:94)
   entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)

Reachable by an unprivileged user via a user+network namespace.

Use the atomic sk_dst_reset() so the cache is cleared and released with a
single xchg(): whichever side wins releases the dst once, the other sees
NULL and does nothing. Behaviour is otherwise unchanged.

Fixes: 2b06cdf3e688 ("xfrm: Clear sk_dst_cache when applying per-socket policy.")
Fixes: be8f8284cd89 ("net: xfrm: allow clearing socket xfrm policies.")
Reported-by: AutonomousCodeSecurity@microsoft.com
Signed-off-by: Xiang Mei (Microsoft) 
Signed-off-by: Steffen Klassert

xfrm: nat_keepalive: avoid double free on send error

2026-06-30T13:59:54+00:00

nat_keepalive_send() frees the keepalive skb whenever the IPv4 or IPv6
send helper reports an error.

That cleanup is only correct before the skb is handed to the output
path. Once ip_build_and_send_pkt() or ip6_xmit() takes ownership, the
networking stack may already have consumed the skb before returning an
error, so freeing it again is unsafe.

Handle the pre-handoff failure cases inside nat_keepalive_send_ipv4()
and nat_keepalive_send_ipv6(), where the caller still owns the skb, and
keep nat_keepalive_send() responsible only for family dispatch and the
unsupported-family cleanup path.

Fixes: f531d13bdfe3 ("xfrm: support sending NAT keepalives in ESP in UDP states")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan 
Reported-by: Xin Liu 
Signed-off-by: Qianyu Luo 
Signed-off-by: Ren Wei 
Reviewed-by: Eyal Birger 
Signed-off-by: Steffen Klassert

xfrm: fix stale skb->prev after async crypto steals a GSO segment

2026-06-26T06:13:55+00:00

skb_gso_segment() leaves the segment list head with ->prev pointing at
the last segment, an invariant validate_xmit_skb_list() relies on when
it sets its tail pointer (tail = skb->prev).

When validate_xmit_xfrm() walks a GSO list and some segments are stolen
by async crypto (->xmit() returns -EINPROGRESS), those segments are
unlinked from the list but the head ->prev is never updated.  If the
last segment is the one stolen, the returned head still has ->prev
pointing at it, even though it is now owned by the crypto engine and may
be freed.  validate_xmit_skb_list() later does tail->next = skb, writing
through that stale pointer -- a use-after-free.

Repoint skb->prev at the last retained segment before returning.

Fixes: f53c723902d1 ("net: Add asynchronous callbacks for xfrm on layer 2.")
Signed-off-by: Petr Wozniak 
Signed-off-by: Steffen Klassert

xfrm: propagate -EINPROGRESS from validate_xmit_xfrm()

2026-06-26T06:13:54+00:00

validate_xmit_xfrm() returns NULL both when a packet is dropped and
when it is stolen by async crypto (-EINPROGRESS from ->xmit()).
Callers cannot distinguish the two cases.

f53c723902d1 ("net: Add asynchronous callbacks for xfrm on layer 2.")
changed the semantics of a NULL return from "dropped" to "stolen or
dropped", but __dev_queue_xmit() was not updated.  On virtual/bridge
interfaces (noqueue qdisc) __dev_queue_xmit() initialises rc=-ENOMEM
and jumps to out: when skb is NULL, returning -ENOMEM to the caller
even though the packet will be delivered correctly via xfrm_dev_resume().

Return ERR_PTR(-EINPROGRESS) from validate_xmit_xfrm() for the async
case so callers can tell it apart from a real drop.  Update
__dev_queue_xmit() to handle ERR_PTR(-EINPROGRESS) from
validate_xmit_skb() correctly.  Update validate_xmit_skb_list() to
use IS_ERR_OR_NULL() so that ERR_PTR(-EINPROGRESS) is not mistakenly
added to the transmitted list.

Fixes: f53c723902d1 ("net: Add asynchronous callbacks for xfrm on layer 2.")
Suggested-by: Sabrina Dubroca 
Signed-off-by: Petr Wozniak 
Signed-off-by: Steffen Klassert

Merge tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

2026-06-23T23:22:24+00:00

Steffen Klassert says:

====================
pull request (net): ipsec 2026-06-22

1) xfrm: use compat translator only for u64 alignment mismatch
   Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT
   so 32-bit compat tasks on arches whose 32-bit ABI already matches
   the native 64-bit layout are no longer rejected with -EOPNOTSUPP.
   From Sanman Pradhan.

2) net: af_key: initialize alg_key_len for IPComp states
   Initialize the alg_key_len to 0 in the IPComp branch of
   pfkey_msg2xfrm_state() so an uninitialized value cannot drive
   xfrm_alg_len() into a slab-out-of-bounds kmemdup during
   XFRM_MSG_MIGRATE. From Zijing Yin.

3) xfrm: Fix dev use-after-free in xfrm async resumption
   Stash the original skb->dev and extend the RCU critical section
   across xfrm_rcv_cb() and transport_finish() to prevent a
   tunnel-device UAF and original-device refcount leak when a
   callback replaces skb->dev. From Dong Chenchen.

4) xfrm: Fix xfrm state cache insertion race
   Move the state-validity check inside xfrm_state_lock in the
   input state cache insertion path so a state cannot be killed
   between the check and the insert. From Herbert Xu.

5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
   Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count
   and xfrm_policy_default to silence the KCSAN data race reported
   on net->xfrm.policy_count. From Eric Dumazet.

6) espintcp: use sk_msg_free_partial to fix partial send
   Replace the manual skmsg accounting in espintcp with
   sk_msg_free_partial() so the skmsg stays consistent on every
   iteration and the partial-send accounting bugs go away.
   From Sabrina Dubroca.

7) xfrm: validate selector family and prefixlen during match
   Reject mismatched address families in xfrm_selector_match() and
   bound prefixlen in addr4_match()/addr_match() to prevent the
   shift-out-of-bounds syzbot reported when an AF_UNSPEC selector
   with a large prefixlen is matched against an IPv4 flow.
   From Eric Dumazet.

* tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: validate selector family and prefixlen during match
  espintcp: use sk_msg_free_partial to fix partial send
  xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
  xfrm: Fix xfrm state cache insertion race
  xfrm: Fix dev use-after-free in xfrm async resumption
  net: af_key: initialize alg_key_len for IPComp states
  xfrm: use compat translator only for u64 alignment mismatch
====================

Link: https://patch.msgid.link/20260622075726.29685-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski