summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
5 daysnetfilter: nfnetlink_queue: make hash table per queueFlorian Westphal
Sharing a global hash table among all queues is tempting, but it can cause crash: BUG: KASAN: slab-use-after-free in nfqnl_recv_verdict+0x11ac/0x15e0 [nfnetlink_queue] [..] nfqnl_recv_verdict+0x11ac/0x15e0 [nfnetlink_queue] nfnetlink_rcv_msg+0x46a/0x930 kmem_cache_alloc_node_noprof+0x11e/0x450 struct nf_queue_entry is freed via kfree, but parallel cpu can still encounter such an nf_queue_entry when walking the list. Alternative fix is to free the nf_queue_entry via kfree_rcu() instead, but as we have to alloc/free for each skb this will cause more mem pressure. Cc: Scott Mitchell <scott.k.mitch1@gmail.com> Fixes: e19079adcd26 ("netfilter: nfnetlink_queue: optimize verdict lookup with hash table") Signed-off-by: Florian Westphal <fw@strlen.de>
5 daysnetfilter: nft_ct: fix use-after-free in timeout object destroyTuan Do
nft_ct_timeout_obj_destroy() frees the timeout object with kfree() immediately after nf_ct_untimeout(), without waiting for an RCU grace period. Concurrent packet processing on other CPUs may still hold RCU-protected references to the timeout object obtained via rcu_dereference() in nf_ct_timeout_data(). Add an rcu_head to struct nf_ct_timeout and use kfree_rcu() to defer freeing until after an RCU grace period, matching the approach already used in nfnetlink_cttimeout.c. KASAN report: BUG: KASAN: slab-use-after-free in nf_conntrack_tcp_packet+0x1381/0x29d0 Read of size 4 at addr ffff8881035fe19c by task exploit/80 Call Trace: nf_conntrack_tcp_packet+0x1381/0x29d0 nf_conntrack_in+0x612/0x8b0 nf_hook_slow+0x70/0x100 __ip_local_out+0x1b2/0x210 tcp_sendmsg_locked+0x722/0x1580 __sys_sendto+0x2d8/0x320 Allocated by task 75: nft_ct_timeout_obj_init+0xf6/0x290 nft_obj_init+0x107/0x1b0 nf_tables_newobj+0x680/0x9c0 nfnetlink_rcv_batch+0xc29/0xe00 Freed by task 26: nft_obj_destroy+0x3f/0xa0 nf_tables_trans_destroy_work+0x51c/0x5c0 process_one_work+0x2c4/0x5a0 Fixes: 7e0b2b57f01d ("netfilter: nft_ct: add ct timeout support") Cc: stable@vger.kernel.org Signed-off-by: Tuan Do <tuan@calif.io> Signed-off-by: Florian Westphal <fw@strlen.de>
5 daysnetfilter: xt_multiport: validate range encoding in checkentryRen Wei
ports_match_v1() treats any non-zero pflags entry as the start of a port range and unconditionally consumes the next ports[] element as the range end. The checkentry path currently validates protocol, flags and count, but it does not validate the range encoding itself. As a result, malformed rules can mark the last slot as a range start or place two range starts back to back, leaving ports_match_v1() to step past the last valid ports[] element while interpreting the rule. Reject malformed multiport v1 rules in checkentry by validating that each range start has a following element and that the following element is not itself marked as another range start. Fixes: a89ecb6a2ef7 ("[NETFILTER]: x_tables: unify IPv4/IPv6 multiport match") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Yuhang Zheng <z1652074432@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
5 daysnetfilter: nfnetlink_log: initialize nfgenmsg in NLMSG_DONE terminatorXiang Mei
When batching multiple NFLOG messages (inst->qlen > 1), __nfulnl_send() appends an NLMSG_DONE terminator with sizeof(struct nfgenmsg) payload via nlmsg_put(), but never initializes the nfgenmsg bytes. The nlmsg_put() helper only zeroes alignment padding after the payload, not the payload itself, so four bytes of stale kernel heap data are leaked to userspace in the NLMSG_DONE message body. Use nfnl_msg_put() to build the NLMSG_DONE terminator, which initializes the nfgenmsg payload via nfnl_fill_hdr(), consistent with how __build_packet_message() already constructs NFULNL_MSG_PACKET headers. Fixes: 29c5d4afba51 ("[NETFILTER]: nfnetlink_log: fix sending of multipart messages") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Florian Westphal <fw@strlen.de>
5 daysipvs: fix NULL deref in ip_vs_add_service error pathWeiming Shi
When ip_vs_bind_scheduler() succeeds in ip_vs_add_service(), the local variable sched is set to NULL. If ip_vs_start_estimator() subsequently fails, the out_err cleanup calls ip_vs_unbind_scheduler(svc, sched) with sched == NULL. ip_vs_unbind_scheduler() passes the cur_sched NULL check (because svc->scheduler was set by the successful bind) but then dereferences the NULL sched parameter at sched->done_service, causing a kernel panic at offset 0x30 from NULL. Oops: general protection fault, [..] [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] RIP: 0010:ip_vs_unbind_scheduler (net/netfilter/ipvs/ip_vs_sched.c:69) Call Trace: <TASK> ip_vs_add_service.isra.0 (net/netfilter/ipvs/ip_vs_ctl.c:1500) do_ip_vs_set_ctl (net/netfilter/ipvs/ip_vs_ctl.c:2809) nf_setsockopt (net/netfilter/nf_sockopt.c:102) [..] Fix by simply not clearing the local sched variable after a successful bind. ip_vs_unbind_scheduler() already detects whether a scheduler is installed via svc->scheduler, and keeping sched non-NULL ensures the error path passes the correct pointer to both ip_vs_unbind_scheduler() and ip_vs_scheduler_put(). While the bug is older, the problem popups in more recent kernels (6.2), when the new error path is taken after the ip_vs_start_estimator() call. Fixes: 705dd3444081 ("ipvs: use kthreads for stats estimation") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
12 daysnetfilter: nf_tables: reject immediate NF_QUEUE verdictPablo Neira Ayuso
nft_queue is always used from userspace nftables to deliver the NF_QUEUE verdict. Immediately emitting an NF_QUEUE verdict is never used by the userspace nft tools, so reject immediate NF_QUEUE verdicts. The arp family does not provide queue support, but such an immediate verdict is still reachable. Globally reject NF_QUEUE immediate verdicts to address this issue. Fixes: f342de4e2f33 ("netfilter: nf_tables: reject QUEUE/DROP verdict parameters") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: x_tables: restrict xt_check_match/xt_check_target extensions for ↵Pablo Neira Ayuso
NFPROTO_ARP Weiming Shi says: xt_match and xt_target structs registered with NFPROTO_UNSPEC can be loaded by any protocol family through nft_compat. When such a match/target sets .hooks to restrict which hooks it may run on, the bitmask uses NF_INET_* constants. This is only correct for families whose hook layout matches NF_INET_*: IPv4, IPv6, INET, and bridge all share the same five hooks (PRE_ROUTING ... POST_ROUTING). ARP only has three hooks (IN=0, OUT=1, FORWARD=2) with different semantics. Because NF_ARP_OUT == 1 == NF_INET_LOCAL_IN, the .hooks validation silently passes for the wrong reasons, allowing matches to run on ARP chains where the hook assumptions (e.g. state->in being set on input hooks) do not hold. This leads to NULL pointer dereferences; xt_devgroup is one concrete example: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000044: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000220-0x0000000000000227] RIP: 0010:devgroup_mt+0xff/0x350 Call Trace: <TASK> nft_match_eval (net/netfilter/nft_compat.c:407) nft_do_chain (net/netfilter/nf_tables_core.c:285) nft_do_chain_arp (net/netfilter/nft_chain_filter.c:61) nf_hook_slow (net/netfilter/core.c:623) arp_xmit (net/ipv4/arp.c:666) </TASK> Kernel panic - not syncing: Fatal exception in interrupt Fix it by restricting arptables to NFPROTO_ARP extensions only. Note that arptables-legacy only supports: - arpt_CLASSIFY - arpt_mangle - arpt_MARK that provide explicit NFPROTO_ARP match/target declarations. Fixes: 9291747f118d ("netfilter: xtables: add device group match") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: ipset: drop logically empty buckets in mtype_delYifan Wu
mtype_del() counts empty slots below n->pos in k, but it only drops the bucket when both n->pos and k are zero. This misses buckets whose live entries have all been removed while n->pos still points past deleted slots. Treat a bucket as empty when all positions below n->pos are unused and release it directly instead of shrinking it further. Fixes: 8af1c6fbd923 ("netfilter: ipset: Fix forceadd evaluation path") Cc: stable@vger.kernel.org Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <dstsmallbird@foxmail.com> Signed-off-by: Yifan Wu <yifanwucs@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Reviewed-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: ctnetlink: ignore explicit helper on new expectationsPablo Neira Ayuso
Use the existing master conntrack helper, anything else is not really supported and it just makes validation more complicated, so just ignore what helper userspace suggests for this expectation. This was uncovered when validating CTA_EXPECT_CLASS via different helper provided by userspace than the existing master conntrack helper: BUG: KASAN: slab-out-of-bounds in nf_ct_expect_related_report+0x2479/0x27c0 Read of size 4 at addr ffff8880043fe408 by task poc/102 Call Trace: nf_ct_expect_related_report+0x2479/0x27c0 ctnetlink_create_expect+0x22b/0x3b0 ctnetlink_new_expect+0x4bd/0x5c0 nfnetlink_rcv_msg+0x67a/0x950 netlink_rcv_skb+0x120/0x350 Allowing to read kernel memory bytes off the expectation boundary. CTA_EXPECT_HELP_NAME is still used to offer the helper name to userspace via netlink dump. Fixes: bd0779370588 ("netfilter: nfnetlink_queue: allow to attach expectations to conntracks") Reported-by: Qi Tang <tpluszz77@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: ctnetlink: zero expect NAT fields when CTA_EXPECT_NAT absentQi Tang
ctnetlink_alloc_expect() allocates expectations from a non-zeroing slab cache via nf_ct_expect_alloc(). When CTA_EXPECT_NAT is not present in the netlink message, saved_addr and saved_proto are never initialized. Stale data from a previous slab occupant can then be dumped to userspace by ctnetlink_exp_dump_expect(), which checks these fields to decide whether to emit CTA_EXPECT_NAT. The safe sibling nf_ct_expect_init(), used by the packet path, explicitly zeroes these fields. Zero saved_addr, saved_proto and dir in the else branch, guarded by IS_ENABLED(CONFIG_NF_NAT) since these fields only exist when NAT is enabled. Confirmed by priming the expect slab with NAT-bearing expectations, freeing them, creating a new expectation without CTA_EXPECT_NAT, and observing that the ctnetlink dump emits a spurious CTA_EXPECT_NAT containing stale data from the prior allocation. Fixes: 076a0ca02644 ("netfilter: ctnetlink: add NAT support for expectations") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Qi Tang <tpluszz77@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: nf_conntrack_helper: pass helper to expect cleanupQi Tang
nf_conntrack_helper_unregister() calls nf_ct_expect_iterate_destroy() to remove expectations belonging to the helper being unregistered. However, it passes NULL instead of the helper pointer as the data argument, so expect_iter_me() never matches any expectation and all of them survive the cleanup. After unregister returns, nfnl_cthelper_del() frees the helper object immediately. Subsequent expectation dumps or packet-driven init_conntrack() calls then dereference the freed exp->helper, causing a use-after-free. Pass the actual helper pointer so expectations referencing it are properly destroyed before the helper object is freed. BUG: KASAN: slab-use-after-free in string+0x38f/0x430 Read of size 1 at addr ffff888003b14d20 by task poc/103 Call Trace: string+0x38f/0x430 vsnprintf+0x3cc/0x1170 seq_printf+0x17a/0x240 exp_seq_show+0x2e5/0x560 seq_read_iter+0x419/0x1280 proc_reg_read+0x1ac/0x270 vfs_read+0x179/0x930 ksys_read+0xef/0x1c0 Freed by task 103: The buggy address is located 32 bytes inside of freed 192-byte region [ffff888003b14d00, ffff888003b14dc0) Fixes: ac7b84839003 ("netfilter: expect: add and use nf_ct_expect_iterate helpers") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Reviewed-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: ipset: use nla_strcmp for IPSET_ATTR_NAME attrFlorian Westphal
IPSET_ATTR_NAME and IPSET_ATTR_NAMEREF are of NLA_STRING type, they cannot be treated like a c-string. They either have to be switched to NLA_NUL_STRING, or the compare operations need to use the nla functions. Fixes: f830837f0eed ("netfilter: ipset: list:set set type support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: x_tables: ensure names are nul-terminatedFlorian Westphal
Reject names that lack a \0 character before feeding them to functions that expect c-strings. Fixes tag is the most recent commit that needs this change. Fixes: c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: nfnetlink_log: account for netlink header sizeFlorian Westphal
This is a followup to an old bug fix: NLMSG_DONE needs to account for the netlink header size, not just the attribute size. This can result in a WARN splat + drop of the netlink message, but other than this there are no ill effects. Fixes: 9dfa1dfe4d5e ("netfilter: nf_log: account for size of NLMSG_DONE attribute") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
12 daysnetfilter: flowtable: strictly check for maximum number of actionsPablo Neira Ayuso
The maximum number of flowtable hardware offload actions in IPv6 is: * ethernet mangling (4 payload actions, 2 for each ethernet address) * SNAT (4 payload actions) * DNAT (4 payload actions) * Double VLAN (4 vlan actions, 2 for popping vlan, and 2 for pushing) for QinQ. * Redirect (1 action) Which makes 17, while the maximum is 16. But act_ct supports for tunnels actions too. Note that payload action operates at 32-bit word level, so mangling an IPv6 address takes 4 payload actions. Update flow_action_entry_next() calls to check for the maximum number of supported actions. While at it, rise the maximum number of actions per flow from 16 to 24 so this works fine with IPv6 setups. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: ctnetlink: use netlink policy range checksDavid Carlier
Replace manual range and mask validations with netlink policy annotations in ctnetlink code paths, so that the netlink core rejects invalid values early and can generate extack errors. - CTA_PROTOINFO_TCP_STATE: reject values > TCP_CONNTRACK_SYN_SENT2 at policy level, removing the manual >= TCP_CONNTRACK_MAX check. - CTA_PROTOINFO_TCP_WSCALE_ORIGINAL/REPLY: reject values > TCP_MAX_WSCALE (14). The normal TCP option parsing path already clamps to this value, but the ctnetlink path accepted 0-255, causing undefined behavior when used as a u32 shift count. - CTA_FILTER_ORIG_FLAGS/REPLY_FLAGS: use NLA_POLICY_MASK with CTA_FILTER_F_ALL, removing the manual mask checks. - CTA_EXPECT_FLAGS: use NLA_POLICY_MASK with NF_CT_EXPECT_MASK, adding a new mask define grouping all valid expect flags. Extracted from a broader nf-next patch by Florian Westphal, scoped to ctnetlink for the fixes tree. Fixes: c8e2078cfe41 ("[NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling") Signed-off-by: David Carlier <devnexen@gmail.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdpWeiming Shi
process_sdp() declares union nf_inet_addr rtp_addr on the stack and passes it to the nf_nat_sip sdp_session hook after walking the SDP media descriptions. However rtp_addr is only initialized inside the media loop when a recognized media type with a non-zero port is found. If the SDP body contains no m= lines, only inactive media sections (m=audio 0 ...) or only unrecognized media types, rtp_addr is never assigned. Despite that, the function still calls hooks->sdp_session() with &rtp_addr, causing nf_nat_sdp_session() to format the stale stack value as an IP address and rewrite the SDP session owner and connection lines with it. With CONFIG_INIT_STACK_ALL_ZERO (default on most distributions) this results in the session-level o= and c= addresses being rewritten to 0.0.0.0 for inactive SDP sessions. Without stack auto-init the rewritten address is whatever happened to be on the stack. Fix this by pre-initializing rtp_addr from the session-level connection address (caddr) when available, and tracking via a have_rtp_addr flag whether any valid address was established. Skip the sdp_session hook entirely when no valid address exists. Fixes: 4ab9e64e5e3c ("[NETFILTER]: nf_nat_sip: split up SDP mangling") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nf_conntrack_expect: skip expectations in other netns via procPablo Neira Ayuso
Skip expectations that do not reside in this netns. Similar to e77e6ff502ea ("netfilter: conntrack: do not dump other netns's conntrack entries via proc"). Fixes: 9b03f38d0487 ("netfilter: netns nf_conntrack: per-netns expectations") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nf_conntrack_expect: store netns and zone in expectationPablo Neira Ayuso
__nf_ct_expect_find() and nf_ct_expect_find_get() are called under rcu_read_lock() but they dereference the master conntrack via exp->master. Since the expectation does not hold a reference on the master conntrack, this could be dying conntrack or different recycled conntrack than the real master due to SLAB_TYPESAFE_RCU. Store the netns, the master_tuple and the zone in struct nf_conntrack_expect as a safety measure. This patch is required by the follow up fix not to dump expectations that do not belong to this netns. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: ctnetlink: ensure safe access to master conntrackPablo Neira Ayuso
Holding reference on the expectation is not sufficient, the master conntrack object can just go away, making exp->master invalid. To access exp->master safely: - Grab the nf_conntrack_expect_lock, this gets serialized with clean_from_lists() which also holds this lock when the master conntrack goes away. - Hold reference on master conntrack via nf_conntrack_find_get(). Not so easy since the master tuple to look up for the master conntrack is not available in the existing problematic paths. This patch goes for extending the nf_conntrack_expect_lock section to address this issue for simplicity, in the cases that are described below this is just slightly extending the lock section. The add expectation command already holds a reference to the master conntrack from ctnetlink_create_expect(). However, the delete expectation command needs to grab the spinlock before looking up for the expectation. Expand the existing spinlock section to address this to cover the expectation lookup. Note that, the nf_ct_expect_iterate_net() calls already grabs the spinlock while iterating over the expectation table, which is correct. The get expectation command needs to grab the spinlock to ensure master conntrack does not go away. This also expands the existing spinlock section to cover the expectation lookup too. I needed to move the netlink skb allocation out of the spinlock to keep it GFP_KERNEL. For the expectation events, the IPEXP_DESTROY event is already delivered under the spinlock, just move the delivery of IPEXP_NEW under the spinlock too because the master conntrack event cache is reached through exp->master. While at it, add lockdep notations to help identify what codepaths need to grab the spinlock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nf_conntrack_expect: use expect->helperPablo Neira Ayuso
Use expect->helper in ctnetlink and /proc to dump the helper name. Using nfct_help() without holding a reference to the master conntrack is unsafe. Use exp->master->helper in ctnetlink path if userspace does not provide an explicit helper when creating an expectation to retain the existing behaviour. The ctnetlink expectation path holds the reference on the master conntrack and nf_conntrack_expect lock and the nfnetlink glue path refers to the master ct that is attached to the skb. Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nf_conntrack_expect: honor expectation helper fieldPablo Neira Ayuso
The expectation helper field is mostly unused. As a result, the netfilter codebase relies on accessing the helper through exp->master. Always set on the expectation helper field so it can be used to reach the helper. nf_ct_expect_init() is called from packet path where the skb owns the ct object, therefore accessing exp->master for the newly created expectation is safe. This saves a lot of updates in all callsites to pass the ct object as parameter to nf_ct_expect_init(). This is a preparation patches for follow up fixes. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nft_set_rbtree: revisit array resize logicPablo Neira Ayuso
Chris Arges reports high memory consumption with thousands of containers, this patch revisits the array allocation logic. For anonymous sets, start by 16 slots (which takes 256 bytes on x86_64). Expand it by x2 until threshold of 512 slots is reached, over that threshold, expand it by x1.5. For non-anonymous set, start by 1024 slots in the array (which takes 16 Kbytes initially on x86_64). Expand it by x1.5. Use set->ndeact to subtract deactivated elements when calculating the number of the slots in the array, otherwise the array size array gets increased artifically. Add special case shrink logic to deal with flush set too. The shrink logic is skipped by anonymous sets. Use check_add_overflow() to calculate the new array size. Add a WARN_ON_ONCE check to make sure elements fit into the new array size. Reported-by: Chris Arges <carges@cloudflare.com> Fixes: 7e43e0a1141d ("netfilter: nft_set_rbtree: translate rbtree to array for binary search") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOADWeiming Shi
__build_packet_message() manually constructs the NFULA_PAYLOAD netlink attribute using skb_put() and skb_copy_bits(), bypassing the standard nla_reserve()/nla_put() helpers. While nla_total_size(data_len) bytes are allocated (including NLA alignment padding), only data_len bytes of actual packet data are copied. The trailing nla_padlen(data_len) bytes (1-3 when data_len is not 4-byte aligned) are never initialized, leaking stale heap contents to userspace via the NFLOG netlink socket. Replace the manual attribute construction with nla_reserve(), which handles the tailroom check, header setup, and padding zeroing via __nla_reserve(). The subsequent skb_copy_bits() fills in the payload data on top of the properly initialized attribute. Fixes: df6fb868d611 ("[NETFILTER]: nfnetlink: convert to generic netlink attribute functions") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-25netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiryFlorian Westphal
New test case fails unexpectedly when avx2 matching functions are used. The test first loads a ranomly generated pipapo set with 'ipv4 . port' key, i.e. nft -f foo. This works. Then, it reloads the set after a flush: (echo flush set t s; cat foo) | nft -f - This is expected to work, because its the same set after all and it was already loaded once. But with avx2, this fails: nft reports a clashing element. The reported clash is of following form: We successfully re-inserted a . b c . d Then we try to insert a . d avx2 finds the already existing a . d, which (due to 'flush set') is marked as invalid in the new generation. It skips the element and moves to next. Due to incorrect masking, the skip-step finds the next matching element *only considering the first field*, i.e. we return the already reinserted "a . b", even though the last field is different and the entry should not have been matched. No such error is reported for the generic c implementation (no avx2) or when the last field has to use the 'nft_pipapo_avx2_lookup_slow' fallback. Bisection points to 7711f4bb4b36 ("netfilter: nft_set_pipapo: fix range overlap detection") but that fix merely uncovers this bug. Before this commit, the wrong element is returned, but erronously reported as a full, identical duplicate. The root-cause is too early return in the avx2 match functions. When we process the last field, we should continue to process data until the entire input size has been consumed to make sure no stale bits remain in the map. Link: https://lore.kernel.org/netfilter-devel/20260321152506.037f68c0@elisabeth/ Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-19nfnetlink_osf: validate individual option lengths in fingerprintsWeiming Shi
nfnl_osf_add_callback() validates opt_num bounds and string NUL-termination but does not check individual option length fields. A zero-length option causes nf_osf_match_one() to enter the option matching loop even when foptsize sums to zero, which matches packets with no TCP options where ctx->optp is NULL: Oops: general protection fault KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98) Call Trace: nf_osf_match (net/netfilter/nfnetlink_osf.c:227) xt_osf_match_packet (net/netfilter/xt_osf.c:32) ipt_do_table (net/ipv4/netfilter/ip_tables.c:293) nf_hook_slow (net/netfilter/core.c:623) ip_local_deliver (net/ipv4/ip_input.c:262) ip_rcv (net/ipv4/ip_input.c:573) Additionally, an MSS option (kind=2) with length < 4 causes out-of-bounds reads when nf_osf_match_one() unconditionally accesses optp[2] and optp[3] for MSS value extraction. While RFC 9293 section 3.2 specifies that the MSS option is always exactly 4 bytes (Kind=2, Length=4), the check uses "< 4" rather than "!= 4" because lengths greater than 4 do not cause memory safety issues -- the buffer is guaranteed to be at least foptsize bytes by the ctx->optsize == foptsize check. Reject fingerprints where any option has zero length, or where an MSS option has length less than 4, at add time rather than trusting these values in the packet matching hot path. Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-19netfilter: nf_tables: release flowtable after rcu grace period on errorPablo Neira Ayuso
Call synchronize_rcu() after unregistering the hooks from error path, since a hook that already refers to this flowtable can be already registered, exposing this flowtable to packet path and nfnetlink_hook control plane. This error path is rare, it should only happen by reaching the maximum number hooks or by failing to set up to hardware offload, just call synchronize_rcu(). There is a check for already used device hooks by different flowtable that could result in EEXIST at this late stage. The hook parser can be updated to perform this check earlier to this error path really becomes rarely exercised. Uncovered by KASAN reported as use-after-free from nfnetlink_hook path when dumping hooks. Fixes: 3b49e2e94e6e ("netfilter: nf_tables: add flow table netlink frontend") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-19netfilter: bpf: defer hook memory release until rcu readers are doneFlorian Westphal
Yiming Qian reports UaF when concurrent process is dumping hooks via nfnetlink_hooks: BUG: KASAN: slab-use-after-free in nfnl_hook_dump_one.isra.0+0xe71/0x10f0 Read of size 8 at addr ffff888003edbf88 by task poc/79 Call Trace: <TASK> nfnl_hook_dump_one.isra.0+0xe71/0x10f0 netlink_dump+0x554/0x12b0 nfnl_hook_get+0x176/0x230 [..] Defer release until after concurrent readers have completed. Reported-by: Yiming Qian <yimingqian591@gmail.com> Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs") Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: nf_conntrack_h323: check for zero length in DecodeQ931()Jenny Guanni Qu
In DecodeQ931(), the UserUserIE code path reads a 16-bit length from the packet, then decrements it by 1 to skip the protocol discriminator byte before passing it to DecodeH323_UserInformation(). If the encoded length is 0, the decrement wraps to -1, which is then passed as a large value to the decoder, leading to an out-of-bounds read. Add a check to ensure len is positive after the decrement. Fixes: 5e35941d9901 ("[NETFILTER]: Add H.323 conntrack/NAT helper") Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com> Reported-by: Dawid Moczadło <dawid@vidocsecurity.com> Tested-by: Jenny Guanni Qu <qguanni@gmail.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: xt_time: use unsigned int for monthday bit shiftJenny Guanni Qu
The monthday field can be up to 31, and shifting a signed integer 1 by 31 positions (1 << 31) is undefined behavior in C, as the result overflows a 32-bit signed int. Use 1U to ensure well-defined behavior for all valid monthday values. Change the weekday shift to 1U as well for consistency. Fixes: ee4411a1b1e0 ("[NETFILTER]: x_tables: add xt_time match") Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com> Reported-by: Dawid Moczadło <dawid@vidocsecurity.com> Tested-by: Jenny Guanni Qu <qguanni@gmail.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: xt_CT: drop pending enqueued packets on template removalPablo Neira Ayuso
Templates refer to objects that can go away while packets are sitting in nfqueue refer to: - helper, this can be an issue on module removal. - timeout policy, nfnetlink_cttimeout might remove it. The use of templates with zone and event cache filter are safe, since this just copies values. Flush these enqueued packets in case the template rule gets removed. Fixes: 24de58f46516 ("netfilter: xt_CT: allow to attach timeout policy + glue code") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: nft_ct: drop pending enqueued packets on removalPablo Neira Ayuso
Packets sitting in nfqueue might hold a reference to: - templates that specify the conntrack zone, because a percpu area is used and module removal is possible. - conntrack timeout policies and helper, where object removal leave a stale reference. Since these objects can just go away, drop enqueued packets to avoid stale reference to them. If there is a need for finer grain removal, this logic can be revisited to make selective packet drop upon dependencies. Fixes: 7e0b2b57f01d ("netfilter: nft_ct: add ct timeout support") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13nf_tables: nft_dynset: fix possible stateful expression memleak in error pathPablo Neira Ayuso
If cloning the second stateful expression in the element via GFP_ATOMIC fails, then the first stateful expression remains in place without being released.   unreferenced object (percpu) 0x607b97e9cab8 (size 16):     comm "softirq", pid 0, jiffies 4294931867     hex dump (first 16 bytes on cpu 3):       00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00     backtrace (crc 0):       pcpu_alloc_noprof+0x453/0xd80       nft_counter_clone+0x9c/0x190 [nf_tables]       nft_expr_clone+0x8f/0x1b0 [nf_tables]       nft_dynset_new+0x2cb/0x5f0 [nf_tables]       nft_rhash_update+0x236/0x11c0 [nf_tables]       nft_dynset_eval+0x11f/0x670 [nf_tables]       nft_do_chain+0x253/0x1700 [nf_tables]       nft_do_chain_ipv4+0x18d/0x270 [nf_tables]       nf_hook_slow+0xaa/0x1e0       ip_local_deliver+0x209/0x330 Fixes: 563125a73ac3 ("netfilter: nftables: generalize set extension to support for several expressions") Reported-by: Gurpreet Shergill <giki.shergill@proton.me> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: nf_conntrack_h323: fix OOB read in decode_int() CONS caseJenny Guanni Qu
In decode_int(), the CONS case calls get_bits(bs, 2) to read a length value, then calls get_uint(bs, len) without checking that len bytes remain in the buffer. The existing boundary check only validates the 2 bits for get_bits(), not the subsequent 1-4 bytes that get_uint() reads. This allows a malformed H.323/RAS packet to cause a 1-4 byte slab-out-of-bounds read. Add a boundary check for len bytes after get_bits() and before get_uint(). Fixes: 5e35941d9901 ("[NETFILTER]: Add H.323 conntrack/NAT helper") Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com> Reported-by: Dawid Moczadło <dawid@vidocsecurity.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: nf_flow_table_ip: reset mac header before vlan pushEric Woudstra
With double vlan tagged packets in the fastpath, getting the error: skb_vlan_push got skb with skb->data not at mac header (offset 18) Call skb_reset_mac_header() before calling skb_vlan_push(). Fixes: c653d5a78f34 ("netfilter: flowtable: inline vlan encapsulation in xmit path") Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: revert nft_set_rbtree: validate open interval overlapFlorian Westphal
This reverts commit 648946966a08 ("netfilter: nft_set_rbtree: validate open interval overlap"). There have been reports of nft failing to laod valid rulesets after this patch was merged into -stable. I can reproduce several such problem with recent nft versions, including nft 1.1.6 which is widely shipped by distributions. We currently have little choice here. This commit can be resurrected at some point once the nftables fix that triggers the false overlap positive has appeared in common distros (see e83e32c8d1cd ("mnl: restore create element command with large batches" in nftables.git). Fixes: 648946966a08 ("netfilter: nft_set_rbtree: validate open interval overlap") Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: nf_conntrack_sip: fix Content-Length u32 truncation in sip_help_tcp()Lukas Johannes Möller
sip_help_tcp() parses the SIP Content-Length header with simple_strtoul(), which returns unsigned long, but stores the result in unsigned int clen. On 64-bit systems, values exceeding UINT_MAX are silently truncated before computing the SIP message boundary. For example, Content-Length 4294967328 (2^32 + 32) is truncated to 32, causing the parser to miscalculate where the current message ends. The loop then treats trailing data in the TCP segment as a second SIP message and processes it through the SDP parser. Fix this by changing clen to unsigned long to match the return type of simple_strtoul(), and reject Content-Length values that exceed the remaining TCP payload length. Fixes: f5b321bd37fb ("netfilter: nf_conntrack_sip: add TCP support") Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: conntrack: add missing netlink policy validationsFlorian Westphal
Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink. These attributes are used by the kernel without any validation. Extend the netlink policies accordingly. Quoting the reporter: nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE value directly to ct->proto.sctp.state without checking that it is within the valid range. [..] and: ... with exp->dir = 100, the access at ct->master->tuplehash[100] reads 5600 bytes past the start of a 320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by UBSAN. Fixes: 076a0ca02644 ("netfilter: ctnetlink: add NAT support for expectations") Fixes: a258860e01b8 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink") Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-13netfilter: ctnetlink: fix use-after-free in ctnetlink_dump_exp_ct()Hyunwoo Kim
ctnetlink_dump_exp_ct() stores a conntrack pointer in cb->data for the netlink dump callback ctnetlink_exp_ct_dump_table(), but drops the conntrack reference immediately after netlink_dump_start(). When the dump spans multiple rounds, the second recvmsg() triggers the dump callback which dereferences the now-freed conntrack via nfct_help(ct), leading to a use-after-free on ct->ext. The bug is that the netlink_dump_control has no .start or .done callbacks to manage the conntrack reference across dump rounds. Other dump functions in the same file (e.g. ctnetlink_get_conntrack) properly use .start/.done callbacks for this purpose. Fix this by adding .start and .done callbacks that hold and release the conntrack reference for the duration of the dump, and move the nfct_help() call after the cb->args[0] early-return check in the dump callback to avoid dereferencing ct->ext unnecessarily. BUG: KASAN: slab-use-after-free in ctnetlink_exp_ct_dump_table+0x4f/0x2e0 Read of size 8 at addr ffff88810597ebf0 by task ctnetlink_poc/133 CPU: 1 UID: 0 PID: 133 Comm: ctnetlink_poc Not tainted 7.0.0-rc2+ #3 PREEMPTLAZY Call Trace: <TASK> ctnetlink_exp_ct_dump_table+0x4f/0x2e0 netlink_dump+0x333/0x880 netlink_recvmsg+0x3e2/0x4b0 ? aa_sk_perm+0x184/0x450 sock_recvmsg+0xde/0xf0 Allocated by task 133: kmem_cache_alloc_noprof+0x134/0x440 __nf_conntrack_alloc+0xa8/0x2b0 ctnetlink_create_conntrack+0xa1/0x900 ctnetlink_new_conntrack+0x3cf/0x7d0 nfnetlink_rcv_msg+0x48e/0x510 netlink_rcv_skb+0xc9/0x1f0 nfnetlink_rcv+0xdb/0x220 netlink_unicast+0x3ec/0x590 netlink_sendmsg+0x397/0x690 __sys_sendmsg+0xf4/0x180 Freed by task 0: slab_free_after_rcu_debug+0xad/0x1e0 rcu_core+0x5c3/0x9c0 Fixes: e844a928431f ("netfilter: ctnetlink: allow to dump expectation per master conntrack") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: xt_IDLETIMER: reject rev0 reuse of ALARM timer labelsYuan Tan
IDLETIMER revision 0 rules reuse existing timers by label and always call mod_timer() on timer->timer. If the label was created first by revision 1 with XT_IDLETIMER_ALARM, the object uses alarm timer semantics and timer->timer is never initialized. Reusing that object from revision 0 causes mod_timer() on an uninitialized timer_list, triggering debugobjects warnings and possible panic when panic_on_warn=1. Fix this by rejecting revision 0 rule insertion when an existing timer with the same label is of ALARM type. Fixes: 68983a354a65 ("netfilter: xtables: Add snapshot of hardidletimer target") Co-developed-by: Yifan Wu <yifanwucs@gmail.com> Signed-off-by: Yifan Wu <yifanwucs@gmail.com> Co-developed-by: Juefei Pu <tomapufckgml@gmail.com> Signed-off-by: Juefei Pu <tomapufckgml@gmail.com> Signed-off-by: Yuan Tan <tanyuan98@outlook.com> Signed-off-by: Xin Liu <dstsmallbird@foxmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: nfnetlink_cthelper: fix OOB read in nfnl_cthelper_dump_table()Hyunwoo Kim
nfnl_cthelper_dump_table() has a 'goto restart' that jumps to a label inside the for loop body. When the "last" helper saved in cb->args[1] is deleted between dump rounds, every entry fails the (cur != last) check, so cb->args[1] is never cleared. The for loop finishes with cb->args[0] == nf_ct_helper_hsize, and the 'goto restart' jumps back into the loop body bypassing the bounds check, causing an 8-byte out-of-bounds read on nf_ct_helper_hash[nf_ct_helper_hsize]. The 'goto restart' block was meant to re-traverse the current bucket when "last" is no longer found, but it was placed after the for loop instead of inside it. Move the block into the for loop body so that the restart only occurs while cb->args[0] is still within bounds. BUG: KASAN: slab-out-of-bounds in nfnl_cthelper_dump_table+0x9f/0x1b0 Read of size 8 at addr ffff888104ca3000 by task poc_cthelper/131 Call Trace: nfnl_cthelper_dump_table+0x9f/0x1b0 netlink_dump+0x333/0x880 netlink_recvmsg+0x3e2/0x4b0 sock_recvmsg+0xde/0xf0 __sys_recvfrom+0x150/0x200 __x64_sys_recvfrom+0x76/0x90 do_syscall_64+0xc3/0x6e0 Allocated by task 1: __kvmalloc_node_noprof+0x21b/0x700 nf_ct_alloc_hashtable+0x65/0xd0 nf_conntrack_helper_init+0x21/0x60 nf_conntrack_init_start+0x18d/0x300 nf_conntrack_standalone_init+0x12/0xc0 Fixes: 12f7a505331e ("netfilter: add user-space connection tracking helper infrastructure") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: nfnetlink_queue: fix entry leak in bridge verdict error pathHyunwoo Kim
nfqnl_recv_verdict() calls find_dequeue_entry() to remove the queue entry from the queue data structures, taking ownership of the entry. For PF_BRIDGE packets, it then calls nfqa_parse_bridge() to parse VLAN attributes. If nfqa_parse_bridge() returns an error (e.g. NFQA_VLAN present but NFQA_VLAN_TCI missing), the function returns immediately without freeing the dequeued entry or its sk_buff. This leaks the nf_queue_entry, its associated sk_buff, and all held references (net_device refcounts, struct net refcount). Repeated triggering exhausts kernel memory. Fix this by dropping the entry via nfqnl_reinject() with NF_DROP verdict on the error path, consistent with other error handling in this file. Fixes: 8d45ff22f1b4 ("netfilter: bridge: nf queue verdict to use NFQA_VLAN and NFQA_L2HDR") Reviewed-by: David Dull <monderasdor@gmail.com> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: x_tables: guard option walkers against 1-byte tail readsDavid Dull
When the last byte of options is a non-single-byte option kind, walkers that advance with i += op[i + 1] ? : 1 can read op[i + 1] past the end of the option area. Add an explicit i == optlen - 1 check before dereferencing op[i + 1] in xt_tcpudp and xt_dccp option walkers. Fixes: 2e4e6a17af35 ("[NETFILTER] x_tables: Abstraction layer for {ip,ip6,arp}_tables") Signed-off-by: David Dull <monderasdor@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: nft_set_pipapo: fix stack out-of-bounds read in pipapo_drop()Jenny Guanni Qu
pipapo_drop() passes rulemap[i + 1].n to pipapo_unmap() as the to_offset argument on every iteration, including the last one where i == m->field_count - 1. This reads one element past the end of the stack-allocated rulemap array (declared as rulemap[NFT_PIPAPO_MAX_FIELDS] with NFT_PIPAPO_MAX_FIELDS == 16). Although pipapo_unmap() returns early when is_last is true without using the to_offset value, the argument is evaluated at the call site before the function body executes, making this a genuine out-of-bounds stack read confirmed by KASAN: BUG: KASAN: stack-out-of-bounds in pipapo_drop+0x50c/0x57c [nf_tables] Read of size 4 at addr ffff8000810e71a4 This frame has 1 object: [32, 160) 'rulemap' The buggy address is at offset 164 -- exactly 4 bytes past the end of the rulemap array. Pass 0 instead of rulemap[i + 1].n on the last iteration to avoid the out-of-bounds read. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: nf_tables: always walk all pending catchall elementsFlorian Westphal
During transaction processing we might have more than one catchall element: 1 live catchall element and 1 pending element that is coming as part of the new batch. If the map holding the catchall elements is also going away, its required to toggle all catchall elements and not just the first viable candidate. Otherwise, we get: WARNING: ./include/net/netfilter/nf_tables.h:1281 at nft_data_release+0xb7/0xe0 [nf_tables], CPU#2: nft/1404 RIP: 0010:nft_data_release+0xb7/0xe0 [nf_tables] [..] __nft_set_elem_destroy+0x106/0x380 [nf_tables] nf_tables_abort_release+0x348/0x8d0 [nf_tables] nf_tables_abort+0xcf2/0x3ac0 [nf_tables] nfnetlink_rcv_batch+0x9c9/0x20e0 [..] Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-10netfilter: nf_tables: Fix for duplicate device in netdev hooksPhil Sutter
When handling NETDEV_REGISTER notification, duplicate device registration must be avoided since the device may have been added by nft_netdev_hook_alloc() already when creating the hook. Suggested-by: Florian Westphal <fw@strlen.de> Reported-by: syzbot+bb9127e278fa198e110c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bb9127e278fa198e110c Fixes: a331b78a5525 ("netfilter: nf_tables: Respect NETDEV_REGISTER events") Tested-by: Helen Koike <koike@igalia.com> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-05netfilter: nft_set_pipapo: split gc into unlink and reclaim phaseFlorian Westphal
Yiming Qian reports Use-after-free in the pipapo set type: Under a large number of expired elements, commit-time GC can run for a very long time in a non-preemptible context, triggering soft lockup warnings and RCU stall reports (local denial of service). We must split GC in an unlink and a reclaim phase. We cannot queue elements for freeing until pointers have been swapped. Expired elements are still exposed to both the packet path and userspace dumpers via the live copy of the data structure. call_rcu() does not protect us: dump operations or element lookups starting after call_rcu has fired can still observe the free'd element, unless the commit phase has made enough progress to swap the clone and live pointers before any new reader has picked up the old version. This a similar approach as done recently for the rbtree backend in commit 35f83a75529a ("netfilter: nft_set_rbtree: don't gc elements on insert"). Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-05netfilter: nf_tables: clone set on flush onlyPablo Neira Ayuso
Syzbot with fault injection triggered a failing memory allocation with GFP_KERNEL which results in a WARN splat: iter.err WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992 Modules linked in: CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845 Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43 +80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9 RSP: 0018:ffffc900045af780 EFLAGS: 00010293 RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40 RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000 RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0 R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920 FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0 Call Trace: <TASK> __nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115 nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380 netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761 __sock_release net/socket.c:662 [inline] sock_close+0xc3/0x240 net/socket.c:1455 Restrict set clone to the flush set command in the preparation phase. Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree and pipapo backends to only clone the set when this iteration type is used. As for the existing NFT_ITER_UPDATE type, update the pipapo backend to use the existing set clone if available, otherwise use the existing set representation. After this update, there is no need to clone a set that is being deleted, this includes bound anonymous set. An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone interface and call it from the flush set path. Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com Fixes: 3f1d886cc7c3 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-05netfilter: nf_tables: unconditionally bump set->nelems before insertionPablo Neira Ayuso
In case that the set is full, a new element gets published then removed without waiting for the RCU grace period, while RCU reader can be walking over it already. To address this issue, add the element transaction even if set is full, but toggle the set_full flag to report -ENFILE so the abort path safely unwinds the set to its previous state. As for element updates, decrement set->nelems to restore it. A simpler fix is to call synchronize_rcu() in the error path. However, with a large batch adding elements to already maxed-out set, this could cause noticeable slowdown of such batches. Fixes: 35d0ac9070ef ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL") Reported-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-26Merge tag 'net-7.0-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...