summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-11-20net/smc: fix dangling sock under state SMC_APPFINCLOSEWAITD. Wythe
[ Upstream commit 5211c9729484c923f8d2e06bd29f9322cc42bb8f ] Considering scenario: smc_cdc_rx_handler __smc_release sock_set_flag smc_close_active() sock_set_flag __set_bit(DEAD) __set_bit(DONE) Dues to __set_bit is not atomic, the DEAD or DONE might be lost. if the DEAD flag lost, the state SMC_CLOSED will be never be reached in smc_close_passive_work: if (sock_flag(sk, SOCK_DEAD) && smc_close_sent_any_close(conn)) { sk->sk_state = SMC_CLOSED; } else { /* just shutdown, but not yet closed locally */ sk->sk_state = SMC_APPFINCLOSEWAIT; } Replace sock_set_flags or __set_bit to set_bit will fix this problem. Since set_bit is atomic. Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20dccp/tcp: Call security_inet_conn_request() after setting IPv6 addresses.Kuniyuki Iwashima
[ Upstream commit 23be1e0e2a83a8543214d2599a31d9a2185a796b ] Initially, commit 4237c75c0a35 ("[MLSXFRM]: Auto-labeling of child sockets") introduced security_inet_conn_request() in some functions where reqsk is allocated. The hook is added just after the allocation, so reqsk's IPv6 remote address was not initialised then. However, SELinux/Smack started to read it in netlbl_req_setattr() after commit e1adea927080 ("calipso: Allow request sockets to be relabelled by the lsm."). Commit 284904aa7946 ("lsm: Relocate the IPv4 security_inet_conn_request() hooks") fixed that kind of issue only in TCPv4 because IPv6 labeling was not supported at that time. Finally, the same issue was introduced again in IPv6. Let's apply the same fix on DCCPv6 and TCPv6. Fixes: e1adea927080 ("calipso: Allow request sockets to be relabelled by the lsm.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20dccp: Call security_inet_conn_request() after setting IPv4 addresses.Kuniyuki Iwashima
[ Upstream commit fa2df45af13091f76b89adb84a28f13818d5d631 ] Initially, commit 4237c75c0a35 ("[MLSXFRM]: Auto-labeling of child sockets") introduced security_inet_conn_request() in some functions where reqsk is allocated. The hook is added just after the allocation, so reqsk's IPv4 remote address was not initialised then. However, SELinux/Smack started to read it in netlbl_req_setattr() after the cited commits. This bug was partially fixed by commit 284904aa7946 ("lsm: Relocate the IPv4 security_inet_conn_request() hooks"). This patch fixes the last bug in DCCPv4. Fixes: 389fb800ac8b ("netlabel: Label incoming TCP connections correctly in SELinux") Fixes: 07feee8f812f ("netlabel: Cleanup the Smack/NetLabel code to fix incoming TCP connections") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tipc: Change nla_policy for bearer-related names to NLA_NUL_STRINGShigeru Yoshida
[ Upstream commit 19b3f72a41a8751e26bffc093bb7e1cef29ad579 ] syzbot reported the following uninit-value access issue [1]: ===================================================== BUG: KMSAN: uninit-value in strlen lib/string.c:418 [inline] BUG: KMSAN: uninit-value in strstr+0xb8/0x2f0 lib/string.c:756 strlen lib/string.c:418 [inline] strstr+0xb8/0x2f0 lib/string.c:756 tipc_nl_node_reset_link_stats+0x3ea/0xb50 net/tipc/node.c:2595 genl_family_rcv_msg_doit net/netlink/genetlink.c:971 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1051 [inline] genl_rcv_msg+0x11ec/0x1290 net/netlink/genetlink.c:1066 netlink_rcv_skb+0x371/0x650 net/netlink/af_netlink.c:2545 genl_rcv+0x40/0x60 net/netlink/genetlink.c:1075 netlink_unicast_kernel net/netlink/af_netlink.c:1342 [inline] netlink_unicast+0xf47/0x1250 net/netlink/af_netlink.c:1368 netlink_sendmsg+0x1238/0x13d0 net/netlink/af_netlink.c:1910 sock_sendmsg_nosec net/socket.c:730 [inline] sock_sendmsg net/socket.c:753 [inline] ____sys_sendmsg+0x9c2/0xd60 net/socket.c:2541 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2595 __sys_sendmsg net/socket.c:2624 [inline] __do_sys_sendmsg net/socket.c:2633 [inline] __se_sys_sendmsg net/socket.c:2631 [inline] __x64_sys_sendmsg+0x307/0x490 net/socket.c:2631 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Uninit was created at: slab_post_alloc_hook+0x12f/0xb70 mm/slab.h:767 slab_alloc_node mm/slub.c:3478 [inline] kmem_cache_alloc_node+0x577/0xa80 mm/slub.c:3523 kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:559 __alloc_skb+0x318/0x740 net/core/skbuff.c:650 alloc_skb include/linux/skbuff.h:1286 [inline] netlink_alloc_large_skb net/netlink/af_netlink.c:1214 [inline] netlink_sendmsg+0xb34/0x13d0 net/netlink/af_netlink.c:1885 sock_sendmsg_nosec net/socket.c:730 [inline] sock_sendmsg net/socket.c:753 [inline] ____sys_sendmsg+0x9c2/0xd60 net/socket.c:2541 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2595 __sys_sendmsg net/socket.c:2624 [inline] __do_sys_sendmsg net/socket.c:2633 [inline] __se_sys_sendmsg net/socket.c:2631 [inline] __x64_sys_sendmsg+0x307/0x490 net/socket.c:2631 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd TIPC bearer-related names including link names must be null-terminated strings. If a link name which is not null-terminated is passed through netlink, strstr() and similar functions can cause buffer overrun. This causes the above issue. This patch changes the nla_policy for bearer-related names from NLA_STRING to NLA_NUL_STRING. This resolves the issue by ensuring that only null-terminated strings are accepted as bearer-related names. syzbot reported similar uninit-value issue related to bearer names [2]. The root cause of this issue is that a non-null-terminated bearer name was passed. This patch also resolved this issue. Fixes: 7be57fc69184 ("tipc: add link get/dump to new netlink api") Fixes: 0655f6a8635b ("tipc: add bearer disable/enable to new netlink api") Reported-and-tested-by: syzbot+5138ca807af9d2b42574@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5138ca807af9d2b42574 [1] Reported-and-tested-by: syzbot+9425c47dccbcb4c17d51@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9425c47dccbcb4c17d51 [2] Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231030075540.3784537-1-syoshida@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20llc: verify mac len before reading mac headerWillem de Bruijn
[ Upstream commit 7b3ba18703a63f6fd487183b9262b08e5632da1b ] LLC reads the mac header with eth_hdr without verifying that the skb has an Ethernet header. Syzbot was able to enter llc_rcv on a tun device. Tun can insert packets without mac len and with user configurable skb->protocol (passing a tun_pi header when not configuring IFF_NO_PI). BUG: KMSAN: uninit-value in llc_station_ac_send_test_r net/llc/llc_station.c:81 [inline] BUG: KMSAN: uninit-value in llc_station_rcv+0x6fb/0x1290 net/llc/llc_station.c:111 llc_station_ac_send_test_r net/llc/llc_station.c:81 [inline] llc_station_rcv+0x6fb/0x1290 net/llc/llc_station.c:111 llc_rcv+0xc5d/0x14a0 net/llc/llc_input.c:218 __netif_receive_skb_one_core net/core/dev.c:5523 [inline] __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5637 netif_receive_skb_internal net/core/dev.c:5723 [inline] netif_receive_skb+0x58/0x660 net/core/dev.c:5782 tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555 tun_get_user+0x54c5/0x69c0 drivers/net/tun.c:2002 Add a mac_len test before all three eth_hdr(skb) calls under net/llc. There are further uses in include/net/llc_pdu.h. All these are protected by a test skb->protocol == ETH_P_802_2. Which does not protect against this tun scenario. But the mac_len test added in this patch in llc_fixup_skb will indirectly protect those too. That is called from llc_rcv before any other LLC code. It is tempting to just add a blanket mac_len check in llc_rcv, but not sure whether that could break valid LLC paths that do not assume an Ethernet header. 802.2 LLC may be used on top of non-802.3 protocols in principle. The below referenced commit shows that used to, on top of Token Ring. At least one of the three eth_hdr uses goes back to before the start of git history. But the one that syzbot exercises is introduced in this commit. That commit is old enough (2008), that effectively all stable kernels should receive this. Fixes: f83f1768f833 ("[LLC]: skb allocation size for responses") Reported-by: syzbot+a8c7be6dee0de1b669cc@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20231025234251.3796495-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20ipv6: avoid atomic fragment on GSO packetsYan Zhai
[ Upstream commit 03d6c848bfb406e9ef6d9846d759e97beaeea113 ] When the ipv6 stack output a GSO packet, if its gso_size is larger than dst MTU, then all segments would be fragmented. However, it is possible for a GSO packet to have a trailing segment with smaller actual size than both gso_size as well as the MTU, which leads to an "atomic fragment". Atomic fragments are considered harmful in RFC-8021. An Existing report from APNIC also shows that atomic fragments are more likely to be dropped even it is equivalent to a no-op [1]. Add an extra check in the GSO slow output path. For each segment from the original over-sized packet, if it fits with the path MTU, then avoid generating an atomic fragment. Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing") Reported-by: David Wragg <dwragg@cloudflare.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Link: https://lore.kernel.org/r/90912e3503a242dca0bc36958b11ed03a2696e5e.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tcp: fix cookie_init_timestamp() overflowsEric Dumazet
[ Upstream commit 73ed8e03388d16c12fc577e5c700b58a29045a15 ] cookie_init_timestamp() is supposed to return a 64bit timestamp suitable for both TSval determination and setting of skb->tstamp. Unfortunately it uses 32bit fields and overflows after 2^32 * 10^6 nsec (~49 days) of uptime. Generated TSval are still correct, but skb->tstamp might be set far away in the past, potentially confusing other layers. tcp_ns_to_ts() is changed to return a full 64bit value, ts and ts_now variables are changed to u64 type, and TSMASK is removed in favor of shifts operations. While we are at it, change this sequence: ts >>= TSBITS; ts--; ts <<= TSBITS; ts |= options; to: ts -= (1UL << TSBITS); Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tcp: Remove one extra ktime_get_ns() from cookie_init_timestampEric Dumazet
[ Upstream commit 200ecef67b8d09d16ec55f91c92751dcc7a38d40 ] tcp_make_synack() already uses tcp_clock_ns(), and can pass the value to cookie_init_timestamp() to avoid another call to ktime_get_ns() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 73ed8e03388d ("tcp: fix cookie_init_timestamp() overflows") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tcp_metrics: do not create an entry from tcp_init_metrics()Eric Dumazet
[ Upstream commit a135798e6e200ecb2f864cecca6d257ba278370c ] tcp_init_metrics() only wants to get metrics if they were previously stored in the cache. Creating an entry is adding useless costs, especially when tcp_no_metrics_save is set. Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tcp_metrics: properly set tp->snd_ssthresh in tcp_init_metrics()Eric Dumazet
[ Upstream commit 081480014a64a69d901f8ef1ffdd56d6085cf87e ] We need to set tp->snd_ssthresh to TCP_INFINITE_SSTHRESH in the case tcp_get_metrics() fails for some reason. Fixes: 9ad7c049f0f7 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tcp_metrics: add missing barriers on deleteEric Dumazet
[ Upstream commit cbc3a153222805d65f821e10f4f78b6afce06f86 ] When removing an item from RCU protected list, we must prevent store-tearing, using rcu_assign_pointer() or WRITE_ONCE(). Fixes: 04f721c671656 ("tcp_metrics: Rewrite tcp_metrics_flush_all") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKedAananth V
[ Upstream commit e326578a21414738de45f77badd332fb00bd0f58 ] For passive TCP Fast Open sockets that had SYN/ACK timeout and did not send more data in SYN_RECV, upon receiving the final ACK in 3WHS, the congestion state may awkwardly stay in CA_Loss mode unless the CA state was undone due to TCP timestamp checks. However, if tcp_rcv_synrecv_state_fastopen() decides not to undo, then we should enter CA_Open, because at that point we have received an ACK covering the retransmitted SYNACKs. Currently, the icsk_ca_state is only set to CA_Open after we receive an ACK for a data-packet. This is because tcp_ack does not call tcp_fastretrans_alert (and tcp_process_loss) if !prior_packets Note that tcp_process_loss() calls tcp_try_undo_recovery(), so having tcp_rcv_synrecv_state_fastopen() decide that if we're in CA_Loss we should call tcp_try_undo_recovery() is consistent with that, and low risk. Fixes: dad8cea7add9 ("tcp: fix TFO SYNACK undo to avoid double-timestamp-undo") Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-08netfilter: nfnetlink_log: silence bogus compiler warningFlorian Westphal
[ Upstream commit 2e1d175410972285333193837a4250a74cd472e6 ] net/netfilter/nfnetlink_log.c:800:18: warning: variable 'ctinfo' is uninitialized The warning is bogus, the variable is only used if ct is non-NULL and always initialised in that case. Init to 0 too to silence this. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309100514.ndBFebXN-lkp@intel.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-08tcp: fix wrong RTO timeout when received SACK renegingFred Chen
[ Upstream commit d2a0fc372aca561556e765d0a9ec365c7c12f0ad ] This commit fix wrong RTO timeout when received SACK reneging. When an ACK arrived pointing to a SACK reneging, tcp_check_sack_reneging() will rearm the RTO timer for min(1/2*srtt, 10ms) into to the future. But since the commit 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") merged, the tcp_set_xmit_timer() is moved after tcp_fastretrans_alert()(which do the SACK reneging check), so the RTO timeout will be overwrited by tcp_set_xmit_timer() with icsk_rto instead of 1/2*srtt. Here is a packetdrill script to check this bug: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // simulate srtt to 100ms +0 < S 0:0(0) win 32792 <mss 1000, sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +.1 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 10000) = 10000 +0 > P. 1:10001(10000) ack 1 // inject sack +.1 < . 1:1(0) ack 1 win 257 <sack 1001:10001,nop,nop> +0 > . 1:1001(1000) ack 1 // inject sack reneging +.1 < . 1:1(0) ack 1001 win 257 <sack 9001:10001,nop,nop> // we expect rto fired in 1/2*srtt (50ms) +.05 > . 1001:2001(1000) ack 1 This fix remove the FLAG_SET_XMIT_TIMER from ack_flag when tcp_check_sack_reneging() set RTO timer with 1/2*srtt to avoid being overwrited later. Fixes: 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") Signed-off-by: Fred Chen <fred.chenchen03@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-08neighbour: fix various data-racesEric Dumazet
[ Upstream commit a9beb7e81bcb876615e1fbb3c07f3f9dba69831f ] 1) tbl->gc_thresh1, tbl->gc_thresh2, tbl->gc_thresh3 and tbl->gc_interval can be written from sysfs. 2) tbl->last_flush is read locklessly from neigh_alloc() 3) tbl->proxy_queue.qlen is read locklessly from neightbl_fill_info() 4) neightbl_fill_info() reads cpu stats that can be changed concurrently. Fixes: c7fb64db001f ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231019122104.1448310-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25xfrm6: fix inet6_dev refcount underflow problemZhang Changzhong
[ Upstream commit cc9b364bb1d58d3dae270c7a931a8cc717dc2b3b ] There are race conditions that may lead to inet6_dev refcount underflow in xfrm6_dst_destroy() and rt6_uncached_list_flush_dev(). One of the refcount underflow bugs is shown below: (cpu 1) | (cpu 2) xfrm6_dst_destroy() | ... | in6_dev_put() | | rt6_uncached_list_flush_dev() ... | ... | in6_dev_put() rt6_uncached_list_del() | ... ... | xfrm6_dst_destroy() calls rt6_uncached_list_del() after in6_dev_put(), so rt6_uncached_list_flush_dev() has a chance to call in6_dev_put() again for the same inet6_dev. Fix it by moving in6_dev_put() after rt6_uncached_list_del() in xfrm6_dst_destroy(). Fixes: 510c321b5571 ("xfrm: reuse uncached_list to track xdsts") Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX nameKees Cook
commit cb3871b1cd135a6662b732fbc6b3db4afcdb4a64 upstream. The code pattern of memcpy(dst, src, strlen(src)) is almost always wrong. In this case it is wrong because it leaves memory uninitialized if it is less than sizeof(ni->name), and overflows ni->name when longer. Normally strtomem_pad() could be used here, but since ni->name is a trailing array in struct hci_mon_new_index, compilers that don't support -fstrict-flex-arrays=3 can't tell how large this array is via __builtin_object_size(). Instead, open-code the helper and use sizeof() since it will work correctly. Additionally mark ni->name as __nonstring since it appears to not be a %NUL terminated C string. Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Edward AD <twuufnxlz@gmail.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: linux-bluetooth@vger.kernel.org Cc: netdev@vger.kernel.org Fixes: 18f547f3fc07 ("Bluetooth: hci_sock: fix slab oob read in create_monitor_event") Link: https://lore.kernel.org/lkml/202310110908.F2639D3276@keescook/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: hci_sock: fix slab oob read in create_monitor_eventEdward AD
commit 18f547f3fc074500ab5d419cf482240324e73a7e upstream. When accessing hdev->name, the actual string length should prevail Reported-by: syzbot+c90849c50ed209d77689@syzkaller.appspotmail.com Fixes: dcda165706b9 ("Bluetooth: hci_core: Fix build warnings") Signed-off-by: Edward AD <twuufnxlz@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: hci_event: Fix using memcmp when comparing keysLuiz Augusto von Dentz
[ Upstream commit b541260615f601ae1b5d6d0cc54e790de706303b ] memcmp is not consider safe to use with cryptographic secrets: 'Do not use memcmp() to compare security critical data, such as cryptographic secrets, because the required CPU time depends on the number of equal bytes.' While usage of memcmp for ZERO_KEY may not be considered a security critical data, it can lead to more usage of memcmp with pairing keys which could introduce more security problems. Fixes: 455c2ff0a558 ("Bluetooth: Fix BR/EDR out-of-band pairing with only initiator data") Fixes: 33155c4aae52 ("Bluetooth: hci_event: Ignore NULL link key") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25wifi: cfg80211: avoid leaking stack data into traceBenjamin Berg
[ Upstream commit 334bf33eec5701a1e4e967bcb7cc8611a998334b ] If the structure is not initialized then boolean types might be copied into the tracing data without being initialised. This causes data from the stack to leak into the trace and also triggers a UBSAN failure which can easily be avoided here. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://lore.kernel.org/r/20230925171855.a9271ef53b05.I8180bae663984c91a3e036b87f36a640ba409817@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25wifi: mac80211: allow transmitting EAPOL frames with tainted keyWen Gong
[ Upstream commit 61304336c67358d49a989e5e0060d8c99bad6ca8 ] Lower layer device driver stop/wake TX by calling ieee80211_stop_queue()/ ieee80211_wake_queue() while hw scan. Sometimes hw scan and PTK rekey are running in parallel, when M4 sent from wpa_supplicant arrive while the TX queue is stopped, then the M4 will pending send, and then new key install from wpa_supplicant. After TX queue wake up by lower layer device driver, the M4 will be dropped by below call stack. When key install started, the current key flag is set KEY_FLAG_TAINTED in ieee80211_pairwise_rekey(), and then mac80211 wait key install complete by lower layer device driver. Meanwhile ieee80211_tx_h_select_key() will return TX_DROP for the M4 in step 12 below, and then ieee80211_free_txskb() called by ieee80211_tx_dequeue(), so the M4 will not send and free, then the rekey process failed becaue AP not receive M4. Please see details in steps below. There are a interval between KEY_FLAG_TAINTED set for current key flag and install key complete by lower layer device driver, the KEY_FLAG_TAINTED is set in this interval, all packet including M4 will be dropped in this interval, the interval is step 8~13 as below. issue steps: TX thread install key thread 1. stop_queue -idle- 2. sending M4 -idle- 3. M4 pending -idle- 4. -idle- starting install key from wpa_supplicant 5. -idle- =>ieee80211_key_replace() 6. -idle- =>ieee80211_pairwise_rekey() and set currently key->flags |= KEY_FLAG_TAINTED 7. -idle- =>ieee80211_key_enable_hw_accel() 8. -idle- =>drv_set_key() and waiting key install complete from lower layer device driver 9. wake_queue -waiting state- 10. re-sending M4 -waiting state- 11. =>ieee80211_tx_h_select_key() -waiting state- 12. drop M4 by KEY_FLAG_TAINTED -waiting state- 13. -idle- install key complete with success/fail success: clear flag KEY_FLAG_TAINTED fail: start disconnect Hence add check in step 11 above to allow the EAPOL send out in the interval. If lower layer device driver use the old key/cipher to encrypt the M4, then AP received/decrypt M4 correctly, after M4 send out, lower layer device driver install the new key/cipher to hardware and return success. If lower layer device driver use new key/cipher to send the M4, then AP will/should drop the M4, then it is same result with this issue, AP will/ should kick out station as well as this issue. issue log: kworker/u16:4-5238 [000] 6456.108926: stop_queue: phy1 queue:0, reason:0 wpa_supplicant-961 [003] 6456.119737: rdev_tx_control_port: wiphy_name=phy1 name=wlan0 ifindex=6 dest=ARRAY[9e, 05, 31, 20, 9b, d0] proto=36488 unencrypted=0 wpa_supplicant-961 [003] 6456.119839: rdev_return_int_cookie: phy1, returned 0, cookie: 504 wpa_supplicant-961 [003] 6456.120287: rdev_add_key: phy1, netdev:wlan0(6), key_index: 0, mode: 0, pairwise: true, mac addr: 9e:05:31:20:9b:d0 wpa_supplicant-961 [003] 6456.120453: drv_set_key: phy1 vif:wlan0(2) sta:9e:05:31:20:9b:d0 cipher:0xfac04, flags=0x9, keyidx=0, hw_key_idx=0 kworker/u16:9-3829 [001] 6456.168240: wake_queue: phy1 queue:0, reason:0 kworker/u16:9-3829 [001] 6456.168255: drv_wake_tx_queue: phy1 vif:wlan0(2) sta:9e:05:31:20:9b:d0 ac:0 tid:7 kworker/u16:9-3829 [001] 6456.168305: cfg80211_control_port_tx_status: wdev(1), cookie: 504, ack: false wpa_supplicant-961 [003] 6459.167982: drv_return_int: phy1 - -110 issue call stack: nl80211_frame_tx_status+0x230/0x340 [cfg80211] cfg80211_control_port_tx_status+0x1c/0x28 [cfg80211] ieee80211_report_used_skb+0x374/0x3e8 [mac80211] ieee80211_free_txskb+0x24/0x40 [mac80211] ieee80211_tx_dequeue+0x644/0x954 [mac80211] ath10k_mac_tx_push_txq+0xac/0x238 [ath10k_core] ath10k_mac_op_wake_tx_queue+0xac/0xe0 [ath10k_core] drv_wake_tx_queue+0x80/0x168 [mac80211] __ieee80211_wake_txqs+0xe8/0x1c8 [mac80211] _ieee80211_wake_txqs+0xb4/0x120 [mac80211] ieee80211_wake_txqs+0x48/0x80 [mac80211] tasklet_action_common+0xa8/0x254 tasklet_action+0x2c/0x38 __do_softirq+0xdc/0x384 Signed-off-by: Wen Gong <quic_wgong@quicinc.com> Link: https://lore.kernel.org/r/20230801064751.25803-1-quic_wgong@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25Bluetooth: hci_core: Fix build warningsLuiz Augusto von Dentz
[ Upstream commit dcda165706b9fbfd685898d46a6749d7d397e0c0 ] This fixes the following warnings: net/bluetooth/hci_core.c: In function ‘hci_register_dev’: net/bluetooth/hci_core.c:2620:54: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size 5 [-Wformat-truncation=] 2620 | snprintf(hdev->name, sizeof(hdev->name), "hci%d", id); | ^~ net/bluetooth/hci_core.c:2620:50: note: directive argument in the range [0, 2147483647] 2620 | snprintf(hdev->name, sizeof(hdev->name), "hci%d", id); | ^~~~~~~ net/bluetooth/hci_core.c:2620:9: note: ‘snprintf’ output between 5 and 14 bytes into a destination of size 8 2620 | snprintf(hdev->name, sizeof(hdev->name), "hci%d", id); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25Bluetooth: Avoid redundant authenticationYing Hsu
[ Upstream commit 1d8e801422d66e4b8c7b187c52196bef94eed887 ] While executing the Android 13 CTS Verifier Secure Server test on a ChromeOS device, it was observed that the Bluetooth host initiates authentication for an RFCOMM connection after SSP completes. When this happens, some Intel Bluetooth controllers, like AC9560, would disconnect with "Connection Rejected due to Security Reasons (0x0e)". Historically, BlueZ did not mandate this authentication while an authenticated combination key was already in use for the connection. This behavior was changed since commit 7b5a9241b780 ("Bluetooth: Introduce requirements for security level 4"). So, this patch addresses the aforementioned disconnection issue by restoring the previous behavior. Signed-off-by: Ying Hsu <yinghsu@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25net: pktgen: Fix interface flags printingGavrilov Ilia
commit 1d30162f35c7a73fc2f8cdcdcdbd690bedb99d1a upstream. Device flags are displayed incorrectly: 1) The comparison (i == F_FLOW_SEQ) is always false, because F_FLOW_SEQ is equal to (1 << FLOW_SEQ_SHIFT) == 2048, and the maximum value of the 'i' variable is (NR_PKT_FLAG - 1) == 17. It should be compared with FLOW_SEQ_SHIFT. 2) Similarly to the F_IPSEC flag. 3) Also add spaces to the print end of the string literal "spi:%u" to prevent the output from merging with the flag that follows. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 99c6d3d20d62 ("pktgen: Remove brute-force printing of flags") Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25netfilter: nft_set_rbtree: .deactivate fails if element has expiredPablo Neira Ayuso
commit d111692a59c1470ae530cbb39bcf0346c950ecc7 upstream. This allows to remove an expired element which is not possible in other existing set backends, this is more noticeable if gc-interval is high so expired elements remain in the tree. On-demand gc also does not help in this case, because this is delete element path. Return NULL if element has expired. Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curvePedro Tammela
commit a13b67c9a015c4e21601ef9aa4ec9c5d972df1b4 upstream. Christian Theune says: I upgraded from 6.1.38 to 6.1.55 this morning and it broke my traffic shaping script, leaving me with a non-functional uplink on a remote router. A 'rt' curve cannot be used as a inner curve (parent class), but we were allowing such configurations since the qdisc was introduced. Such configurations would trigger a UAF as Budimir explains: The parent will have vttree_insert() called on it in init_vf(), but will not have vttree_remove() called on it in update_vf() because it does not have the HFSC_FSC flag set. The qdisc always assumes that inner classes have the HFSC_FSC flag set. This is by design as it doesn't make sense 'qdisc wise' for an 'rt' curve to be an inner curve. Budimir's original patch disallows users to add classes with a 'rt' parent, but this is too strict as it breaks users that have been using 'rt' as a inner class. Another approach, taken by this patch, is to upgrade the inner 'rt' into a 'sc', warning the user in the process. It avoids the UAF reported by Budimir while also being more permissive to bad scripts/users/code using 'rt' as a inner class. Users checking the `tc class ls [...]` or `tc class get [...]` dumps would observe the curve change and are potentially breaking with this change. v1->v2: https://lore.kernel.org/all/20231013151057.2611860-1-pctammela@mojatatu.com/ - Correct 'Fixes' tag and merge with revert (Jakub) Cc: Christian Theune <ct@flyingcircus.io> Cc: Budimir Markovic <markovicbudimir@gmail.com> Fixes: b3d26c5702c7 ("net/sched: sch_hfsc: Ensure inner classes have fsc curve") Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231017143602.3191556-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddrEric Dumazet
commit 195374d893681da43a39796e53b30ac4f20400c4 upstream. syzbot reported a data-race while accessing nh->nh_saddr_genid [1] Add annotations, but leave the code lazy as intended. [1] BUG: KCSAN: data-race in fib_select_path / fib_select_path write to 0xffff8881387166f0 of 4 bytes by task 6778 on cpu 1: fib_info_update_nhc_saddr net/ipv4/fib_semantics.c:1334 [inline] fib_result_prefsrc net/ipv4/fib_semantics.c:1354 [inline] fib_select_path+0x292/0x330 net/ipv4/fib_semantics.c:2269 ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810 ip_route_output_key_hash net/ipv4/route.c:2644 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872 send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 read to 0xffff8881387166f0 of 4 bytes by task 6759 on cpu 0: fib_result_prefsrc net/ipv4/fib_semantics.c:1350 [inline] fib_select_path+0x1cb/0x330 net/ipv4/fib_semantics.c:2269 ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810 ip_route_output_key_hash net/ipv4/route.c:2644 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872 send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 value changed: 0x959d3217 -> 0x959d3218 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6759 Comm: kworker/u4:15 Not tainted 6.6.0-rc4-syzkaller-00029-gcbf3a2cb156a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023 Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker Fixes: 436c3b66ec98 ("ipv4: Invalidate nexthop cache nh_saddr more correctly.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231017192304.82626-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skbEric Dumazet
commit f921a4a5bffa8a0005b190fb9421a7fc1fd716b6 upstream. In commit 75eefc6c59fd ("tcp: tsq: add a shortcut in tcp_small_queue_check()") we allowed to send an skb regardless of TSQ limits being hit if rtx queue was empty or had a single skb, in order to better fill the pipe when/if TX completions were slow. Then later, commit 75c119afe14f ("tcp: implement rb-tree based retransmit queue") accidentally removed the special case for one skb in rtx queue. Stefan Wahren reported a regression in single TCP flow throughput using a 100Mbit fec link, starting from commit 65466904b015 ("tcp: adjust TSO packet sizes based on min_rtt"). This last commit only made the regression more visible, because it locked the TCP flow on a particular behavior where TSQ prevented two skbs being pushed downstream, adding silences on the wire between each TSO packet. Many thanks to Stefan for his invaluable help ! Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Link: https://lore.kernel.org/netdev/7f31ddc8-9971-495e-a1f6-819df542e0af@gmx.net/ Reported-by: Stefan Wahren <wahrenst@gmx.net> Tested-by: Stefan Wahren <wahrenst@gmx.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20231017124526.4060202-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25tcp: fix excessive TLP and RACK timeouts from HZ roundingNeal Cardwell
commit 1c2709cfff1dedbb9591e989e2f001484208d914 upstream. We discovered from packet traces of slow loss recovery on kernels with the default HZ=250 setting (and min_rtt < 1ms) that after reordering, when receiving a SACKed sequence range, the RACK reordering timer was firing after about 16ms rather than the desired value of roughly min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the exact same issue. This commit fixes the TLP transmit timer and RACK reordering timer floor calculation to more closely match the intended 2ms floor even on kernels with HZ=250. It does this by adding in a new TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies, instead of the current approach of converting to jiffies and then adding th TCP_TIMEOUT_MIN value of 2 jiffies. Our testing has verified that on kernels with HZ=1000, as expected, this does not produce significant changes in behavior, but on kernels with the default HZ=250 the latency improvement can be large. For example, our tests show that for HZ=250 kernels at low RTTs this fix roughly halves the latency for the RACK reorder timer: instead of mostly firing at 16ms it mostly fires at 8ms. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25net: rfkill: gpio: prevent value glitch during probeJosua Mayer
commit b2f750c3a80b285cd60c9346f8c96bd0a2a66cde upstream. When either reset- or shutdown-gpio have are initially deasserted, e.g. after a reboot - or when the hardware does not include pull-down, there will be a short toggle of both IOs to logical 0 and back to 1. It seems that the rfkill default is unblocked, so the driver should not glitch to output low during probe. It can lead e.g. to unexpected lte modem reconnect: [1] root@localhost:~# dmesg | grep "usb 2-1" [ 2.136124] usb 2-1: new SuperSpeed USB device number 2 using xhci-hcd [ 21.215278] usb 2-1: USB disconnect, device number 2 [ 28.833977] usb 2-1: new SuperSpeed USB device number 3 using xhci-hcd The glitch has been discovered on an arm64 board, now that device-tree support for the rfkill-gpio driver has finally appeared :). Change the flags for devm_gpiod_get_optional from GPIOD_OUT_LOW to GPIOD_ASIS to avoid any glitches. The rfkill driver will set the intended value during rfkill_sync_work. Fixes: 7176ba23f8b5 ("net: rfkill: add generic gpio rfkill driver") Signed-off-by: Josua Mayer <josua@solid-run.com> Link: https://lore.kernel.org/r/20231004163928.14609-1-josua@solid-run.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25net: ipv6: fix return value check in esp_remove_trailerMa Ke
commit dad4e491e30b20f4dc615c9da65d2142d703b5c2 upstream. In esp_remove_trailer(), to avoid an unexpected result returned by pskb_trim, we should check the return value of pskb_trim(). Signed-off-by: Ma Ke <make_ruc2021@163.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25net: ipv4: fix return value check in esp_remove_trailerMa Ke
commit 513f61e2193350c7a345da98559b80f61aec4fa6 upstream. In esp_remove_trailer(), to avoid an unexpected result returned by pskb_trim, we should check the return value of pskb_trim(). Signed-off-by: Ma Ke <make_ruc2021@163.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25xfrm: interface: use DEV_STATS_INC()Eric Dumazet
commit f7c4e3e5d4f6609b4725a97451948ca2e425379a upstream. syzbot/KCSAN reported data-races in xfrm whenever dev->stats fields are updated. It appears all of these updates can happen from multiple cpus. Adopt SMP safe DEV_STATS_INC() to update dev->stats fields. BUG: KCSAN: data-race in xfrmi_xmit / xfrmi_xmit read-write to 0xffff88813726b160 of 8 bytes by task 23986 on cpu 1: xfrmi_xmit+0x74e/0xb20 net/xfrm/xfrm_interface_core.c:583 __netdev_start_xmit include/linux/netdevice.h:4889 [inline] netdev_start_xmit include/linux/netdevice.h:4903 [inline] xmit_one net/core/dev.c:3544 [inline] dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560 __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340 dev_queue_xmit include/linux/netdevice.h:3082 [inline] neigh_connected_output+0x231/0x2a0 net/core/neighbour.c:1581 neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0x74a/0x850 net/ipv4/ip_output.c:230 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:318 NF_HOOK_COND include/linux/netfilter.h:293 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:432 dst_output include/net/dst.h:458 [inline] ip_local_out net/ipv4/ip_output.c:127 [inline] ip_send_skb+0x72/0xe0 net/ipv4/ip_output.c:1487 udp_send_skb+0x6a4/0x990 net/ipv4/udp.c:963 udp_sendmsg+0x1249/0x12d0 net/ipv4/udp.c:1246 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:840 sock_sendmsg_nosec net/socket.c:730 [inline] sock_sendmsg net/socket.c:753 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2540 ___sys_sendmsg net/socket.c:2594 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2680 __do_sys_sendmmsg net/socket.c:2709 [inline] __se_sys_sendmmsg net/socket.c:2706 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2706 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read-write to 0xffff88813726b160 of 8 bytes by task 23987 on cpu 0: xfrmi_xmit+0x74e/0xb20 net/xfrm/xfrm_interface_core.c:583 __netdev_start_xmit include/linux/netdevice.h:4889 [inline] netdev_start_xmit include/linux/netdevice.h:4903 [inline] xmit_one net/core/dev.c:3544 [inline] dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3560 __dev_queue_xmit+0xeee/0x1de0 net/core/dev.c:4340 dev_queue_xmit include/linux/netdevice.h:3082 [inline] neigh_connected_output+0x231/0x2a0 net/core/neighbour.c:1581 neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0x74a/0x850 net/ipv4/ip_output.c:230 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:318 NF_HOOK_COND include/linux/netfilter.h:293 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:432 dst_output include/net/dst.h:458 [inline] ip_local_out net/ipv4/ip_output.c:127 [inline] ip_send_skb+0x72/0xe0 net/ipv4/ip_output.c:1487 udp_send_skb+0x6a4/0x990 net/ipv4/udp.c:963 udp_sendmsg+0x1249/0x12d0 net/ipv4/udp.c:1246 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:840 sock_sendmsg_nosec net/socket.c:730 [inline] sock_sendmsg net/socket.c:753 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2540 ___sys_sendmsg net/socket.c:2594 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2680 __do_sys_sendmmsg net/socket.c:2709 [inline] __se_sys_sendmmsg net/socket.c:2706 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2706 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00000000000010d7 -> 0x00000000000010d8 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 23987 Comm: syz-executor.5 Not tainted 6.5.0-syzkaller-10885-g0468be89b3fa #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25xfrm: fix a data-race in xfrm_gen_index()Eric Dumazet
commit 3e4bc23926b83c3c67e5f61ae8571602754131a6 upstream. xfrm_gen_index() mutual exclusion uses net->xfrm.xfrm_policy_lock. This means we must use a per-netns idx_generator variable, instead of a static one. Alternative would be to use an atomic variable. syzbot reported: BUG: KCSAN: data-race in xfrm_sk_policy_insert / xfrm_sk_policy_insert write to 0xffffffff87005938 of 4 bytes by task 29466 on cpu 0: xfrm_gen_index net/xfrm/xfrm_policy.c:1385 [inline] xfrm_sk_policy_insert+0x262/0x640 net/xfrm/xfrm_policy.c:2347 xfrm_user_policy+0x413/0x540 net/xfrm/xfrm_state.c:2639 do_ipv6_setsockopt+0x1317/0x2ce0 net/ipv6/ipv6_sockglue.c:943 ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012 rawv6_setsockopt+0x21e/0x410 net/ipv6/raw.c:1054 sock_common_setsockopt+0x61/0x70 net/core/sock.c:3697 __sys_setsockopt+0x1c9/0x230 net/socket.c:2263 __do_sys_setsockopt net/socket.c:2274 [inline] __se_sys_setsockopt net/socket.c:2271 [inline] __x64_sys_setsockopt+0x66/0x80 net/socket.c:2271 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffffff87005938 of 4 bytes by task 29460 on cpu 1: xfrm_sk_policy_insert+0x13e/0x640 xfrm_user_policy+0x413/0x540 net/xfrm/xfrm_state.c:2639 do_ipv6_setsockopt+0x1317/0x2ce0 net/ipv6/ipv6_sockglue.c:943 ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012 rawv6_setsockopt+0x21e/0x410 net/ipv6/raw.c:1054 sock_common_setsockopt+0x61/0x70 net/core/sock.c:3697 __sys_setsockopt+0x1c9/0x230 net/socket.c:2263 __do_sys_setsockopt net/socket.c:2274 [inline] __se_sys_setsockopt net/socket.c:2271 [inline] __x64_sys_setsockopt+0x66/0x80 net/socket.c:2271 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00006ad8 -> 0x00006b18 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 29460 Comm: syz-executor.1 Not tainted 6.5.0-rc5-syzkaller-00243-g9106536c1aa3 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Fixes: 1121994c803f ("netns xfrm: policy insertion in netns") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25netfilter: nft_payload: fix wrong mac header matchingFlorian Westphal
commit d351c1ea2de3e36e608fc355d8ae7d0cc80e6cd6 upstream. mcast packets get looped back to the local machine. Such packets have a 0-length mac header, we should treat this like "mac header not set" and abort rule evaluation. As-is, we just copy data from the network header instead. Fixes: 96518518cc41 ("netfilter: add nftables") Reported-by: Blažej Krajňák <krajnak@levonet.sk> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25nfc: nci: fix possible NULL pointer dereference in send_acknowledge()Krzysztof Kozlowski
commit 7937609cd387246aed994e81aa4fa951358fba41 upstream. Handle memory allocation failure from nci_skb_alloc() (calling alloc_skb()) to avoid possible NULL pointer dereference. Reported-by: 黄思聪 <huangsicong@iie.ac.cn> Fixes: 391d8a2da787 ("NFC: Add NCI over SPI receive") Cc: <stable@vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231013184129.18738-1-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: avoid memcmp() out of bounds warningArnd Bergmann
commit 9d1a3c74746428102d55371fbf74b484733937d9 upstream. bacmp() is a wrapper around memcpy(), which contain compile-time checks for buffer overflow. Since the hci_conn_request_evt() also calls bt_dev_dbg() with an implicit NULL pointer check, the compiler is now aware of a case where 'hdev' is NULL and treats this as meaning that zero bytes are available: In file included from net/bluetooth/hci_event.c:32: In function 'bacmp', inlined from 'hci_conn_request_evt' at net/bluetooth/hci_event.c:3276:7: include/net/bluetooth/bluetooth.h:364:16: error: 'memcmp' specified bound 6 exceeds source size 0 [-Werror=stringop-overread] 364 | return memcmp(ba1, ba2, sizeof(bdaddr_t)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Add another NULL pointer check before the bacmp() to ensure the compiler understands the code flow enough to not warn about it. Since the patch that introduced the warning is marked for stable backports, this one should also go that way to avoid introducing build regressions. Fixes: 1ffc6f8cc332 ("Bluetooth: Reject connection with the device which has same BD_ADDR") Cc: Kees Cook <keescook@chromium.org> Cc: "Lee, Chun-Yi" <jlee@suse.com> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: hci_event: Fix coding styleLuiz Augusto von Dentz
commit 35d91d95a0cd61ebb90e0246dc917fd25e519b8c upstream. This fixes the following code style problem: ERROR: that open brace { should be on the previous line + if (!bacmp(&hdev->bdaddr, &ev->bdaddr)) + { Fixes: 1ffc6f8cc332 ("Bluetooth: Reject connection with the device which has same BD_ADDR") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: Fix a refcnt underflow problem for hci_connZiyang Xuan
commit c7f59461f5a78994613afc112cdd73688aef9076 upstream. Syzbot reports a warning as follows: WARNING: CPU: 1 PID: 26946 at net/bluetooth/hci_conn.c:619 hci_conn_timeout+0x122/0x210 net/bluetooth/hci_conn.c:619 ... Call Trace: <TASK> process_one_work+0x884/0x15c0 kernel/workqueue.c:2630 process_scheduled_works kernel/workqueue.c:2703 [inline] worker_thread+0x8b9/0x1290 kernel/workqueue.c:2784 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 </TASK> It is because the HCI_EV_SIMPLE_PAIR_COMPLETE event handler drops hci_conn directly without check Simple Pairing whether be enabled. But the Simple Pairing process can only be used if both sides have the support enabled in the host stack. Add hci_conn_ssp_enabled() for hci_conn in HCI_EV_IO_CAPA_REQUEST and HCI_EV_SIMPLE_PAIR_COMPLETE event handlers to fix the problem. Fixes: 0493684ed239 ("[Bluetooth] Disable disconnect timer during Simple Pairing") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: Reject connection with the device which has same BD_ADDRLee, Chun-Yi
commit 1ffc6f8cc33268731fcf9629fc4438f6db1191fc upstream. This change is used to relieve CVE-2020-26555. The description of the CVE: Bluetooth legacy BR/EDR PIN code pairing in Bluetooth Core Specification 1.0B through 5.2 may permit an unauthenticated nearby device to spoof the BD_ADDR of the peer device to complete pairing without knowledge of the PIN. [1] The detail of this attack is in IEEE paper: BlueMirror: Reflections on Bluetooth Pairing and Provisioning Protocols [2] It's a reflection attack. The paper mentioned that attacker can induce the attacked target to generate null link key (zero key) without PIN code. In BR/EDR, the key generation is actually handled in the controller which is below HCI. A condition of this attack is that attacker should change the BR_ADDR of his hacking device (Host B) to equal to the BR_ADDR with the target device being attacked (Host A). Thus, we reject the connection with device which has same BD_ADDR both on HCI_Create_Connection and HCI_Connection_Request to prevent the attack. A similar implementation also shows in btstack project. [3][4] Cc: stable@vger.kernel.org Link: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26555 [1] Link: https://ieeexplore.ieee.org/abstract/document/9474325/authors#authors [2] Link: https://github.com/bluekitchen/btstack/blob/master/src/hci.c#L3523 [3] Link: https://github.com/bluekitchen/btstack/blob/master/src/hci.c#L7297 [4] Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Bluetooth: hci_event: Ignore NULL link keyLee, Chun-Yi
commit 33155c4aae5260475def6f7438e4e35564f4f3ba upstream. This change is used to relieve CVE-2020-26555. The description of the CVE: Bluetooth legacy BR/EDR PIN code pairing in Bluetooth Core Specification 1.0B through 5.2 may permit an unauthenticated nearby device to spoof the BD_ADDR of the peer device to complete pairing without knowledge of the PIN. [1] The detail of this attack is in IEEE paper: BlueMirror: Reflections on Bluetooth Pairing and Provisioning Protocols [2] It's a reflection attack. The paper mentioned that attacker can induce the attacked target to generate null link key (zero key) without PIN code. In BR/EDR, the key generation is actually handled in the controller which is below HCI. Thus, we can ignore null link key in the handler of "Link Key Notification event" to relieve the attack. A similar implementation also shows in btstack project. [3] v3: Drop the connection when null link key be detected. v2: - Used Link: tag instead of Closes: - Used bt_dev_dbg instead of BT_DBG - Added Fixes: tag Cc: stable@vger.kernel.org Fixes: 55ed8ca10f35 ("Bluetooth: Implement link key handling for the management interface") Link: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26555 [1] Link: https://ieeexplore.ieee.org/abstract/document/9474325/authors#authors [2] Link: https://github.com/bluekitchen/btstack/blob/master/src/hci.c#L3722 [3] Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25libceph: use kernel_connect()Jordan Rife
commit 7563cf17dce0a875ba3d872acdc63a78ea344019 upstream. Direct calls to ops->connect() can overwrite the address parameter when used in conjunction with BPF SOCK_ADDR hooks. Recent changes to kernel_connect() ensure that callers are insulated from such side effects. This patch wraps the direct call to ops->connect() with kernel_connect() to prevent unexpected changes to the address passed to ceph_tcp_connect(). This change was originally part of a larger patch targeting the net tree addressing all instances of unprotected calls to ops->connect() throughout the kernel, but this change was split up into several patches targeting various trees. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/ Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/ Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect") Signed-off-by: Jordan Rife <jrife@google.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25nfc: nci: assert requested protocol is validJeremy Cline
[ Upstream commit 354a6e707e29cb0c007176ee5b8db8be7bd2dee0 ] The protocol is used in a bit mask to determine if the protocol is supported. Assert the provided protocol is less than the maximum defined so it doesn't potentially perform a shift-out-of-bounds and provide a clearer error for undefined protocols vs unsupported ones. Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation") Reported-and-tested-by: syzbot+0839b78e119aae1fec78@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0839b78e119aae1fec78 Signed-off-by: Jeremy Cline <jeremy@jcline.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231009200054.82557-1-jeremy@jcline.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()Eric Dumazet
[ Upstream commit 31c07dffafce914c1d1543c135382a11ff058d93 ] Sili Luo reported a race in nfc_llcp_sock_get(), leading to UAF. Getting a reference on the socket found in a lookup while holding a lock should happen before releasing the lock. nfc_llcp_sock_get_sn() has a similar problem. Finally nfc_llcp_recv_snl() needs to make sure the socket found by nfc_llcp_sock_from_sn() does not disappear. Fixes: 8f50020ed9b8 ("NFC: LLCP late binding") Reported-by: Sili Luo <rootlab@huawei.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20231009123110.3735515-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25net: prevent address rewrite in kernel_bind()Jordan Rife
commit c889a99a21bf124c3db08d09df919f0eccc5ea4c upstream. Similar to the change in commit 0bdf399342c5("net: Avoid address overwrite in kernel_connect"), BPF hooks run on bind may rewrite the address passed to kernel_bind(). This change 1) Makes a copy of the bind address in kernel_bind() to insulate callers. 2) Replaces direct calls to sock->ops->bind() in net with kernel_bind() Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/ Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind") Cc: stable@vger.kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jordan Rife <jrife@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10sctp: update hb timer immediately after users change hb_intervalXin Long
[ Upstream commit 1f4e803cd9c9166eb8b6c8b0b8e4124f7499fc07 ] Currently, when hb_interval is changed by users, it won't take effect until the next expiry of hb timer. As the default value is 30s, users have to wait up to 30s to wait its hb_interval update to work. This becomes pretty bad in containers where a much smaller value is usually set on hb_interval. This patch improves it by resetting the hb timer immediately once the value of hb_interval is updated by users. Note that we don't address the already existing 'problem' when sending a heartbeat 'on demand' if one hb has just been sent(from the timer) mentioned in: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10sctp: update transport state when processing a dupcook packetXin Long
[ Upstream commit 2222a78075f0c19ca18db53fd6623afb4aff602d ] During the 4-way handshake, the transport's state is set to ACTIVE in sctp_process_init() when processing INIT_ACK chunk on client or COOKIE_ECHO chunk on server. In the collision scenario below: 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state, sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it creates a new association and sets its transport to ACTIVE then updates to the old association in sctp_assoc_update(). However, in sctp_assoc_update(), it will skip the transport update if it finds a transport with the same ipaddr already existing in the old asoc, and this causes the old asoc's transport state not to move to ACTIVE after the handshake. This means if DATA retransmission happens at this moment, it won't be able to enter PF state because of the check 'transport->state == SCTP_ACTIVE' in sctp_do_8_2_transport_strike(). This patch fixes it by updating the transport in sctp_assoc_update() with sctp_assoc_add_peer() where it updates the transport state if there is already a transport with the same ipaddr exists in the old asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10tcp: fix delayed ACKs for MSS boundary conditionNeal Cardwell
[ Upstream commit 4720852ed9afb1c5ab84e96135cb5b73d5afde6f ] This commit fixes poor delayed ACK behavior that can cause poor TCP latency in a particular boundary condition: when an application makes a TCP socket write that is an exact multiple of the MSS size. The problem is that there is painful boundary discontinuity in the current delayed ACK behavior. With the current delayed ACK behavior, we have: (1) If an app reads data when > 1*MSS is unacknowledged, then tcp_cleanup_rbuf() ACKs immediately because of: tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || (2) If an app reads all received data, and the packets were < 1*MSS, and either (a) the app is not ping-pong or (b) we received two packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause of: ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && (3) *However*: if an app reads exactly 1*MSS of data, tcp_cleanup_rbuf() does not send an immediate ACK. This is true even if the app is not ping-pong and the 1*MSS of data had the PSH bit set, suggesting the sending application completed an application write. Thus if the app is not ping-pong, we have this painful case where >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a write whose last skb is an exact multiple of 1*MSS can get a 40ms delayed ACK. This means that any app that transfers data in one direction and takes care to align write size or packet size with MSS can suffer this problem. With receive zero copy making 4KB MSS values more common, it is becoming more common to have application writes naturally align with MSS, and more applications are likely to encounter this delayed ACK problem. The fix in this commit is to refine the delayed ACK heuristics with a simple check: immediately ACK a received 1*MSS skb with PSH bit set if the app reads all data. Why? If an skb has a len of exactly 1*MSS and has the PSH bit set then it is likely the end of an application write. So more data may not be arriving soon, and yet the data sender may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send an ACK immediately if the app reads all of the data and is not ping-pong. Note that this logic is also executed for the case where len > MSS, but in that case this logic does not matter (and does not hurt) because tcp_cleanup_rbuf() will always ACK immediately if the app reads data and there is more than an MSS of unACKed data. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Xin Guo <guoxin0309@gmail.com> Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10tcp: fix quick-ack counting to count actual ACKs of new dataNeal Cardwell
[ Upstream commit 059217c18be6757b95bfd77ba53fb50b48b8a816 ] This commit fixes quick-ack counting so that it only considers that a quick-ack has been provided if we are sending an ACK that newly acknowledges data. The code was erroneously using the number of data segments in outgoing skbs when deciding how many quick-ack credits to remove. This logic does not make sense, and could cause poor performance in request-response workloads, like RPC traffic, where requests or responses can be multi-segment skbs. When a TCP connection decides to send N quick-acks, that is to accelerate the cwnd growth of the congestion control module controlling the remote endpoint of the TCP connection. That quick-ack decision is purely about the incoming data and outgoing ACKs. It has nothing to do with the outgoing data or the size of outgoing data. And in particular, an ACK only serves the intended purpose of allowing the remote congestion control to grow the congestion window quickly if the ACK is ACKing or SACKing new data. The fix is simple: only count packets as serving the goal of the quickack mechanism if they are ACKing/SACKing new data. We can tell whether this is the case by checking inet_csk_ack_scheduled(), since we schedule an ACK exactly when we are ACKing/SACKing new data. Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10netfilter: handle the connecting collision properly in nf_conntrack_proto_sctpXin Long
[ Upstream commit 8e56b063c86569e51eed1c5681ce6361fa97fc7a ] In Scenario A and B below, as the delayed INIT_ACK always changes the peer vtag, SCTP ct with the incorrect vtag may cause packet loss. Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK 192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772] 192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151] 192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772] 192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] * 192.168.1.2 > 192.168.1.1: [COOKIE ECHO] 192.168.1.1 > 192.168.1.2: [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: [COOKIE ACK] Scenario B: INIT_ACK is delayed until the peer completes its own handshake 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] * This patch fixes it as below: In SCTP_CID_INIT processing: - clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir]. (Scenario E) - set ct->proto.sctp.init[dir]. In SCTP_CID_INIT_ACK processing: - drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] && ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C) - drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A) In SCTP_CID_COOKIE_ACK processing: - clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir]. (Scenario D) Also, it's important to allow the ct state to move forward with cookie_echo and cookie_ack from the opposite dir for the collision scenarios. There are also other Scenarios where it should allow the packet through, addressed by the processing above: Scenario C: new CT is created by INIT_ACK. Scenario D: start INIT on the existing ESTABLISHED ct. Scenario E: start INIT after the old collision on the existing ESTABLISHED ct. 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] (both side are stopped, then start new connection again in hours) 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742] Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>