From a9352db24328d58ce44f17191eaef0654aae8962 Mon Sep 17 00:00:00 2001 From: Wei Wang Date: Wed, 16 Oct 2019 12:03:15 -0700 Subject: ipv4: fix race condition between route lookup and invalidation [ Upstream commit 5018c59607a511cdee743b629c76206d9c9e6d7b ] Jesse and Ido reported the following race condition: - Received packet A is forwarded and cached dst entry is taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set() - Given Jesse has busy routers ("ingesting full BGP routing tables from multiple ISPs"), route is added / deleted and rt_cache_flush() is called - Received packet B tries to use the same cached dst entry from t0, but rt_cache_valid() is no longer true and it is replaced in rt_cache_route() by the newer one. This calls dst_dev_put() on the original dst entry which assigns the blackhole netdev to 'dst->dev' - dst_input(skb) is called on packet A and it is dropped due to 'dst->dev' being the blackhole netdev There are 2 issues in the v4 routing code: 1. A per-netns counter is used to do the validation of the route. That means whenever a route is changed in the netns, users of all routes in the netns needs to redo lookup. v6 has an implementation of only updating fn_sernum for routes that are affected. 2. When rt_cache_valid() returns false, rt_cache_route() is called to throw away the current cache, and create a new one. This seems unnecessary because as long as this route does not change, the route cache does not need to be recreated. To fully solve the above 2 issues, it probably needs quite some code changes and requires careful testing, and does not suite for net branch. So this patch only tries to add the deleted cached rt into the uncached list, so user could still be able to use it to receive packets until it's done. Fixes: 95c47f9cf5e0 ("ipv4: call dst_dev_put() properly") Signed-off-by: Wei Wang Reported-by: Ido Schimmel Reported-by: Jesse Hathaway Tested-by: Jesse Hathaway Acked-by: Martin KaFai Lau Cc: David Ahern Reviewed-by: Ido Schimmel Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net/ipv4') diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 14654876127e..9e0c8dff2cd6 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt) prev = cmpxchg(p, orig, rt); if (prev == orig) { if (orig) { - dst_dev_put(&orig->dst); + rt_add_uncached_list(orig); dst_release(&orig->dst); } } else { -- cgit v1.2.3 From ac138f244c6341ddae123eb5d7b55edcd4b45f9b Mon Sep 17 00:00:00 2001 From: Stefano Brivio Date: Wed, 16 Oct 2019 20:52:09 +0200 Subject: ipv4: Return -ENETUNREACH if we can't create route but saddr is valid [ Upstream commit 595e0651d0296bad2491a4a29a7a43eae6328b02 ] ...instead of -EINVAL. An issue was found with older kernel versions while unplugging a NFS client with pending RPCs, and the wrong error code here prevented it from recovering once link is back up with a configured address. Incidentally, this is not an issue anymore since commit 4f8943f80883 ("SUNRPC: Replace direct task wakeups from softirq context"), included in 5.2-rc7, had the effect of decoupling the forwarding of this error by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin Coddington. To the best of my knowledge, this isn't currently causing any further issue, but the error code doesn't look appropriate anyway, and we might hit this in other paths as well. In detail, as analysed by Gonzalo Siero, once the route is deleted because the interface is down, and can't be resolved and we return -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(), as the socket error seen by tcp_write_err(), called by tcp_retransmit_timer(). In turn, tcp_write_err() indirectly calls xs_error_report(), which wakes up the RPC pending tasks with a status of -EINVAL. This is then seen by call_status() in the SUN RPC implementation, which aborts the RPC call calling rpc_exit(), instead of handling this as a potentially temporary condition, i.e. as a timeout. Return -EINVAL only if the input parameters passed to ip_route_output_key_hash_rcu() are actually invalid (this is the case if the specified source address is multicast, limited broadcast or all zeroes), but return -ENETUNREACH in all cases where, at the given moment, the given source address doesn't allow resolving the route. While at it, drop the initialisation of err to -ENETUNREACH, which was added to __ip_route_output_key() back then by commit 0315e3827048 ("net: Fix behaviour of unreachable, blackhole and prohibit routes"), but actually had no effect, as it was, and is, overwritten by the fib_lookup() return code assignment, and anyway ignored in all other branches, including the if (fl4->saddr) one: I find this rather confusing, as it would look like -ENETUNREACH is the "default" error, while that statement has no effect. Also note that after commit fc75fc8339e7 ("ipv4: dont create routes on down devices"), we would get -ENETUNREACH if the device is down, but -EINVAL if the source address is specified and we can't resolve the route, and this appears to be rather inconsistent. Reported-by: Stefan Walter Analysed-by: Benjamin Coddington Analysed-by: Gonzalo Siero Signed-off-by: Stefano Brivio Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/route.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 9e0c8dff2cd6..621f83434b24 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2470,14 +2470,17 @@ struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4, int orig_oif = fl4->flowi4_oif; unsigned int flags = 0; struct rtable *rth; - int err = -ENETUNREACH; + int err; if (fl4->saddr) { - rth = ERR_PTR(-EINVAL); if (ipv4_is_multicast(fl4->saddr) || ipv4_is_lbcast(fl4->saddr) || - ipv4_is_zeronet(fl4->saddr)) + ipv4_is_zeronet(fl4->saddr)) { + rth = ERR_PTR(-EINVAL); goto out; + } + + rth = ERR_PTR(-ENETUNREACH); /* I removed check for oif == dev_out->oif here. It was wrong for two reasons: -- cgit v1.2.3 From d72b88fe4645c6bace1b8223573163aa046e5491 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Mon, 28 Oct 2019 23:19:35 +0800 Subject: erspan: fix the tun_info options_len check for erspan [ Upstream commit 2eb8d6d2910cfe3dc67dc056f26f3dd9c63d47cd ] The check for !md doens't really work for ip_tunnel_info_opts(info) which only does info + 1. Also to avoid out-of-bounds access on info, it should ensure options_len is not less than erspan_metadata in both erspan_xmit() and ip6erspan_tunnel_xmit(). Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel") Signed-off-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/ip_gre.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 52690bb3e40f..10636fb6093e 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -509,9 +509,9 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev) key = &tun_info->key; if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT)) goto err_free_skb; - md = ip_tunnel_info_opts(tun_info); - if (!md) + if (tun_info->options_len < sizeof(*md)) goto err_free_skb; + md = ip_tunnel_info_opts(tun_info); /* ERSPAN has fixed 8 byte GRE header */ version = md->version; -- cgit v1.2.3 From 40e068f53a10a25013dc3f0396cee893ba72c068 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 1 Nov 2019 10:32:19 -0700 Subject: inet: stop leaking jiffies on the wire [ Upstream commit a904a0693c189691eeee64f6c6b188bd7dc244e9 ] Historically linux tried to stick to RFC 791, 1122, 2003 for IPv4 ID field generation. RFC 6864 made clear that no matter how hard we try, we can not ensure unicity of IP ID within maximum lifetime for all datagrams with a given source address/destination address/protocol tuple. Linux uses a per socket inet generator (inet_id), initialized at connection startup with a XOR of 'jiffies' and other fields that appear clear on the wire. Thiemo Nagel pointed that this strategy is a privacy concern as this provides 16 bits of entropy to fingerprint devices. Let's switch to a random starting point, this is just as good as far as RFC 6864 is concerned and does not leak anything critical. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet Reported-by: Thiemo Nagel Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/datagram.c | 2 +- net/ipv4/tcp_ipv4.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c index 9a0fe0c2fa02..4a8550c49202 100644 --- a/net/ipv4/datagram.c +++ b/net/ipv4/datagram.c @@ -73,7 +73,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len reuseport_has_conns(sk, true); sk->sk_state = TCP_ESTABLISHED; sk_set_txhash(sk); - inet->inet_id = jiffies; + inet->inet_id = prandom_u32(); sk_dst_set(sk, &rt->dst); err = 0; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..54320ef35405 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -300,7 +300,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) inet->inet_daddr); } - inet->inet_id = tp->write_seq ^ jiffies; + inet->inet_id = prandom_u32(); if (tcp_fastopen_defer_connect(sk, &err)) return err; @@ -1443,7 +1443,7 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, inet_csk(newsk)->icsk_ext_hdr_len = 0; if (inet_opt) inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen; - newinet->inet_id = newtp->write_seq ^ jiffies; + newinet->inet_id = prandom_u32(); if (!dst) { dst = inet_csk_route_child_sock(sk, newsk, req); -- cgit v1.2.3 From 597f5ff2e37aa1487dd1150787aabe2186151770 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 30 Oct 2019 13:00:04 -0700 Subject: net: annotate accesses to sk->sk_incoming_cpu [ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ] This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet Reported-by: syzbot Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/inet_hashtables.c | 2 +- net/ipv4/udp.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 97824864e40d..83fb00153018 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -240,7 +240,7 @@ static inline int compute_score(struct sock *sk, struct net *net, return -1; score = sk->sk_family == PF_INET ? 2 : 1; - if (sk->sk_incoming_cpu == raw_smp_processor_id()) + if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id()) score++; } return score; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 5e5d0575a43c..05ecf7f2e647 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -388,7 +388,7 @@ static int compute_score(struct sock *sk, struct net *net, return -1; score += 4; - if (sk->sk_incoming_cpu == raw_smp_processor_id()) + if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id()) score++; return score; } -- cgit v1.2.3 From cebdf3fb370d7a213e87f7138217681a13278735 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 24 Oct 2019 11:43:31 -0700 Subject: udp: fix data-race in udp_set_dev_scratch() [ Upstream commit a793183caa9afae907a0d7ddd2ffd57329369bf5 ] KCSAN reported a data-race in udp_set_dev_scratch() [1] The issue here is that we must not write over skb fields if skb is shared. A similar issue has been fixed in commit 89c22d8c3b27 ("net: Fix skb csum races when peeking") While we are at it, use a helper only dealing with udp_skb_scratch(skb)->csum_unnecessary, as this allows udp_set_dev_scratch() to be called once and thus inlined. [1] BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1: udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308 __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556 first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579 udp_poll+0xea/0x110 net/ipv4/udp.c:2720 sock_poll+0xed/0x250 net/socket.c:1256 vfs_poll include/linux/poll.h:90 [inline] do_select+0x7d0/0x1020 fs/select.c:534 core_sys_select+0x381/0x550 fs/select.c:677 do_pselect.constprop.0+0x11d/0x160 fs/select.c:759 __do_sys_pselect6 fs/select.c:784 [inline] __se_sys_pselect6 fs/select.c:769 [inline] __x64_sys_pselect6+0x12e/0x170 fs/select.c:769 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0: udp_skb_csum_unnecessary include/net/udp.h:358 [inline] udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310 inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Eric Dumazet Reported-by: syzbot Cc: Paolo Abeni Reviewed-by: Paolo Abeni Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/udp.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 05ecf7f2e647..ce0728c4b4aa 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1316,6 +1316,20 @@ static void udp_set_dev_scratch(struct sk_buff *skb) scratch->_tsize_state |= UDP_SKB_IS_STATELESS; } +static void udp_skb_csum_unnecessary_set(struct sk_buff *skb) +{ + /* We come here after udp_lib_checksum_complete() returned 0. + * This means that __skb_checksum_complete() might have + * set skb->csum_valid to 1. + * On 64bit platforms, we can set csum_unnecessary + * to true, but only if the skb is not shared. + */ +#if BITS_PER_LONG == 64 + if (!skb_shared(skb)) + udp_skb_scratch(skb)->csum_unnecessary = true; +#endif +} + static int udp_skb_truesize(struct sk_buff *skb) { return udp_skb_scratch(skb)->_tsize_state & ~UDP_SKB_IS_STATELESS; @@ -1550,10 +1564,7 @@ static struct sk_buff *__first_packet_length(struct sock *sk, *total += skb->truesize; kfree_skb(skb); } else { - /* the csum related bits could be changed, refresh - * the scratch area - */ - udp_set_dev_scratch(skb); + udp_skb_csum_unnecessary_set(skb); break; } } -- cgit v1.2.3 From cac7057f3642dd04889fc27ca0ba63f6d79bde75 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 23 Oct 2019 22:44:49 -0700 Subject: udp: use skb_queue_empty_lockless() [ Upstream commit 137a0dbe3426fd7bcfe3f8117b36a87b3590e4eb ] syzbot reported a data-race [1]. We should use skb_queue_empty_lockless() to document that we are not ensuring a mutual exclusion and silence KCSAN. [1] BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470 __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline] udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057 udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074 udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233 __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300 udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1: skb_queue_empty include/linux/skbuff.h:1494 [inline] __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653 udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet Reported-by: syzbot Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/udp.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ce0728c4b4aa..97310f7a25a8 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1588,7 +1588,7 @@ static int first_packet_length(struct sock *sk) spin_lock_bh(&rcvq->lock); skb = __first_packet_length(sk, rcvq, &total); - if (!skb && !skb_queue_empty(sk_queue)) { + if (!skb && !skb_queue_empty_lockless(sk_queue)) { spin_lock(&sk_queue->lock); skb_queue_splice_tail_init(sk_queue, rcvq); spin_unlock(&sk_queue->lock); @@ -1661,7 +1661,7 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, return skb; } - if (skb_queue_empty(sk_queue)) { + if (skb_queue_empty_lockless(sk_queue)) { spin_unlock_bh(&queue->lock); goto busy_check; } @@ -1687,7 +1687,7 @@ busy_check: break; sk_busy_loop(sk, flags & MSG_DONTWAIT); - } while (!skb_queue_empty(sk_queue)); + } while (!skb_queue_empty_lockless(sk_queue)); /* sk_queue is empty, reader_queue may contain peeked packets */ } while (timeo && -- cgit v1.2.3 From 9fcb07822f7df2d9dd0d461d1ddd571ce3119959 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 23 Oct 2019 22:44:50 -0700 Subject: net: use skb_queue_empty_lockless() in poll() handlers [ Upstream commit 3ef7cf57c72f32f61e97f8fa401bc39ea1f1a5d4 ] Many poll() handlers are lockless. Using skb_queue_empty_lockless() instead of skb_queue_empty() is more appropriate. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp.c | 2 +- net/ipv4/udp.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 61082065b26a..0ba28b2ea84a 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -584,7 +584,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait) } /* This barrier is coupled with smp_wmb() in tcp_reset() */ smp_rmb(); - if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue)) + if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue)) mask |= EPOLLERR; return mask; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 97310f7a25a8..5487b43b8a56 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2723,7 +2723,7 @@ __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait) __poll_t mask = datagram_poll(file, sock, wait); struct sock *sk = sock->sk; - if (!skb_queue_empty(&udp_sk(sk)->reader_queue)) + if (!skb_queue_empty_lockless(&udp_sk(sk)->reader_queue)) mask |= EPOLLIN | EPOLLRDNORM; /* Check for false positives due to checksum errors */ -- cgit v1.2.3 From ddd19f01faaa00776f9ce7fd9ffc6dfae5db7442 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 23 Oct 2019 22:44:51 -0700 Subject: net: use skb_queue_empty_lockless() in busy poll contexts [ Upstream commit 3f926af3f4d688e2e11e7f8ed04e277a14d4d4a4 ] Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net/ipv4') diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0ba28b2ea84a..cf79ab96c2df 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1961,7 +1961,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, if (unlikely(flags & MSG_ERRQUEUE)) return inet_recv_error(sk, msg, len, addr_len); - if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue) && + if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) && (sk->sk_state == TCP_ESTABLISHED)) sk_busy_loop(sk, nonblock); -- cgit v1.2.3 From 5c39c22dcd0e7099629dc893f4b452392516a7d2 Mon Sep 17 00:00:00 2001 From: Paolo Abeni Date: Sat, 26 Oct 2019 11:53:39 +0200 Subject: ipv4: fix route update on metric change. [ Upstream commit 0b834ba00ab5337e938c727e216e1f5249794717 ] Since commit af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes"), when updating an IP address with a different metric, the associated connected route is updated, too. Still, the mentioned commit doesn't handle properly some corner cases: $ ip addr add dev eth0 192.168.1.0/24 $ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2 $ ip addr add dev eth0 192.168.3.1/24 $ ip addr change dev eth0 192.168.1.0/24 metric 10 $ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10 $ ip addr change dev eth0 192.168.3.1/24 metric 10 $ ip -4 route 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0 192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1 192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10 Only the last route is correctly updated. The problem is the current test in fib_modify_prefix_metric(): if (!(dev->flags & IFF_UP) || ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) || ipv4_is_zeronet(prefix) || prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32) Which should be the logical 'not' of the pre-existing test in fib_add_ifaddr(): if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) && (prefix != addr || ifa->ifa_prefixlen < 32)) To properly negate the original expression, we need to change the last logical 'or' to a logical 'and'. Fixes: af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes") Reported-and-suggested-by: Beniamino Galvani Signed-off-by: Paolo Abeni Reviewed-by: David Ahern Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/fib_frontend.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net/ipv4') diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index e8bc939b56dd..fb4162943fae 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -1147,7 +1147,7 @@ void fib_modify_prefix_metric(struct in_ifaddr *ifa, u32 new_metric) if (!(dev->flags & IFF_UP) || ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) || ipv4_is_zeronet(prefix) || - prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32) + (prefix == ifa->ifa_local && ifa->ifa_prefixlen == 32)) return; /* add the new */ -- cgit v1.2.3 From dc9df3fc9d80347ba36f7a0fdaa7fca9636cbdb6 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Sat, 19 Oct 2019 09:26:37 -0700 Subject: ipv4: fix IPSKB_FRAG_PMTU handling with fragmentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [ Upstream commit e7a409c3f46cb0dbc7bfd4f6f9421d53e92614a5 ] This patch removes the iph field from the state structure, which is not properly initialized. Instead, add a new field to make the "do we want to set DF" be the state bit and move the code to set the DF flag from ip_frag_next(). Joint work with Pablo and Linus. Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators") Reported-by: Patrick Schönthaler Signed-off-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso Signed-off-by: Linus Torvalds Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/ip_output.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'net/ipv4') diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index da521790cd63..be64c4a210fa 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -645,11 +645,12 @@ void ip_fraglist_prepare(struct sk_buff *skb, struct ip_fraglist_iter *iter) EXPORT_SYMBOL(ip_fraglist_prepare); void ip_frag_init(struct sk_buff *skb, unsigned int hlen, - unsigned int ll_rs, unsigned int mtu, + unsigned int ll_rs, unsigned int mtu, bool DF, struct ip_frag_state *state) { struct iphdr *iph = ip_hdr(skb); + state->DF = DF; state->hlen = hlen; state->ll_rs = ll_rs; state->mtu = mtu; @@ -668,9 +669,6 @@ static void ip_frag_ipcb(struct sk_buff *from, struct sk_buff *to, /* Copy the flags to each fragment. */ IPCB(to)->flags = IPCB(from)->flags; - if (IPCB(from)->flags & IPSKB_FRAG_PMTU) - state->iph->frag_off |= htons(IP_DF); - /* ANK: dirty, but effective trick. Upgrade options only if * the segment to be fragmented was THE FIRST (otherwise, * options are already fixed) and make it ONCE @@ -738,6 +736,8 @@ struct sk_buff *ip_frag_next(struct sk_buff *skb, struct ip_frag_state *state) */ iph = ip_hdr(skb2); iph->frag_off = htons((state->offset >> 3)); + if (state->DF) + iph->frag_off |= htons(IP_DF); /* * Added AC : If we are fragmenting a fragment that's not the @@ -881,7 +881,8 @@ slow_path: * Fragment the datagram. */ - ip_frag_init(skb, hlen, ll_rs, mtu, &state); + ip_frag_init(skb, hlen, ll_rs, mtu, IPCB(skb)->flags & IPSKB_FRAG_PMTU, + &state); /* * Keep copying data until we run out. -- cgit v1.2.3 From 95753984d8751580180ab0f06e6d3c2d0512da15 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 16 Oct 2019 18:00:56 -0700 Subject: net: ensure correct skb->tstamp in various fragmenters [ Upstream commit 9669fffc1415bb0c30e5d2ec98a8e1c3a418cb9c ] Thomas found that some forwarded packets would be stuck in FQ packet scheduler because their skb->tstamp contained timestamps far in the future. We thought we addressed this point in commit 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths") but there is still an issue when/if a packet needs to be fragmented. In order to meet EDT requirements, we have to make sure all fragments get the original skb->tstamp. Note that this original skb->tstamp should be zero in forwarding path, but might have a non zero value in output path if user decided so. Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC") Signed-off-by: Eric Dumazet Reported-by: Thomas Bartschies Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman --- net/ipv4/ip_output.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'net/ipv4') diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index be64c4a210fa..e780ceab16e1 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -771,6 +771,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, struct rtable *rt = skb_rtable(skb); unsigned int mtu, hlen, ll_rs; struct ip_fraglist_iter iter; + ktime_t tstamp = skb->tstamp; struct ip_frag_state state; int err = 0; @@ -846,6 +847,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, ip_fraglist_prepare(skb, &iter); } + skb->tstamp = tstamp; err = output(net, sk, skb); if (!err) @@ -901,6 +903,7 @@ slow_path: /* * Put this fragment into the sending queue. */ + skb2->tstamp = tstamp; err = output(net, sk, skb2); if (err) goto fail; -- cgit v1.2.3