linux-toradex.git/net, branch v3.18.18

sctp: Fix race between OOTB responce and route removal

2015-07-05T14:13:00+00:00

[ Upstream commit 29c4afc4e98f4dc0ea9df22c631841f9c220b944 ]

There is NULL pointer dereference possible during statistics update if the route
used for OOTB responce is removed at unfortunate time. If the route exists when
we receive OOTB packet and we finally jump into sctp_packet_transmit() to send
ABORT, but in the meantime route is removed under our feet, we take "no_route"
path and try to update stats with IP_INC_STATS(sock_net(asoc->base.sk), ...).

But sctp_ootb_pkt_new() used to prepare responce packet doesn't call
sctp_transport_set_owner() and therefore there is no asoc associated with this
packet. Probably temporary asoc just for OOTB responces is overkill, so just
introduce a check like in all other places in sctp_packet_transmit(), where
"asoc" is dereferenced.

To reproduce this, one needs to
0. ensure that sctp module is loaded (otherwise ABORT is not generated)
1. remove default route on the machine
2. while true; do
     ip route del [interface-specific route]
     ip route add [interface-specific route]
   done
3. send enough OOTB packets (i.e. HB REQs) from another host to trigger ABORT
   responce

On x86_64 the crash looks like this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [] sctp_packet_transmit+0x63c/0x730 [sctp]
PGD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: ...
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.0.5-1-ARCH #1
Hardware name: ...
task: ffffffff818124c0 ti: ffffffff81800000 task.ti: ffffffff81800000
RIP: 0010:[]  [] sctp_packet_transmit+0x63c/0x730 [sctp]
RSP: 0018:ffff880127c037b8  EFLAGS: 00010296
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000015ff66b480
RDX: 00000015ff66b400 RSI: ffff880127c17200 RDI: ffff880123403700
RBP: ffff880127c03888 R08: 0000000000017200 R09: ffffffff814625af
R10: ffffea00047e4680 R11: 00000000ffffff80 R12: ffff8800b0d38a28
R13: ffff8800b0d38a28 R14: ffff8800b3e88000 R15: ffffffffa05f24e0
FS:  0000000000000000(0000) GS:ffff880127c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 00000000c855b000 CR4: 00000000000007f0
Stack:
 ffff880127c03910 ffff8800b0d38a28 ffffffff8189d240 ffff88011f91b400
 ffff880127c03828 ffffffffa05c94c5 0000000000000000 ffff8800baa1c520
 0000000000000000 0000000000000001 0000000000000000 0000000000000000
Call Trace:
 
 [] ? sctp_sf_tabort_8_4_8.isra.20+0x85/0x140 [sctp]
 [] ? sctp_transport_put+0x52/0x80 [sctp]
 [] sctp_do_sm+0xb8c/0x19a0 [sctp]
 [] ? trigger_load_balance+0x90/0x210
 [] ? update_process_times+0x59/0x60
 [] ? timerqueue_add+0x60/0xb0
 [] ? enqueue_hrtimer+0x29/0xa0
 [] ? read_tsc+0x9/0x10
 [] ? put_page+0x55/0x60
 [] ? clockevents_program_event+0x6d/0x100
 [] ? skb_free_head+0x58/0x80
 [] ? chksum_update+0x1b/0x27 [crc32c_generic]
 [] ? crypto_shash_update+0xce/0xf0
 [] sctp_endpoint_bh_rcv+0x113/0x280 [sctp]
 [] sctp_inq_push+0x46/0x60 [sctp]
 [] sctp_rcv+0x880/0x910 [sctp]
 [] ? sctp_packet_transmit_chunk+0xb0/0xb0 [sctp]
 [] ? sctp_csum_update+0x20/0x20 [sctp]
 [] ? ip_route_input_noref+0x235/0xd30
 [] ? ack_ioapic_level+0x7b/0x150
 [] ip_local_deliver_finish+0xae/0x210
 [] ip_local_deliver+0x35/0x90
 [] ip_rcv_finish+0xf5/0x370
 [] ip_rcv+0x2b8/0x3a0
 [] __netif_receive_skb_core+0x763/0xa50
 [] __netif_receive_skb+0x18/0x60
 [] netif_receive_skb_internal+0x40/0xd0
 [] napi_gro_receive+0xe8/0x120
 [] rtl8169_poll+0x2da/0x660 [r8169]
 [] net_rx_action+0x21a/0x360
 [] __do_softirq+0xe1/0x2d0
 [] irq_exit+0xad/0xb0
 [] do_IRQ+0x58/0xf0
 [] common_interrupt+0x6d/0x6d
 
 [] ? hrtimer_start+0x18/0x20
 [] ? sctp_transport_destroy_rcu+0x29/0x30 [sctp]
 [] ? mwait_idle+0x60/0xa0
 [] arch_cpu_idle+0xf/0x20
 [] cpu_startup_entry+0x3ec/0x480
 [] rest_init+0x85/0x90
 [] start_kernel+0x48b/0x4ac
 [] ? early_idt_handlers+0x120/0x120
 [] x86_64_start_reservations+0x2a/0x2c
 [] x86_64_start_kernel+0x161/0x184
Code: 90 48 8b 80 b8 00 00 00 48 89 85 70 ff ff ff 48 83 bd 70 ff ff ff 00 0f 85 cd fa ff ff 48 89 df 31 db e8 18 63 e7 e0 48 8b 45 80 <48> 8b 40 20 48 8b 40 30 48 8b 80 68 01 00 00 65 48 ff 40 78 e9
RIP  [] sctp_packet_transmit+0x63c/0x730 [sctp]
 RSP 
CR2: 0000000000000020
---[ end trace 5aec7fd2dc983574 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
drm_kms_helper: panic occurred, switching back to text console
---[ end Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Alexander Sverdlin 
Acked-by: Neil Horman 
Acked-by: Marcelo Ricardo Leitner 
Acked-by: Vlad Yasevich 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

tcp: Do not call tcp_fastopen_reset_cipher from interrupt context

2015-07-05T14:12:59+00:00

[ Upstream commit dfea2aa654243f70dc53b8648d0bbdeec55a7df1 ]

tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.

Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:

[   36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[   36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[   36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[   36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   36.008250]  00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[   36.009630]  ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[   36.011076]  0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[   36.012494] Call Trace:
[   36.012953]    [] dump_stack+0x4f/0x6d
[   36.014085]  [] ___might_sleep+0x103/0x170
[   36.015117]  [] __might_sleep+0x52/0x90
[   36.016117]  [] kmem_cache_alloc_trace+0x47/0x190
[   36.017266]  [] ? tcp_fastopen_reset_cipher+0x42/0x130
[   36.018485]  [] tcp_fastopen_reset_cipher+0x42/0x130
[   36.019679]  [] tcp_fastopen_init_key_once+0x61/0x70
[   36.020884]  [] __tcp_fastopen_cookie_gen+0x1c/0x60
[   36.022058]  [] tcp_try_fastopen+0x58f/0x730
[   36.023118]  [] tcp_conn_request+0x3e8/0x7b0
[   36.024185]  [] ? __module_text_address+0x12/0x60
[   36.025327]  [] tcp_v4_conn_request+0x51/0x60
[   36.026410]  [] tcp_rcv_state_process+0x190/0xda0
[   36.027556]  [] ? __inet_lookup_established+0x47/0x170
[   36.028784]  [] tcp_v4_do_rcv+0x16d/0x3d0
[   36.029832]  [] ? security_sock_rcv_skb+0x16/0x20
[   36.030936]  [] tcp_v4_rcv+0x77a/0x7b0
[   36.031875]  [] ? iptable_filter_hook+0x33/0x70
[   36.032953]  [] ip_local_deliver_finish+0x92/0x1f0
[   36.034065]  [] ip_local_deliver+0x9a/0xb0
[   36.035069]  [] ? ip_rcv+0x3d0/0x3d0
[   36.035963]  [] ip_rcv_finish+0x119/0x330
[   36.036950]  [] ip_rcv+0x2e7/0x3d0
[   36.037847]  [] __netif_receive_skb_core+0x552/0x930
[   36.038994]  [] __netif_receive_skb+0x27/0x70
[   36.040033]  [] process_backlog+0xd2/0x1f0
[   36.041025]  [] net_rx_action+0x122/0x310
[   36.042007]  [] __do_softirq+0x103/0x2f0
[   36.042978]  [] do_softirq_own_stack+0x1c/0x30

This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)

Cc: Eric Dumazet 
Cc: Hannes Frederic Sowa 
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

neigh: do not modify unlinked entries

2015-07-05T14:12:59+00:00

[ Upstream commit 2c51a97f76d20ebf1f50fef908b986cb051fdff9 ]

The lockless lookups can return entry that is unlinked.
Sometimes they get reference before last neigh_cleanup_and_release,
sometimes they do not need reference. Later, any
modification attempts may result in the following problems:

1. entry is not destroyed immediately because neigh_update
can start the timer for dead entry, eg. on change to NUD_REACHABLE
state. As result, entry lives for some time but is invisible
and out of control.

2. __neigh_event_send can run in parallel with neigh_destroy
while refcnt=0 but if timer is started and expired refcnt can
reach 0 for second time leading to second neigh_destroy and
possible crash.

Thanks to Eric Dumazet and Ying Xue for their work and analyze
on the __neigh_event_send change.

Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour")
Fixes: a263b3093641 ("ipv4: Make neigh lookups directly in output packet path.")
Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Cc: Eric Dumazet 
Cc: Ying Xue 
Signed-off-by: Julian Anastasov 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

packet: avoid out of bounds read in round robin fanout

2015-07-05T14:12:58+00:00

[ Upstream commit 468479e6043c84f5a65299cc07cb08a22a28c2b1 ]

PACKET_FANOUT_LB computes f->rr_cur such that it is modulo
f->num_members. It returns the old value unconditionally, but
f->num_members may have changed since the last store. Ensure
that the return value is always < num.

When modifying the logic, simplify it further by replacing the loop
with an unconditional atomic increment.

Fixes: dc99f600698d ("packet: Add fanout support.")
Suggested-by: Eric Dumazet 
Signed-off-by: Willem de Bruijn 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

packet: read num_members once in packet_rcv_fanout()

2015-07-05T14:12:58+00:00

[ Upstream commit f98f4514d07871da7a113dd9e3e330743fd70ae4 ]

We need to tell compiler it must not read f->num_members multiple
times. Otherwise testing if num is not zero is flaky, and we could
attempt an invalid divide by 0 in fanout_demux_cpu()

Note bug was present in packet_rcv_fanout_hash() and
packet_rcv_fanout_lb() but final 3.1 had a simple location
after commit 95ec3eb417115fb ("packet: Add 'cpu' fanout policy.")

Fixes: dc99f600698dc ("packet: Add fanout support.")
Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

bridge: fix br_stp_set_bridge_priority race conditions

2015-07-05T14:12:58+00:00

[ Upstream commit 2dab80a8b486f02222a69daca6859519e05781d9 ]

After the ->set() spinlocks were removed br_stp_set_bridge_priority
was left running without any protection when used via sysfs. It can
race with port add/del and could result in use-after-free cases and
corrupted lists. Tested by running port add/del in a loop with stp
enabled while setting priority in a loop, crashes are easily
reproducible.
The spinlocks around sysfs ->set() were removed in commit:
14f98f258f19 ("bridge: range check STP parameters")
There's also a race condition in the netlink priority support that is
fixed by this change, but it was introduced recently and the fixes tag
covers it, just in case it's needed the commit is:
af615762e972 ("bridge: add ageing_time, stp_state, priority over netlink")

Signed-off-by: Nikolay Aleksandrov 
Fixes: 14f98f258f19 ("bridge: range check STP parameters")
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

sctp: fix ASCONF list handling

2015-07-05T14:12:58+00:00

[ Upstream commit 2d45a02d0166caf2627fe91897c6ffc3b19514c4 ]

->auto_asconf_splist is per namespace and mangled by functions like
sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.

Also, the call to inet_sk_copy_descendant() was backuping
->auto_asconf_list through the copy but was not honoring
->do_auto_asconf, which could lead to list corruption if it was
different between both sockets.

This commit thus fixes the list handling by using ->addr_wq_lock
spinlock to protect the list. A special handling is done upon socket
creation and destruction for that. Error handlig on sctp_init_sock()
will never return an error after having initialized asconf, so
sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
will be take on sctp_close_sock(), before locking the socket, so we
don't do it in inverse order compared to sctp_addr_wq_timeout_handler().

Instead of taking the lock on sctp_sock_migrate() for copying and
restoring the list values, it's preferred to avoid rewritting it by
implementing sctp_copy_descendant().

Issue was found with a test application that kept flipping sysctl
default_auto_asconf on and off, but one could trigger it by issuing
simultaneous setsockopt() calls on multiple sockets or by
creating/destroying sockets fast enough. This is only triggerable
locally.

Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
Reported-by: Ji Jianwen 
Suggested-by: Neil Horman 
Suggested-by: Hannes Frederic Sowa 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: Marcelo Ricardo Leitner 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

net: don't wait for order-3 page allocation

2015-07-05T14:12:57+00:00

[ Upstream commit fb05e7a89f500cfc06ae277bdc911b281928995d ]

We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f7685831e0
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.

This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.

alloc_skb_with_frags is the same.

The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.

V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer

Cc: Eric Dumazet 
Cc: Chris Mason 
Cc: Debabrata Banerjee 
Signed-off-by: Shaohua Li 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

bridge: fix multicast router rlist endless loop

2015-07-05T14:12:57+00:00

[ Upstream commit 1a040eaca1a22f8da8285ceda6b5e4a2cb704867 ]

Since the addition of sysfs multicast router support if one set
multicast_router to "2" more than once, then the port would be added to
the hlist every time and could end up linking to itself and thus causing an
endless loop for rlist walkers.
So to reproduce just do:
echo 2 > multicast_router; echo 2 > multicast_router;
in a bridge port and let some igmp traffic flow, for me it hangs up
in br_multicast_flood().
Fix this by adding a check in br_multicast_add_router() if the port is
already linked.
The reason this didn't happen before the addition of multicast_router
sysfs entries is because there's a !hlist_unhashed check that prevents
it.

Signed-off-by: Nikolay Aleksandrov 
Fixes: 0909e11758bd ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

netfilter: nf_qeueue: Drop queue entries on nf_unregister_hook

2015-07-05T14:12:48+00:00

[ Upstream commit 8405a8fff3f8545c888a872d6e3c0c8eecd4d348 ]

Add code to nf_unregister_hook to flush the nf_queue when a hook is
unregistered.  This guarantees that the pointer that the nf_queue code
retains into the nf_hook list will remain valid while a packet is
queued.

I tested what would happen if we do not flush queued packets and was
trivially able to obtain the oops below.  All that was required was
to stop the nf_queue listening process, to delete all of the nf_tables,
and to awaken the nf_queue listening process.

> BUG: unable to handle kernel paging request at 0000000100000001
> IP: [<0000000100000001>] 0x100000001
> PGD b9c35067 PUD 0
> Oops: 0010 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 519 Comm: lt-nfqnl_test Not tainted
> task: ffff8800b9c8c050 ti: ffff8800ba9d8000 task.ti: ffff8800ba9d8000
> RIP: 0010:[<0000000100000001>]  [<0000000100000001>] 0x100000001
> RSP: 0018:ffff8800ba9dba40  EFLAGS: 00010a16
> RAX: ffff8800bab48a00 RBX: ffff8800ba9dba90 RCX: ffff8800ba9dba90
> RDX: ffff8800b9c10128 RSI: ffff8800ba940900 RDI: ffff8800bab48a00
> RBP: ffff8800b9c10128 R08: ffffffff82976660 R09: ffff8800ba9dbb28
> R10: dead000000100100 R11: dead000000200200 R12: ffff8800ba940900
> R13: ffffffff8313fd50 R14: ffff8800b9c95200 R15: 0000000000000000
> FS:  00007fb91fc34700(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000100000001 CR3: 00000000babfb000 CR4: 00000000000007f0
> Stack:
>  ffffffff8206ab0f ffffffff82982240 ffff8800bab48a00 ffff8800b9c100a8
>  ffff8800b9c10100 0000000000000001 ffff8800ba940900 ffff8800b9c10128
>  ffffffff8206bd65 ffff8800bfb0d5e0 ffff8800bab48a00 0000000000014dc0
> Call Trace:
>  [] ? nf_iterate+0x4f/0xa0
>  [] ? nf_reinject+0x125/0x190
>  [] ? nfqnl_recv_verdict+0x255/0x360
>  [] ? nla_parse+0x80/0xf0
>  [] ? nfnetlink_rcv_msg+0x13c/0x240
>  [] ? __memcg_kmem_get_cache+0x4c/0x150
>  [] ? nfnl_lock+0x20/0x20
>  [] ? netlink_rcv_skb+0xa9/0xc0
>  [] ? netlink_unicast+0x12f/0x1c0
>  [] ? netlink_sendmsg+0x28e/0x650
>  [] ? sock_sendmsg+0x44/0x50
>  [] ? ___sys_sendmsg+0x2ab/0x2c0
>  [] ? __wake_up+0x43/0x70
>  [] ? tty_write+0x1c4/0x2a0
>  [] ? __sys_sendmsg+0x44/0x80
>  [] ? system_call_fastpath+0x12/0x6a
> Code:  Bad RIP value.
> RIP  [<0000000100000001>] 0x100000001
>  RSP 
> CR2: 0000000100000001
> ---[ end trace 08eb65d42362793f ]---

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin