linux-toradex.git/net/netlink, branch v4.2.7

netlink: fix locking around NETLINK_LIST_MEMBERSHIPS

2015-12-09T19:31:03+00:00

[ Upstream commit 47191d65b647af5eb5c82ede70ed4c24b1e93ef4 ]

Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
the membership state to user-space. However, grabing the netlink table is
effectively a write_lock_irq(), and as such we should not be triggering
page-faults in the critical section.

This can be easily reproduced by the following snippet:
    int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
    int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));

This should work just fine, but currently triggers EFAULT and a possible
WARN_ON below handle_mm_fault().

Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
lock. The write-lock was overkill in the first place, and the read-lock
allows page-faults just fine.

Reported-by: Dmitry Vyukov 
Signed-off-by: David Herrmann 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

netlink: Trim skb to alloc size to avoid MSG_TRUNC

2015-10-27T00:53:37+00:00

[ Upstream commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe ]

netlink_dump() allocates skb based on the calculated min_dump_alloc or
a per socket max_recvmsg_len.
min_alloc_size is maximum space required for any single netdev
attributes as calculated by rtnl_calcit().
max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
It is capped at 16KiB.
The intention is to avoid small allocations and to minimize the number
of calls required to obtain dump information for all net devices.

netlink_dump packs as many small messages as could fit within an skb
that was sized for the largest single netdev information. The actual
space available within an skb is larger than what is requested. It could
be much larger and up to near 2x with align to next power of 2 approach.

Allowing netlink_dump to use all the space available within the
allocated skb increases the buffer size a user has to provide to avoid
truncaion (i.e. MSG_TRUNG flag set).

It was observed that with many VLANs configured on at least one netdev,
a larger buffer of near 64KiB was necessary to avoid "Message truncated"
error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
min_alloc_size was only little over 32KiB.

This patch trims skb to allocated size in order to allow the user to
avoid truncation with more reasonable buffer size.

Signed-off-by: Ronen Arad 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

netlink: Replace rhash_portid with bound

2015-10-03T11:51:38+00:00

[ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo 
Reported-by: Linus Torvalds 
Signed-off-by: Herbert Xu 
Nacked-by: Tejun Heo 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

netlink: Fix autobind race condition that leads to zero port ID

2015-10-03T11:51:37+00:00

[ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo 
Reported-by: Linus Torvalds 
Signed-off-by: Herbert Xu 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

netlink, mmap: transform mmap skb into full skb on taps

2015-10-03T11:51:36+00:00

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA 
Signed-off-by: Daniel Borkmann 
Tested-by: Ken-ichirou MATSUZAWA 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

netlink: mmap: fix tx type check

2015-08-23T23:04:46+00:00

I can't send netlink message via mmaped netlink socket since

    commit: a8866ff6a5bce7d0ec465a63bc482a85c09b0d39
    netlink: make the check for "send from tx_ring" deterministic

msg->msg_iter.type is set to WRITE (1) at

    SYSCALL_DEFINE6(sendto, ...
        import_single_range(WRITE, ...
            iov_iter_init(1, WRITE, ...

call path, so that we need to check the type by iter_is_iovec()
to accept the WRITE.

Signed-off-by: Ken-ichirou MATSUZAWA 
Signed-off-by: David S. Miller

netlink: make sure -EBUSY won't escape from netlink_insert

2015-08-10T17:59:10+00:00

Linus reports the following deadlock on rtnl_mutex; triggered only
once so far (extract):

[12236.694209] NetworkManager  D 0000000000013b80     0  1047      1 0x00000000
[12236.694218]  ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
[12236.694224]  ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
[12236.694230]  ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
[12236.694235] Call Trace:
[12236.694250]  [] ? schedule_preempt_disabled+0x9/0x10
[12236.694257]  [] ? schedule+0x2a/0x70
[12236.694263]  [] ? schedule_preempt_disabled+0x9/0x10
[12236.694271]  [] ? __mutex_lock_slowpath+0x7f/0xf0
[12236.694280]  [] ? mutex_lock+0x16/0x30
[12236.694291]  [] ? rtnetlink_rcv+0x10/0x30
[12236.694299]  [] ? netlink_unicast+0xfb/0x180
[12236.694309]  [] ? rtnl_getlink+0x113/0x190
[12236.694319]  [] ? rtnetlink_rcv_msg+0x7a/0x210
[12236.694331]  [] ? sock_has_perm+0x5c/0x70
[12236.694339]  [] ? rtnetlink_rcv+0x30/0x30
[12236.694346]  [] ? netlink_rcv_skb+0x9c/0xc0
[12236.694354]  [] ? rtnetlink_rcv+0x1f/0x30
[12236.694360]  [] ? netlink_unicast+0xfb/0x180
[12236.694367]  [] ? netlink_sendmsg+0x484/0x5d0
[12236.694376]  [] ? __wake_up+0x2f/0x50
[12236.694387]  [] ? sock_sendmsg+0x33/0x40
[12236.694396]  [] ? ___sys_sendmsg+0x22e/0x240
[12236.694405]  [] ? ___sys_recvmsg+0x135/0x1a0
[12236.694415]  [] ? eventfd_write+0x82/0x210
[12236.694423]  [] ? fsnotify+0x32e/0x4c0
[12236.694429]  [] ? wake_up_q+0x60/0x60
[12236.694434]  [] ? __sys_sendmsg+0x39/0x70
[12236.694440]  [] ? entry_SYSCALL_64_fastpath+0x12/0x6a

It seems so far plausible that the recursive call into rtnetlink_rcv()
looks suspicious. One way, where this could trigger is that the senders
NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
the rtnl_getlink() request's answer would be sent to the kernel instead
to the actual user process, thus grabbing rtnl_mutex() twice.

One theory would be that netlink_autobind() triggered via netlink_sendmsg()
internally overwrites the -EBUSY error to 0, but where it is wrongly
originating from __netlink_insert() instead. That would reset the
socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
also puts it, -EBUSY should not be propagated from netlink_insert().

It looks like it's very unlikely to reproduce. We need to trigger the
rhashtable_insert_rehash() handler under a situation where rehashing
currently occurs (one /rare/ way would be to hit ht->elasticity limits
while not filled enough to expand the hashtable, but that would rather
require a specifically crafted bind() sequence with knowledge about
destination slots, seems unlikely). It probably makes sense to guard
__netlink_insert() in any case and remap that error. It was suggested
that EOVERFLOW might be better than an already overloaded ENOMEM.

Reference: http://thread.gmane.org/gmane.linux.network/372676
Reported-by: Linus Torvalds 
Signed-off-by: Daniel Borkmann 
Acked-by: Herbert Xu 
Acked-by: Thomas Graf 
Signed-off-by: David S. Miller

netlink: don't hold mutex in rcu callback when releasing mmapd ring

2015-07-22T05:22:56+00:00

Kirill A. Shutemov says:

This simple test-case trigers few locking asserts in kernel:

int main(int argc, char **argv)
{
        unsigned int block_size = 16 * 4096;
        struct nl_mmap_req req = {
                .nm_block_size          = block_size,
                .nm_block_nr            = 64,
                .nm_frame_size          = 16384,
                .nm_frame_nr            = 64 * block_size / 16384,
        };
        unsigned int ring_size;
	int fd;

	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
                exit(1);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
                exit(1);

	ring_size = req.nm_block_nr * req.nm_block_size;
	mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	return 0;
}

+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
 #0:  (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
 #1:  ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
 #2:  (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[] __delay+0xf/0x20

CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
   [] dump_stack+0x4f/0x7b
 [] ___might_sleep+0x16d/0x270
 [] __might_sleep+0x4d/0x90
 [] mutex_lock_nested+0x2f/0x430
 [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
 [] ? __this_cpu_preempt_check+0x13/0x20
 [] netlink_set_ring+0x1ed/0x350
 [] ? netlink_undo_bind+0x70/0x70
 [] netlink_sock_destruct+0x80/0x150
 [] __sk_free+0x1d/0x160
 [] sk_free+0x19/0x20
[..]

Cong Wang says:

We can't hold mutex lock in a rcu callback, [..]

Thomas Graf says:

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.

Reported-by: "Kirill A. Shutemov" 
Diagnosed-by: Cong Wang 
Suggested-by: Thomas Graf 
Signed-off-by: Florian Westphal 
Signed-off-by: David S. Miller

netlink: Delete an unnecessary check before the function call "module_put"

2015-07-03T16:27:43+00:00

The module_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
Signed-off-by: David S. Miller

netlink: add API to retrieve all group memberships

2015-06-21T17:18:18+00:00

This patch adds getsockopt(SOL_NETLINK, NETLINK_LIST_MEMBERSHIPS) to
retrieve all groups a socket is a member of. Currently, we have to use
getsockname() and look at the nl.nl_groups bitmask. However, this mask is
limited to 32 groups. Hence, similar to NETLINK_ADD_MEMBERSHIP and
NETLINK_DROP_MEMBERSHIP, this adds a separate sockopt to manager higher
groups IDs than 32.

This new NETLINK_LIST_MEMBERSHIPS option takes a pointer to __u32 and the
size of the array. The array is filled with the full membership-set of the
socket, and the required array size is returned in optlen. Hence,
user-space can retry with a properly sized array in case it was too small.

Signed-off-by: David Herrmann 
Signed-off-by: David S. Miller