linux-toradex.git/include/net/sock.h, branch v4.7

dccp: limit sk_filter trim to payload

2016-07-13T18:53:41+00:00

Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.

Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn 
Acked-by: Daniel Borkmann 
Signed-off-by: David S. Miller

net: sock: move ->sk_shutdown out of bitfields.

2016-05-20T22:05:32+00:00

->sk_shutdown bits share one bitfield with some other bits in sock struct,
such as ->sk_no_check_[r,t]x, ->sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.

In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.

Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since ->sk_shutdown moved into
previously unused 16-bit hole.

Signed-off-by: Andrey Ryabinin 
Suggested-by: Hannes Frederic Sowa 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

tcp: fix lockdep splat in tcp_snd_una_update()

2016-05-04T20:55:11+00:00

tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit & SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam 
Diagnosed-by: Francois Romieu 
Signed-off-by: Eric Dumazet 
Tested-by: Fabio Estevam 
Signed-off-by: David S. Miller

net: add __sock_wfree() helper

2016-05-03T20:02:36+00:00

Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE

We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.

skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

tcp: make tcp_sendmsg() aware of socket backlog

2016-05-02T21:02:26+00:00

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet 
Cc: Alexei Starovoitov 
Cc: Marcelo Ricardo Leitner 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: David S. Miller

net: SOCKWQ_ASYNC_WAITDATA optimizations

2016-04-28T03:08:40+00:00

SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.

Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: SOCKWQ_ASYNC_NOSPACE optimizations

2016-04-28T03:08:40+00:00

SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.

tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.

We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())

This also removes one cache line miss to access sk->sk_wq->flags
in tcp_sendmsg()

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

soreuseport: Resolve merge conflict for v4/v6 ordering fix

2016-04-25T17:27:54+00:00

d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek 
Signed-off-by: David S. Miller

sock: relax WARN_ON() in sock_owned_by_user()

2016-04-25T15:49:53+00:00

Valdis reported tons of stack dumps caused by WARN_ON() in
sock_owned_by_user()

This test needs to be relaxed if/when lockdep disables itself.

Note that other lockdep_sock_is_held() callers are all from
rcu_dereference_protected() sections which already are disabled
if/when lockdep has been disabled.

Fixes: fafc4e1ea1a4 ("sock: tigthen lockdep checks for sock_owned_by_user")
Reported-by: Valdis Kletnieks 
Signed-off-by: Eric Dumazet 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

2016-04-23T22:51:33+00:00

Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller