linux-toradex.git/net/ipv4, branch v4.1.26

tcp_cubic: do not set epoch_start in the future

2016-04-25T15:57:09+00:00

[ Upstream commit c2e7204d180f8efc80f27959ca9cf16fa17f67db ]

Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.

Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.

Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.

This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.

Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Cc: Jana Iyengar 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

tcp_cubic: better follow cubic curve after idle period

2016-04-20T05:13:13+00:00

[ Upstream commit 30927520dbae297182990bb21d08762bcc35ce1d ]

Jana Iyengar found an interesting issue on CUBIC :

The epoch is only updated/reset initially and when experiencing losses.
The delta "t" of now - epoch_start can be arbitrary large after app idle
as well as the bic_target. Consequentially the slope (inverse of
ca->cnt) would be really large, and eventually ca->cnt would be
lower-bounded in the end to 2 to have delayed-ACK slow-start behavior.

This particularly shows up when slow_start_after_idle is disabled
as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle
time.

Jana initial fix was to reset epoch_start if app limited,
but Neal pointed out it would ask the CUBIC algorithm to recalculate the
curve so that we again start growing steeply upward from where cwnd is
now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth
curve to be the same shape, just shifted later in time by the amount of
the idle period.

Reported-by: Jana Iyengar 
Signed-off-by: Eric Dumazet 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Cc: Stephen Hemminger 
Cc: Sangtae Ha 
Cc: Lawrence Brakmo 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

rtnl: RTM_GETNETCONF: fix wrong return value

2016-03-04T15:25:51+00:00

[ Upstream commit a97eb33ff225f34a8124774b3373fd244f0e83ce ]

An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.

Signed-off-by: Anton Protopopov 
Acked-by: Cong Wang 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

route: check and remove route cache when we get route

2016-03-04T15:25:51+00:00

[ Upstream commit deed49df7390d5239024199e249190328f1651e7 ]

Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.

Fix this issue by checking  and removing route cache when we get route.

Signed-off-by: Xin Long 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

ipv4: fix memory leaks in ip_cmsg_send() callers

2016-03-04T15:25:50+00:00

[ Upstream commit 919483096bfe75dda338e98d56da91a263746a0a ]

Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.

Callers are responsible for the freeing.

Many thanks to Dmitry for the report and diagnostic.

Reported-by: Dmitry Vyukov 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

net:Add sysctl_max_skb_frags

2016-03-04T15:25:49+00:00

[ Upstream commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 ]

Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.

Signed-off-by: Hans Westgaard Ry 
Reviewed-by: Håkon Bugge 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

tcp: beware of alignments in tcp_get_info()

2016-03-04T15:25:47+00:00

[ Upstream commit ff5d749772018602c47509bdc0093ff72acd82ec ]

With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.

It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.

Current iproute2 package does not trigger this particular issue.

Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

tcp: fix NULL deref in tcp_v4_send_ack()

2016-03-04T15:25:46+00:00

[ Upstream commit e62a123b8ef7c5dc4db2c16383d506860ad21b47 ]

Neal reported crashes with this stack trace :

 RIP: 0010:[] tcp_v4_send_ack+0x41/0x20f
...
 CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
...
  [] tcp_v4_reqsk_send_ack+0xa5/0xb4
  [] tcp_check_req+0x2ea/0x3e0
  [] tcp_rcv_state_process+0x850/0x2500
  [] tcp_v4_do_rcv+0x141/0x330
  [] sk_backlog_rcv+0x21/0x30
  [] tcp_recvmsg+0x75d/0xf90
  [] inet_recvmsg+0x80/0xa0
  [] sock_aio_read+0xee/0x110
  [] do_sync_read+0x6f/0xa0
  [] SyS_read+0x1e1/0x290
  [] system_call_fastpath+0x16/0x1b

The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.

Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.

Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).

IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Cc: Jerry Chu 
Cc: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

xfrm: dst_entries_init() per-net dst_ops

2016-01-31T19:23:38+00:00

[ Upstream commit a8a572a6b5f2a79280d6e302cb3c1cb1fbaeb3e8 ]

Remove the dst_entries_init/destroy calls for xfrm4 and xfrm6 dst_ops
templates; their dst_entries counters will never be used.  Move the
xfrm dst_ops initialization from the common xfrm/xfrm_policy.c to
xfrm4/xfrm4_policy.c and xfrm6/xfrm6_policy.c, and call dst_entries_init
and dst_entries_destroy for each net namespace.

The ipv4 and ipv6 xfrms each create dst_ops template, and perform
dst_entries_init on the templates.  The template values are copied to each
net namespace's xfrm.xfrm*_dst_ops.  The problem there is the dst_ops
pcpuc_entries field is a percpu counter and cannot be used correctly by
simply copying it to another object.

The result of this is a very subtle bug; changes to the dst entries
counter from one net namespace may sometimes get applied to a different
net namespace dst entries counter.  This is because of how the percpu
counter works; it has a main count field as well as a pointer to the
percpu variables.  Each net namespace maintains its own main count
variable, but all point to one set of percpu variables.  When any net
namespace happens to change one of the percpu variables to outside its
small batch range, its count is moved to the net namespace's main count
variable.  So with multiple net namespaces operating concurrently, the
dst_ops entries counter can stray from the actual value that it should
be; if counts are consistently moved from one net namespace to another
(which my testing showed is likely), then one net namespace winds up
with a negative dst_ops count while another winds up with a continually
increasing count, eventually reaching its gc_thresh limit, which causes
all new traffic on the net namespace to fail with -ENOBUFS.

Signed-off-by: Dan Streetman 
Signed-off-by: Dan Streetman 
Signed-off-by: Steffen Klassert 
Signed-off-by: Greg Kroah-Hartman

tcp/dccp: fix timewait races in timer handling

2016-01-31T19:23:37+00:00

[ Upstream commit ed2e923945892a8372ab70d2f61d364b0b6d9054 ]

When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet 
Reported-by: Ying Cai 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman