linux-toradex.git/net/ipv4, branch v3.4.8

tcp: perform DMA to userspace only if there is a task waiting for it

2012-08-09T15:31:51+00:00

[ Upstream commit 59ea33a68a9083ac98515e4861c00e71efdc49a1 ]

Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT")
added support for receive offloading to IOAT dma engine if available.

The code in tcp_rcv_established() tries to perform early DMA copy if
applicable. It however does so without checking whether the userspace
task is actually expecting the data in the buffer.

This is not a problem under normal circumstances, but there is a corner
case where this doesn't work -- and that's when MSG_TRUNC flag to
recvmsg() is used.

If the IOAT dma engine is not used, the code properly checks whether
there is a valid ucopy.task and the socket is owned by userspace, but
misses the check in the dmaengine case.

This problem can be observed in real trivially -- for example 'tbench' is a
good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
have been already early-copied in tcp_rcv_established() using dma engine.

This patch introduces the same check we are performing in the simple
iovec copy case to the IOAT case as well. It fixes the indefinite
recvmsg(MSG_TRUNC) hangs.

Signed-off-by: Jiri Kosina 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: Add TCP_USER_TIMEOUT negative value check

2012-08-09T15:31:51+00:00

[ Upstream commit 42493570100b91ef663c4c6f0c0fdab238f9d3c2 ]

TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But
patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative
values. If a user assign -1 to it, the socket will set successfully and wait
for 4294967295 miliseconds. This patch add a negative value check to avoid
this issue.

Signed-off-by: Hangbin Liu 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

cipso: don't follow a NULL pointer when setsockopt() is called

2012-08-09T15:31:42+00:00

[ Upstream commit 89d7ae34cdda4195809a5a987f697a517a2a3177 ]

As reported by Alan Cox, and verified by Lin Ming, when a user
attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
tag the kernel dies a terrible death when it attempts to follow a NULL
pointer (the skb argument to cipso_v4_validate() is NULL when called via
the setsockopt() syscall).

This patch fixes this by first checking to ensure that the skb is
non-NULL before using it to find the incoming network interface.  In
the unlikely case where the skb is NULL and the user attempts to add
a CIPSO option with the _TAG_LOCAL tag we return an error as this is
not something we want to allow.

A simple reproducer, kindly supplied by Lin Ming, although you must
have the CIPSO DOI #3 configure on the system first or you will be
caught early in cipso_v4_validate():

	#include 
	#include 
	#include 
	#include 
	#include 

	struct local_tag {
		char type;
		char length;
		char info[4];
	};

	struct cipso {
		char type;
		char length;
		char doi[4];
		struct local_tag local;
	};

	int main(int argc, char **argv)
	{
		int sockfd;
		struct cipso cipso = {
			.type = IPOPT_CIPSO,
			.length = sizeof(struct cipso),
			.local = {
				.type = 128,
				.length = sizeof(struct local_tag),
			},
		};

		memset(cipso.doi, 0, 4);
		cipso.doi[3] = 3;

		sockfd = socket(AF_INET, SOCK_DGRAM, 0);
		#define SOL_IP 0
		setsockopt(sockfd, SOL_IP, IP_OPTIONS,
			&cipso, sizeof(struct cipso));

		return 0;
	}

CC: Lin Ming 
Reported-by: Alan Cox 
Signed-off-by: Paul Moore 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

inetpeer: fix a race in inetpeer_gc_worker()

2012-07-16T16:03:45+00:00

[ Upstream commit 55432d2b543a4b6dfae54f5c432a566877a85d90 ]

commit 5faa5df1fa2024 (inetpeer: Invalidate the inetpeer tree along with
the routing cache) added a race :

Before freeing an inetpeer, we must respect a RCU grace period, and make
sure no user will attempt to increase refcnt.

inetpeer_invalidate_tree() waits for a RCU grace period before inserting
inetpeer tree into gc_list and waking the worker. At that time, no
concurrent lookup can find a inetpeer in this tree.

Signed-off-by: Eric Dumazet 
Cc: Steffen Klassert 
Acked-by: Steffen Klassert 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

xfrm: take net hdr len into account for esp payload size calculation

2012-06-09T15:36:15+00:00

[ Upstream commit 91657eafb64b4cb53ec3a2fbc4afc3497f735788 ]

Corrects the function that determines the esp payload size. The calculations
done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for
certain mtu values and suboptimal frames for others.

According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(),
net_header_len must be taken into account before doing the alignment
calculation.

Signed-off-by: Benjamin Poirier 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

2012-06-09T15:36:14+00:00

[ Upstream commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1 ]

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [] __might_sleep+0x10a/0x110
<4>[23900.007357]  [] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [] ? restore_all+0xf/0xf
<4>[23900.024024]  [] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [] error_code+0x5f/0x64
<4>[23900.133466]  [] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [] sys_connect+0x9a/0xd0
<4>[23900.183538]  [] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [] ? sub_preempt_count+0x3d/0x50

Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.

With the patch, we ran MTBF testing on Android mobile for 12 hours
and didn't trigger the issue.

Thank Eric for very detailed review/checking the issue.

Signed-off-by: Yanmin Zhang 
Signed-off-by: Kun Jiang 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: do_tcp_sendpages() must try to push data out on oom conditions

2012-05-17T22:31:43+00:00

Since recent changes on TCP splicing (starting with commits 2f533844
"tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp:
tcp_sendpages() should call tcp_push() once"), I started seeing
massive stalls when forwarding traffic between two sockets using
splice() when pipe buffers were larger than socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau 
Acked-by: Eric Dumazet 
Cc:

ipv4: Do not use dead fib_info entries.

2012-05-11T02:16:32+00:00

Due to RCU lookups and RCU based release, fib_info objects can
be found during lookup which have fi->fib_dead set.

We must ignore these entries, otherwise we risk dereferencing
the parts of the entry which are being torn down.

Reported-by: Yevgen Pronenko 
Signed-off-by: David S. Miller

tcp: change tcp_adv_win_scale and tcp_rmem[2]

2012-05-03T01:08:58+00:00

tcp_adv_win_scale default value is 2, meaning we expect a good citizen
skb to have skb->len / skb->truesize ratio of 75% (3/4)

In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
So these skbs were considered as not bloated.

With recent truesize fixes, a typical MSS=1460 frame truesize is now the
more precise :
2048 + 256 = 2304. But 2304 * 3/4 = 1728.
So these skb are not good citizen anymore, because 1460 < 1728

(GRO can escape this problem because it build skbs with a too low
truesize.)

This also means tcp advertises a too optimistic window for a given
allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
especially when application is slow to drain its receive queue or in
case of losses (netperf is fast, scp is slow). This is a major latency
source.

We should adjust the len/truesize ratio to 50% instead of 75%

This patch :

1) changes tcp_adv_win_scale default to 1 instead of 2

2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
better truesize tracking and to allow autotuning tcp receive window to
reach same value than before. Note that same amount of kernel memory is
consumed compared to 2.6 kernels.

Signed-off-by: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Tom Herbert 
Cc: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller

tcp: fix infinite cwnd in tcp_complete_cwr()

2012-04-30T17:44:39+00:00

When the cwnd reduction is done, ssthresh may be infinite
if TCP enters CWR via ECN or F-RTO. If cwnd is not undone, i.e.,
undo_marker is set, tcp_complete_cwr() falsely set cwnd to the
infinite ssthresh value. The correct operation is to keep cwnd
intact because it has been updated in ECN or F-RTO.

Signed-off-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller