Age | Commit message (Collapse) | Author |
|
The default initial rwnd is hardcoded to 10.
Now we allow it to be controlled via
/proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100
This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.
See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>
Conflicts:
net/ipv4/sysctl_net_ipv4.c
net/ipv4/tcp_input.c
|
|
A plain read() on a socket does set msg->msg_name to NULL. So check for
NULL pointer first.
[Backport of net-next cf970c002d270c36202bd5b9c2804d3097a52da0]
Bug: 12780426
Change-Id: I3df76aca2fa56478b9a33c404f7b1f0940475ef7
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
|
|
[net-next commit c26d6b46da3ee86fa8a864347331e5513ca84c2b]
If we don't need scope id, we should initialize it to zero.
Same for ->sin6_flowinfo.
Bug: 12800827
Change-Id: Ic19792cee3f5dc30237562cf48e6bdf49817c96e
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
[net-next commit 6d0bfe22611602f36617bc7aa2ffa1bbb2f54c67]
This adds the ability to send ICMPv6 echo requests without a
raw socket. The equivalent ability for ICMPv4 was added in
2011.
Instead of having separate code paths for IPv4 and IPv6, make
most of the code in net/ipv4/ping.c dual-stack and only add a
few IPv6-specific bits (like the protocol definition) to a new
net/ipv6/ping.c. Hopefully this will reduce divergence and/or
duplication of bugs in the future.
Caveats:
- Setting options via ancillary data (e.g., using IPV6_PKTINFO
to specify the outgoing interface) is not yet supported.
- There are no separate security settings for IPv4 and IPv6;
everything is controlled by /proc/net/ipv4/ping_group_range.
- The proc interface does not yet display IPv6 ping sockets
properly.
Tested with a patched copy of ping6 and using raw socket calls.
Compiles and works with all of CONFIG_IPV6={n,m,y}.
Bug: 12800827
Change-Id: I718cd9931823873ab44df22e8a66e12d6a0a6eb1
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow the REJECT --reject-with icmp*blabla to also set the matching error
locally on the socket affected by the reject.
This allows the process to see an error almost as if it received it
via ICMP.
It avoids the local process who's ingress packet is rejected to have to
wait for a pseudo-eternity until some timeout kicks in.
Ideally, this should be enabled with a new iptables flag similar to
--reject-with-sock-err
For now it is enabled with CONFIG_IP*_NF_TARGET_REJECT_SKERR option.
Change-Id: I649a4fd5940029ec0b3233e5abb205da6984891e
Signed-off-by: JP Abgrall <jpa@google.com>
|
|
Signed-off-by: Chia-chi Yeh <chiachi@android.com>
|
|
Signed-off-by: Mike Chan <mike@android.com>
|
|
Add a family of knobs to /sys/kernel/ipv4 for controlling the TCP window size:
tcp_wmem_min
tcp_wmem_def
tcp_wmem_max
tcp_rmem_min
tcp_rmem_def
tcp_rmem_max
This six values mirror the sysctl knobs in /proc/sys/net/ipv4/tcp_wmem and
/proc/sys/net/ipv4/tcp_rmem.
Sysfs, unlike sysctl, allows us to set and manage the files' permissions and
owners.
Signed-off-by: Robert Love <rlove@google.com>
|
|
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.
Signed-off-by: Brian Swetland <swetland@google.com>
net: fix tcp_v4_nuke_addr
Signed-off-by: Dima Zavin <dima@android.com>
net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.
We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.
Signed-off-by: Dima Zavin <dima@android.com>
net: ipv4: Fix race in tcp_v4_nuke_addr().
To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.
Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.
Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>
ipv4: disable bottom halves around call to tcp_done().
Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr
When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close. When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.
This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.
Signed-off-by: Colin Cross <ccross@android.com>
|
|
With CONFIG_ANDROID_PARANOID_NETWORK, require specific uids/gids to instantiate
network sockets.
Signed-off-by: Robert Love <rlove@google.com>
paranoid networking: Use in_egroup_p() to check group membership
The previous group_search() caused trouble for partners with module builds.
in_egroup_p() is also cleaner.
Signed-off-by: Nick Pelly <npelly@google.com>
Fix 2.6.29 build.
Signed-off-by: Arve Hjønnevåg <arve@android.com>
net: Fix compilation of the IPv6 module
Fix compilation of the IPv6 module -- current->euid does not exist anymore,
current_euid() is what needs to be used.
Signed-off-by: Steinar H. Gunderson <sesse@google.com>
|
|
[ Upstream commit a452ce345d63ddf92cd101e4196569f8718ad319 ]
I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
unreferenced object 0xffff88008cba4a40 (size 1696):
comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
hex dump (first 32 bytes):
0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2.....
02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
[<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
[<ffffffff812702cf>] sk_clone_lock+0x14/0x283
[<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
[<ffffffff8129a893>] netlink_broadcast+0x14/0x16
[<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
[<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
[<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
[<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
[<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
[<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
[<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
[<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
[<ffffffff8127aa77>] process_backlog+0xee/0x1c5
[<ffffffff8127c949>] net_rx_action+0xa7/0x200
[<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
But there are many more, resulting in the machine going OOM after some
days.
From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():
void tcp_v4_early_demux(struct sk_buff *skb)
{
/* ... */
iph = ip_hdr(skb);
th = tcp_hdr(skb);
if (th->doff < sizeof(struct tcphdr) / 4)
return;
sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
iph->saddr, th->source,
iph->daddr, ntohs(th->dest),
skb->skb_iif);
if (sk) {
skb->sk = sk;
where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it. This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target. This then results
in the leak I see.
The very same issue seems to be with IPv6, but haven't tested.
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a0065f266a9b5d51575535a25c15ccbeed9a9966 ]
The two commits 0115e8e30d (net: remove delay at device dismantle) and
748e2d9396a (net: reinstate rtnl in call_netdevice_notifiers()) silently
removed a NULL pointer check for in_dev since Linux 3.7.
This patch re-introduces this check as it causes crashing the kernel when
setting small mtu values on non-ip capable netdevices.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 11c21a307d79ea5f6b6fc0d3dfdeda271e5e65f6 ]
commit a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach")
clear IPCB in ip_tunnel_xmit() , or else skb->cb[] may contain garbage from
GSO segmentation layer.
But commit 0e6fbc5b6c621("ip_tunnels: extend iptunnel_xmit()") refactor codes,
and it clear IPCB behind the dst_link_failure().
So clear IPCB in ip_tunnel_xmit() just like commti a622260254ee48("ip_tunnel:
fix kernel panic with icmp_dest_unreach").
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 77f99ad16a07aa062c2d30fae57b1fee456f6ef6 ]
Because the tcp-metrics is an RCU-list, it may be that two
soft-interrupts are inside __tcp_get_metrics() for the same
destination-IP at the same time. If this destination-IP is not yet part of
the tcp-metrics, both soft-interrupts will end up in tcpm_new and create
a new entry for this IP.
So, we will have two tcp-metrics with the same destination-IP in the list.
This patch checks twice __tcp_get_metrics(). First without holding the
lock, then while holding the lock. The second one is there to confirm
that the entry has not been added by another soft-irq while waiting for
the spin-lock.
Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.)
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 95f4a45de1a0f172b35451fc52283290adb21f6e ]
Bob Falken reported that after 4G packets, multicast forwarding stopped
working. This was because of a rule reference counter overflow which
freed the rule as soon as the overflow happend.
This patch solves this by adding the FIB_LOOKUP_NOREF flag to
fib_rules_lookup calls. This is safe even from non-rcu locked sections
as in this case the flag only implies not taking a reference to the rule,
which we don't need at all.
Rules only hold references to the namespace, which are guaranteed to be
available during the call of the non-rcu protected function reg_vif_xmit
because of the interface reference which itself holds a reference to
the net namespace.
Fixes: f0ad0860d01e47 ("ipv4: ipmr: support multiple tables")
Fixes: d1db275dd3f6e4 ("ipv6: ip6mr: support multiple tables")
Reported-by: Bob Falken <NetFestivalHaveFun@gmx.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Based upon upstream commit 70315d22d3c7383f9a508d0aab21e2eb35b2303a ]
Fix inet_diag_dump_icsk() to reflect the fact that both TIME_WAIT and
FIN_WAIT2 connections are represented by inet_timewait_sock (not just
TIME_WAIT). Thus:
(a) We need to iterate through the time_wait buckets if the user wants
either TIME_WAIT or FIN_WAIT2. (Before fixing this, "ss -nemoi state
fin-wait-2" would not return any sockets, even if there were some in
FIN_WAIT2.)
(b) We need to check tw_substate to see if the user wants to dump
sockets in the particular substate (TIME_WAIT or FIN_WAIT2) that a
given connection is in. (Before fixing this, "ss -nemoi state
time-wait" would actually return sockets in state FIN_WAIT2.)
An analogous fix is in v3.13: 70315d22d3c7383f9a508d0aab21e2eb35b2303a
("inet_diag: fix inet_diag_dump_icsk() to use correct state for
timewait sockets") but that patch is quite different because 3.13 code
is very different in this area due to the unification of TCP hash
tables in 05dbc7b ("tcp/dccp: remove twchain") in v3.13-rc1.
I tested that this applies cleanly between v3.3 and v3.12, and tested
that it works in both 3.3 and 3.12. It does not apply cleanly to 3.2
and earlier (though it makes semantic sense), and semantically is not
the right fix for 3.13 and beyond (as mentioned above).
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b1aac815c0891fe4a55a6b0b715910142227700f ]
Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].
In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...
rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);
... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:
r->id.idiag_src[0] = inet->inet_rcv_saddr;
r->id.idiag_dst[0] = inet->inet_daddr;
struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.
So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).
Similarly, fix also other places where we fill that out.
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0e3da5bb8da45890b1dc413404e0f978ab71173e ]
ipgre_header_parse() needs to parse the tunnel's ip header and it
uses mac_header to locate the iphdr. This got broken when gre tunneling
was refactored as mac_header is no longer updated to point to iphdr.
Introduce skb_pop_mac_header() helper to do the mac_header assignment
and use it in ipgre_rcv() to fix msg_name parsing.
Bug introduced in commit c54419321455 (GRE: Refactor GRE tunneling code.)
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
It turns out that commit: d3f7d56a7a4671d395e8af87071068a195257bf6 was
applied to the tree twice, which didn't hurt anything, but it's good to
fix this up.
Reported-by: Veaceslav Falico <veaceslav@falico.eu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Shawn Landden <shawnlandden@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d3f7d56a7a4671d395e8af87071068a195257bf6 upstream.
Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag MSG_SENDPAGE_NOTLAST, similar to
MSG_MORE.
algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages()
and need to see the new flag as identical to MSG_MORE.
This fixes sendfile() on AF_ALG.
v3: also fix udp
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com>
Original-patch: Richard Weinberger <richard@nod.at>
Signed-off-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0d08c42cf9a71530fef5ebcfe368f38f2dd0476f ]
commit 6ff50cd55545 ("tcp: gso: do not generate out of order packets")
had an heuristic that can trigger a warning in skb_try_coalesce(),
because skb->truesize of the gso segments were exactly set to mss.
This breaks the requirement that
skb->truesize >= skb->len + truesizeof(struct sk_buff);
It can trivially be reproduced by :
ifconfig lo mtu 1500
ethtool -K lo tso off
netperf
As the skbs are looped into the TCP networking stack, skb_try_coalesce()
warns us of these skb under-estimating their truesize.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f1d8cba61c3c4b1eb88e507249c4cb8d635d9a76 ]
In commit c9e9042994d3 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.
udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.
This was detected by lockdep seqlock support.
Reported-by: jongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d3f7d56a7a4671d395e8af87071068a195257bf6 ]
Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag MSG_SENDPAGE_NOTLAST, similar to
MSG_MORE.
algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages()
and need to see the new flag as identical to MSG_MORE.
This fixes sendfile() on AF_ALG.
v3: also fix udp
Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Original-patch: Richard Weinberger <richard@nod.at>
Signed-off-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
CPUs can ask for local route via ip_route_input_noref() concurrently.
if nh_rth_input is not cached yet, CPUs will proceed to allocate
equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
via rt_cache_route()
Most of the time they succeed, but on occasion the following two lines:
orig = *p;
prev = cmpxchg(p, orig, rt);
in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
so dst is leaking. dst_destroy() is never called and 'lo' device
refcnt doesn't go to zero, which can be seen in the logs as:
unregister_netdevice: waiting for lo to become free. Usage count = 1
Adding mdelay() between above two lines makes it easily reproducible.
Fix it similar to nh_pcpu_rth_output case.
Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit dbde497966804e63a38fdedc1e3815e77097efc2 ]
snd_nxt must be updated synchronously with sk_send_head. Otherwise
tp->packets_out may be updated incorrectly, what may bring a kernel panic.
Here is a kernel panic from my host.
[ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
...
[ 146.301158] Call Trace:
[ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0
Before this panic a tcp socket was restored. This socket had sent and
unsent data in the write queue. Sent data was restored in repair mode,
then the socket was switched from reapair mode and unsent data was
restored. After that the socket was switched back into repair mode.
In that moment we had a socket where write queue looks like this:
snd_una snd_nxt write_seq
|_________|________|
|
sk_send_head
After a second switching from repair mode the state of socket was
changed:
snd_una snd_nxt, write_seq
|_________ ________|
|
sk_send_head
This state is inconsistent, because snd_nxt and sk_send_head are not
synchronized.
Bellow you can find a call trace, how packets_out can be incremented
twice for one skb, if snd_nxt and sk_send_head are not synchronized.
In this case packets_out will be always positive, even when
sk_write_queue is empty.
tcp_write_wakeup
skb = tcp_send_head(sk);
tcp_fragment
if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq))
tcp_adjust_pcount(sk, skb, diff);
tcp_event_new_data_sent
tp->packets_out += tcp_skb_pcount(skb);
I think update of snd_nxt isn't required, when a socket is switched from
repair mode. Because it's initialized in tcp_connect_init. Then when a
write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
so it's always is in consistent state.
I have checked, that the bug is not reproduced with this patch and
all tests about restoring tcp connections work fine.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 236c9f84868534c718b6889aa624de64763281f9 ]
After searching rt by the vti tunnel dst/src parameter,
if this rt has neither attached to any transformation
nor the transformation is not tunnel oriented, this rt
should be released back to ip layer.
otherwise causing dst memory leakage.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
functions
[ Upstream commit 85fbaa75037d0b6b786ff18658ddf0b4014ce2a4 ]
Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.
As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.
This broke traceroute and such.
Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit bceaa90240b6019ed73b49965eac7d167610be69 ]
Only update *addr_len when we actually fill in sockaddr, otherwise we
can return uninitialized memory from the stack to the caller in the
recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL)
checks because we only get called with a valid addr_len pointer either
from sock_common_recvmsg or inet_recvmsg.
If a blocking read waits on a socket which is concurrently shut down we
now return zero and set msg_msgnamelen to 0.
Reported-by: mpb <mpb.mail@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c9e9042994d37cbc1ee538c500e9da1bb9d1bcdf ]
ip4_datagram_connect() being called from process context,
it should use IP_INC_STATS() instead of IP_INC_STATS_BH()
otherwise we can deadlock on 32bit arches, or get corruptions of
SNMP counters.
Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit dccf76ca6b626c0c4a4e09bb221adee3270ab0ef ]
We had some reports of crashes using TCP fastopen, and Dave Jones
gave a nice stack trace pointing to the error.
Issue is that tcp_get_metrics() should not be called with a NULL dst
Fixes: 1fe4c481ba637 ("net-tcp: Fast Open client - cookie cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 98e09386c0ef4dfd48af7ba60ff908f0d525cdee ]
After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.
802.11 AMPDU requires a fair amount of queueing to be effective.
This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.
It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.
Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.
This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.
Normally, line rate for a single TCP flow should be possible,
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.
Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sujith Manoharan <sujith@msujith.org>
Reported-by: Arnaud Ebalard <arno@natisbad.org>
Tested-by: Sujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ No applicable upstream commit, the upstream implementation is
now completely different and doesn't have this bug. ]
In case of WCCPv2 GRE header has extra four bytes. Following
patch pull those extra four bytes so that skb offsets are set
correctly.
CC: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Peter Schmitt <peter.schmitt82@yahoo.de>
Tested-by: Peter Schmitt <peter.schmitt82@yahoo.de>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ This is a simplified -stable version of a set of upstream commits. ]
This is a replacement patch only for stable which does fix the problems
handled by the following two commits in -net:
"ip_output: do skb ufo init for peeked non ufo skb as well" (e93b7d748be887cd7639b113ba7d7ef792a7efb9)
"ip6_output: do skb ufo init for peeked non ufo skb as well" (c547dbf55d5f8cf615ccc0e7265e98db27d3fb8b)
Three frames are written on a corked udp socket for which the output
netdevice has UFO enabled. If the first and third frame are smaller than
the mtu and the second one is bigger, we enqueue the second frame with
skb_append_datato_frags without initializing the gso fields. This leads
to the third frame appended regulary and thus constructing an invalid skb.
This fixes the problem by always using skb_append_datato_frags as soon
as the first frag got enqueued to the skb without marking the packet
as SKB_GSO_UDP.
The problem with only two frames for ipv6 was fixed by "ipv6: udp
packets following an UFO enqueued packet need also be handled by UFO"
(2811ebac2521ceac84f2bdae402455baa6a7fb47).
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7263a5187f9e9de45fcb51349cf0e031142c19a1 ]
This patch fixes and improves the use of vti interfaces (while
lightly changing the way of configuring them).
Currently:
- it is necessary to identify and mark inbound IPsec
packets destined to each vti interface, via netfilter rules in
the mangle table at prerouting hook.
- the vti module cannot retrieve the right tunnel in input since
commit b9959fd3: vti tunnels all have an i_key, but the tunnel lookup
is done with flag TUNNEL_NO_KEY, so there no chance to retrieve them.
- the i_key is used by the outbound processing as a mark to lookup
for the right SP and SA bundle.
This patch uses the o_key to store the vti mark (instead of i_key) and
enables:
- to avoid the need for previously marking the inbound skbuffs via a
netfilter rule.
- to properly retrieve the right tunnel in input, only based on the IPsec
packet outer addresses.
- to properly perform an inbound policy check (using the tunnel o_key
as a mark).
- to properly perform an outbound SPD and SAD lookup (using the tunnel
o_key as a mark).
- to keep the current mark of the skbuff. The skbuff mark is neither
used nor changed by the vti interface. Only the vti interface o_key
is used.
SAs have a wildcard mark.
SPs have a mark equal to the vti interface o_key.
The vti interface must be created as follows (i_key = 0, o_key = mark):
ip link add vti1 mode vti local 1.1.1.1 remote 2.2.2.2 okey 1
The SPs attached to vti1 must be created as follows (mark = vti1 o_key):
ip xfrm policy add dir out mark 1 tmpl src 1.1.1.1 dst 2.2.2.2 \
proto esp mode tunnel
ip xfrm policy add dir in mark 1 tmpl src 2.2.2.2 dst 1.1.1.1 \
proto esp mode tunnel
The SAs are created with the default wildcard mark. There is no
distinction between global vs. vti SAs. Just their addresses will
possibly link them to a vti interface:
ip xfrm state add src 1.1.1.1 dst 2.2.2.2 proto esp spi 1000 mode tunnel \
enc "cbc(aes)" "azertyuiopqsdfgh"
ip xfrm state add src 2.2.2.2 dst 1.1.1.1 proto esp spi 2000 mode tunnel \
enc "cbc(aes)" "sqbdhgqsdjqjsdfh"
To avoid matching "global" (not vti) SPs in vti interfaces, global SPs
should no use the default wildcard mark, but explicitly match mark 0.
To avoid a double SPD lookup in input and output (in global and vti SPDs),
the NOPOLICY and NOXFRM options should be set on the vti interfaces:
echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_policy
echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_xfrm
The outgoing traffic is steered to vti1 by a route via the vti interface:
ip route add 192.168.0.0/16 dev vti1
The incoming IPsec traffic is steered to vti1 because its outer addresses
match the vti1 tunnel configuration.
Signed-off-by: Christophe Gouault <christophe.gouault@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0a7e22609067ff524fc7bbd45c6951dd08561667 ]
When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 80ad1d61e72d626e30ebe8529a0455e660ca4693 ]
commit 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU /
hlist_nulls") incorrectly used sock_put() on TIMEWAIT sockets.
We should instead use inet_twsk_put()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 031afe4990a7c9dbff41a3a742c44d3e740ea0a1 ]
On receiving an ACK that covers the loss probe sequence, TLP
immediately sets the congestion state to Open, even though some packets
are not recovered and retransmisssion are on the way. The later ACks
may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g.,
https://bugzilla.redhat.com/show_bug.cgi?id=989251
The fix is to follow the similar procedure in recovery by calling
tcp_try_keep_open(). The sender switches to Open state if no packets
are retransmissted. Otherwise it goes to Disorder and let subsequent
ACKs move the state to Recovery or Open.
Reported-By: Michael Sterrett <michael@sterretts.net>
Tested-By: Dormando <dormando@rydia.net>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5e8a402f831dbe7ee831340a91439e46f0d38acd ]
Yuchung found following problem :
There are bugs in the SACK processing code, merging part in
tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
skbs FIN flag. When a receiver first SACK the FIN sequence, and later
throw away ofo queue (e.g., sack-reneging), the sender will stop
retransmitting the FIN flag, and hangs forever.
Following packetdrill test can be used to reproduce the bug.
$ cat sack-merge-bug.pkt
`sysctl -q net.ipv4.tcp_fack=0`
// Establish a connection and send 10 MSS.
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0
+.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.001 < . 1:1(0) ack 1 win 1024
+.000 accept(3, ..., ...) = 4
+.100 write(4, ..., 12000) = 12000
+.000 shutdown(4, SHUT_WR) = 0
+.000 > . 1:10001(10000) ack 1
+.050 < . 1:1(0) ack 2001 win 257
+.000 > FP. 10001:12001(2000) ack 1
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
+.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
// SACK reneg
+.050 < . 1:1(0) ack 12001 win 257
+0 %{ print "unacked: ",tcpi_unacked }%
+5 %{ print "" }%
First, a typo inverted left/right of one OR operation, then
code forgot to advance end_seq if the merged skb carried FIN.
Bug was added in 2.6.29 by commit 832d11c5cd076ab
("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c52e2421f7368fd36cbe330d2cf41b10452e39a9 ]
TCP stack should make sure it owns skbs before mangling them.
We had various crashes using bnx2x, and it turned out gso_size
was cleared right before bnx2x driver was populating TC descriptor
of the _previous_ packet send. TCP stack can sometime retransmit
packets that are still in Qdisc.
Of course we could make bnx2x driver more robust (using
ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack.
We have identified two points where skb_unclone() was needed.
This patch adds a WARN_ON_ONCE() to warn us if we missed another
fix of this kind.
Kudos to Neal for finding the root cause of this bug. Its visible
using small MSS.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c9eeec26e32e087359160406f96e0949b3cc6f10 ]
When TCP Small Queues was added, we used a sysctl to limit amount of
packets queues on Qdisc/device queues for a given TCP flow.
Problem is this limit is either too big for low rates, or too small
for high rates.
Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
auto sizing, it can better control number of packets in Qdisc/device
queues.
New limit is two packets or at least 1 to 2 ms worth of packets.
Low rates flows benefit from this patch by having even smaller
number of packets in queues, allowing for faster recovery,
better RTT estimations.
High rates flows benefit from this patch by allowing more than 2 packets
in flight as we had reports this was a limiting factor to reach line
rate. [ In particular if TX completion is delayed because of coalescing
parameters ]
Example for a single flow on 10Gbp link controlled by FQ/pacing
14 packets in flight instead of 2
$ tc -s -d qd
qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
buckets 1024 quantum 3028 initial_quantum 15140
Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
requeues 6822476)
rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
Note that sk_pacing_rate is currently set to twice the actual rate, but
this might be refined in the future when a flow is in congestion
avoidance.
Additional change : skb->destructor should be set to tcp_wfree().
A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commits 6d36824e730f247b602c90e8715a792003e3c5a7,
02cf4ebd82ff0ac7254b88e466820a290ed8289a, and parts of
7eec4174ff29cd42f2acfae8112f51c228545d40 ]
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3e08f4a72f689c6296d336c2aab4bddd60c93ae2 ]
We might extend the used aera of a skb beyond the total
headroom when we install the ipip header. Fix this by
calling skb_cow_head() unconditionally.
Bug was introduced with commit c544193214
("GRE: Refactor GRE tunneling code.")
Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e2401654dd0f5f3fb7a8d80dad9554d73d7ca394 ]
It is possible for the timer handlers to run after the call to
ip_mc_down so use in_dev_put instead of __in_dev_put in the handler
function in order to do proper cleanup when the refcnt reaches 0.
Otherwise, the refcnt can reach zero without the in_device being
destroyed and we end up leaking a reference to the net_device and
see messages like the following,
unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Tested on linux-3.4.43.
Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9a3bab6b05383f1e4c3716b3615500c51285959e ]
A host might need net_secret[] and never open a single socket.
Problem added in commit aebda156a570782
("net: defer net_secret[] initialization")
Based on prior patch from Hannes Frederic Sowa.
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@strressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 703133de331a7a7df47f31fb9de51dc6f68a9de8 ]
If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.
For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 749154aa56b57652a282cbde57a57abc278d1205 ]
skb->data already points to IP header, but for the sake of
consistency we can also use ip_hdr() to retrieve it.
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e2e5c4c07caf810d7849658dca42f598b3938e21 ]
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit eb8895debe1baba41fcb62c78a16f0c63c21662a ]
In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed
the call to sock_wmalloc in tcp_make_synack to alloc_skb. In doing so,
the netfilter owner match lost its ability to block the SYNACK packet on
outbound listening sockets. Revert the change, restoring the owner match
functionality.
This closes netfilter bugzilla #847.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c27c9322d015dc1d9dfdf31724fca71c0476c4d1 ]
ipv4: raw_sendmsg: don't use header's destination address
A sendto() regression was bisected and found to start with commit
f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.)
The problem is that it tries to ARP-lookup the constructed packet's
destination address rather than the explicitly provided address.
Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.
cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b
Reported-by: Chris Clark <chris.clark@alcatel-lucent.com>
Bisected-by: Chris Clark <chris.clark@alcatel-lucent.com>
Tested-by: Chris Clark <chris.clark@alcatel-lucent.com>
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Chris Clark <chris.clark@alcatel-lucent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e3e12028315749b7fa2edbc37328e5847be9ede9 ]
The zero value means that tsecr is not valid, so it's a special case.
tsoffset is used to customize tcp_time_stamp for one socket.
tsoffset is usually zero, it's used when a socket was moved from one
host to another host.
Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
TCP_RTO_MAX.
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|