<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/net/ipv4/tcp_timer.c, branch v4.1</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>tcp: RFC7413 option support for Fast Open client</title>
<updated>2015-04-07T22:36:39+00:00</updated>
<author>
<name>Daniel Lee</name>
<email>Longinus00@gmail.com</email>
</author>
<published>2015-04-06T21:37:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2646c831c00c5d22aa72b79d24069c1b412cda7c'/>
<id>2646c831c00c5d22aa72b79d24069c1b412cda7c</id>
<content type='text'>
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies.  This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts.  If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.

The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.

Signed-off-by: Daniel Lee &lt;Longinus00@gmail.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies.  This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts.  If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.

The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.

Signed-off-by: Daniel Lee &lt;Longinus00@gmail.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>inet: remove sk_listener parameter from syn_ack_timeout()</title>
<updated>2015-03-23T20:52:25+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-03-22T17:22:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=42cb80a2353f42913ae78074ffa1f1b4a49e5436'/>
<id>42cb80a2353f42913ae78074ffa1f1b4a49e5436</id>
<content type='text'>
It is not needed, and req-&gt;sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It is not needed, and req-&gt;sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>inet: get rid of central tcp/dccp listener timer</title>
<updated>2015-03-20T16:40:25+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-03-20T02:04:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fa76ce7328b289b6edd476e24eb52fd634261720'/>
<id>fa76ce7328b289b6edd476e24eb52fd634261720</id>
<content type='text'>
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipv4: Create probe timer for tcp PMTU as per RFC4821</title>
<updated>2015-03-06T19:57:42+00:00</updated>
<author>
<name>Fan Du</name>
<email>fan.du@intel.com</email>
</author>
<published>2015-03-06T03:18:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=05cbc0db03e82128f2e7e353d4194dd24a1627fe'/>
<id>05cbc0db03e82128f2e7e353d4194dd24a1627fe</id>
<content type='text'>
As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du &lt;fan.du@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du &lt;fan.du@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipv4: Namespecify TCP PMTU mechanism</title>
<updated>2015-02-10T02:45:00+00:00</updated>
<author>
<name>Fan Du</name>
<email>fan.du@intel.com</email>
</author>
<published>2015-02-10T01:53:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b0f9ca53cbb103e9240a29a974e0b6085e58f9f7'/>
<id>b0f9ca53cbb103e9240a29a974e0b6085e58f9f7</id>
<content type='text'>
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.

Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.

Signed-off-by: Fan Du &lt;fan.du@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.

Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.

Signed-off-by: Fan Du &lt;fan.du@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited</title>
<updated>2014-11-11T19:10:31+00:00</updated>
<author>
<name>Joe Perches</name>
<email>joe@perches.com</email>
</author>
<published>2014-11-11T18:59:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ba7a46f16dd29f93303daeb1fee8af316c5a07f4'/>
<id>ba7a46f16dd29f93303daeb1fee8af316c5a07f4</id>
<content type='text'>
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_&lt;LEVEL&gt; uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled.  Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings".  For backward compatibility,
the sysctl is not removed, but it has no function.  The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_&lt;LEVEL&gt; uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled.  Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings".  For backward compatibility,
the sysctl is not removed, but it has no function.  The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: abort orphan sockets stalling on zero window probes</title>
<updated>2014-10-01T20:27:52+00:00</updated>
<author>
<name>Yuchung Cheng</name>
<email>ycheng@google.com</email>
</author>
<published>2014-09-29T20:20:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b248230c34970a6c1c17c591d63b464e8d2cfc33'/>
<id>b248230c34970a6c1c17c591d63b464e8d2cfc33</id>
<content type='text'>
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).

But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html

This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.

In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.

Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reported-by: Andrey Dmitrov &lt;andrey.dmitrov@oktetlabs.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).

But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html

This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.

In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.

Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reported-by: Andrey Dmitrov &lt;andrey.dmitrov@oktetlabs.ru&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: avoid possible arithmetic overflows</title>
<updated>2014-09-22T20:27:10+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2014-09-22T20:19:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fcdd1cf4dd63aecf86c987d7f4ec7187be5c2fbc'/>
<id>fcdd1cf4dd63aecf86c987d7f4ec7187be5c2fbc</id>
<content type='text'>
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.

Better use 64bit to perform icsk_rto &lt;&lt; icsk_backoff operations

As Joe Perches suggested, add a helper for this.

Yuchung spotted the tcp_v4_err() case.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.

Better use 64bit to perform icsk_rto &lt;&lt; icsk_backoff operations

As Joe Perches suggested, add a helper for this.

Yuchung spotted the tcp_v4_err() case.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: remove TCP_SKB_CB(skb)-&gt;when</title>
<updated>2014-09-06T00:49:33+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2014-09-05T22:33:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7faee5c0d514162853a343d93e4a0b6bb8bfec21'/>
<id>7faee5c0d514162853a343d93e4a0b6bb8bfec21</id>
<content type='text'>
After commit 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.

TCP_SKB_CB(skb)-&gt;when can be removed, as same information sits in skb_mstamp.stamp_jiffies

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
After commit 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.

TCP_SKB_CB(skb)-&gt;when can be removed, as same information sits in skb_mstamp.stamp_jiffies

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: reduce spurious retransmits due to transient SACK reneging</title>
<updated>2014-08-05T23:29:33+00:00</updated>
<author>
<name>Neal Cardwell</name>
<email>ncardwell@google.com</email>
</author>
<published>2014-08-04T23:12:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5ae344c949e79b8545a11db149f0a85a6e59e1f3'/>
<id>5ae344c949e79b8545a11db149f0a85a6e59e1f3</id>
<content type='text'>
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.

When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.

Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.

To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.

A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.

We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:

 (1) 90% of inter-ACK delays were less than 10ms
 (2) 99% of inter-ACK delays were less than RTT/2

In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.

Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.

When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.

Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.

To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.

A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.

We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:

 (1) 90% of inter-ACK delays were less than 10ms
 (2) 99% of inter-ACK delays were less than RTT/2

In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.

Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
