linux-toradex.git/net/ipv4/tcp_timer.c, branch v3.12.43

tcp: refactor F-RTO

2013-03-21T15:47:50+00:00

The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features.  It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path.  F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently.  F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

Signed-off-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

tcp: TLP loss detection.

2013-03-12T12:30:34+00:00

This is the second of the TLP patch series; it augments the basic TLP
algorithm with a loss detection scheme.

This patch implements a mechanism for loss detection when a Tail
loss probe retransmission plugs a hole thereby masking packet loss
from the sender. The loss detection algorithm relies on counting
TLP dupacks as outlined in Sec. 3 of:
http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01

The basic idea is: Sender keeps track of TLP "episode" upon
retransmission of a TLP packet. An episode ends when the sender receives
an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
episode. We want to make sure that before the episode ends the sender
receives a "TLP dupack", indicating that the TLP retransmission was
unnecessary, so there was no loss/hole that needed plugging. If the
sender gets no TLP dupack before the end of the episode, then it reduces
ssthresh and the congestion window, because the TLP packet arriving at
the receiver probably plugged a hole.

Signed-off-by: Nandita Dukkipati 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller

tcp: Tail loss probe (TLP)

2013-03-12T12:30:34+00:00

This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati 
Acked-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

2012-11-10T23:32:51+00:00

Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller

tcp: better retrans tracking for defer-accept

2012-11-03T18:45:00+00:00

For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Cc: Julian Anastasov 
Cc: Vijay Subramanian 
Cc: Elliott Hughes 
Cc: Neal Cardwell 
Signed-off-by: David S. Miller

tcp: Reject invalid ack_seq to Fast Open sockets

2012-10-23T06:42:56+00:00

A packet with an invalid ack_seq may cause a TCP Fast Open socket to switch
to the unexpected TCP_CLOSING state, triggering a BUG_ON kernel panic.

When a FIN packet with an invalid ack_seq# arrives at a socket in
the TCP_FIN_WAIT1 state, rather than discarding the packet, the current
code will accept the FIN, causing state transition to TCP_CLOSING.

This may be a small deviation from RFC793, which seems to say that the
packet should be dropped. Unfortunately I did not expect this case for
Fast Open hence it will trigger a BUG_ON panic.

It turns out there is really nothing bad about a TFO socket going into
TCP_CLOSING state so I could just remove the BUG_ON statements. But after
some thought I think it's better to treat this case like TCP_SYN_RECV
and return a RST to the confused peer who caused the unacceptable ack_seq
to be generated in the first place.

Signed-off-by: H.K. Jerry Chu 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
Acked-by: Yuchung Cheng 
Acked-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller

tcp: TCP Fast Open Server - support TFO listeners

2012-09-01T00:02:19+00:00

This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Eric Dumazet 
Cc: Tom Herbert 
Signed-off-by: David S. Miller

tcp: fix possible socket refcount problem

2012-08-21T21:42:23+00:00

Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :

[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [] process_one_work+0x1de/0x47f
[ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [] process_one_work+0x1de/0x47f
[ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
[ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281]    [] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281]  [] ? __sk_free+0xfd/0x114
[ 2866.132281]  [] kmem_cache_free+0x6b/0x13a
[ 2866.132281]  [] __sk_free+0xfd/0x114
[ 2866.132281]  [] sk_free+0x1c/0x1e
[ 2866.132281]  [] tcp_write_timer+0x51/0x56
[ 2866.132281]  [] run_timer_softirq+0x218/0x35f
[ 2866.132281]  [] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281]  [] ? rb_commit+0x58/0x85
[ 2866.132281]  [] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281]  [] __do_softirq+0xcb/0x1f9
[ 2866.132281]  [] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281]  [] call_softirq+0x1c/0x30
[ 2866.132281]  [] do_softirq+0x4a/0xa6
[ 2866.132281]  [] irq_exit+0x51/0xad
[ 2866.132281]  [] do_IRQ+0x9d/0xb4
[ 2866.132281]  [] common_interrupt+0x6f/0x6f
[ 2866.132281]    [] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281]  [] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281]  [] mod_timer+0x178/0x1a9
[ 2866.132281]  [] sk_reset_timer+0x19/0x26
[ 2866.132281]  [] tcp_rearm_rto+0x99/0xa4
[ 2866.132281]  [] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281]  [] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281]  [] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281]  [] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281]  [] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281]  [] inet_sendmsg+0xaa/0xd5
[ 2866.132281]  [] ? inet_autobind+0x5f/0x5f
[ 2866.132281]  [] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [] sock_sendmsg+0xa3/0xc4
[ 2866.132281]  [] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281]  [] ? native_sched_clock+0x29/0x6f
[ 2866.132281]  [] ? sched_clock+0x9/0xd
[ 2866.132281]  [] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [] kernel_sendmsg+0x37/0x43
[ 2866.132281]  [] xs_send_kvec+0x77/0x80
[ 2866.132281]  [] xs_sendpages+0x6f/0x1a0
[ 2866.132281]  [] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281]  [] xs_tcp_send_request+0x55/0xf1
[ 2866.132281]  [] xprt_transmit+0x89/0x1db
[ 2866.132281]  [] ? call_connect+0x3c/0x3c
[ 2866.132281]  [] call_transmit+0x1c5/0x20e
[ 2866.132281]  [] __rpc_execute+0x6f/0x225
[ 2866.132281]  [] ? call_connect+0x3c/0x3c
[ 2866.132281]  [] rpc_async_schedule+0x28/0x34
[ 2866.132281]  [] process_one_work+0x24d/0x47f
[ 2866.132281]  [] ? process_one_work+0x1de/0x47f
[ 2866.132281]  [] ? __rpc_execute+0x225/0x225
[ 2866.132281]  [] worker_thread+0x236/0x317
[ 2866.132281]  [] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281]  [] kthread+0x9a/0xa2
[ 2866.132281]  [] kernel_thread_helper+0x4/0x10
[ 2866.132281]  [] ? retint_restore_args+0x13/0x13
[ 2866.132281]  [] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281]  [] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]

The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.

timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.

We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.

For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.

Reported-by: Fengguang Wu 
Tested-by: Fengguang Wu 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

tcp: improve latencies of timer triggered events

2012-07-20T17:59:41+00:00

Modern TCP stack highly depends on tcp_write_timer() having a small
latency, but current implementation doesn't exactly meet the
expectations.

When a timer fires but finds the socket is owned by the user, it rearms
itself for an additional delay hoping next run will be more
successful.

tcp_write_timer() for example uses a 50ms delay for next try, and it
defeats many attempts to get predictable TCP behavior in term of
latencies.

Use the recently introduced tcp_release_cb(), so that the user owning
the socket will call various handlers right before socket release.

This will permit us to post a followup patch to address the
tcp_tso_should_defer() syndrome (some deferred packets have to wait
RTO timer to be transmitted, while cwnd should allow us to send them
sooner)

Signed-off-by: Eric Dumazet 
Cc: Tom Herbert 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Nandita Dukkipati 
Cc: H.K. Jerry Chu 
Cc: John Heffner 
Signed-off-by: David S. Miller

tcp: early retransmit: delayed fast retransmit

2012-05-03T00:56:10+00:00

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller