<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/net/sctp, branch v3.0.44</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>sctp: Fix list corruption resulting from freeing an association on a list</title>
<updated>2012-08-09T15:27:51+00:00</updated>
<author>
<name>Neil Horman</name>
<email>nhorman@tuxdriver.com</email>
</author>
<published>2012-07-16T09:13:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2f890d2777247beb207be1c99835a0c5e09d340c'/>
<id>2f890d2777247beb207be1c99835a0c5e09d340c</id>
<content type='text'>
[ Upstream commit 2eebc1e188e9e45886ee00662519849339884d6d ]

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  &lt;IRQ&gt;
[22766.387150]  [&lt;ffffffffa169f292&gt;] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [&lt;ffffffffa169f2b6&gt;] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [&lt;ffffffffa169f597&gt;] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [&lt;ffffffff810d4da8&gt;] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [&lt;ffffffff815827e3&gt;] ? nf_hook_slow+0x133/0x210
[22766.387166]  [&lt;ffffffff815902fc&gt;] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [&lt;ffffffff8159043d&gt;] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [&lt;ffffffff815902fc&gt;] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [&lt;ffffffff81590a07&gt;] ip_local_deliver+0x47/0x80
[22766.387172]  [&lt;ffffffff8158fd80&gt;] ip_rcv_finish+0x150/0x680
[22766.387174]  [&lt;ffffffff81590c54&gt;] ip_rcv+0x214/0x320
[22766.387176]  [&lt;ffffffff81558c07&gt;] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [&lt;ffffffff8155856c&gt;] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [&lt;ffffffff810d423e&gt;] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [&lt;ffffffff81558f83&gt;] netif_receive_skb+0x23/0x1f0
[22766.387183]  [&lt;ffffffff815596a9&gt;] ? dev_gro_receive+0x139/0x440
[22766.387185]  [&lt;ffffffff81559280&gt;] napi_skb_finish+0x70/0xa0
[22766.387187]  [&lt;ffffffff81559cb5&gt;] napi_gro_receive+0xf5/0x130
[22766.387218]  [&lt;ffffffffa01c4679&gt;] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [&lt;ffffffffa01c5aab&gt;] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [&lt;ffffffffa01c9c18&gt;] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [&lt;ffffffff81559fea&gt;] net_rx_action+0x1aa/0x3d0
[22766.387270]  [&lt;ffffffff810a495f&gt;] ? account_system_vtime+0x10f/0x130
[22766.387273]  [&lt;ffffffff810734d0&gt;] __do_softirq+0xe0/0x420
[22766.387275]  [&lt;ffffffff8169826c&gt;] call_softirq+0x1c/0x30
[22766.387278]  [&lt;ffffffff8101db15&gt;] do_softirq+0xd5/0x110
[22766.387279]  [&lt;ffffffff81073bc5&gt;] irq_exit+0xd5/0xe0
[22766.387281]  [&lt;ffffffff81698b03&gt;] do_IRQ+0x63/0xd0
[22766.387283]  [&lt;ffffffff8168ee2f&gt;] common_interrupt+0x6f/0x6f
[22766.387283]  &lt;EOI&gt;
[22766.387284]
[22766.387285]  [&lt;ffffffff8168eed9&gt;] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 &lt;0f&gt; b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [&lt;ffffffffa168a2c9&gt;] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP &lt;ffff880147c039b0&gt;
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Reported-by: davej@redhat.com
CC: davej@redhat.com
CC: "David S. Miller" &lt;davem@davemloft.net&gt;
CC: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
CC: Sridhar Samudrala &lt;sri@us.ibm.com&gt;
CC: linux-sctp@vger.kernel.org
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 2eebc1e188e9e45886ee00662519849339884d6d ]

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  &lt;IRQ&gt;
[22766.387150]  [&lt;ffffffffa169f292&gt;] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [&lt;ffffffffa169f2b6&gt;] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [&lt;ffffffffa169f597&gt;] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [&lt;ffffffff810d4da8&gt;] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [&lt;ffffffff815827e3&gt;] ? nf_hook_slow+0x133/0x210
[22766.387166]  [&lt;ffffffff815902fc&gt;] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [&lt;ffffffff8159043d&gt;] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [&lt;ffffffff815902fc&gt;] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [&lt;ffffffff81590a07&gt;] ip_local_deliver+0x47/0x80
[22766.387172]  [&lt;ffffffff8158fd80&gt;] ip_rcv_finish+0x150/0x680
[22766.387174]  [&lt;ffffffff81590c54&gt;] ip_rcv+0x214/0x320
[22766.387176]  [&lt;ffffffff81558c07&gt;] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [&lt;ffffffff8155856c&gt;] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [&lt;ffffffff810d423e&gt;] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [&lt;ffffffff81558f83&gt;] netif_receive_skb+0x23/0x1f0
[22766.387183]  [&lt;ffffffff815596a9&gt;] ? dev_gro_receive+0x139/0x440
[22766.387185]  [&lt;ffffffff81559280&gt;] napi_skb_finish+0x70/0xa0
[22766.387187]  [&lt;ffffffff81559cb5&gt;] napi_gro_receive+0xf5/0x130
[22766.387218]  [&lt;ffffffffa01c4679&gt;] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [&lt;ffffffffa01c5aab&gt;] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [&lt;ffffffffa01c9c18&gt;] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [&lt;ffffffff81559fea&gt;] net_rx_action+0x1aa/0x3d0
[22766.387270]  [&lt;ffffffff810a495f&gt;] ? account_system_vtime+0x10f/0x130
[22766.387273]  [&lt;ffffffff810734d0&gt;] __do_softirq+0xe0/0x420
[22766.387275]  [&lt;ffffffff8169826c&gt;] call_softirq+0x1c/0x30
[22766.387278]  [&lt;ffffffff8101db15&gt;] do_softirq+0xd5/0x110
[22766.387279]  [&lt;ffffffff81073bc5&gt;] irq_exit+0xd5/0xe0
[22766.387281]  [&lt;ffffffff81698b03&gt;] do_IRQ+0x63/0xd0
[22766.387283]  [&lt;ffffffff8168ee2f&gt;] common_interrupt+0x6f/0x6f
[22766.387283]  &lt;EOI&gt;
[22766.387284]
[22766.387285]  [&lt;ffffffff8168eed9&gt;] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 &lt;0f&gt; b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [&lt;ffffffffa168a2c9&gt;] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP &lt;ffff880147c039b0&gt;
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Reported-by: davej@redhat.com
CC: davej@redhat.com
CC: "David S. Miller" &lt;davem@davemloft.net&gt;
CC: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
CC: Sridhar Samudrala &lt;sri@us.ibm.com&gt;
CC: linux-sctp@vger.kernel.org
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: check cached dst before using it</title>
<updated>2012-06-09T15:33:03+00:00</updated>
<author>
<name>Nicolas Dichtel</name>
<email>nicolas.dichtel@6wind.com</email>
</author>
<published>2012-05-04T05:24:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=337c934a55a7956bcd349b6f56c6add58d2550f3'/>
<id>337c934a55a7956bcd349b6f56c6add58d2550f3</id>
<content type='text'>
[ Upstream commit e0268868ba064980488fc8c194db3d8e9fb2959c ]

dst_check() will take care of SA (and obsolete field), hence
IPsec rekeying scenario is taken into account.

Signed-off-by: Nicolas Dichtel &lt;nicolas.dichtel@6wind.com&gt;
Acked-by: Vlad Yaseivch &lt;vyasevich@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit e0268868ba064980488fc8c194db3d8e9fb2959c ]

dst_check() will take care of SA (and obsolete field), hence
IPsec rekeying scenario is taken into account.

Signed-off-by: Nicolas Dichtel &lt;nicolas.dichtel@6wind.com&gt;
Acked-by: Vlad Yaseivch &lt;vyasevich@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: Allow struct sctp_event_subscribe to grow without breaking binaries</title>
<updated>2012-04-27T16:51:18+00:00</updated>
<author>
<name>Thomas Graf</name>
<email>tgraf@infradead.org</email>
</author>
<published>2012-04-03T22:17:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3109ea06da8538ee3ded3489b26065a3be32f360'/>
<id>3109ea06da8538ee3ded3489b26065a3be32f360</id>
<content type='text'>
[ Upstream commit acdd5985364f8dc511a0762fab2e683f29d9d692 ]

getsockopt(..., SCTP_EVENTS, ...) performs a length check and returns
an error if the user provides less bytes than the size of struct
sctp_event_subscribe.

Struct sctp_event_subscribe needs to be extended by an u8 for every
new event or notification type that is added.

This obviously makes getsockopt fail for binaries that are compiled
against an older versions of &lt;net/sctp/user.h&gt; which do not contain
all event types.

This patch changes getsockopt behaviour to no longer return an error
if not enough bytes are being provided by the user. Instead, it
returns as much of sctp_event_subscribe as fits into the provided buffer.

This leads to the new behavior that users see what they have been aware
of at compile time.

The setsockopt(..., SCTP_EVENTS, ...) API is already behaving like this.

Signed-off-by: Thomas Graf &lt;tgraf@suug.ch&gt;
Acked-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit acdd5985364f8dc511a0762fab2e683f29d9d692 ]

getsockopt(..., SCTP_EVENTS, ...) performs a length check and returns
an error if the user provides less bytes than the size of struct
sctp_event_subscribe.

Struct sctp_event_subscribe needs to be extended by an u8 for every
new event or notification type that is added.

This obviously makes getsockopt fail for binaries that are compiled
against an older versions of &lt;net/sctp/user.h&gt; which do not contain
all event types.

This patch changes getsockopt behaviour to no longer return an error
if not enough bytes are being provided by the user. Instead, it
returns as much of sctp_event_subscribe as fits into the provided buffer.

This leads to the new behavior that users see what they have been aware
of at compile time.

The setsockopt(..., SCTP_EVENTS, ...) API is already behaving like this.

Signed-off-by: Thomas Graf &lt;tgraf@suug.ch&gt;
Acked-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd</title>
<updated>2012-01-06T22:14:09+00:00</updated>
<author>
<name>Thomas Graf</name>
<email>tgraf@redhat.com</email>
</author>
<published>2011-12-19T04:11:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0e5fe3ed8d751c7be333fa193882e91dcc289158'/>
<id>0e5fe3ed8d751c7be333fa193882e91dcc289158</id>
<content type='text'>
[ Upstream commit a76c0adf60f6ca5ff3481992e4ea0383776b24d2 ]

When checking whether a DATA chunk fits into the estimated rwnd a
full sizeof(struct sk_buff) is added to the needed chunk size. This
quickly exhausts the available rwnd space and leads to packets being
sent which are much below the PMTU limit. This can lead to much worse
performance.

The reason for this behaviour was to avoid putting too much memory
pressure on the receiver. The concept is not completely irational
because a Linux receiver does in fact clone an skb for each DATA chunk
delivered. However, Linux also reserves half the available socket
buffer space for data structures therefore usage of it is already
accounted for.

When proposing to change this the last time it was noted that this
behaviour was introduced to solve a performance issue caused by rwnd
overusage in combination with small DATA chunks.

Trying to reproduce this I found that with the sk_buff overhead removed,
the performance would improve significantly unless socket buffer limits
are increased.

The following numbers have been gathered using a patched iperf
supporting SCTP over a live 1 Gbit ethernet network. The -l option
was used to limit DATA chunk sizes. The numbers listed are based on
the average of 3 test runs each. Default values have been used for
sk_(r|w)mem.

Chunk
Size    Unpatched     No Overhead
-------------------------------------
   4    15.2 Kbit [!]   12.2 Mbit [!]
   8    35.8 Kbit [!]   26.0 Mbit [!]
  16    95.5 Kbit [!]   54.4 Mbit [!]
  32   106.7 Mbit      102.3 Mbit
  64   189.2 Mbit      188.3 Mbit
 128   331.2 Mbit      334.8 Mbit
 256   537.7 Mbit      536.0 Mbit
 512   766.9 Mbit      766.6 Mbit
1024   810.1 Mbit      808.6 Mbit

Signed-off-by: Thomas Graf &lt;tgraf@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit a76c0adf60f6ca5ff3481992e4ea0383776b24d2 ]

When checking whether a DATA chunk fits into the estimated rwnd a
full sizeof(struct sk_buff) is added to the needed chunk size. This
quickly exhausts the available rwnd space and leads to packets being
sent which are much below the PMTU limit. This can lead to much worse
performance.

The reason for this behaviour was to avoid putting too much memory
pressure on the receiver. The concept is not completely irational
because a Linux receiver does in fact clone an skb for each DATA chunk
delivered. However, Linux also reserves half the available socket
buffer space for data structures therefore usage of it is already
accounted for.

When proposing to change this the last time it was noted that this
behaviour was introduced to solve a performance issue caused by rwnd
overusage in combination with small DATA chunks.

Trying to reproduce this I found that with the sk_buff overhead removed,
the performance would improve significantly unless socket buffer limits
are increased.

The following numbers have been gathered using a patched iperf
supporting SCTP over a live 1 Gbit ethernet network. The -l option
was used to limit DATA chunk sizes. The numbers listed are based on
the average of 3 test runs each. Default values have been used for
sk_(r|w)mem.

Chunk
Size    Unpatched     No Overhead
-------------------------------------
   4    15.2 Kbit [!]   12.2 Mbit [!]
   8    35.8 Kbit [!]   26.0 Mbit [!]
  16    95.5 Kbit [!]   54.4 Mbit [!]
  32   106.7 Mbit      102.3 Mbit
  64   189.2 Mbit      188.3 Mbit
 128   331.2 Mbit      334.8 Mbit
 256   537.7 Mbit      536.0 Mbit
 512   766.9 Mbit      766.6 Mbit
1024   810.1 Mbit      808.6 Mbit

Signed-off-by: Thomas Graf &lt;tgraf@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: fix incorrect overflow check on autoclose</title>
<updated>2012-01-06T22:14:08+00:00</updated>
<author>
<name>Xi Wang</name>
<email>xi.wang@gmail.com</email>
</author>
<published>2011-12-16T12:44:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f6e4c89e089ae671a677242edb9e8b08c369c415'/>
<id>f6e4c89e089ae671a677242edb9e8b08c369c415</id>
<content type='text'>
[ Upstream commit 2692ba61a82203404abd7dd2a027bda962861f74 ]

Commit 8ffd3208 voids the previous patches f6778aab and 810c0719 for
limiting the autoclose value.  If userspace passes in -1 on 32-bit
platform, the overflow check didn't work and autoclose would be set
to 0xffffffff.

This patch defines a max_autoclose (in seconds) for limiting the value
and exposes it through sysctl, with the following intentions.

1) Avoid overflowing autoclose * HZ.

2) Keep the default autoclose bound consistent across 32- and 64-bit
   platforms (INT_MAX / HZ in this patch).

3) Keep the autoclose value consistent between setsockopt() and
   getsockopt() calls.

Suggested-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: Xi Wang &lt;xi.wang@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 2692ba61a82203404abd7dd2a027bda962861f74 ]

Commit 8ffd3208 voids the previous patches f6778aab and 810c0719 for
limiting the autoclose value.  If userspace passes in -1 on 32-bit
platform, the overflow check didn't work and autoclose would be set
to 0xffffffff.

This patch defines a max_autoclose (in seconds) for limiting the value
and exposes it through sysctl, with the following intentions.

1) Avoid overflowing autoclose * HZ.

2) Keep the default autoclose bound consistent across 32- and 64-bit
   platforms (INT_MAX / HZ in this patch).

3) Keep the autoclose value consistent between setsockopt() and
   getsockopt() calls.

Suggested-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: Xi Wang &lt;xi.wang@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: sctp: fix checksum marking for outgoing packets</title>
<updated>2011-07-14T22:16:31+00:00</updated>
<author>
<name>Michał Mirosław</name>
<email>mirq-linux@rere.qmqm.pl</email>
</author>
<published>2011-07-13T14:10:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b73c43f884b1b26ef8e824a33f3924f92e493c11'/>
<id>b73c43f884b1b26ef8e824a33f3924f92e493c11</id>
<content type='text'>
Packets to devices without NETIF_F_SCTP_CSUM (including NETIF_F_NO_CSUM)
should be properly checksummed because the packets can be diverted or
rerouted after construction. This still leaves packets diverted from
NETIF_F_SCTP_CSUM-enabled devices with broken checksums. Fixing this
needs implementing software offload fallback in networking core.

For users of sctp_checksum_disable, skb-&gt;ip_summed should be left as
CHECKSUM_NONE and not CHECKSUM_UNNECESSARY as per include/linux/skbuff.h.

Signed-off-by: Michał Mirosław &lt;mirq-linux@rere.qmqm.pl&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Packets to devices without NETIF_F_SCTP_CSUM (including NETIF_F_NO_CSUM)
should be properly checksummed because the packets can be diverted or
rerouted after construction. This still leaves packets diverted from
NETIF_F_SCTP_CSUM-enabled devices with broken checksums. Fixing this
needs implementing software offload fallback in networking core.

For users of sctp_checksum_disable, skb-&gt;ip_summed should be left as
CHECKSUM_NONE and not CHECKSUM_UNNECESSARY as per include/linux/skbuff.h.

Signed-off-by: Michał Mirosław &lt;mirq-linux@rere.qmqm.pl&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: ABORT if receive, reassmbly, or reodering queue is not empty while closing socket</title>
<updated>2011-07-08T16:53:08+00:00</updated>
<author>
<name>Thomas Graf</name>
<email>tgraf@infradead.org</email>
</author>
<published>2011-07-08T04:37:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cd4fcc704f30f2064ab30b5300d44d431e46db50'/>
<id>cd4fcc704f30f2064ab30b5300d44d431e46db50</id>
<content type='text'>
Trigger user ABORT if application closes a socket which has data
queued on the socket receive queue or chunks waiting on the
reassembly or ordering queue as this would imply data being lost
which defeats the point of a graceful shutdown.

This behavior is already practiced in TCP.

We do not check the input queue because that would mean to parse
all chunks on it to look for unacknowledged data which seems too
much of an effort. Control chunks or duplicated chunks may also
be in the input queue and should not be stopping a graceful
shutdown.

Signed-off-by: Thomas Graf &lt;tgraf@infradead.org&gt;
Acked-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Trigger user ABORT if application closes a socket which has data
queued on the socket receive queue or chunks waiting on the
reassembly or ordering queue as this would imply data being lost
which defeats the point of a graceful shutdown.

This behavior is already practiced in TCP.

We do not check the input queue because that would mean to parse
all chunks on it to look for unacknowledged data which seems too
much of an effort. Control chunks or duplicated chunks may also
be in the input queue and should not be stopping a graceful
shutdown.

Signed-off-by: Thomas Graf &lt;tgraf@infradead.org&gt;
Acked-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: Enforce retransmission limit during shutdown</title>
<updated>2011-07-07T21:08:44+00:00</updated>
<author>
<name>Thomas Graf</name>
<email>tgraf@infradead.org</email>
</author>
<published>2011-07-07T00:28:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f8d9605243280f1870dd2c6c37a735b925c15f3c'/>
<id>f8d9605243280f1870dd2c6c37a735b925c15f3c</id>
<content type='text'>
When initiating a graceful shutdown while having data chunks
on the retransmission queue with a peer which is in zero
window mode the shutdown is never completed because the
retransmission error count is reset periodically by the
following two rules:

 - Do not timeout association while doing zero window probe.
 - Reset overall error count when a heartbeat request has
   been acknowledged.

The graceful shutdown will wait for all outstanding TSN to
be acknowledged before sending the SHUTDOWN request. This
never happens due to the peer's zero window not acknowledging
the continuously retransmitted data chunks. Although the
error counter is incremented for each failed retransmission,
the receiving of the SACK announcing the zero window clears
the error count again immediately. Also heartbeat requests
continue to be sent periodically. The peer acknowledges these
requests causing the error counter to be reset as well.

This patch changes behaviour to only reset the overall error
counter for the above rules while not in shutdown. After
reaching the maximum number of retransmission attempts, the
T5 shutdown guard timer is scheduled to give the receiver
some additional time to recover. The timer is stopped as soon
as the receiver acknowledges any data.

The issue can be easily reproduced by establishing a sctp
association over the loopback device, constantly queueing
data at the sender while not reading any at the receiver.
Wait for the window to reach zero, then initiate a shutdown
by killing both processes simultaneously. The association
will never be freed and the chunks on the retransmission
queue will be retransmitted indefinitely.

Signed-off-by: Thomas Graf &lt;tgraf@infradead.org&gt;
Acked-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When initiating a graceful shutdown while having data chunks
on the retransmission queue with a peer which is in zero
window mode the shutdown is never completed because the
retransmission error count is reset periodically by the
following two rules:

 - Do not timeout association while doing zero window probe.
 - Reset overall error count when a heartbeat request has
   been acknowledged.

The graceful shutdown will wait for all outstanding TSN to
be acknowledged before sending the SHUTDOWN request. This
never happens due to the peer's zero window not acknowledging
the continuously retransmitted data chunks. Although the
error counter is incremented for each failed retransmission,
the receiving of the SACK announcing the zero window clears
the error count again immediately. Also heartbeat requests
continue to be sent periodically. The peer acknowledges these
requests causing the error counter to be reset as well.

This patch changes behaviour to only reset the overall error
counter for the above rules while not in shutdown. After
reaching the maximum number of retransmission attempts, the
T5 shutdown guard timer is scheduled to give the receiver
some additional time to recover. The timer is stopped as soon
as the receiver acknowledges any data.

The issue can be easily reproduced by establishing a sctp
association over the loopback device, constantly queueing
data at the sender while not reading any at the receiver.
Wait for the window to reach zero, then initiate a shutdown
by killing both processes simultaneously. The association
will never be freed and the chunks on the retransmission
queue will be retransmitted indefinitely.

Signed-off-by: Thomas Graf &lt;tgraf@infradead.org&gt;
Acked-by: Vlad Yasevich &lt;vladislav.yasevich@hp.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sctp: fix missing send up SCTP_SENDER_DRY_EVENT when subscribe it</title>
<updated>2011-07-07T11:10:26+00:00</updated>
<author>
<name>Wei Yongjun</name>
<email>yjwei@cn.fujitsu.com</email>
</author>
<published>2011-07-02T09:28:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=949123016a2ef578009b6aa3e98d45d1a154ebfb'/>
<id>949123016a2ef578009b6aa3e98d45d1a154ebfb</id>
<content type='text'>
We forgot to send up SCTP_SENDER_DRY_EVENT notification when
user app subscribes to this event, and there is no data to be
sent or retransmit.

This is required by the Socket API and used by the DTLS/SCTP
implementation.

Reported-by: Michael Tüxen &lt;Michael.Tuexen@lurchi.franken.de&gt;
Signed-off-by: Wei Yongjun &lt;yjwei@cn.fujitsu.com&gt;
Tested-by: Robin Seggelmann &lt;seggelmann@fh-muenster.de&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We forgot to send up SCTP_SENDER_DRY_EVENT notification when
user app subscribes to this event, and there is no data to be
sent or retransmit.

This is required by the Socket API and used by the DTLS/SCTP
implementation.

Reported-by: Michael Tüxen &lt;Michael.Tuexen@lurchi.franken.de&gt;
Signed-off-by: Wei Yongjun &lt;yjwei@cn.fujitsu.com&gt;
Tested-by: Robin Seggelmann &lt;seggelmann@fh-muenster.de&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: refine {udp|tcp|sctp}_mem limits</title>
<updated>2011-07-07T07:27:05+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>eric.dumazet@gmail.com</email>
</author>
<published>2011-07-07T07:27:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f03d78db65085609938fdb686238867e65003181'/>
<id>f03d78db65085609938fdb686238867e65003181</id>
<content type='text'>
Current tcp/udp/sctp global memory limits are not taking into account
hugepages allocations, and allow 50% of ram to be used by buffers of a
single protocol [ not counting space used by sockets / inodes ...]

Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
per protocol, and a minimum of 128 pages.
Heavy duty machines sysadmins probably need to tweak limits anyway.


References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032
Reported-by: starlight &lt;starlight@binnacle.cx&gt;
Suggested-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Current tcp/udp/sctp global memory limits are not taking into account
hugepages allocations, and allow 50% of ram to be used by buffers of a
single protocol [ not counting space used by sockets / inodes ...]

Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
per protocol, and a minimum of 128 pages.
Heavy duty machines sysadmins probably need to tweak limits anyway.


References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032
Reported-by: starlight &lt;starlight@binnacle.cx&gt;
Suggested-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
