<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/net/core/sock.c, branch v3.13.7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>net: use __GFP_NORETRY for high order allocations</title>
<updated>2014-03-07T06:06:15+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2014-02-06T18:42:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7f35646d50ad3052697dd29e22ba5be74f4dbcd6'/>
<id>7f35646d50ad3052697dd29e22ba5be74f4dbcd6</id>
<content type='text'>
[ Upstream commit ed98df3361f059db42786c830ea96e2d18b8d4db ]

sock_alloc_send_pskb() &amp; sk_page_frag_refill()
have a loop trying high order allocations to prepare
skb with low number of fragments as this increases performance.

Problem is that under memory pressure/fragmentation, this can
trigger OOM while the intent was only to try the high order
allocations, then fallback to order-0 allocations.

We had various reports from unexpected regressions.

According to David, setting __GFP_NORETRY should be fine,
as the asynchronous compaction is still enabled, and this
will prevent OOM from kicking as in :

CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0,
oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled
CFSClientEventm

Call Trace:
 [&lt;ffffffff8043766c&gt;] dump_header+0xe1/0x23e
 [&lt;ffffffff80437a02&gt;] oom_kill_process+0x6a/0x323
 [&lt;ffffffff80438443&gt;] out_of_memory+0x4b3/0x50d
 [&lt;ffffffff8043a4a6&gt;] __alloc_pages_may_oom+0xa2/0xc7
 [&lt;ffffffff80236f42&gt;] __alloc_pages_nodemask+0x1002/0x17f0
 [&lt;ffffffff8024bd23&gt;] alloc_pages_current+0x103/0x2b0
 [&lt;ffffffff8028567f&gt;] sk_page_frag_refill+0x8f/0x160
 [&lt;ffffffff80295fa0&gt;] tcp_sendmsg+0x560/0xee0
 [&lt;ffffffff802a5037&gt;] inet_sendmsg+0x67/0x100
 [&lt;ffffffff80283c9c&gt;] __sock_sendmsg_nosec+0x6c/0x90
 [&lt;ffffffff80283e85&gt;] sock_sendmsg+0xc5/0xf0
 [&lt;ffffffff802847b6&gt;] __sys_sendmsg+0x136/0x430
 [&lt;ffffffff80284ec8&gt;] sys_sendmsg+0x88/0x110
 [&lt;ffffffff80711472&gt;] system_call_fastpath+0x16/0x1b
Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit ed98df3361f059db42786c830ea96e2d18b8d4db ]

sock_alloc_send_pskb() &amp; sk_page_frag_refill()
have a loop trying high order allocations to prepare
skb with low number of fragments as this increases performance.

Problem is that under memory pressure/fragmentation, this can
trigger OOM while the intent was only to try the high order
allocations, then fallback to order-0 allocations.

We had various reports from unexpected regressions.

According to David, setting __GFP_NORETRY should be fine,
as the asynchronous compaction is still enabled, and this
will prevent OOM from kicking as in :

CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0,
oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled
CFSClientEventm

Call Trace:
 [&lt;ffffffff8043766c&gt;] dump_header+0xe1/0x23e
 [&lt;ffffffff80437a02&gt;] oom_kill_process+0x6a/0x323
 [&lt;ffffffff80438443&gt;] out_of_memory+0x4b3/0x50d
 [&lt;ffffffff8043a4a6&gt;] __alloc_pages_may_oom+0xa2/0xc7
 [&lt;ffffffff80236f42&gt;] __alloc_pages_nodemask+0x1002/0x17f0
 [&lt;ffffffff8024bd23&gt;] alloc_pages_current+0x103/0x2b0
 [&lt;ffffffff8028567f&gt;] sk_page_frag_refill+0x8f/0x160
 [&lt;ffffffff80295fa0&gt;] tcp_sendmsg+0x560/0xee0
 [&lt;ffffffff802a5037&gt;] inet_sendmsg+0x67/0x100
 [&lt;ffffffff80283c9c&gt;] __sock_sendmsg_nosec+0x6c/0x90
 [&lt;ffffffff80283e85&gt;] sock_sendmsg+0xc5/0xf0
 [&lt;ffffffff802847b6&gt;] __sys_sendmsg+0x136/0x430
 [&lt;ffffffff80284ec8&gt;] sys_sendmsg+0x88/0x110
 [&lt;ffffffff80711472&gt;] system_call_fastpath+0x16/0x1b
Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: unix: allow set_peek_off to fail</title>
<updated>2013-12-11T02:45:15+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2013-12-07T22:26:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=12663bfc97c8b3fdb292428105dd92d563164050'/>
<id>12663bfc97c8b3fdb292428105dd92d563164050</id>
<content type='text'>
unix_dgram_recvmsg() will hold the readlock of the socket until recv
is complete.

In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
unix_dgram_recvmsg() will complete (which can take a while) without allowing
us to break out of it, triggering a hung task spew.

Instead, allow set_peek_off to fail, this way userspace will not hang.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Acked-by: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
unix_dgram_recvmsg() will hold the readlock of the socket until recv
is complete.

In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until
unix_dgram_recvmsg() will complete (which can take a while) without allowing
us to break out of it, triggering a hung task spew.

Instead, allow set_peek_off to fail, this way userspace will not hang.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Acked-by: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: remove function sk_reset_txq()</title>
<updated>2013-10-22T18:00:21+00:00</updated>
<author>
<name>ZHAO Gang</name>
<email>gamerh2o@gmail.com</email>
</author>
<published>2013-10-22T08:23:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0a6957e7d47096bbeedda4e1d926359eb487dcfc'/>
<id>0a6957e7d47096bbeedda4e1d926359eb487dcfc</id>
<content type='text'>
What sk_reset_txq() does is just calls function sk_tx_queue_reset(),
and sk_reset_txq() is used only in sock.h, by dst_negative_advice().
Let dst_negative_advice() calls sk_tx_queue_reset() directly so we
can remove unneeded sk_reset_txq().

Signed-off-by: ZHAO Gang &lt;gamerh2o@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
What sk_reset_txq() does is just calls function sk_tx_queue_reset(),
and sk_reset_txq() is used only in sock.h, by dst_negative_advice().
Let dst_negative_advice() calls sk_tx_queue_reset() directly so we
can remove unneeded sk_reset_txq().

Signed-off-by: ZHAO Gang &lt;gamerh2o@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: refactor sk_page_frag_refill()</title>
<updated>2013-10-18T04:08:51+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-10-17T23:27:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=400dfd3ae899849b27d398ca7894e1b44430887f'/>
<id>400dfd3ae899849b27d398ca7894e1b44430887f</id>
<content type='text'>
While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
skb_page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or skb_page_frag_refill() from process context in refill_work()
 (GFP_KERNEL allocations)

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Michael Dalton &lt;mwdalton@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.

This patch splits sk_page_frag_refill() into two parts, adding
skb_page_frag_refill() which can be used without a socket.

While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()

Michael will either use netdev_alloc_frag() from softirq context,
or skb_page_frag_refill() from process context in refill_work()
 (GFP_KERNEL allocations)

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Michael Dalton &lt;mwdalton@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2013-10-09T03:07:53+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2013-10-09T03:07:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=53af53ae83fe960ceb9ef74cac7915e9088f4266'/>
<id>53af53ae83fe960ceb9ef74cac7915e9088f4266</id>
<content type='text'>
Conflicts:
	include/linux/netdevice.h
	net/core/sock.c

Trivial merge issues.

Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.

Two parallel line additions in net/core/sock.c

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Conflicts:
	include/linux/netdevice.h
	net/core/sock.c

Trivial merge issues.

Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.

Two parallel line additions in net/core/sock.c

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pkt_sched: fq: fix non TCP flows pacing</title>
<updated>2013-10-09T01:54:01+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-10-08T22:16:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7eec4174ff29cd42f2acfae8112f51c228545d40'/>
<id>7eec4174ff29cd42f2acfae8112f51c228545d40</id>
<content type='text'>
Steinar reported FQ pacing was not working for UDP flows.

It looks like the initial sk-&gt;sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)

Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.

Reported-by: Steinar H. Gunderson &lt;sesse@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Steinar reported FQ pacing was not working for UDP flows.

It looks like the initial sk-&gt;sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)

Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.

Reported-by: Steinar H. Gunderson &lt;sesse@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: introduce SO_MAX_PACING_RATE</title>
<updated>2013-09-28T22:35:41+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-09-24T15:20:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=62748f32d501f5d3712a7c372bbb92abc7c62bc7'/>
<id>62748f32d501f5d3712a7c372bbb92abc7c62bc7</id>
<content type='text'>
As mentioned in commit afe4fd062416b ("pkt_sched: fq: Fair Queue packet
scheduler"), this patch adds a new socket option.

SO_MAX_PACING_RATE offers the application the ability to cap the
rate computed by transport layer. Value is in bytes per second.

u32 val = 1000000;
setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &amp;val, sizeof(val));

To be effectively paced, a flow must use FQ packet scheduler.

Note that a packet scheduler takes into account the headers for its
computations. The effective payload rate depends on MSS and retransmits
if any.

I chose to make this pacing rate a SOL_SOCKET option instead of a
TCP one because this can be used by other protocols.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Steinar H. Gunderson &lt;sesse@google.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
As mentioned in commit afe4fd062416b ("pkt_sched: fq: Fair Queue packet
scheduler"), this patch adds a new socket option.

SO_MAX_PACING_RATE offers the application the ability to cap the
rate computed by transport layer. Value is in bytes per second.

u32 val = 1000000;
setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &amp;val, sizeof(val));

To be effectively paced, a flow must use FQ packet scheduler.

Note that a packet scheduler takes into account the headers for its
computations. The effective payload rate depends on MSS and retransmits
if any.

I chose to make this pacing rate a SOL_SOCKET option instead of a
TCP one because this can be used by other protocols.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Steinar H. Gunderson &lt;sesse@google.com&gt;
Cc: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: attempt high order allocations in sock_alloc_send_pskb()</title>
<updated>2013-08-10T08:16:44+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-08-08T21:38:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=28d6427109d13b0f447cba5761f88d3548e83605'/>
<id>28d6427109d13b0f447cba5761f88d3548e83605</id>
<content type='text'>
Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    57981.11

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    57981.11

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2013-08-04T04:36:46+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2013-08-04T04:36:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0e76a3a587fc7abda2badf249053b427baad255e'/>
<id>0e76a3a587fc7abda2badf249053b427baad255e</id>
<content type='text'>
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL</title>
<updated>2013-08-01T22:11:17+00:00</updated>
<author>
<name>Cong Wang</name>
<email>amwang@redhat.com</email>
</author>
<published>2013-08-01T03:10:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e0d1095ae3405404d247afb00233ef837d58da83'/>
<id>e0d1095ae3405404d247afb00233ef837d58da83</id>
<content type='text'>
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir &lt;eliezer.tamir@linux.intel.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Cong Wang &lt;amwang@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir &lt;eliezer.tamir@linux.intel.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Cong Wang &lt;amwang@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
