<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/net/unix, branch v4.9.91</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>net/unix: don't show information about sockets from other namespaces</title>
<updated>2017-11-18T10:22:22+00:00</updated>
<author>
<name>Andrei Vagin</name>
<email>avagin@openvz.org</email>
</author>
<published>2017-10-25T17:16:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=62de3fe46c6b11ad3db086615dd3d46a35e7b06d'/>
<id>62de3fe46c6b11ad3db086615dd3d46a35e7b06d</id>
<content type='text'>
[ Upstream commit 0f5da659d8f1810f44de14acf2c80cd6499623a0 ]

socket_diag shows information only about sockets from a namespace where
a diag socket lives.

But if we request information about one unix socket, the kernel don't
check that its netns is matched with a diag socket namespace, so any
user can get information about any unix socket in a system. This looks
like a bug.

v2: add a Fixes tag

Fixes: 51d7cccf0723 ("net: make sock diag per-namespace")
Signed-off-by: Andrei Vagin &lt;avagin@openvz.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 0f5da659d8f1810f44de14acf2c80cd6499623a0 ]

socket_diag shows information only about sockets from a namespace where
a diag socket lives.

But if we request information about one unix socket, the kernel don't
check that its netns is matched with a diag socket namespace, so any
user can get information about any unix socket in a system. This looks
like a bug.

v2: add a Fixes tag

Fixes: 51d7cccf0723 ("net: make sock diag per-namespace")
Signed-off-by: Andrei Vagin &lt;avagin@openvz.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers</title>
<updated>2017-07-05T12:40:14+00:00</updated>
<author>
<name>Mateusz Jurczyk</name>
<email>mjurczyk@google.com</email>
</author>
<published>2017-06-08T09:13:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=bb84290cd2967a5774a97fa44381713e20a7924c'/>
<id>bb84290cd2967a5774a97fa44381713e20a7924c</id>
<content type='text'>
[ Upstream commit defbcf2decc903a28d8398aa477b6881e711e3ea ]

Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the AF_UNIX socket. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or
one byte long) result in operating on uninitialized memory while
referencing .sa_family.

Signed-off-by: Mateusz Jurczyk &lt;mjurczyk@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit defbcf2decc903a28d8398aa477b6881e711e3ea ]

Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the AF_UNIX socket. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or
one byte long) result in operating on uninitialized memory while
referencing .sa_family.

Signed-off-by: Mateusz Jurczyk &lt;mjurczyk@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: unix: properly re-increment inflight counter of GC discarded candidates</title>
<updated>2017-03-30T07:41:21+00:00</updated>
<author>
<name>Andrey Ulanov</name>
<email>andreyu@google.com</email>
</author>
<published>2017-03-15T03:16:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=91ad0c0885c85fbea1900de6870b7416597d6dfa'/>
<id>91ad0c0885c85fbea1900de6870b7416597d6dfa</id>
<content type='text'>
[ Upstream commit 7df9c24625b9981779afb8fcdbe2bb4765e61147 ]

Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().

The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.

Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.

This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.

  kernel BUG at net/unix/garbage.c:149!
  RIP: 0010:[&lt;ffffffff8717ebf4&gt;]  [&lt;ffffffff8717ebf4&gt;]
  unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
  Call Trace:
   [&lt;ffffffff8716cfbf&gt;] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
   [&lt;ffffffff8716f6a9&gt;] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
   [&lt;ffffffff86a90a01&gt;] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
   [&lt;ffffffff86a9808a&gt;] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
   [&lt;ffffffff86a980ea&gt;] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
   [&lt;ffffffff86a98284&gt;] kfree_skb+0x184/0x570 net/core/skbuff.c:705
   [&lt;ffffffff871789d5&gt;] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
   [&lt;ffffffff87179039&gt;] unix_release+0x49/0x90 net/unix/af_unix.c:836
   [&lt;ffffffff86a694b2&gt;] sock_release+0x92/0x1f0 net/socket.c:570
   [&lt;ffffffff86a6962b&gt;] sock_close+0x1b/0x20 net/socket.c:1017
   [&lt;ffffffff81a76b8e&gt;] __fput+0x34e/0x910 fs/file_table.c:208
   [&lt;ffffffff81a771da&gt;] ____fput+0x1a/0x20 fs/file_table.c:244
   [&lt;ffffffff81483ab0&gt;] task_work_run+0x1a0/0x280 kernel/task_work.c:116
   [&lt;     inline     &gt;] exit_task_work include/linux/task_work.h:21
   [&lt;ffffffff8141287a&gt;] do_exit+0x183a/0x2640 kernel/exit.c:828
   [&lt;ffffffff8141383e&gt;] do_group_exit+0x14e/0x420 kernel/exit.c:931
   [&lt;ffffffff814429d3&gt;] get_signal+0x663/0x1880 kernel/signal.c:2307
   [&lt;ffffffff81239b45&gt;] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
   [&lt;ffffffff8100666a&gt;] exit_to_usermode_loop+0x1ea/0x2d0
  arch/x86/entry/common.c:156
   [&lt;     inline     &gt;] prepare_exit_to_usermode arch/x86/entry/common.c:190
   [&lt;ffffffff81009693&gt;] syscall_return_slowpath+0x4d3/0x570
  arch/x86/entry/common.c:259
   [&lt;ffffffff881478e6&gt;] entry_SYSCALL_64_fastpath+0xc4/0xc6

Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov &lt;andreyu@google.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 7df9c24625b9981779afb8fcdbe2bb4765e61147 ]

Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().

The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.

Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.

This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.

  kernel BUG at net/unix/garbage.c:149!
  RIP: 0010:[&lt;ffffffff8717ebf4&gt;]  [&lt;ffffffff8717ebf4&gt;]
  unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
  Call Trace:
   [&lt;ffffffff8716cfbf&gt;] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
   [&lt;ffffffff8716f6a9&gt;] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
   [&lt;ffffffff86a90a01&gt;] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
   [&lt;ffffffff86a9808a&gt;] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
   [&lt;ffffffff86a980ea&gt;] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
   [&lt;ffffffff86a98284&gt;] kfree_skb+0x184/0x570 net/core/skbuff.c:705
   [&lt;ffffffff871789d5&gt;] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
   [&lt;ffffffff87179039&gt;] unix_release+0x49/0x90 net/unix/af_unix.c:836
   [&lt;ffffffff86a694b2&gt;] sock_release+0x92/0x1f0 net/socket.c:570
   [&lt;ffffffff86a6962b&gt;] sock_close+0x1b/0x20 net/socket.c:1017
   [&lt;ffffffff81a76b8e&gt;] __fput+0x34e/0x910 fs/file_table.c:208
   [&lt;ffffffff81a771da&gt;] ____fput+0x1a/0x20 fs/file_table.c:244
   [&lt;ffffffff81483ab0&gt;] task_work_run+0x1a0/0x280 kernel/task_work.c:116
   [&lt;     inline     &gt;] exit_task_work include/linux/task_work.h:21
   [&lt;ffffffff8141287a&gt;] do_exit+0x183a/0x2640 kernel/exit.c:828
   [&lt;ffffffff8141383e&gt;] do_group_exit+0x14e/0x420 kernel/exit.c:931
   [&lt;ffffffff814429d3&gt;] get_signal+0x663/0x1880 kernel/signal.c:2307
   [&lt;ffffffff81239b45&gt;] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
   [&lt;ffffffff8100666a&gt;] exit_to_usermode_loop+0x1ea/0x2d0
  arch/x86/entry/common.c:156
   [&lt;     inline     &gt;] prepare_exit_to_usermode arch/x86/entry/common.c:190
   [&lt;ffffffff81009693&gt;] syscall_return_slowpath+0x4d3/0x570
  arch/x86/entry/common.c:259
   [&lt;ffffffff881478e6&gt;] entry_SYSCALL_64_fastpath+0xc4/0xc6

Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov &lt;andreyu@google.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: move unix_mknod() out of bindlock</title>
<updated>2017-02-04T08:47:10+00:00</updated>
<author>
<name>WANG Cong</name>
<email>xiyou.wangcong@gmail.com</email>
</author>
<published>2017-01-23T19:17:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=93ff5e03bcba0761055491dc6bf52b1e0e33bbe6'/>
<id>93ff5e03bcba0761055491dc6bf52b1e0e33bbe6</id>
<content type='text'>
[ Upstream commit 0fb44559ffd67de8517098b81f675fa0210f13f0 ]

Dmitry reported a deadlock scenario:

unix_bind() path:
u-&gt;bindlock ==&gt; sb_writer

do_splice() path:
sb_writer ==&gt; pipe-&gt;mutex ==&gt; u-&gt;bindlock

In the unix_bind() code path, unix_mknod() does not have to
be done with u-&gt;bindlock held, since it is a pure fs operation,
so we can just move unix_mknod() out.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Tested-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Rainer Weikusat &lt;rweikusat@mobileactivedefense.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 0fb44559ffd67de8517098b81f675fa0210f13f0 ]

Dmitry reported a deadlock scenario:

unix_bind() path:
u-&gt;bindlock ==&gt; sb_writer

do_splice() path:
sb_writer ==&gt; pipe-&gt;mutex ==&gt; u-&gt;bindlock

In the unix_bind() code path, unix_mknod() does not have to
be done with u-&gt;bindlock held, since it is a pure fs operation,
so we can just move unix_mknod() out.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Tested-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Rainer Weikusat &lt;rweikusat@mobileactivedefense.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: conditionally use freezable blocking calls in read</title>
<updated>2016-11-18T18:58:39+00:00</updated>
<author>
<name>WANG Cong</name>
<email>xiyou.wangcong@gmail.com</email>
</author>
<published>2016-11-17T23:55:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=06a77b07e3b44aea2b3c0e64de420ea2cfdcbaa9'/>
<id>06a77b07e3b44aea2b3c0e64de420ea2cfdcbaa9</id>
<content type='text'>
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets")
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Colin Cross &lt;ccross@android.com&gt;
Cc: Rafael J. Wysocki &lt;rafael.j.wysocki@intel.com&gt;
Cc: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets")
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Colin Cross &lt;ccross@android.com&gt;
Cc: Rafael J. Wysocki &lt;rafael.j.wysocki@intel.com&gt;
Cc: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>unix: escape all null bytes in abstract unix domain socket</title>
<updated>2016-11-01T16:15:13+00:00</updated>
<author>
<name>Isaac Boukris</name>
<email>iboukris@gmail.com</email>
</author>
<published>2016-11-01T00:41:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e7947ea770d0de434d38a0f823e660d3fd4bebb5'/>
<id>e7947ea770d0de434d38a0f823e660d3fd4bebb5</id>
<content type='text'>
Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.

This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).

Signed-off-by: Isaac Boukris &lt;iboukris@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Abstract unix domain socket may embed null characters,
these should be translated to '@' when printed out to
proc the same way the null prefix is currently being
translated.

This helps for tools such as netstat, lsof and the proc
based implementation in ss to show all the significant
bytes of the name (instead of getting cut at the first
null occurrence).

Signed-off-by: Isaac Boukris &lt;iboukris@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>skb_splice_bits(): get rid of callback</title>
<updated>2016-10-04T00:40:56+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2016-09-18T01:02:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=25869262ef7af24ccde988867ac3eb1c3d4b88d4'/>
<id>25869262ef7af24ccde988867ac3eb1c3d4b88d4</id>
<content type='text'>
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
since pipe_lock is the outermost now, we don't need to drop/regain
socket locks around the call of splice_to_pipe() from skb_splice_bits(),
which kills the need to have a socket-specific callback; we can just
call splice_to_pipe() and be done with that.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: split 'u-&gt;readlock' into two: 'iolock' and 'bindlock'</title>
<updated>2016-09-04T20:29:29+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-09-01T21:43:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6e1ce3c3451291142a57c4f3f6f999a29fb5b3bc'/>
<id>6e1ce3c3451291142a57c4f3f6f999a29fb5b3bc</id>
<content type='text'>
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Acked-by: Rainer Weikusat &lt;rweikusat@cyberadapt.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Acked-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324aa ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Acked-by: Rainer Weikusat &lt;rweikusat@cyberadapt.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Acked-by: Hannes Frederic Sowa &lt;hannes@stressinduktion.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Revert "af_unix: Fix splice-bind deadlock"</title>
<updated>2016-09-04T20:29:29+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-09-01T21:56:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=38f7bd94a97b542de86a2be9229289717e33a7a4'/>
<id>38f7bd94a97b542de86a2be9229289717e33a7a4</id>
<content type='text'>
This reverts commit c845acb324aa85a39650a14e7696982ceea75dc1.

It turns out that it just replaces one deadlock with another one: we can
still get the wrong lock ordering with the readlock due to overlayfs
calling back into the filesystem layer and still taking the vfs locks
after the readlock.

The proper solution ends up being to just split the readlock into two
pieces: the bind lock (taken *outside* the vfs locks) and the IO lock
(taken *inside* the filesystem locks).  The two locks are independent
anyway.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Shmulik Ladkani &lt;shmulik.ladkani@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit c845acb324aa85a39650a14e7696982ceea75dc1.

It turns out that it just replaces one deadlock with another one: we can
still get the wrong lock ordering with the readlock due to overlayfs
calling back into the filesystem layer and still taking the vfs locks
after the readlock.

The proper solution ends up being to just split the readlock into two
pieces: the bind lock (taken *outside* the vfs locks) and the IO lock
(taken *inside* the filesystem locks).  The two locks are independent
anyway.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Shmulik Ladkani &lt;shmulik.ladkani@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: charge buffers to kmemcg</title>
<updated>2016-07-26T23:19:19+00:00</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@virtuozzo.com</email>
</author>
<published>2016-07-26T22:24:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3aa9799e13645fda605e1c68831f2d4256a38537'/>
<id>3aa9799e13645fda605e1c68831f2d4256a38537</id>
<content type='text'>
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context, all
we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock-&gt;sk_allocation mask.

Eric asked:

&gt; 1) What happens when a buffer, allocated from socket &lt;A&gt; lands in a
&gt; different socket &lt;B&gt;, maybe owned by another user/process.
&gt;
&gt; Who owns it now, in term of kmemcg accounting ?

We never move memcg charges.  E.g.  if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first.  Or if two processes are working with
the same directory tree, inodes and dentries will be charged to the
first user.  The same is fair for unix socket buffers - they will be
charged to the sender.

&gt; 2) Has performance impact been evaluated ?

I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
x2 HT box.  The results are below:

 # clients            bandwidth (10^6bits/sec)
                    base              patched
         1      67643 +-  725      64874 +-  353    - 4.0 %
         4     193585 +- 2516     186715 +- 1460    - 3.5 %
         8     194820 +-  377     187443 +- 1229    - 3.7 %

So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.

Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).

Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context, all
we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock-&gt;sk_allocation mask.

Eric asked:

&gt; 1) What happens when a buffer, allocated from socket &lt;A&gt; lands in a
&gt; different socket &lt;B&gt;, maybe owned by another user/process.
&gt;
&gt; Who owns it now, in term of kmemcg accounting ?

We never move memcg charges.  E.g.  if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first.  Or if two processes are working with
the same directory tree, inodes and dentries will be charged to the
first user.  The same is fair for unix socket buffers - they will be
charged to the sender.

&gt; 2) Has performance impact been evaluated ?

I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
x2 HT box.  The results are below:

 # clients            bandwidth (10^6bits/sec)
                    base              patched
         1      67643 +-  725      64874 +-  353    - 4.0 %
         4     193585 +- 2516     186715 +- 1460    - 3.5 %
         8     194820 +-  377     187443 +- 1229    - 3.7 %

So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.

Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).

Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
