<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/fs/proc/inode.c, branch v3.3.1</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>procfs: add hidepid= and gid= mount options</title>
<updated>2012-01-11T00:30:54+00:00</updated>
<author>
<name>Vasiliy Kulikov</name>
<email>segooon@gmail.com</email>
</author>
<published>2012-01-10T23:11:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0499680a42141d86417a8fbaa8c8db806bea1201'/>
<id>0499680a42141d86417a8fbaa8c8db806bea1201</id>
<content type='text'>
Add support for mount options to restrict access to /proc/PID/
directories.  The default backward-compatible "relaxed" behaviour is left
untouched.

The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:

hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.

hidepid=1 means users may not access any /proc/&lt;pid&gt;/ directories, but
their own.  Sensitive files like cmdline, sched*, status are now protected
against other users.  As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.

hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users.  It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
and egid.  It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.

gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode).  This group should be used instead of putting
nonroot user in sudoers file or something.  However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.

hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:

http://www.openwall.com/lists/oss-security/2011/11/05/3

hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes.  pstree shows the process subtree which
contains "pstree" process.

Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.

Signed-off-by: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Theodore Tso &lt;tytso@MIT.EDU&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add support for mount options to restrict access to /proc/PID/
directories.  The default backward-compatible "relaxed" behaviour is left
untouched.

The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:

hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.

hidepid=1 means users may not access any /proc/&lt;pid&gt;/ directories, but
their own.  Sensitive files like cmdline, sched*, status are now protected
against other users.  As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.

hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users.  It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g.  by kill -0 $PID), but it hides process' euid
and egid.  It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.

gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode).  This group should be used instead of putting
nonroot user in sudoers file or something.  However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.

hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:

http://www.openwall.com/lists/oss-security/2011/11/05/3

hidepid=1/2 doesn't break monitoring userspace tools.  ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes.  pstree shows the process subtree which
contains "pstree" process.

Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368).  We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.

Signed-off-by: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Theodore Tso &lt;tytso@MIT.EDU&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>procfs: parse mount options</title>
<updated>2012-01-11T00:30:54+00:00</updated>
<author>
<name>Vasiliy Kulikov</name>
<email>segooon@gmail.com</email>
</author>
<published>2012-01-10T23:11:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=97412950b10e64f347aec4a9b759395c2465adf6'/>
<id>97412950b10e64f347aec4a9b759395c2465adf6</id>
<content type='text'>
Add support for procfs mount options.  Actual mount options are coming in
the next patches.

Signed-off-by: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Theodore Tso &lt;tytso@MIT.EDU&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add support for procfs mount options.  Actual mount options are coming in
the next patches.

Signed-off-by: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Theodore Tso &lt;tytso@MIT.EDU&gt;
Cc: Alan Cox &lt;alan@lxorguk.ukuu.org.uk&gt;
Cc: James Morris &lt;jmorris@namei.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vfs: fix the stupidity with i_dentry in inode destructors</title>
<updated>2012-01-04T03:52:40+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2011-12-12T20:51:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6b520e0565422966cdf1c3759bd73df77b0f248c'/>
<id>6b520e0565422966cdf1c3759bd73df77b0f248c</id>
<content type='text'>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from -&gt;destroy_inode() instances...

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from -&gt;destroy_inode() instances...

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>filesystems: add set_nlink()</title>
<updated>2011-11-02T11:53:43+00:00</updated>
<author>
<name>Miklos Szeredi</name>
<email>mszeredi@suse.cz</email>
</author>
<published>2011-10-28T12:13:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=bfe8684869601dacfcb2cd69ef8cfd9045f62170'/>
<id>bfe8684869601dacfcb2cd69ef8cfd9045f62170</id>
<content type='text'>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
Tested-by: Toshiyuki Okajima &lt;toshi.okajima@jp.fujitsu.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
Tested-by: Toshiyuki Okajima &lt;toshi.okajima@jp.fujitsu.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>procfs: return ENOENT on opening a being-removed proc entry</title>
<updated>2011-07-26T23:49:43+00:00</updated>
<author>
<name>Daisuke Ogino</name>
<email>ogino.daisuke@jp.fujitsu.com</email>
</author>
<published>2011-07-26T23:08:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d2857e79a2ba7c155eaa1a7d3581c8d26b31e54e'/>
<id>d2857e79a2ba7c155eaa1a7d3581c8d26b31e54e</id>
<content type='text'>
Change the return value to ENOENT.  This return value is then returned
when opening the proc entry that have been removed.  For example,
open("/proc/bus/pci/XX/YY") when the corresponding device is being
hot-removed.

Signed-off-by: Daisuke Ogino &lt;ogino.daisuke@jp.fujitsu.com&gt;
Cc: Jesse Barnes &lt;jbarnes@virtuousgeek.org&gt;
Acked-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Change the return value to ENOENT.  This return value is then returned
when opening the proc entry that have been removed.  For example,
open("/proc/bus/pci/XX/YY") when the corresponding device is being
hot-removed.

Signed-off-by: Daisuke Ogino &lt;ogino.daisuke@jp.fujitsu.com&gt;
Cc: Jesse Barnes &lt;jbarnes@virtuousgeek.org&gt;
Acked-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ns: proc files for namespace naming policy.</title>
<updated>2011-05-10T21:31:44+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2010-03-08T00:41:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6b4e306aa3dc94a0545eb9279475b1ab6209a31f'/>
<id>6b4e306aa3dc94a0545eb9279475b1ab6209a31f</id>
<content type='text'>
Create files under /proc/&lt;pid&gt;/ns/ to allow controlling the
namespaces of a process.

This addresses three specific problems that can make namespaces hard to
work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the child
  of the original creator.
- Namespaces don't have names that userspace can use to talk about
  them.

The namespace files under /proc/&lt;pid&gt;/ns/ can be opened and the
file descriptor can be used to talk about a specific namespace, and
to keep the specified namespace alive.

A namespace can be kept alive by either holding the file descriptor
open or bind mounting the file someplace else.  aka:
mount --bind /proc/self/ns/net /some/filesystem/path
mount --bind /proc/self/fd/&lt;N&gt; /some/filesystem/path

This allows namespaces to be named with userspace policy.

It requires additional support to make use of these filedescriptors
and that will be comming in the following patches.

Acked-by: Daniel Lezcano &lt;daniel.lezcano@free.fr&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Create files under /proc/&lt;pid&gt;/ns/ to allow controlling the
namespaces of a process.

This addresses three specific problems that can make namespaces hard to
work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the child
  of the original creator.
- Namespaces don't have names that userspace can use to talk about
  them.

The namespace files under /proc/&lt;pid&gt;/ns/ can be opened and the
file descriptor can be used to talk about a specific namespace, and
to keep the specified namespace alive.

A namespace can be kept alive by either holding the file descriptor
open or bind mounting the file someplace else.  aka:
mount --bind /proc/self/ns/net /some/filesystem/path
mount --bind /proc/self/fd/&lt;N&gt; /some/filesystem/path

This allows namespaces to be named with userspace policy.

It requires additional support to make use of these filedescriptors
and that will be comming in the following patches.

Acked-by: Daniel Lezcano &lt;daniel.lezcano@free.fr&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>procfs: kill the global proc_mnt variable</title>
<updated>2011-03-24T02:46:58+00:00</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2011-03-23T23:43:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=52e9fc76d0d4b1e8adeee736172c6c23180059b2'/>
<id>52e9fc76d0d4b1e8adeee736172c6c23180059b2</id>
<content type='text'>
After the previous cleanup in proc_get_sb() the global proc_mnt has no
reasons to exists, kill it.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Daniel Lezcano &lt;daniel.lezcano@free.fr&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Acked-by: Serge E. Hallyn &lt;serge@hallyn.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
After the previous cleanup in proc_get_sb() the global proc_mnt has no
reasons to exists, kill it.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Daniel Lezcano &lt;daniel.lezcano@free.fr&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Acked-by: Serge E. Hallyn &lt;serge@hallyn.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>unfuck proc_sysctl -&gt;d_compare()</title>
<updated>2011-03-08T07:22:27+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2011-03-08T06:25:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=dfef6dcd35cb4a251f6322ca9b2c06f0bb1aa1f4'/>
<id>dfef6dcd35cb4a251f6322ca9b2c06f0bb1aa1f4</id>
<content type='text'>
a) struct inode is not going to be freed under -&gt;d_compare();
however, the thing PROC_I(inode)-&gt;sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its -&gt;unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in -&gt;d_compare().

b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry-&gt;seq between verifying that it's still hashed / fetching
dentry-&gt;d_inode and passing it to -&gt;d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have -&gt;d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
a) struct inode is not going to be freed under -&gt;d_compare();
however, the thing PROC_I(inode)-&gt;sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its -&gt;unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in -&gt;d_compare().

b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry-&gt;seq between verifying that it's still hashed / fetching
dentry-&gt;d_inode and passing it to -&gt;d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have -&gt;d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>proc: -&gt;low_ino cleanup</title>
<updated>2011-01-13T16:03:16+00:00</updated>
<author>
<name>Alexey Dobriyan</name>
<email>adobriyan@gmail.com</email>
</author>
<published>2011-01-13T01:00:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6d1b6e4eff89475785f60fa00f65da780f869f36'/>
<id>6d1b6e4eff89475785f60fa00f65da780f869f36</id>
<content type='text'>
- -&gt;low_ino is write-once field -- reading it under locks is unnecessary.

- /proc/$PID stuff never reaches pde_put()/free_proc_entry() --
   PROC_DYNAMIC_FIRST check never triggers.

- in proc_get_inode(), inode number always matches proc dir entry, so
  save one parameter.

Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
- -&gt;low_ino is write-once field -- reading it under locks is unnecessary.

- /proc/$PID stuff never reaches pde_put()/free_proc_entry() --
   PROC_DYNAMIC_FIRST check never triggers.

- in proc_get_inode(), inode number always matches proc dir entry, so
  save one parameter.

Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: icache RCU free inodes</title>
<updated>2011-01-07T06:50:26+00:00</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@kernel.dk</email>
</author>
<published>2011-01-07T06:49:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9'/>
<id>fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9</id>
<content type='text'>
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page-&gt;mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin &lt;npiggin@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page-&gt;mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin &lt;npiggin@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
</feed>
