<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/linux/cred.h, branch v6.4-rc1</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>x86: Mark __invalid_creds() __noreturn</title>
<updated>2022-03-15T09:32:44+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2022-03-14T16:58:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=105cd68596392cfe15056a891b0723609dcad247'/>
<id>105cd68596392cfe15056a891b0723609dcad247</id>
<content type='text'>
vmlinux.o: warning: objtool: ksys_unshare()+0x36c: unreachable instruction

0000 0000000000067040 &lt;ksys_unshare&gt;:
...
0364    673a4:	4c 89 ef             	mov    %r13,%rdi
0367    673a7:	e8 00 00 00 00       	call   673ac &lt;ksys_unshare+0x36c&gt;	673a8: R_X86_64_PLT32	__invalid_creds-0x4
036c    673ac:	e9 28 ff ff ff       	jmp    672d9 &lt;ksys_unshare+0x299&gt;
0371    673b1:	41 bc f4 ff ff ff    	mov    $0xfffffff4,%r12d
0377    673b7:	e9 80 fd ff ff       	jmp    6713c &lt;ksys_unshare+0xfc&gt;

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/Yi9gOW9f1GGwwUD6@hirez.programming.kicks-ass.net
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
vmlinux.o: warning: objtool: ksys_unshare()+0x36c: unreachable instruction

0000 0000000000067040 &lt;ksys_unshare&gt;:
...
0364    673a4:	4c 89 ef             	mov    %r13,%rdi
0367    673a7:	e8 00 00 00 00       	call   673ac &lt;ksys_unshare+0x36c&gt;	673a8: R_X86_64_PLT32	__invalid_creds-0x4
036c    673ac:	e9 28 ff ff ff       	jmp    672d9 &lt;ksys_unshare+0x299&gt;
0371    673b1:	41 bc f4 ff ff ff    	mov    $0xfffffff4,%r12d
0377    673b7:	e9 80 fd ff ff       	jmp    6713c &lt;ksys_unshare+0xfc&gt;

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/Yi9gOW9f1GGwwUD6@hirez.programming.kicks-ass.net
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace</title>
<updated>2021-06-29T03:39:26+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-06-29T03:39:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c54b245d011855ea91c5beff07f1db74143ce614'/>
<id>c54b245d011855ea91c5beff07f1db74143ce614</id>
<content type='text'>
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/cred.c: make init_groups static</title>
<updated>2021-05-07T02:24:11+00:00</updated>
<author>
<name>Rasmus Villemoes</name>
<email>linux@rasmusvillemoes.dk</email>
</author>
<published>2021-05-07T01:02:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=32c93976ac2ee7ecb4b09cc032efe1445d37bd7e'/>
<id>32c93976ac2ee7ecb4b09cc032efe1445d37bd7e</id>
<content type='text'>
init_groups is declared in both cred.h and init_task.h, but it is not
actually referenced anywhere outside of cred.c where it is defined.  So
make it static and remove the declarations.

Link: https://lkml.kernel.org/r/20210310220102.2484201-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
init_groups is declared in both cred.h and init_task.h, but it is not
actually referenced anywhere outside of cred.c where it is defined.  So
make it static and remove the declarations.

Link: https://lkml.kernel.org/r/20210310220102.2484201-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Reimplement RLIMIT_NPROC on top of ucounts</title>
<updated>2021-04-30T19:14:01+00:00</updated>
<author>
<name>Alexey Gladkov</name>
<email>legion@kernel.org</email>
</author>
<published>2021-04-22T12:27:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=21d1c5e386bc751f1953b371d72cd5b7d9c9e270'/>
<id>21d1c5e386bc751f1953b371d72cd5b7d9c9e270</id>
<content type='text'>
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Changelog

v11:
* Change inc_rlimit_ucounts() which now returns top value of ucounts.
* Drop inc_rlimit_ucounts_and_test() because the return code of
  inc_rlimit_ucounts() can be checked.

Signed-off-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Changelog

v11:
* Change inc_rlimit_ucounts() which now returns top value of ucounts.
* Drop inc_rlimit_ucounts_and_test() because the return code of
  inc_rlimit_ucounts() can be checked.

Signed-off-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Add a reference to ucounts for each cred</title>
<updated>2021-04-30T19:14:00+00:00</updated>
<author>
<name>Alexey Gladkov</name>
<email>legion@kernel.org</email>
</author>
<published>2021-04-22T12:27:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=905ae01c4ae2ae3df05bb141801b1db4b7d83c61'/>
<id>905ae01c4ae2ae3df05bb141801b1db4b7d83c61</id>
<content type='text'>
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Signed-off-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Signed-off-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>lsm: separate security_task_getsecid() into subjective and objective variants</title>
<updated>2021-03-22T19:23:32+00:00</updated>
<author>
<name>Paul Moore</name>
<email>paul@paul-moore.com</email>
</author>
<published>2021-02-19T19:26:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4ebd7651bfc8992ba05b355a8036cb7fd0e8d7de'/>
<id>4ebd7651bfc8992ba05b355a8036cb7fd0e8d7de</id>
<content type='text'>
Of the three LSMs that implement the security_task_getsecid() LSM
hook, all three LSMs provide the task's objective security
credentials.  This turns out to be unfortunate as most of the hook's
callers seem to expect the task's subjective credentials, although
a small handful of callers do correctly expect the objective
credentials.

This patch is the first step towards fixing the problem: it splits
the existing security_task_getsecid() hook into two variants, one
for the subjective creds, one for the objective creds.

  void security_task_getsecid_subj(struct task_struct *p,
				   u32 *secid);
  void security_task_getsecid_obj(struct task_struct *p,
				  u32 *secid);

While this patch does fix all of the callers to use the correct
variant, in order to keep this patch focused on the callers and to
ease review, the LSMs continue to use the same implementation for
both hooks.  The net effect is that this patch should not change
the behavior of the kernel in any way, it will be up to the latter
LSM specific patches in this series to change the hook
implementations and return the correct credentials.

Acked-by: Mimi Zohar &lt;zohar@linux.ibm.com&gt; (IMA)
Acked-by: Casey Schaufler &lt;casey@schaufler-ca.com&gt;
Reviewed-by: Richard Guy Briggs &lt;rgb@redhat.com&gt;
Signed-off-by: Paul Moore &lt;paul@paul-moore.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Of the three LSMs that implement the security_task_getsecid() LSM
hook, all three LSMs provide the task's objective security
credentials.  This turns out to be unfortunate as most of the hook's
callers seem to expect the task's subjective credentials, although
a small handful of callers do correctly expect the objective
credentials.

This patch is the first step towards fixing the problem: it splits
the existing security_task_getsecid() hook into two variants, one
for the subjective creds, one for the objective creds.

  void security_task_getsecid_subj(struct task_struct *p,
				   u32 *secid);
  void security_task_getsecid_obj(struct task_struct *p,
				  u32 *secid);

While this patch does fix all of the callers to use the correct
variant, in order to keep this patch focused on the callers and to
ease review, the LSMs continue to use the same implementation for
both hooks.  The net effect is that this patch should not change
the behavior of the kernel in any way, it will be up to the latter
LSM specific patches in this series to change the hook
implementations and return the correct credentials.

Acked-by: Mimi Zohar &lt;zohar@linux.ibm.com&gt; (IMA)
Acked-by: Casey Schaufler &lt;casey@schaufler-ca.com&gt;
Reviewed-by: Richard Guy Briggs &lt;rgb@redhat.com&gt;
Signed-off-by: Paul Moore &lt;paul@paul-moore.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>groups: use flexible-array member in struct group_info</title>
<updated>2021-02-26T17:41:03+00:00</updated>
<author>
<name>Hubert Jasudowicz</name>
<email>hubert.jasudowicz@gmail.com</email>
</author>
<published>2021-02-26T01:21:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c1f26493ed7f363c63e0e9d91e50d4db26df6603'/>
<id>c1f26493ed7f363c63e0e9d91e50d4db26df6603</id>
<content type='text'>
Replace zero-size array with flexible array member, as recommended by
the docs.

Link: https://lkml.kernel.org/r/155995eed35c3c1bdcc56e69d8997c8e4c46740a.1611620846.git.hubert.jasudowicz@gmail.com
Signed-off-by: Hubert Jasudowicz &lt;hubert.jasudowicz@gmail.com&gt;
Cc: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Cc: Micah Morton &lt;mortonm@chromium.org&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Michael Kelley &lt;mikelley@microsoft.com&gt;
Cc: Thomas Cedeno &lt;thomascedeno@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Replace zero-size array with flexible array member, as recommended by
the docs.

Link: https://lkml.kernel.org/r/155995eed35c3c1bdcc56e69d8997c8e4c46740a.1611620846.git.hubert.jasudowicz@gmail.com
Signed-off-by: Hubert Jasudowicz &lt;hubert.jasudowicz@gmail.com&gt;
Cc: "Peter Zijlstra (Intel)" &lt;peterz@infradead.org&gt;
Cc: Micah Morton &lt;mortonm@chromium.org&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Michael Kelley &lt;mikelley@microsoft.com&gt;
Cc: Thomas Cedeno &lt;thomascedeno@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux</title>
<updated>2019-09-23T18:21:04+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-09-23T18:21:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5825a95fe92566ada2292a65de030850b5cff1da'/>
<id>5825a95fe92566ada2292a65de030850b5cff1da</id>
<content type='text'>
Pull selinux updates from Paul Moore:

 - Add LSM hooks, and SELinux access control hooks, for dnotify,
   fanotify, and inotify watches. This has been discussed with both the
   LSM and fs/notify folks and everybody is good with these new hooks.

 - The LSM stacking changes missed a few calls to current_security() in
   the SELinux code; we fix those and remove current_security() for
   good.

 - Improve our network object labeling cache so that we always return
   the object's label, even when under memory pressure. Previously we
   would return an error if we couldn't allocate a new cache entry, now
   we always return the label even if we can't create a new cache entry
   for it.

 - Convert the sidtab atomic_t counter to a normal u32 with
   READ/WRITE_ONCE() and memory barrier protection.

 - A few patches to policydb.c to clean things up (remove forward
   declarations, long lines, bad variable names, etc)

* tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  lsm: remove current_security()
  selinux: fix residual uses of current_security() for the SELinux blob
  selinux: avoid atomic_t usage in sidtab
  fanotify, inotify, dnotify, security: add security hook for fs notifications
  selinux: always return a secid from the network caches if we find one
  selinux: policydb - rename type_val_to_struct_array
  selinux: policydb - fix some checkpatch.pl warnings
  selinux: shuffle around policydb.c to get rid of forward declarations
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull selinux updates from Paul Moore:

 - Add LSM hooks, and SELinux access control hooks, for dnotify,
   fanotify, and inotify watches. This has been discussed with both the
   LSM and fs/notify folks and everybody is good with these new hooks.

 - The LSM stacking changes missed a few calls to current_security() in
   the SELinux code; we fix those and remove current_security() for
   good.

 - Improve our network object labeling cache so that we always return
   the object's label, even when under memory pressure. Previously we
   would return an error if we couldn't allocate a new cache entry, now
   we always return the label even if we can't create a new cache entry
   for it.

 - Convert the sidtab atomic_t counter to a normal u32 with
   READ/WRITE_ONCE() and memory barrier protection.

 - A few patches to policydb.c to clean things up (remove forward
   declarations, long lines, bad variable names, etc)

* tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  lsm: remove current_security()
  selinux: fix residual uses of current_security() for the SELinux blob
  selinux: avoid atomic_t usage in sidtab
  fanotify, inotify, dnotify, security: add security hook for fs notifications
  selinux: always return a secid from the network caches if we find one
  selinux: policydb - rename type_val_to_struct_array
  selinux: policydb - fix some checkpatch.pl warnings
  selinux: shuffle around policydb.c to get rid of forward declarations
</pre>
</div>
</content>
</entry>
<entry>
<title>lsm: remove current_security()</title>
<updated>2019-09-04T22:53:39+00:00</updated>
<author>
<name>Paul Moore</name>
<email>paul@paul-moore.com</email>
</author>
<published>2019-09-04T22:53:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=15322a0d90b6fd62ae8f22e5b87f735c3fdfeff7'/>
<id>15322a0d90b6fd62ae8f22e5b87f735c3fdfeff7</id>
<content type='text'>
There are no remaining callers and it really is unsafe in the brave
new world of LSM stacking.

Acked-by: James Morris &lt;jamorris@linux.microsoft.com&gt;
Signed-off-by: Paul Moore &lt;paul@paul-moore.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are no remaining callers and it really is unsafe in the brave
new world of LSM stacking.

Acked-by: James Morris &lt;jamorris@linux.microsoft.com&gt;
Signed-off-by: Paul Moore &lt;paul@paul-moore.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>access: avoid the RCU grace period for the temporary subjective credentials</title>
<updated>2019-07-24T17:12:09+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-07-11T16:54:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d7852fbd0f0423937fa287a598bfde188bb68c22'/>
<id>d7852fbd0f0423937fa287a598bfde188bb68c22</id>
<content type='text'>
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
-&gt;cred entirely.  Only -&gt;real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.

Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.ibm.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Jan Glauber &lt;jglauber@marvell.com&gt;
Cc: Jiri Kosina &lt;jikos@kernel.org&gt;
Cc: Jayachandran Chandrasekharan Nair &lt;jnair@marvell.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
-&gt;cred entirely.  Only -&gt;real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.

Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.ibm.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Jan Glauber &lt;jglauber@marvell.com&gt;
Cc: Jiri Kosina &lt;jikos@kernel.org&gt;
Cc: Jayachandran Chandrasekharan Nair &lt;jnair@marvell.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Miklos Szeredi &lt;miklos@szeredi.hu&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
