linux-toradex.git/kernel/fork.c, branch v4.13-rc2

fork,random: use get_random_canary() to set tsk->stack_canary

2017-07-12T23:26:03+00:00

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and Daniel Micay's linux-hardened
tree.

Link: http://lkml.kernel.org/r/20170524155751.424-3-riel@redhat.com
Signed-off-by: Rik van Riel 
Acked-by: Kees Cook 
Cc: Daniel Micay 
Cc: "Theodore Ts'o" 
Cc: H. Peter Anvin 
Cc: Andy Lutomirski 
Cc: Ingo Molnar 
Cc: Catalin Marinas 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fault-inject: support systematic fault injection

2017-07-12T23:26:01+00:00

Add /proc/self/task//fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:

 "Write to this file of integer N makes N-th call in the current task
  fail (N is 0-based). Read from this file returns a single char 'Y' or
  'N' that says if the fault setup with a previous write to this file
  was injected or not, and disables the fault if it wasn't yet injected.
  Note that this file enables all types of faults (slab, futex, etc).
  This setting takes precedence over all other generic settings like
  probability, interval, times, etc. But per-capability settings (e.g.
  fail_futex/ignore-private) take precedence over it. This feature is
  intended for systematic testing of faults in a single system call. See
  an example below"

Why add a new setting:
1. Existing settings are global rather than per-task.
   So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
   which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
   of all of probability, interval, times, space, task-filter and
   unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
   types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
   stress testing better be done from an unprivileged user.
   Similarly, this would require opening the debugfs files to the
   unprivileged user, as he would need to reopen at least times file
   (not possible to pre-open before dropping privs).

The proposed interface solves all of the above (see the example).

We want to integrate this into syzkaller fuzzer.  A prototype has found
10 bugs in kernel in first day of usage:

  https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned.  So I am fine with the current
version of the code.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov 
Cc: Akinobu Mita 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/fork.c: virtually mapped stacks: do not disable interrupts

2017-07-12T23:25:59+00:00

The reason to disable interrupts seems to be to avoid switching to a
different processor while handling per cpu data using individual loads and
stores.  If we use per cpu RMV primitives we will not have to disable
interrupts.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
Signed-off-by: Christoph Lameter 
Cc: Andy Lutomirski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2017-07-09T17:52:16+00:00

Pull scheduler fixes from Thomas Gleixner:
 "This scheduler update provides:

   - The (hopefully) final fix for the vtime accounting issues which
     were around for quite some time

   - Use types known to user space in UAPI headers to unbreak user space
     builds

   - Make load balancing respect the current scheduling domain again
     instead of evaluating unrelated CPUs"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/headers/uapi: Fix linux/sched/types.h userspace compilation errors
  sched/fair: Fix load_balance() affinity redo path
  sched/cputime: Accumulate vtime on top of nsec clocksource
  sched/cputime: Move the vtime task fields to their own struct
  sched/cputime: Rename vtime fields
  sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
  vtime, sched/cputime: Remove vtime_account_user()
  Revert "sched/cputime: Refactor the cputime_adjust() code"

mm: memcontrol: use generic mod_memcg_page_state for kmem pages

2017-07-06T23:24:35+00:00

The kmem-specific functions do the same thing.  Switch and drop.

Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Acked-by: Vladimir Davydov 
Cc: Josef Bacik 
Cc: Michal Hocko 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched/cputime: Move the vtime task fields to their own struct

2017-07-05T07:54:15+00:00

We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime information to their own
struct.

Tested-by: Luiz Capitulino 
Signed-off-by: Frederic Weisbecker 
Reviewed-by: Thomas Gleixner 
Acked-by: Rik van Riel 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

sched/cputime: Rename vtime fields

2017-07-05T07:54:14+00:00

The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.

So lets reflect that with more meaningful names.

Tested-by: Luiz Capitulino 
Signed-off-by: Frederic Weisbecker 
Reviewed-by: Thomas Gleixner 
Acked-by: Rik van Riel 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Wanpeng Li 
Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2017-05-27T15:52:27+00:00

Pull kthread fix from Thomas Gleixner:
 "A single fix which prevents a use after free when kthread fork fails"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread: Fix use-after-free if kthread fork fails

kthread: Fix use-after-free if kthread fork fails

2017-05-22T20:21:16+00:00

If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

    kthread()
     - worker_thread()
        - process_one_work()
        |  - call_usermodehelper_exec_work()
        |     - kernel_thread()
        |        - _do_fork()
        |           - copy_process()
        |              - dup_task_struct()
        |                 - arch_dup_task_struct()
        |                    - tsk->set_child_tid = current->set_child_tid // implied
        |              - ...
        |              - goto bad_fork_*
        |              - ...
        |              - free_task(tsk)
        |                 - free_kthread_struct(tsk)
        |                    - kfree(tsk->set_child_tid)
        - ...
        - schedule()
           - __schedule()
              - wq_worker_sleeping()
                 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Debugged-by: Jamie Iles 
Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed")
Signed-off-by: Vegard Nossum 
Acked-by: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Greg Kroah-Hartman 
Cc: Andy Lutomirski 
Cc: Frederic Weisbecker 
Cc: Jamie Iles 
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com
Signed-off-by: Thomas Gleixner

pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()

2017-05-13T22:26:02+00:00

Imagine we have a pid namespace and a task from its parent's pid_ns,
which made setns() to the pid namespace. The task is doing fork(),
while the pid namespace's child reaper is dying. We have the race
between them:

Task from parent pid_ns             Child reaper
copy_process()                      ..
  alloc_pid()                       ..
  ..                                zap_pid_ns_processes()
  ..                                  disable_pid_allocation()
  ..                                  read_lock(&tasklist_lock)
  ..                                  iterate over pids in pid_ns
  ..                                    kill tasks linked to pids
  ..                                  read_unlock(&tasklist_lock)
  write_lock_irq(&tasklist_lock);   ..
  attach_pid(p, PIDTYPE_PID);       ..
  ..                                ..

So, just created task p won't receive SIGKILL signal,
and the pid namespace will be in contradictory state.
Only manual kill will help there, but does the userspace
care about this? I suppose, the most users just inject
a task into a pid namespace and wait a SIGCHLD from it.

The patch fixes the problem. It simply checks for
(pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
We do it under the tasklist_lock, and can't skip
PIDNS_HASH_ADDING as noted by Oleg:

"zap_pid_ns_processes() does disable_pid_allocation()
and then takes tasklist_lock to kill the whole namespace.
Given that copy_process() checks PIDNS_HASH_ADDING
under write_lock(tasklist) they can't race;
if copy_process() takes this lock first, the new child will
be killed, otherwise copy_process() can't miss
the change in ->nr_hashed."

If allocation is disabled, we just return -ENOMEM
like it's made for such cases in alloc_pid().

v2: Do not move disable_pid_allocation(), do not
introduce a new variable in copy_process() and simplify
the patch as suggested by Oleg Nesterov.
Account the problem with double irq enabling
found by Eric W. Biederman.

Fixes: c876ad768215 ("pidns: Stop pid allocation when init dies")
Signed-off-by: Kirill Tkhai 
CC: Andrew Morton 
CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Oleg Nesterov 
CC: Mike Rapoport 
CC: Michal Hocko 
CC: Andy Lutomirski 
CC: "Eric W. Biederman" 
CC: Andrei Vagin 
CC: Cyrill Gorcunov 
CC: Serge Hallyn 
Cc: stable@vger.kernel.org
Acked-by: Oleg Nesterov 
Signed-off-by: Eric W. Biederman