linux-toradex.git/kernel/fork.c, branch v4.7-rc3

mm: oom_reaper: remove some bloat

2016-05-26T22:35:44+00:00

mmput_async is currently used only from the oom_reaper which is defined
only for CONFIG_MMU.  We can save work_struct in mm_struct for
!CONFIG_MMU.

[akpm@linux-foundation.org: fix typo, per Minchan]
Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.cz
Reported-by: Minchan Kim 
Signed-off-by: Michal Hocko 
Acked-by: Minchan Kim 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, fork: make dup_mmap wait for mmap_sem for write killable

2016-05-24T00:04:14+00:00

dup_mmap needs to lock current's mm mmap_sem for write.  If the waiting
task gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving.  Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.

Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Oleg Nesterov 
Cc: Konstantin Khlebnikov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernek/fork.c: allocate idle task for a CPU always on its local node

2016-05-24T00:04:14+00:00

Linux preallocates the task structs of the idle tasks for all possible
CPUs.  This currently means they all end up on node 0.  This also
implies that the cache line of MWAIT, which is around the flags field in
the task struct, are all located in node 0.

We see a noticeable performance improvement on Knights Landing CPUs when
the cache lines used for MWAIT are located in the local nodes of the
CPUs using them.  I would expect this to give a (likely slight)
improvement on other systems too.

The patch implements placing the idle task in the node of its CPUs, by
passing the right target node to copy_process()

[akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org
Signed-off-by: Andi Kleen 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fork: free thread in copy_process on failure

2016-05-21T00:58:30+00:00

When using this program (as root):

	#include 
	#include 
	#include 
	#include 

	#include 
	#include 
	#include 

	#define ITER 1000
	#define FORKERS 15
	#define THREADS (6000/FORKERS) // 1850 is proc max

	static void fork_100_wait()
	{
		unsigned a, to_wait = 0;

		printf("\t%d forking %d\n", THREADS, getpid());

		for (a = 0; a < THREADS; a++) {
			switch (fork()) {
			case 0:
				usleep(1000);
				exit(0);
				break;
			case -1:
				break;
			default:
				to_wait++;
				break;
			}
		}

		printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
				to_wait);

		for (a = 0; a < to_wait; a++)
			wait(NULL);

		printf("\t%d waited from %d\n", THREADS, getpid());
	}

	static void run_forkers()
	{
		pid_t forkers[FORKERS];
		unsigned a;

		for (a = 0; a < FORKERS; a++) {
			switch ((forkers[a] = fork())) {
			case 0:
				fork_100_wait();
				exit(0);
				break;
			case -1:
				err(1, "DIE fork of %d'th forker", a);
				break;
			default:
				break;
			}
		}

		for (a = 0; a < FORKERS; a++)
			waitpid(forkers[a], NULL, 0);
	}

	int main()
	{
		unsigned a;
		int ret;

		ret = ioperm(10, 20, 0);
		if (ret < 0)
			err(1, "ioperm");

		for (a = 0; a < ITER; a++)
			run_forkers();

		return 0;
	}

kmemleak reports many occurences of this leak:
unreferenced object 0xffff8805917c8000 (size 8192):
  comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
  hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace:
    [] kmemdup+0x25/0x50
    [] copy_thread_tls+0x6c3/0x9a0
    [] copy_process+0x1a84/0x5790
    [] wake_up_new_task+0x2d5/0x6f0
    [] _do_fork+0x12d/0x820
...

Due to the leakage of the memory items which should have been freed in
arch/x86/kernel/process.c:exit_thread().

Make sure the memory is freed when fork fails later in copy_process.
This is done by calling exit_thread with the thread to kill.

Signed-off-by: Jiri Slaby 
Cc: "David S. Miller" 
Cc: "H. Peter Anvin" 
Cc: "James E.J. Bottomley" 
Cc: Aurelien Jacquiot 
Cc: Benjamin Herrenschmidt 
Cc: Catalin Marinas 
Cc: Chen Liqin 
Cc: Chris Metcalf 
Cc: Chris Zankel 
Cc: David Howells 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Guan Xuetao 
Cc: Haavard Skinnemoen 
Cc: Hans-Christian Egtvedt 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ingo Molnar 
Cc: Ivan Kokshaysky 
Cc: James Hogan 
Cc: Jeff Dike 
Cc: Jesper Nilsson 
Cc: Jiri Slaby 
Cc: Jonas Bonn 
Cc: Koichi Yasutake 
Cc: Lennox Wu 
Cc: Ley Foon Tan 
Cc: Mark Salter 
Cc: Martin Schwidefsky 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Mikael Starvik 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Ralf Baechle 
Cc: Rich Felker 
Cc: Richard Henderson 
Cc: Richard Kuo 
Cc: Richard Weinberger 
Cc: Russell King 
Cc: Steven Miao 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom_reaper: do not mmput synchronously from the oom reaper context

2016-05-21T00:58:30+00:00

Tetsuo has properly noted that mmput slow path might get blocked waiting
for another party (e.g.  exit_aio waits for an IO).  If that happens the
oom_reaper would be put out of the way and will not be able to process
next oom victim.  We should strive for making this context as reliable
and independent on other subsystems as much as possible.

Introduce mmput_async which will perform the slow path from an async
(WQ) context.  This will delay the operation but that shouldn't be a
problem because the oom_reaper has reclaimed the victim's address space
for most cases as much as possible and the remaining context shouldn't
bind too much memory anymore.  The only exception is when mmap_sem
trylock has failed which shouldn't happen too often.

The issue is only theoretical but not impossible.

Signed-off-by: Michal Hocko 
Reported-by: Tetsuo Handa 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

signals/sigaltstack: Implement SS_AUTODISARM flag

2016-05-03T06:37:59+00:00

This patch implements the SS_AUTODISARM flag that can be OR-ed with
SS_ONSTACK when forming ss_flags.

When this flag is set, sigaltstack will be disabled when entering
the signal handler; more precisely, after saving sas to uc_stack.
When leaving the signal handler, the sigaltstack is restored by
uc_stack.

When this flag is used, it is safe to switch from sighandler with
swapcontext(). Without this flag, the subsequent signal will corrupt
the state of the switched-away sighandler.

To detect the support of this functionality, one can do:

  err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
  if (err && errno == EINVAL)
	unsupported();

Signed-off-by: Stas Sergeev 
Cc: Al Viro 
Cc: Aleksa Sarai 
Cc: Amanieu d'Antras 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: Eric W. Biederman 
Cc: Frederic Weisbecker 
Cc: H. Peter Anvin 
Cc: Heinrich Schuchardt 
Cc: Jason Low 
Cc: Josh Triplett 
Cc: Konstantin Khlebnikov 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Palmer Dabbelt 
Cc: Paul Moore 
Cc: Pavel Emelyanov 
Cc: Peter Zijlstra 
Cc: Richard Weinberger 
Cc: Sasha Levin 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ru
Signed-off-by: Ingo Molnar

kernel: add kcov code coverage

2016-03-22T22:36:02+00:00

kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov 
Reviewed-by: Kees Cook 
Cc: syzkaller 
Cc: Vegard Nossum 
Cc: Catalin Marinas 
Cc: Tavis Ormandy 
Cc: Will Deacon 
Cc: Quentin Casasnovas 
Cc: Kostya Serebryany 
Cc: Eric Dumazet 
Cc: Alexander Potapenko 
Cc: Kees Cook 
Cc: Bjorn Helgaas 
Cc: Sasha Levin 
Cc: David Drysdale 
Cc: Ard Biesheuvel 
Cc: Andrey Ryabinin 
Cc: Kirill A. Shutemov 
Cc: Jiri Slaby 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2016-03-21T17:05:13+00:00

Pull cgroup namespace support from Tejun Heo:
 "These are changes to implement namespace support for cgroup which has
  been pending for quite some time now.  It is very straight-forward and
  only affects what part of cgroup hierarchies are visible.

  After unsharing, mounting a cgroup fs will be scoped to the cgroups
  the task belonged to at the time of unsharing and the cgroup paths
  exposed to userland would be adjusted accordingly"

* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix and restructure error handling in copy_cgroup_ns()
  cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
  Add FS_USERNS_FLAG to cgroup fs
  cgroup: Add documentation for cgroup namespaces
  cgroup: mount cgroupns-root when inside non-init cgroupns
  kernfs: define kernfs_node_dentry
  cgroup: cgroup namespace setns support
  cgroup: introduce cgroup namespaces
  sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  kernfs: Add API to generate relative kernfs path

mm: memcontrol: report kernel stack usage in cgroup2 memory.stat

2016-03-17T22:09:34+00:00

Show how much memory is allocated to kernel stacks.

Signed-off-by: Vladimir Davydov 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

cgroup: introduce cgroup namespaces

2016-02-16T18:04:58+00:00

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge Hallyn 
Signed-off-by: Tejun Heo