linux-toradex.git/kernel, branch v3.18.23

genirq: Fix race in register_irq_proc()

2015-10-28T02:14:21+00:00

[ Upstream commit 95c2b17534654829db428f11bcf4297c059a2a7e ]

Per-IRQ directories in procfs are created only when a handler is first
added to the irqdesc, not when the irqdesc is created.  In the case of
a shared IRQ, multiple tasks can race to create a directory.  This
race condition seems to have been present forever, but is easier to
hit with async probing.

Signed-off-by: Ben Hutchings 
Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk
Signed-off-by: Thomas Gleixner 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

sched/fair: Prevent throttling in early pick_next_task_fair()

2015-10-28T02:13:51+00:00

[ Upstream commit 54d27365cae88fbcc853b391dcd561e71acb81fa ]

The optimized task selection logic optimistically selects a new task
to run without first doing a full put_prev_task(). This is so that we
can avoid a put/set on the common ancestors of the old and new task.

Similarly, we should only call check_cfs_rq_runtime() to throttle
eligible groups if they're part of the common ancestry, otherwise it
is possible to end up with no eligible task in the simple task
selection.

Imagine:
		/root
	/prev		/next
	/A		/B

If our optimistic selection ends up throttling /next, we goto simple
and our put_prev_task() ends up throttling /prev, after which we're
going to bug out in set_next_entity() because there aren't any tasks
left.

Avoid this scenario by only throttling common ancestors.

Reported-by: Mohammed Naser 
Reported-by: Konstantin Khlebnikov 
Signed-off-by: Ben Segall 
[ munged Changelog ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Roman Gushchin 
Cc: Thomas Gleixner 
Cc: pjt@google.com
Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()")
Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar 

Signed-off-by: Sasha Levin

sched/core: Fix TASK_DEAD race in finish_task_switch()

2015-10-28T02:13:14+00:00

[ Upstream commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 ]

So the problem this patch is trying to address is as follows:

        CPU0                            CPU1

        context_switch(A, B)
                                        ttwu(A)
                                          LOCK A->pi_lock
                                          A->on_cpu == 0
        finish_task_switch(A)
          prev_state = A->state  <-.
          WMB                      |
          A->on_cpu = 0;           |
          UNLOCK rq0->lock         |
                                   |    context_switch(C, A)
                                   `--  A->state = TASK_DEAD
          prev_state == TASK_DEAD
            put_task_struct(A)
                                        context_switch(A, C)
                                        finish_task_switch(A)
                                          A->state == TASK_DEAD
                                            put_task_struct(A)

The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Reported-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Linus Torvalds 
Cc:  # v3.1+
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

sched: access local runqueue directly in single_task_running

2015-10-28T02:12:59+00:00

[ Upstream commit 00cc1633816de8c95f337608a1ea64e228faf771 ]

Commit 2ee507c47293 ("sched: Add function single_task_running to let a task
check if it is the only task running on a cpu") referenced the current
runqueue with the smp_processor_id.  When CONFIG_DEBUG_PREEMPT is enabled,
that is only allowed if preemption is disabled or the currrent task is
bound to the local cpu (e.g. kernel worker).

With commit f78195129963 ("kvm: add halt_poll_ns module parameter") KVM
calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
generates a lot of kernel messages.

To avoid adding preemption in that cases, as it would limit the usefulness,
we change single_task_running to access directly the cpu local runqueue.

Cc: Tim Chen 
Suggested-by: Peter Zijlstra 
Acked-by: Peter Zijlstra (Intel) 
Cc: 
Fixes: 2ee507c472939db4b146d545352b8a7c79ef47f8
Signed-off-by: Dominik Dingel 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Sasha Levin

time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64()

2015-10-28T02:12:51+00:00

[ Upstream commit 2619d7e9c92d524cb155ec89fd72875321512e5b ]

The internal clocksteering done for fine-grained error
correction uses a logarithmic approximation, so any time
adjtimex() adjusts the clock steering, timekeeping_freqadjust()
quickly approximates the correct clock frequency over a series
of ticks.

Unfortunately, the logic in timekeeping_freqadjust(), introduced
in commit:

  dc491596f639 ("timekeeping: Rework frequency adjustments to work better w/ nohz")

used the abs() function with a s64 error value to calculate the
size of the approximated adjustment to be made.

Per include/linux/kernel.h:

  "abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".

Thus on 32-bit platforms, this resulted in the clocksteering to
take a quite dampended random walk trying to converge on the
proper frequency, which caused the adjustments to be made much
slower then intended (most easily observed when large
adjustments are made).

This patch fixes the issue by using abs64() instead.

Reported-by: Nuno Gonçalves 
Tested-by: Nuno Goncalves 
Signed-off-by: John Stultz 
Cc:  # v3.17+
Cc: Linus Torvalds 
Cc: Miroslav Lichvar 
Cc: Peter Zijlstra 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

unshare: Unsharing a thread does not require unsharing a vm

2015-10-07T14:03:47+00:00

[ Upstream commit 12c641ab8270f787dfcce08b5f20ce8b65008096 ]

In the logic in the initial commit of unshare made creating a new
thread group for a process, contingent upon creating a new memory
address space for that process.  That is wrong.  Two separate
processes in different thread groups can share a memory address space
and clone allows creation of such proceses.

This is significant because it was observed that mm_users > 1 does not
mean that a process is multi-threaded, as reading /proc/PID/maps
temporarily increments mm_users, which allows other processes to
(accidentally) interfere with unshare() calls.

Correct the check in check_unshare_flags() to test for
!thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.

By using the correct checks in unshare this removes the possibility of
an accidental denial of service attack.

Additionally using the correct checks in unshare ensures that only an
explicit unshare(CLONE_VM) can possibly trigger the slow path of
current_is_single_threaded().  As an explict unshare(CLONE_VM) is
pointless it is not expected there are many applications that make
that call.

Cc: stable@vger.kernel.org
Fixes: b2e0d98705e60e45bbb3c0032c48824ad7ae0704 userns: Implement unshare of the user namespace
Reported-by: Ricky Zhou 
Reported-by: Kees Cook 
Reviewed-by: Kees Cook 
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Sasha Levin

cpuset: use trialcs->mems_allowed as a temp variable

2015-09-17T05:32:04+00:00

[ Upstream commit 24ee3cf89bef04e8bc23788aca4e029a3f0f06d9 ]

The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.

This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:

| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#

This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.

Signed-off-by: Alban Crequy 
Tested-by: Iago López Galeiras 
Cc:  # 3.17+
Acked-by: Li Zefan 
Signed-off-by: Tejun Heo 
Signed-off-by: Sasha Levin

perf: Fix PERF_EVENT_IOC_PERIOD migration race

2015-09-15T17:58:01+00:00

[ Upstream commit c7999c6f3fed9e383d3131474588f282ae6d56b9 ]

I ran the perf fuzzer, which triggered some WARN()s which are due to
trying to stop/restart an event on the wrong CPU.

Use the normal IPI pattern to ensure we run the code on the correct CPU.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Vince Weaver 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

perf: Fix fasync handling on inherited events

2015-09-15T17:57:10+00:00

[ Upstream commit fed66e2cdd4f127a43fd11b8d92a99bdd429528c ]

Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.

Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.

Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.

Reported-and-Tested-by: Vince Weaver 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: Arnaldo Carvalho deMelo 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFP

2015-08-27T17:26:05+00:00

[ Upstream commit 7e01b5acd88b3f3108d8c4ce44e3205d67437202 ]

Introduce KEXEC_CONTROL_MEMORY_GFP to allow the architecture code
to override the gfp flags of the allocation for the kexec control
page. The loop in kimage_alloc_normal_control_pages allocates pages
with GFP_KERNEL until a page is found that happens to have an
address smaller than the KEXEC_CONTROL_MEMORY_LIMIT. On systems
with a large memory size but a small KEXEC_CONTROL_MEMORY_LIMIT
the loop will keep allocating memory until the oom killer steps in.

Signed-off-by: Martin Schwidefsky 
Signed-off-by: Sasha Levin