linux-toradex.git/kernel/sched, branch v4.4.77

sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

2017-07-05T12:37:21+00:00

commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream.

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update		     [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Initialize throttle_count for new task-groups lazily

2017-05-25T12:30:12+00:00

commit 094f469172e00d6ab0a3130b0e01c83b3cf3a98d upstream.

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar 
Cc: Ben Pineau 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Do not announce throttled next buddy in dequeue_task_fair()

2017-05-25T12:30:12+00:00

commit 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7 upstream.

Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar 
Cc: Ben Pineau 
Signed-off-by: Greg Kroah-Hartman

sched/rt: Add a missing rescheduling point

2017-03-31T07:49:54+00:00

commit 619bd4a71874a8fd78eb6ccf9f272c5e98bcc7b7 upstream.

Since the change in commit:

  fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")

... we don't reschedule a task under certain circumstances:

Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.

The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.

The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.

Suggested-by: Peter Zijlstra 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Thomas Gleixner 
Fixes: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/core: Fix a race between try_to_wake_up() and a woken up task

2016-09-24T08:07:42+00:00

commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh 
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Benjamin Herrenschmidt 
Cc: Alexey Kardashevskiy 
Cc: Linus Torvalds 
Cc: Nicholas Piggin 
Cc: Nicholas Piggin 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/numa: Fix use-after-free bug in the task_numa_compare

2016-09-15T06:27:45+00:00

[ Upstream commit 1dff76b92f69051e579bdc131e01500da9fa2a91 ]

The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [] dump_stack+0x45/0x56
   [] print_trailer+0xfd/0x170
   [] object_err+0x36/0x40
   [] kasan_report_error+0x1e9/0x3a0
   [] kasan_report+0x40/0x50
   [] ? task_numa_find_cpu+0x64c/0x890
   [] __asan_load8+0x69/0xa0
   [] ? find_next_bit+0xd8/0x120
   [] task_numa_find_cpu+0x64c/0x890
   [] task_numa_migrate+0x4ac/0x7b0
   [] numa_migrate_preferred+0xb3/0xc0
   [] task_numa_fault+0xb88/0xed0
   [] do_numa_page+0x192/0x200
   [] handle_mm_fault+0x808/0x1160
   [] ? sched_clock_cpu+0x10d/0x160
   [] ? native_load_tls+0x82/0xa0
   [] __do_page_fault+0x218/0x750
   [] ? hrtimer_try_to_cancel+0x76/0x160
   [] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [] do_page_fault+0x1a/0x70
   [] page_fault+0x28/0x30
   [] ? do_sys_poll+0x1c4/0x6d0
   [] ? enqueue_task_fair+0x4b6/0xaa0
   [] ? sched_clock+0x9/0x10
   [] ? resched_task+0x7a/0xc0
   [] ? check_preempt_curr+0xb3/0x130
   [] ? poll_select_copy_remaining+0x170/0x170
   [] ? wake_up_state+0x10/0x20
   [] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [] ? futex_requeue+0x3de/0xba0
   [] ? do_futex+0xbe/0x8f0
   [] ? read_tsc+0x9/0x20
   [] ? ktime_get_ts+0x12d/0x170
   [] ? timespec_add_safe+0x59/0xe0
   [] SyS_poll+0x66/0x1a0
   [] system_call_fastpath+0x1a/0x1f

As commit 1effd9f19324 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:

do_exit()
        current->flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq->curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Hugh Dickins 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

sched/nohz: Fix affine unpinned timers mess

2016-09-07T06:32:41+00:00

commit 444969223c81c7d0a95136b7b4cfdcfbc96ac5bd upstream.

The following commit:

  9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'

intended to affine unpinned timers to housekeepers:

  unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(houserkeepers, idle)    =>   nearest busy housekeepers(otherwise, fallback to itself)

However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
intention to:

  unpinned timers(full dynaticks, idle)   =>   any housekeepers(no mattter cpu topology)
  unpinned timers(full dynaticks, busy)   =>   any housekeepers(no mattter cpu topology)
  unpinned timers(housekeepers, idle)     =>   any busy cpus(otherwise, fallback to any housekeepers)

This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:

  unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(housekeepers, idle)     =>   nearest busy housekeepers(otherwise, fallback to itself)

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
[ Fixed the changelog. ]
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'
Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression

2016-09-07T06:32:41+00:00

commit 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e upstream.

Mike reports:

 Roughly 10% of the time, ltp testcase getrusage04 fails:
 getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
 getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
 getrusage04    0  TINFO  :  utime:           0us; stime:         179us
 getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
 getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased > 5000us:

And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.

Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.

Reported-by: Mike Galbraith 
Tested-by: Mike Galbraith 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Fredrik Markstrom 
Cc: Linus Torvalds 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim 
Cc: Rik van Riel 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: Wanpeng Li 
Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix effective_load() to consistently use smoothed load

2016-08-10T09:49:28+00:00

commit 7dd4912594daf769a46744848b05bd5bc6d62469 upstream.

Starting with the following commit:

  fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w

2016-08-10T09:49:25+00:00

commit 57675cb976eff977aefb428e68e4e0236d48a9ff upstream.

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman