linux-toradex.git/kernel/sched, branch v3.19.4

sched/autogroup: Fix failure to set cpu.rt_runtime_us

2015-03-18T13:11:12+00:00

commit 1fe89e1b6d270aa0d3452c60d38461ea589594e3 upstream.

Because task_group() uses a cache of autogroup_task_group(), whose
output depends on sched_class, switching classes can generate
problems.

In particular, when started as fair, the cache points to the
autogroup, so when switching to RT the tg_rt_schedulable() test fails
for every cpu.rt_{runtime,period}_us change because now the autogroup
has tasks and no runtime.

Furthermore, going back to the previous semantics of varying
task_group() with sched_class has the down-side that the sched_debug
output varies as well, even though the task really is in the
autogroup.

Therefore add an autogroup exception to tg_has_rt_tasks() -- such that
both (all) task_group() usages in sched/core now have one. And remove
all the remnants of the variable task_group() output.

Reported-by: Zefan Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Stefan Bader 
Fixes: 8323f26ce342 ("sched: Fix race in task_group()")
Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Fix hrtick_start() on UP

2015-03-18T13:11:11+00:00

commit 868933359a3bdda25b562e9d41bce7071edc1b08 upstream.

The commit 177ef2a6315e ("sched/deadline: Fix a precision problem in
the microseconds range") forgot to change the UP version of
hrtick_start(), do so now.

Signed-off-by: Wanpeng Li 
Fixes: 177ef2a6315e ("sched/deadline: Fix a precision problem in the microseconds range")
[ Fixed the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2015-02-06T21:34:26+00:00

Pull scheduler fixes from Ingo Molnar:
 "Misc fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix deadline parameter modification handling
  sched/wait: Remove might_sleep() from wait_event_cmd()
  sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask
  sched/fair: Avoid using uninitialized variable in preferred_group_nid()

sched/deadline: Fix deadline parameter modification handling

2015-02-04T06:42:48+00:00

Commit 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to
use in switched_from_dl()") removed the hrtimer_try_cancel() function
call out from init_dl_task_timer(), which gets called from
__setparam_dl().

The result is that we can now re-init the timer while its active --
this is bad and corrupts timer state.

Furthermore; changing the parameters of an active deadline task is
tricky in that you want to maintain guarantees, while immediately
effective change would allow one to circumvent the CBS guarantees --
this too is bad, as one (bad) task should not be able to affect the
others.

Rework things to avoid both problems. We only need to initialize the
timer once, so move that to __sched_fork() for new tasks.

Then make sure __setparam_dl() doesn't affect the current running
state but only updates the parameters used to calculate the next
scheduling period -- this guarantees the CBS functions as expected
(albeit slightly pessimistic).

This however means we need to make sure __dl_clear_params() needs to
reset the active state otherwise new (and tasks flipping between
classes) will not properly (re)compute their first instance.

Todo: close class flipping CBS hole.
Todo: implement delayed BW release.

Reported-by: Luca Abeni 
Acked-by: Juri Lelli 
Tested-by: Luca Abeni 
Fixes: 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

sched: don't cause task state changes in nested sleep debugging

2015-02-01T20:23:32+00:00

Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.

However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).

And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.

In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont.  But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.

This fixes both cases:

 - don't set TASK_RUNNING when the nested condition happens (note: even
   if WARN_ONCE() only _warns_ once, the return value isn't whether the
   warning happened, but whether the condition for the warning was true.
   So despite the warning only happening once, the "if (WARN_ON(..))"
   would trigger for every nested sleep.

 - in the cases where we knowingly disable the warning by using
   "sched_annotate_sleep()", don't change the task state (that is used
   for all core scheduling decisions), instead use '->task_state_change'
   that is used for the debugging decision itself.

(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)

Reported-and-bisected-by: Bruno Prémont 
Tested-by: Rafael J Wysocki 
Acked-by: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner ,
Cc: Ilya Dryomov ,
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Peter Hurley ,
Cc: Davidlohr Bueso ,
Signed-off-by: Linus Torvalds

sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask

2015-01-28T14:28:15+00:00

While creating an exclusive cpuset, we passed cpuset_cpumask_can_shrink()
an empty cpumask (cur), and dl_bw_of(cpumask_any(cur)) made boom with it:

 CPU: 0 PID: 6942 Comm: shield.sh Not tainted 3.19.0-master #19
 Hardware name: MEDIONPC MS-7502/MS-7502, BIOS 6.00 PG 12/26/2007
 task: ffff880224552450 ti: ffff8800caab8000 task.ti: ffff8800caab8000
 RIP: 0010:[]  [] cpuset_cpumask_can_shrink+0x56/0xb0
 [...]
 Call Trace:
  [] validate_change+0x18a/0x200
  [] cpuset_write_resmask+0x3b7/0x720
  [] cgroup_file_write+0x38/0x100
  [] kernfs_fop_write+0x12a/0x180
  [] vfs_write+0xb3/0x1d0
  [] SyS_write+0x46/0xb0
  [] system_call_fastpath+0x16/0x1b

Signed-off-by: Mike Galbraith 
Acked-by: Zefan Li 
Fixes: f82f80426f7a ("sched/deadline: Ensure that updates to exclusive cpusets don't break AC")
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1422417235.5716.5.camel@marge.simpson.net
Signed-off-by: Ingo Molnar

sched/fair: Avoid using uninitialized variable in preferred_group_nid()

2015-01-28T12:14:12+00:00

At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.

Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.

[ mingo: changed the solution to zero initialization: uninitialized_var()
  needs to die, as it's an actively dangerous construct: if in the future
  a known-proven-good piece of code is changed to have a true, buggy
  uninitialized variable, the compiler warning is then supressed...

  The better long term solution is to clean up the code flow, so that
  even simple minded compilers (and humans!) are able to read it without
  getting a headache.  ]

Signed-off-by: Jan Beulich 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar

sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()

2015-01-09T10:19:00+00:00

When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

  INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
  Task dump for CPU 1:
  (fprintd)       R  running task        0  6249      1 0x00000088
  ...
  Call Trace:
     [] sched_show_task+0xa8/0x110
   [] dump_cpu_task+0x3d/0x50
   [] rcu_dump_cpu_stacks+0x90/0xd0
   [] rcu_check_callbacks+0x491/0x700
   [] update_process_times+0x4b/0x80
   [] tick_sched_handle.isra.20+0x36/0x50
   [] tick_sched_timer+0x42/0x70
   [] __run_hrtimer+0x69/0x1a0
   [] ? tick_sched_handle.isra.20+0x50/0x50
   [] hrtimer_interrupt+0xef/0x230
   [] local_apic_timer_interrupt+0x3b/0x70
   [] smp_apic_timer_interrupt+0x45/0x60
   [] apic_timer_interrupt+0x6d/0x80
     [] ? lock_hrtimer_base.isra.23+0x18/0x50
   [] ? __kmalloc+0x211/0x230
   [] hrtimer_try_to_cancel+0x22/0xd0
   [] ? __kmalloc+0x211/0x230
   [] hrtimer_cancel+0x22/0x30
   [] free_fair_sched_group+0x25/0xd0
   [] free_sched_group+0x16/0x40
   [] sched_create_group+0x4b/0x80
   [] sched_autogroup_create_attach+0x43/0x1c0
   [] sys_setsid+0x7c/0x110
   [] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa 
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Paul Turner 
Cc: Ben Segall 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar

sched/deadline: Avoid double-accounting in case of missed deadlines

2015-01-09T10:18:57+00:00

The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is <= 0. However, it also checks if the
scheduling deadline has been missed (the current time is
larger than the current scheduling deadline), further
decreasing the runtime if this happens.
This "double accounting" is wrong:

- In case of partitioned scheduling (or single CPU), this
  happens if task_tick_dl() has been called later than expected
  (due to small HZ values). In this case, the current runtime is
  also negative, and replenish_dl_entity() can take care of the
  deadline miss by recharging the current runtime to a value smaller
  than dl_runtime

- In case of global scheduling on multiple CPUs, scheduling
  deadlines can be missed even if the task did not consume more
  runtime than expected, hence penalizing the task is wrong

This patch fix this problem by throttling a SCHED_DEADLINE task
only when its runtime becomes negative, and not modifying the runtime

Signed-off-by: Luca Abeni 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Cc: 
Cc: Dario Faggioli 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar

sched/deadline: Fix migration of SCHED_DEADLINE tasks

2015-01-09T10:18:56+00:00

According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:

	deactivate_task(rq, next_task, 0);
	set_task_cpu(next_task, later_rq->cpu);
	activate_task(later_rq, next_task, 0);

which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().

enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.

As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2

This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.

Signed-off-by: Luca Abeni 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Cc: 
Cc: Dario Faggioli 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar