linux-toradex.git/kernel/sched, branch v3.19-rc5

sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()

2015-01-09T10:19:00+00:00

When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

  INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
  Task dump for CPU 1:
  (fprintd)       R  running task        0  6249      1 0x00000088
  ...
  Call Trace:
     [] sched_show_task+0xa8/0x110
   [] dump_cpu_task+0x3d/0x50
   [] rcu_dump_cpu_stacks+0x90/0xd0
   [] rcu_check_callbacks+0x491/0x700
   [] update_process_times+0x4b/0x80
   [] tick_sched_handle.isra.20+0x36/0x50
   [] tick_sched_timer+0x42/0x70
   [] __run_hrtimer+0x69/0x1a0
   [] ? tick_sched_handle.isra.20+0x50/0x50
   [] hrtimer_interrupt+0xef/0x230
   [] local_apic_timer_interrupt+0x3b/0x70
   [] smp_apic_timer_interrupt+0x45/0x60
   [] apic_timer_interrupt+0x6d/0x80
     [] ? lock_hrtimer_base.isra.23+0x18/0x50
   [] ? __kmalloc+0x211/0x230
   [] hrtimer_try_to_cancel+0x22/0xd0
   [] ? __kmalloc+0x211/0x230
   [] hrtimer_cancel+0x22/0x30
   [] free_fair_sched_group+0x25/0xd0
   [] free_sched_group+0x16/0x40
   [] sched_create_group+0x4b/0x80
   [] sched_autogroup_create_attach+0x43/0x1c0
   [] sys_setsid+0x7c/0x110
   [] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa 
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Paul Turner 
Cc: Ben Segall 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar

sched/deadline: Avoid double-accounting in case of missed deadlines

2015-01-09T10:18:57+00:00

The dl_runtime_exceeded() function is supposed to ckeck if
a SCHED_DEADLINE task must be throttled, by checking if its
current runtime is <= 0. However, it also checks if the
scheduling deadline has been missed (the current time is
larger than the current scheduling deadline), further
decreasing the runtime if this happens.
This "double accounting" is wrong:

- In case of partitioned scheduling (or single CPU), this
  happens if task_tick_dl() has been called later than expected
  (due to small HZ values). In this case, the current runtime is
  also negative, and replenish_dl_entity() can take care of the
  deadline miss by recharging the current runtime to a value smaller
  than dl_runtime

- In case of global scheduling on multiple CPUs, scheduling
  deadlines can be missed even if the task did not consume more
  runtime than expected, hence penalizing the task is wrong

This patch fix this problem by throttling a SCHED_DEADLINE task
only when its runtime becomes negative, and not modifying the runtime

Signed-off-by: Luca Abeni 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Cc: 
Cc: Dario Faggioli 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar

sched/deadline: Fix migration of SCHED_DEADLINE tasks

2015-01-09T10:18:56+00:00

According to global EDF, tasks should be migrated between runqueues
without checking if their scheduling deadlines and runtimes are valid.
However, SCHED_DEADLINE currently performs such a check:
a migration happens doing:

	deactivate_task(rq, next_task, 0);
	set_task_cpu(next_task, later_rq->cpu);
	activate_task(later_rq, next_task, 0);

which ends up calling dequeue_task_dl(), setting the new CPU, and then
calling enqueue_task_dl().

enqueue_task_dl() then calls enqueue_dl_entity(), which calls
update_dl_entity(), which can modify scheduling deadline and runtime,
breaking global EDF scheduling.

As a result, some of the properties of global EDF are not respected:
for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
two cores can have unbounded response times for the third task even
if 30/80+40/80+120/170 = 1.5809 < 2

This can be fixed by invoking update_dl_entity() only in case of
wakeup, or if this is a new SCHED_DEADLINE task.

Signed-off-by: Luca Abeni 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Cc: 
Cc: Dario Faggioli 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar

sched: Fix odd values in effective_load() calculations

2015-01-09T10:18:54+00:00

In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.

Reported-by: Sasha Levin 
Signed-off-by: Yuyang Du 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Dave Jones 
Cc: Andrey Ryabinin 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar

sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation

2014-12-23T10:43:48+00:00

When allocating space for load_balance_mask, in sched_init, when
CPUMASK_OFFSTACK is set, we've managed to spill over
KMALLOC_MAX_SIZE on our 6144 core machine.  The patch below
breaks up the allocations so that they don't overflow the max
alloc size.  It also allocates the masks on the the node from
which they'll most commonly be accessed, to minimize remote
accesses on NUMA machines.

Suggested-by: George Beshers 
Signed-off-by: Alex Thorlton 
Cc: George Beshers 
Cc: Russ Anderson 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.com
Signed-off-by: Ingo Molnar

sched_show_task: fix unsafe usage of ->real_parent

2014-12-11T01:41:09+00:00

rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.

Note: we need some helpers to cleanup the code like this.  And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.

Signed-off-by: Oleg Nesterov 
Cc: Aaron Tomlin 
Cc: Alexey Dobriyan 
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander 
Acked-by: Peter Zijlstra (Intel) 
Cc: Roland McGrath 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-12-10T05:21:34+00:00

Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

     This instruments might_sleep() checks to catch places that nest
     blocking primitives - such as mutex usage in a wait loop.  Such
     bugs can result in hard to debug races/hangs.

     Another category of invalid nesting that this facility will detect
     is the calling of blocking functions from within schedule() ->
     sched_submit_work() -> blk_schedule_flush_plug().

     There's some potential for false positives (if secondary blocking
     primitives themselves are not ready yet for this facility), but the
     kernel will warn once about such bugs per bootup, so the warning
     isn't much of a nuisance.

     This feature comes with a number of fixes, for problems uncovered
     with it, so no messages are expected normally.

   - Another round of sched/numa optimizations and refinements, for
     CONFIG_NUMA_BALANCING=y.

   - Another round of sched/dl fixes and refinements.

  Plus various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched: Add missing rcu protection to wake_up_all_idle_cpus
  sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
  sched/numa: Init numa balancing fields of init_task
  sched/deadline: Remove unnecessary definitions in cpudeadline.h
  sched/cpupri: Remove unnecessary definitions in cpupri.h
  sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
  sched/fair: Fix stale overloaded status in the busiest group finding logic
  sched: Move p->nr_cpus_allowed check to select_task_rq()
  sched/completion: Document when to use wait_for_completion_io_*()
  sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
  sched/fair: Kill task_struct::numa_entry and numa_group::task_list
  sched: Refactor task_struct to use numa_faults instead of numa_* pointers
  sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
  sched/deadline: Reschedule from switched_from_dl() after a successful pull
  sched/deadline: Push task away if the deadline is equal to curr during wakeup
  sched/deadline: Add deadline rq status print
  sched/deadline: Fix artificial overrun introduced by yield_task_dl()
  sched/rt: Clean up check_preempt_equal_prio()
  sched/core: Use dl_bw_of() under rcu_read_lock_sched()
  sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
  ...

Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-12-10T04:23:19+00:00

Pull RCU updates from Ingo Molnar:
 "These are the main changes in this cycle:

    - Streamline RCU's use of per-CPU variables, shifting from "cpu"
      arguments to functions to "this_"-style per-CPU variable
      accessors.

    - signal-handling RCU updates.

    - real-time updates.

    - torture-test updates.

    - miscellaneous fixes.

    - documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  rcu: Fix FIXME in rcu_tasks_kthread()
  rcu: More info about potential deadlocks with rcu_read_unlock()
  rcu: Optimize cond_resched_rcu_qs()
  rcu: Add sparse check for RCU_INIT_POINTER()
  documentation: memory-barriers.txt: Correct example for reorderings
  documentation: Add atomic_long_t to atomic_ops.txt
  documentation: Additional restriction for control dependencies
  documentation: Document RCU self test boot params
  rcutorture: Fix rcu_torture_cbflood() memory leak
  rcutorture: Remove obsolete kversion param in kvm.sh
  rcutorture: Remove stale test configurations
  rcutorture: Enable RCU self test in configs
  rcutorture: Add early boot self tests
  torture: Run Linux-kernel binary out of results directory
  cpu: Avoid puts_pending overflow
  rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
  rcu: Remove "cpu" argument to rcu_prepare_for_idle()
  rcu: Remove "cpu" argument to rcu_needs_cpu()
  rcu: Remove "cpu" argument to rcu_note_context_switch()
  rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
  ...

sched: Add missing rcu protection to wake_up_all_idle_cpus

2014-12-08T10:44:19+00:00

Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection.  The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.

Fixes: f6be8af1c95d ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: Andy Lutomirski 
Reviewed-by: Chuansheng Liu 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar

context_tracking: Restore previous state in schedule_user

2014-12-04T04:55:58+00:00

It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context.  This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.

Reported-and-tested-by: Dave Jones 
Cc: Oleg Nesterov 
Cc: Frédéric Weisbecker 
Acked-by: Paul E. McKenney 
Signed-off-by: Andy Lutomirski 
Signed-off-by: Linus Torvalds