linux-toradex.git/kernel/sched.c, branch v2.6.36.4

Sched: fix skip_clock_update optimization

2011-01-07T21:58:51+00:00

commit f26f9aff6aaf67e9a430d16c266f91b13a5bff64 upstream.

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.

Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().

Signed-off-by: Mike Galbraith 
Reported-by: Bjoern B. Brandenburg 
Tested-by: Yong Zhang 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Cure more NO_HZ load average woes

2011-01-07T21:58:31+00:00

commit 0f004f5a696a9434b7214d0d3cbd0525ee77d428 upstream.

There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.

Damien reported that after 74f5187ac8 (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.

The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.

When comming out of NO_HZ mode there are two important things to take
care of:

 - Folding the pending idle delta into the global active count.
 - Correctly aging the averages for the idle-duration.

So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.

Reported-by: Damien Wyart 
Reported-by: Tim McGrath 
Tested-by: Damien Wyart 
Tested-by: Orion Poplawski 
Tested-by: Kyle McMartin 
Signed-off-by: Peter Zijlstra 
Cc: Chase Douglas 
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: fix RCU lockdep splat from task_group()

2010-12-09T21:32:59+00:00

commit 6506cf6ce68d78a5470a8360c965dafe8e4b78e3 upstream.

This addresses the following RCU lockdep splat:

[0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
[0.052999] lockdep: fixing up alternatives.
[0.054105]
[0.054106] ===================================================
[0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
[0.054999] ---------------------------------------------------
[0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
[0.054999]
[0.054999] other info that might help us debug this:
[0.054999]
[0.054999]
[0.054999] rcu_scheduler_active = 1, debug_locks = 1
[0.054999] 3 locks held by swapper/1:
[0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [] cpu_up+0x42/0x6a
[0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2a/0x51
[0.054999]  #2:  (&rq->lock){-.-...}, at: [] init_idle+0x2f/0x113
[0.054999]
[0.054999] stack backtrace:
[0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
[0.054999] Call Trace:
[0.054999]  [] lockdep_rcu_dereference+0x9b/0xa3
[0.054999]  [] task_group+0x7b/0x8a
[0.054999]  [] set_task_rq+0x13/0x40
[0.054999]  [] init_idle+0xd2/0x113
[0.054999]  [] fork_idle+0xb8/0xc7
[0.054999]  [] ? mark_held_locks+0x4d/0x6b
[0.054999]  [] do_fork_idle+0x17/0x2b
[0.054999]  [] native_cpu_up+0x1c1/0x724
[0.054999]  [] ? do_fork_idle+0x0/0x2b
[0.054999]  [] _cpu_up+0xac/0x127
[0.054999]  [] cpu_up+0x55/0x6a
[0.054999]  [] kernel_init+0xe1/0x1ff
[0.054999]  [] kernel_thread_helper+0x4/0x10
[0.054999]  [] ? restore_args+0x0/0x30
[0.054999]  [] ? kernel_init+0x0/0x1ff
[0.054999]  [] ? kernel_thread_helper+0x0/0x10
[0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
[0.130045]  #2lockdep: fixing up alternatives.
[0.203089]  #3 Ok.
[0.275286] Brought up 4 CPUs
[0.276005] Total of 4 processors activated (16017.17 BogoMIPS).

The cgroup_subsys_state structures referenced by idle tasks are never
freed, because the idle tasks should be part of the root cgroup,
which is not removable.

The problem is that while we do in-fact hold rq->lock, the newly spawned
idle thread's cpu is not yet set to the correct cpu so the lockdep check
in task_group():

  lockdep_is_held(&task_rq(p)->lock)

will fail.

But this is a chicken and egg problem.  Setting the CPU's runqueue requires
that the CPU's runqueue already be set.  ;-)

So insert an RCU read-side critical section to avoid the complaint.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Paul E. McKenney 
Signed-off-by: Greg Kroah-Hartman

sched: Fix string comparison in /proc/sched_features

2010-11-22T19:03:01+00:00

commit 7740191cd909b75d75685fb08a5d1f54b8a9d28b upstream.

Fix incorrect handling of the following case:

 INTERACTIVE
 INTERACTIVE_SOMETHING_ELSE

The comparison only checks up to each element's length.

Changelog since v1:
 - Embellish using some Rostedtisms.
  [ mingo:                 ^^ == smaller and cleaner ]

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Steven Rostedt 
Cc: Peter Zijlstra 
Cc: Tony Lindgren 
LKML-Reference: <20100913214700.GB16118@Krystal>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Drop all load weight manipulation for RT tasks

2010-11-22T19:03:01+00:00

commit 17bdcf949d03306b308c5fb694849cd35f119807 upstream.

Load weights are for the CFS, they do not belong in the RT task. This makes all
RT scheduling classes leave the CFS weights alone.

This fixes a real bug as well: I noticed the following phonomena: a process
elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
inserted by set_load_weight() remains at 0, giving the task insignificat
priority.

With this fix, the weight is reset to what the task had before being elevated
to SCHED_RR/SCHED_FIFO.

Cc: Lennart Poettering 
Signed-off-by: Linus Walleij 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Fix user time incorrectly accounted as system time on 32-bit

2010-09-15T08:41:36+00:00

We have 32-bit variable overflow possibility when multiply in
task_times() and thread_group_times() functions. When the
overflow happens then the scaled utime value becomes erroneously
small and the scaled stime becomes i erroneously big.

Reported here:

 https://bugzilla.redhat.com/show_bug.cgi?id=633037
 https://bugzilla.kernel.org/show_bug.cgi?id=16559

Reported-by: Michael Chapman 
Reported-by: Ciriaco Garcia de Celis 
Signed-off-by: Stanislaw Gruszka 
Signed-off-by: Peter Zijlstra 
Cc: Hidetoshi Seto 
Cc:   # 2.6.32.19+ (partially) and 2.6.33+
LKML-Reference: <20100914143513.GB8415@redhat.com>
Signed-off-by: Ingo Molnar

sched: Move sched_avg_update() to update_cpu_load()

2010-09-09T18:39:33+00:00

Currently sched_avg_update() (which updates rt_avg stats in the rq)
is getting called from scale_rt_power() (in the load balance context)
which doesn't take rq->lock.

Fix it by moving the sched_avg_update() to more appropriate
update_cpu_load() where the CFS load gets updated as well.

Signed-off-by: Suresh Siddha 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3>
Signed-off-by: Ingo Molnar

mutex: Improve the scalability of optimistic spinning

2010-08-23T08:56:27+00:00

There is a scalability issue for current implementation of optimistic
mutex spin in the kernel.  It is found on a 8 node 64 core Nehalem-EX
system (HT mode).

The intention of the optimistic mutex spin is to busy wait and spin on a
mutex if the owner of the mutex is running, in the hope that the mutex
will be released soon and be acquired, without the thread trying to
acquire mutex going to sleep. However, when we have a large number of
threads, contending for the mutex, we could have the mutex grabbed by
other thread, and then another ……, and we will keep spinning, wasting cpu
cycles and adding to the contention.  One possible fix is to quit
spinning and put the current thread on wait-list if mutex lock switch to
a new owner while we spin, indicating heavy contention (see the patch
included).

I did some testing on a 8 socket Nehalem-EX system with a total of 64
cores. Using Ingo's test-mutex program that creates/delete files with 256
threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up
after putting in the mutex spin fix:

 ./mutex-test V 256 10
                 Ops/sec
 2.6.34          62864
 With fix        197200

Repeating the test with Aim7 fserver workload, again there is a speed up
with the fix:

                 Jobs/min
 2.6.34          91657
 With fix        149325

To look at the impact on the distribution of mutex acquisition time, I
collected the mutex acquisition time on Aim7 fserver workload with some
instrumentation.  The average acquisition time is reduced by 48% and
number of contentions reduced by 32%.

                 #contentions    Time to acquire mutex (cycles)
 2.6.34          72973           44765791
 With fix        49210           23067129

The histogram of mutex acquisition time is listed below.  The acquisition
time is in 2^bin cycles.  We see that without the fix, the acquisition
time is mostly around 2^26 cycles.  With the fix, we the distribution get
spread out a lot more towards the lower cycles, starting from 2^13.
However, there is an increase of the tail distribution with the fix at
2^28 and 2^29 cycles.  It seems a small price to pay for the reduced
average acquisition time and also getting the cpu to do useful work.

 Mutex acquisition time distribution (acq time = 2^bin cycles):
         2.6.34                  With Fix
 bin     #occurrence     %       #occurrence     %
 11      2               0.00%   120             0.24%
 12      10              0.01%   790             1.61%
 13      14              0.02%   2058            4.18%
 14      86              0.12%   3378            6.86%
 15      393             0.54%   4831            9.82%
 16      710             0.97%   4893            9.94%
 17      815             1.12%   4667            9.48%
 18      790             1.08%   5147            10.46%
 19      580             0.80%   6250            12.70%
 20      429             0.59%   6870            13.96%
 21      311             0.43%   1809            3.68%
 22      255             0.35%   2305            4.68%
 23      317             0.44%   916             1.86%
 24      610             0.84%   233             0.47%
 25      3128            4.29%   95              0.19%
 26      63902           87.69%  122             0.25%
 27      619             0.85%   286             0.58%
 28      0               0.00%   3536            7.19%
 29      0               0.00%   903             1.83%
 30      0               0.00%   0               0.00%

I've done similar experiments with 2.6.35 kernel on smaller boxes as
well.  One is on a dual-socket Westmere box (12 cores total, with HT).
Another experiment is on an old dual-socket Core 2 box (4 cores total, no
HT)

On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test
program with my mutex patch but no significant difference in aim7's
fserver workload.

On the 4-core Core 2 box, I see the difference with the patch for both
mutex-test and aim7 fserver are negligible.

So far, it seems like the patch has not caused regression on smaller
systems.

Signed-off-by: Tim Chen 
Acked-by: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Frederic Weisbecker 
Cc:  # .35.x
LKML-Reference: <1282168827.9542.72.camel@schen9-DESK>
Signed-off-by: Ingo Molnar

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2010-08-06T16:39:22+00:00

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
  sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
  sched: No need for bootmem special cases
  sched: Revert nohz_ratelimit() for now
  sched: Reduce update_group_power() calls
  sched: Update rq->clock for nohz balanced cpus
  sched: Fix spelling of sibling
  sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
  sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
  sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
  sched: thread_group_cputime: Simplify, document the "alive" check
  sched: Remove the obsolete exit_state/signal hacks
  sched: task_tick_rt: Remove the obsolete ->signal != NULL check
  sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
  sched: Fix comments to make them DocBook happy
  sched: Fix fix_small_capacity
  powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
  powerpc: Enable asymmetric SMT scheduling on POWER7
  sched: Add asymmetric group packing option for sibling domain
  sched: Fix capacity calculations for SMT4
  sched: Change nohz idle load balancing logic to push model
  ...

Merge branch 'sched/urgent' into sched/core

2010-08-05T07:46:29+00:00

Conflicts:
	include/linux/sched.h

Merge reason: Add the leftover .35 urgent bits, fix the conflict.

Signed-off-by: Ingo Molnar