linux-toradex.git/kernel/sched.c, branch v2.6.28.9

wait: prevent exclusive waiter starvation

2009-02-12T17:50:23+00:00

commit 777c6c5f1f6e757ae49ecca2ed72d6b1f523c007 upstream.

With exclusive waiters, every process woken up through the wait queue must
ensure that the next waiter down the line is woken when it has finished.

Interruptible waiters don't do that when aborting due to a signal.  And if
an aborting waiter is concurrently woken up through the waitqueue, noone
will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting when
the actual lock holder woke it up concurrently.  The aborted contender
didn't acquire the lock and therefor never did an unlock followed by
waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor from
the waitqueue, iff still queued, or wakes up the next waiter otherwise.
It does so under the waitqueue lock.  Racing with a wake up means the
aborting process is either already woken (removed from the queue) and will
wake up the next waiter, or it will remove itself from the queue and the
concurrent wake up will apply to the next waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
__wait_on_bit_lock() when they were interrupted by other means than a wake
up through the queue.

[akpm@linux-foundation.org: coding-style fixes]
Reported-by: Chris Mason 
Signed-off-by: Johannes Weiner 
Mentored-by: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Matthew Wilcox 
Cc: Chuck Lever 
Cc: Nick Piggin 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

System call wrappers part 08

2009-01-18T18:43:55+00:00

commit 17da2bd90abf428523de0fb98f7075e00e3ed42e upstream.

Signed-off-by: Heiko Carstens 
Signed-off-by: Greg Kroah-Hartman

System call wrappers part 07

2009-01-18T18:43:55+00:00

commit 754fe8d297bfae7b77f7ce866e2fb0c5fb186506 upstream.

Signed-off-by: Heiko Carstens 
Signed-off-by: Greg Kroah-Hartman

System call wrappers part 06

2009-01-18T18:43:54+00:00

commit 5add95d4f7cf08f6f62510f19576992912387501 upstream.

Signed-off-by: Heiko Carstens 
Signed-off-by: Greg Kroah-Hartman

sched: CPU remove deadlock fix

2008-12-09T18:27:03+00:00

Impact: fix possible deadlock in CPU hot-remove path

This patch fixes a possible deadlock scenario in the CPU remove path.
migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
with the lock held. Then one of the tasks on the migration queue ends up
calling tg_shares_up which then also tries to acquire the same rq->lock.

[c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
[c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
[c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
[c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
[c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
[c000000058eab640] c00000000007ada4 .complete+0x54/0x80
[c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
[c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
[c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
[c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
[c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
[c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
[c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
[c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
[c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
[c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40

Signed-off-by: Brian King 
Acked-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar

sched: prevent divide by zero error in cpu_avg_load_per_task, update

2008-11-29T19:45:15+00:00

Regarding the bug addressed in:

  4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task

Linus points out that the fix is not complete:

> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.

So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.

Signed-off-by: Ingo Molnar

sched: prevent divide by zero error in cpu_avg_load_per_task

2008-11-27T09:29:52+00:00

Impact: fix divide by zero crash in scheduler rebalance irq

While testing the branch profiler, I hit this crash:

divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[]  [] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
  <0> [] find_busiest_group+0x3e5/0xcaa
 [] rebalance_domains+0x2da/0xa21
 [] ? find_next_bit+0x1b2/0x1e6
 [] run_rebalance_domains+0x112/0x19f
 [] __do_softirq+0xa8/0x232
 [] call_softirq+0x1c/0x3e
 [] do_softirq+0x94/0x1cd
 [] irq_exit+0x6b/0x10e
 [] smp_apic_timer_interrupt+0xd3/0xff
 [] apic_timer_interrupt+0x13/0x20

The code for cpu_avg_load_per_task has:

	if (rq->nr_running)
		rq->avg_load_per_task = rq->load.weight / rq->nr_running;

The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.

The branch profiler simply made the race window bigger.

This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.

Signed-off-by: Steven Rostedt 
Peter Zijlstra 
Signed-off-by: Ingo Molnar

cpuset: fix regression when failed to generate sched domains

2008-11-18T07:44:51+00:00

Impact: properly rebuild sched-domains on kmalloc() failure

When cpuset failed to generate sched domains due to kmalloc()
failure, the scheduler should fallback to the single partition
'fallback_doms' and rebuild sched domains, but now it only
destroys but not rebuilds sched domains.

The regression was introduced by:

| commit dfb512ec4834116124da61d6c1ee10fd0aa32bd6
| Author: Max Krasnyansky 
| Date:   Fri Aug 29 13:11:41 2008 -0700
|
|    sched: arch_reinit_sched_domains() must destroy domains to force rebuild

After the above commit, partition_sched_domains(0, NULL, NULL) will
only destroy sched domains and partition_sched_domains(1, NULL, NULL)
will create the default sched domain.

Signed-off-by: Li Zefan 
Cc: Max Krasnyansky 
Cc: 
Signed-off-by: Ingo Molnar

sched: fix init_idle()'s use of sched_clock()

2008-11-12T19:05:50+00:00

Maciej Rutecki reported:

> I have this bug during suspend to disk:
>
> [  188.592151] Enabling non-boot CPUs ...
> [  188.592151] SMP alternatives: switching to SMP code
> [  188.666058] BUG: using smp_processor_id() in preemptible
> [00000000]
> code: suspend_to_disk/2934
> [  188.666064] caller is native_sched_clock+0x2b/0x80

Which, as noted by Linus, was caused by me, via:

  7cbaef9c "sched: optimize sched_clock() a bit"

Move the rq locking a bit earlier in the initialization sequence,
that will make the sched_clock() call in init_idle() non-preemptible.

Reported-by: Maciej Rutecki 
Signed-off-by: Ingo Molnar

sched: fix stale value in average load per task

2008-11-12T11:33:50+00:00

Impact: fix load balancer load average calculation accuracy

cpu_avg_load_per_task() returns a stale value when nr_running is 0.
It returns an older stale (caculated when nr_running was non zero) value.

This patch returns and sets rq->avg_load_per_task to zero when nr_running
is 0.

Compile and boot tested on a x86_64 box.

Signed-off-by: Balbir Singh 
Acked-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar