linux-toradex.git/kernel/futex.c, branch v3.14.3

futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test

2014-04-14T13:50:05+00:00

commit 03b8c7b623c80af264c4c8d6111e5c6289933666 upstream.

If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens 
Cc: Finn Thain 
Cc: Geert Uytterhoeven 
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

futex: avoid race between requeue and wake

2014-04-14T13:50:03+00:00

commit 69cd9eba38867a493a043bb13eb9b33cad5f1a9a upstream.

Jan Stancek reported:
 "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
  occasionally fails, because some threads fail to wake up.

  Testcase creates 5 threads, which are all waiting on same condition.
  Main thread then calls pthread_cond_broadcast() without holding mutex,
  which calls:

      futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

  This immediately wakes up single thread A, which unlocks mutex and
  tries to wake up another thread:

      futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

  If thread A manages to call futex_wake() before any waiters are
  requeued for uaddr2, no other thread is woken up"

The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.

This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks.  It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.

Reported-and-tested-by: Jan Stancek 
Acked-by: Davidlohr Bueso 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

futex: revert back to the explicit waiter counting code

2014-03-21T05:11:17+00:00

Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb->lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju 
Original-code-by: Davidlohr Bueso 
Signed-off-by: Linus Torvalds

Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-01-20T18:42:08+00:00

Pull scheduler changes from Ingo Molnar:

 - Add the initial implementation of SCHED_DEADLINE support: a real-time
   scheduling policy where tasks that meet their deadlines and
   periodically execute their instances in less than their runtime quota
   see real-time scheduling and won't miss any of their deadlines.
   Tasks that go over their quota get delayed (Available to privileged
   users for now)

 - Clean up and fix preempt_enable_no_resched() abuse all around the
   tree

 - Do sched_clock() performance optimizations on x86 and elsewhere

 - Fix and improve auto-NUMA balancing

 - Fix and clean up the idle loop

 - Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  sched: Fix __sched_setscheduler() nice test
  sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
  sched: Fix up attr::sched_priority warning
  sched: Fix up scheduler syscall LTP fails
  sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
  sched/core: Fix htmldocs warnings
  sched/deadline: No need to check p if dl_se is valid
  sched/deadline: Remove unused variables
  sched/deadline: Fix sparse static warnings
  m68k: Fix build warning in mac_via.h
  sched, thermal: Clean up preempt_enable_no_resched() abuse
  sched, net: Fixup busy_loop_us_clock()
  sched, net: Clean up preempt_enable_no_resched() abuse
  sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
  sched/preempt, locking: Rework local_bh_{dis,en}able()
  sched/clock, x86: Avoid a runtime condition in native_sched_clock()
  sched/clock: Fix up clear_sched_clock_stable()
  sched/clock, x86: Use a static_key for sched_clock_stable
  sched/clock: Remove local_irq_disable() from the clocks
  sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
  ...

futexes: Fix futex_hashsize initialization

2014-01-16T14:14:32+00:00

"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.

alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.

Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.

Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.

Signed-off-by: Heiko Carstens 
Cc: Darren Hart 
Cc: Peter Zijlstra 
Cc: Paul E. McKenney 
Cc: Waiman Long 
Cc: Jason Low 
Cc: Davidlohr Bueso 
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar

rtmutex: Turn the plist into an rb-tree

2014-01-13T12:41:50+00:00

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Dario Faggioli 
Signed-off-by: Juri Lelli 
Signed-off-again-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

futexes: Avoid taking the hb->lock if there's nothing to wake up

2014-01-13T10:45:21+00:00

In futex_wake() there is clearly no point in taking the hb->lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

  https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds 
Reviewed-by: Darren Hart 
Reviewed-by: Thomas Gleixner 
Reviewed-by: Peter Zijlstra 
Signed-off-by: Davidlohr Bueso 
Cc: Paul E. McKenney 
Cc: Mike Galbraith 
Cc: Jeff Mahoney 
Cc: Scott Norton 
Cc: Tom Vaden 
Cc: Aswin Chandramouleeswaran 
Cc: Waiman Long 
Cc: Jason Low 
Cc: Andrew Morton 
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

futexes: Document multiprocessor ordering guarantees

2014-01-13T10:45:19+00:00

That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart 
Reviewed-by: Peter Zijlstra 
Reviewed-by: Paul E. McKenney 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Davidlohr Bueso 
Cc: Mike Galbraith 
Cc: Jeff Mahoney 
Cc: Linus Torvalds 
Cc: Randy Dunlap 
Cc: Scott Norton 
Cc: Tom Vaden 
Cc: Aswin Chandramouleeswaran 
Cc: Waiman Long 
Cc: Jason Low 
Cc: Andrew Morton 
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

futexes: Increase hash table size for better performance

2014-01-13T10:45:18+00:00

Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb->lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

 +---------+--------------------+------------------------+-----------------------+-------------------------------+
 | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
 +---------+--------------------+------------------------+-----------------------+-------------------------------+
 |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
 |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
 |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
 |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
 |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
 |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
 +---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart 
Reviewed-by: Peter Zijlstra 
Reviewed-by: Paul E. McKenney 
Reviewed-by: Waiman Long 
Reviewed-and-tested-by: Jason Low 
Reviewed-by: Thomas Gleixner 
Signed-off-by: Davidlohr Bueso 
Cc: Mike Galbraith 
Cc: Jeff Mahoney 
Cc: Linus Torvalds 
Cc: Scott Norton 
Cc: Tom Vaden 
Cc: Aswin Chandramouleeswaran 
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

futexes: Clean up various details

2014-01-13T10:45:17+00:00

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart 
Reviewed-by: Peter Zijlstra 
Reviewed-by: Paul E. McKenney 
Reviewed-by: Thomas Gleixner 
Signed-off-by: Jason Low 
Signed-off-by: Davidlohr Bueso 
Cc: Mike Galbraith 
Cc: Jeff Mahoney 
Cc: Linus Torvalds 
Cc: Scott Norton 
Cc: Tom Vaden 
Cc: Aswin Chandramouleeswaran 
Cc: Waiman Long 
Cc: Andrew Morton 
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar