<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/kernel/futex.c, branch v3.14.3</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test</title>
<updated>2014-04-14T13:50:05+00:00</updated>
<author>
<name>Heiko Carstens</name>
<email>heiko.carstens@de.ibm.com</email>
</author>
<published>2014-03-02T12:09:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d232ed0c0eee44a5d3bbc4f21f136a303e77c34d'/>
<id>d232ed0c0eee44a5d3bbc4f21f136a303e77c34d</id>
<content type='text'>
commit 03b8c7b623c80af264c4c8d6111e5c6289933666 upstream.

If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Finn Thain &lt;fthain@telegraphics.com.au&gt;
Cc: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 03b8c7b623c80af264c4c8d6111e5c6289933666 upstream.

If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Finn Thain &lt;fthain@telegraphics.com.au&gt;
Cc: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>futex: avoid race between requeue and wake</title>
<updated>2014-04-14T13:50:03+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-04-08T22:30:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8e58cd80d042569da7af501de897c5e0538d99b0'/>
<id>8e58cd80d042569da7af501de897c5e0538d99b0</id>
<content type='text'>
commit 69cd9eba38867a493a043bb13eb9b33cad5f1a9a upstream.

Jan Stancek reported:
 "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
  occasionally fails, because some threads fail to wake up.

  Testcase creates 5 threads, which are all waiting on same condition.
  Main thread then calls pthread_cond_broadcast() without holding mutex,
  which calls:

      futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

  This immediately wakes up single thread A, which unlocks mutex and
  tries to wake up another thread:

      futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

  If thread A manages to call futex_wake() before any waiters are
  requeued for uaddr2, no other thread is woken up"

The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.

This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks.  It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.

Reported-and-tested-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Acked-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 69cd9eba38867a493a043bb13eb9b33cad5f1a9a upstream.

Jan Stancek reported:
 "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
  occasionally fails, because some threads fail to wake up.

  Testcase creates 5 threads, which are all waiting on same condition.
  Main thread then calls pthread_cond_broadcast() without holding mutex,
  which calls:

      futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

  This immediately wakes up single thread A, which unlocks mutex and
  tries to wake up another thread:

      futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

  If thread A manages to call futex_wake() before any waiters are
  requeued for uaddr2, no other thread is woken up"

The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.

This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks.  It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.

Reported-and-tested-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Acked-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>futex: revert back to the explicit waiter counting code</title>
<updated>2014-03-21T05:11:17+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-03-21T05:11:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=11d4616bd07f38d496bd489ed8fad1dc4d928823'/>
<id>11d4616bd07f38d496bd489ed8fad1dc4d928823</id>
<content type='text'>
Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb-&gt;lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Original-code-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb-&gt;lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Original-code-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2014-01-20T18:42:08+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-01-20T18:42:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a0fa1dd3cdbccec9597fe53b6177a9aa6e20f2f8'/>
<id>a0fa1dd3cdbccec9597fe53b6177a9aa6e20f2f8</id>
<content type='text'>
Pull scheduler changes from Ingo Molnar:

 - Add the initial implementation of SCHED_DEADLINE support: a real-time
   scheduling policy where tasks that meet their deadlines and
   periodically execute their instances in less than their runtime quota
   see real-time scheduling and won't miss any of their deadlines.
   Tasks that go over their quota get delayed (Available to privileged
   users for now)

 - Clean up and fix preempt_enable_no_resched() abuse all around the
   tree

 - Do sched_clock() performance optimizations on x86 and elsewhere

 - Fix and improve auto-NUMA balancing

 - Fix and clean up the idle loop

 - Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  sched: Fix __sched_setscheduler() nice test
  sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
  sched: Fix up attr::sched_priority warning
  sched: Fix up scheduler syscall LTP fails
  sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
  sched/core: Fix htmldocs warnings
  sched/deadline: No need to check p if dl_se is valid
  sched/deadline: Remove unused variables
  sched/deadline: Fix sparse static warnings
  m68k: Fix build warning in mac_via.h
  sched, thermal: Clean up preempt_enable_no_resched() abuse
  sched, net: Fixup busy_loop_us_clock()
  sched, net: Clean up preempt_enable_no_resched() abuse
  sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
  sched/preempt, locking: Rework local_bh_{dis,en}able()
  sched/clock, x86: Avoid a runtime condition in native_sched_clock()
  sched/clock: Fix up clear_sched_clock_stable()
  sched/clock, x86: Use a static_key for sched_clock_stable
  sched/clock: Remove local_irq_disable() from the clocks
  sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull scheduler changes from Ingo Molnar:

 - Add the initial implementation of SCHED_DEADLINE support: a real-time
   scheduling policy where tasks that meet their deadlines and
   periodically execute their instances in less than their runtime quota
   see real-time scheduling and won't miss any of their deadlines.
   Tasks that go over their quota get delayed (Available to privileged
   users for now)

 - Clean up and fix preempt_enable_no_resched() abuse all around the
   tree

 - Do sched_clock() performance optimizations on x86 and elsewhere

 - Fix and improve auto-NUMA balancing

 - Fix and clean up the idle loop

 - Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  sched: Fix __sched_setscheduler() nice test
  sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
  sched: Fix up attr::sched_priority warning
  sched: Fix up scheduler syscall LTP fails
  sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
  sched/core: Fix htmldocs warnings
  sched/deadline: No need to check p if dl_se is valid
  sched/deadline: Remove unused variables
  sched/deadline: Fix sparse static warnings
  m68k: Fix build warning in mac_via.h
  sched, thermal: Clean up preempt_enable_no_resched() abuse
  sched, net: Fixup busy_loop_us_clock()
  sched, net: Clean up preempt_enable_no_resched() abuse
  sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
  sched/preempt, locking: Rework local_bh_{dis,en}able()
  sched/clock, x86: Avoid a runtime condition in native_sched_clock()
  sched/clock: Fix up clear_sched_clock_stable()
  sched/clock, x86: Use a static_key for sched_clock_stable
  sched/clock: Remove local_irq_disable() from the clocks
  sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>futexes: Fix futex_hashsize initialization</title>
<updated>2014-01-16T14:14:32+00:00</updated>
<author>
<name>Heiko Carstens</name>
<email>heiko.carstens@de.ibm.com</email>
</author>
<published>2014-01-16T13:54:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=63b1a81699c2a45c9f737419b1ec1da0ecf92812'/>
<id>63b1a81699c2a45c9f737419b1ec1da0ecf92812</id>
<content type='text'>
"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.

alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.

Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.

Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.

alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.

Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.

Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rtmutex: Turn the plist into an rb-tree</title>
<updated>2014-01-13T12:41:50+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2013-11-07T13:43:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fb00aca474405f4fa8a8519c3179fed722eabd83'/>
<id>fb00aca474405f4fa8a8519c3179fed722eabd83</id>
<content type='text'>
Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Dario Faggioli &lt;raistlin@linux.it&gt;
Signed-off-by: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Signed-off-again-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Dario Faggioli &lt;raistlin@linux.it&gt;
Signed-off-by: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Signed-off-again-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>futexes: Avoid taking the hb-&gt;lock if there's nothing to wake up</title>
<updated>2014-01-13T10:45:21+00:00</updated>
<author>
<name>Davidlohr Bueso</name>
<email>davidlohr@hp.com</email>
</author>
<published>2014-01-12T23:31:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b0c29f79ecea0b6fbcefc999e70f2843ae8306db'/>
<id>b0c29f79ecea0b6fbcefc999e70f2843ae8306db</id>
<content type='text'>
In futex_wake() there is clearly no point in taking the hb-&gt;lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

  https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In futex_wake() there is clearly no point in taking the hb-&gt;lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

  https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>futexes: Document multiprocessor ordering guarantees</title>
<updated>2014-01-13T10:45:19+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2014-01-12T23:31:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=99b60ce69734dfeda58c6184a326b9475ce1dba3'/>
<id>99b60ce69734dfeda58c6184a326b9475ce1dba3</id>
<content type='text'>
That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>futexes: Increase hash table size for better performance</title>
<updated>2014-01-13T10:45:18+00:00</updated>
<author>
<name>Davidlohr Bueso</name>
<email>davidlohr@hp.com</email>
</author>
<published>2014-01-12T23:31:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a52b89ebb6d4499be38780db8d176c5d3a6fbc17'/>
<id>a52b89ebb6d4499be38780db8d176c5d3a6fbc17</id>
<content type='text'>
Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb-&gt;lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb-&gt;locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

 +---------+--------------------+------------------------+-----------------------+-------------------------------+
 | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
 +---------+--------------------+------------------------+-----------------------+-------------------------------+
 |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
 |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
 |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
 |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
 |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
 |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
 +---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Reviewed-by: Waiman Long &lt;Waiman.Long@hp.com&gt;
Reviewed-and-tested-by: Jason Low &lt;jason.low2@hp.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb-&gt;lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb-&gt;locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

 +---------+--------------------+------------------------+-----------------------+-------------------------------+
 | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
 +---------+--------------------+------------------------+-----------------------+-------------------------------+
 |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
 |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
 |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
 |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
 |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
 |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
 +---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Reviewed-by: Waiman Long &lt;Waiman.Long@hp.com&gt;
Reviewed-and-tested-by: Jason Low &lt;jason.low2@hp.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>futexes: Clean up various details</title>
<updated>2014-01-13T10:45:17+00:00</updated>
<author>
<name>Jason Low</name>
<email>jason.low2@hp.com</email>
</author>
<published>2014-01-12T23:31:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0d00c7b20c7716ce08399570ea48813ecf001aa8'/>
<id>0d00c7b20c7716ce08399570ea48813ecf001aa8</id>
<content type='text'>
- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Jason Low &lt;jason.low2@hp.com&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reviewed-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Jason Low &lt;jason.low2@hp.com&gt;
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Jeff Mahoney &lt;jeffm@suse.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Scott Norton &lt;scott.norton@hp.com&gt;
Cc: Tom Vaden &lt;tom.vaden@hp.com&gt;
Cc: Aswin Chandramouleeswaran &lt;aswin@hp.com&gt;
Cc: Waiman Long &lt;Waiman.Long@hp.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
