linux-toradex.git/kernel/locking/qspinlock.c, branch v4.6-rc3

locking/qspinlock: Use smp_cond_acquire() in pending code

2016-02-29T09:02:42+00:00

The newly introduced smp_cond_acquire() was used to replace the
slowpath lock acquisition loop. Similarly, the new function can also
be applied to the pending bit locking loop. This patch uses the new
function in that loop.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Douglas Hatch 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

locking/pvqspinlock: Queue node adaptive spinning

2015-12-04T10:39:51+00:00

In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

		    Westmere			Haswell
  Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
  -----		--------    --------    --------    --------
  Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
  After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s

For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Davidlohr Bueso 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

locking/pvqspinlock: Allow limited lock stealing

2015-12-04T10:39:51+00:00

This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt.  This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.

The pv_wait_head() function was renamed as pv_wait_head_or_lock()
as it was modified to acquire the lock before returning. This is
necessary because of possible lock stealing attempts from other tasks.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

                    Westmere                    Haswell
  Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
  -----         --------    --------    --------    --------
  Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
  After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s

For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Davidlohr Bueso 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

locking, sched: Introduce smp_cond_acquire() and use it

2015-12-04T09:33:41+00:00

Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.

This primitive has two benefits:

 - it documents control dependencies,
 - its typically cheaper than using smp_load_acquire() in a loop.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Signed-off-by: Ingo Molnar

locking/qspinlock: Avoid redundant read of next pointer

2015-11-23T09:02:00+00:00

With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Davidlohr Bueso 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

locking/qspinlock: Prefetch the next node cacheline

2015-11-23T09:01:59+00:00

A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.

This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.

The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Davidlohr Bueso 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()

2015-11-23T09:01:58+00:00

This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with the more relaxed _acquire or _release versions of
those calls to enable other architectures to adopt queued spinlocks
with less memory barrier performance overhead.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Davidlohr Bueso 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

locking/qspinlock/x86: Fix performance regression under unaccelerated VMs

2015-09-11T05:49:42+00:00

Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
set and Linus noted that the test-and-set implementation was retarded.

One should spin on the variable with a load, not a RMW.

While there, remove 'queued' from the name, as the lock isn't queued
at all, but a simple test-and-set.

Suggested-by: Linus Torvalds 
Reported-by: Dave Chinner 
Tested-by: Dave Chinner 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Cc: stable@vger.kernel.org # v4.2+
Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

locking/pvqspinlock: Only kick CPU at unlock time

2015-08-03T08:57:11+00:00

For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - once before it becomes queue head and once again
before it gets the lock. All these CPU kicking and halting (VMEXIT)
can be expensive and slow down system performance.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar

locking/pvqspinlock: Implement simple paravirt support for the qspinlock

2015-05-08T10:37:05+00:00

Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage directly visible to the
unlocker.  So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Boris Ostrovsky 
Cc: Borislav Petkov 
Cc: Daniel J Blueman 
Cc: David Vrabel 
Cc: Douglas Hatch 
Cc: H. Peter Anvin 
Cc: Konrad Rzeszutek Wilk 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paolo Bonzini 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Raghavendra K T 
Cc: Rik van Riel 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar