linux-toradex.git/include/linux/timer.h, branch v4.8-rc1

timers/core: Convert to hotplug state machine

2016-07-15T08:41:42+00:00

When tearing down, call timers_dead_cpu() before notify_dead().
There is a hidden dependency between:

 - timers
 - block multiqueue
 - rcutree

If timers_dead_cpu() comes later than blk_mq_queue_reinit_notify()
that latter function causes a RCU stall.

Signed-off-by: Richard Cochran 
Signed-off-by: Anna-Maria Gleixner 
Reviewed-by: Sebastian Andrzej Siewior 
Cc: John Stultz 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Rasmus Villemoes 
Cc: Thomas Gleixner 
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.566790058@linutronix.de
Signed-off-by: Ingo Molnar

timers: Remove set_timer_slack() leftovers

2016-07-07T08:35:09+00:00

We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.

Signed-off-by: Thomas Gleixner 
Cc: Alan Stern 
Cc: Andrew F. Davis 
Cc: Arjan van de Ven 
Cc: Chris Mason 
Cc: David S. Miller 
Cc: David Woodhouse 
Cc: Dmitry Eremin-Solenikov 
Cc: Eric Dumazet 
Cc: Frederic Weisbecker 
Cc: George Spelvin 
Cc: Greg Kroah-Hartman 
Cc: Jaehoon Chung 
Cc: Jens Axboe 
Cc: John Stultz 
Cc: Josh Triplett 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Mathias Nyman 
Cc: Pali Rohár 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sebastian Reichel 
Cc: Ulf Hansson 
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
Signed-off-by: Ingo Molnar

timers: Switch to a non-cascading wheel

2016-07-07T08:35:09+00:00

The current timer wheel has some drawbacks:

1) Cascading:

   Cascading can be an unbound operation and is completely pointless in most
   cases because the vast majority of the timer wheel timers are canceled or
   rearmed before expiration. (They are used as timeout safeguards, not as
   real timers to measure time.)

2) No fast lookup of the next expiring timer:

   In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
   must fast forward the base time to the current value of jiffies. As we
   have no way to find the next expiring timer fast, the code loops linearly
   and increments the base time one by one and checks for expired timers
   in each step. This causes unbound overhead spikes exactly in the moment
   when we should wake up as fast as possible.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kinds of timeouts the
accuracy of the timeout is not really a concern. Timeouts are very often
approximate worst-case values and in case the timeout fires, we already
waited for a long time and performance is down the drain already.

The few timers which actually expire can be split into two categories:

 1) Short expiry times which expect halfways accurate expiry

 2) Long term expiry times are inaccurate today already due to the
    batching which is done for NOHZ automatically and also via the
    set_timer_slack() API.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or
cancelation. Timers which are armed with a timeout larger than the wheel
capacity are no longer cascaded. We expire them with the longest possible
timeout (6+ days). We have not observed such timeouts in our data collection,
but at least we handle them, applying the rule of the least surprise.

To avoid extending the wheel levels for HZ=1000 so we can accomodate the
longest observed timeouts (5 days in the network conntrack code) we reduce the
first level granularity on HZ=1000 to 4ms, which effectively is the same as
the HZ=250 behaviour. From our data analysis there is nothing which relies on
that 1ms granularity and as a side effect we get better batching and timer
locality for the networking code as well.

Contrary to the classic wheel the granularity of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel.

So for HZ=250 we end up with the following granularity levels:

 Level Offset   Granularity                  Range
     0      0          4 ms                 0 ms -        252 ms
     1     64         32 ms               256 ms -       2044 ms (256ms - ~2s)
     2    128        256 ms              2048 ms -      16380 ms (~2s   - ~16s)
     3    192       2048 ms (~2s)       16384 ms -     131068 ms (~16s  - ~2m)
     4    256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m   - ~17m)
     5    320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m  - ~2h)
     6    384    1048576 ms (~17m)    8388608 ms -   67108863 ms (~2h   - ~18h)
     7    448    8388608 ms (~2h)    67108864 ms -  536870911 ms (~18h  - ~6d)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.

So the new wheel concept addresses the old issues:

1) Cascading is avoided completely

2) By keeping the timers in the bucket until expiry/cancelation we can track
   the buckets which have timers enqueued in a bucket bitmap and therefore can
   look up the next expiring timer very fast and O(1).

A further benefit of the concept is that the slack calculation which is done
on every timer start is no longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This patch does not address the 'fast lookup' issue as we wanted to make sure
that there is no regression introduced by the wheel redesign. The
optimizations are in follow up patches.

This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

Signed-off-by: Thomas Gleixner 
Cc: Arjan van de Ven 
Cc: Chris Mason 
Cc: Eric Dumazet 
Cc: Frederic Weisbecker 
Cc: George Spelvin 
Cc: Josh Triplett 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
Signed-off-by: Ingo Molnar

timers: Reduce the CPU index space to 256k

2016-07-07T08:35:08+00:00

We want to store the array index in the flags space. 256k CPUs should be
enough for a while.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Cc: Arjan van de Ven 
Cc: Chris Mason 
Cc: George Spelvin 
Cc: Josh Triplett 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Paul McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.030144293@linutronix.de
Signed-off-by: Ingo Molnar

timers: Remove the deprecated mod_timer_pinned() API

2016-07-07T08:35:06+00:00

We switched all users to initialize the timers as pinned and call
mod_timer(). Remove the now unused timer API function.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Cc: Arjan van de Ven 
Cc: Chris Mason 
Cc: Eric Dumazet 
Cc: George Spelvin 
Cc: Josh Triplett 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de
Signed-off-by: Ingo Molnar

timers: Make 'pinned' a timer property

2016-07-07T08:25:13+00:00

We want to move the timer migration logic from a 'push' to a 'pull' model.

Under the current 'push' model pinned timers are handled via
a runtime API variant: mod_timer_pinned().

The 'pull' model requires us to store the pinned attribute of a timer
in the timer_list structure itself, as a new TIMER_PINNED bit in
timer->flags.

This flag must be set at initialization time and the timer APIs
recognize the flag.

This patch:

 - Implements the new flag and associated new-style initialization
   methods

 - makes mod_timer() recognize new-style pinned timers,

 - and adds some migration helper facility to allow
   step by step conversion of old-style to new-style
   pinned timers.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Cc: Arjan van de Ven 
Cc: Chris Mason 
Cc: Eric Dumazet 
Cc: George Spelvin 
Cc: Josh Triplett 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de
Signed-off-by: Ingo Molnar

timer: add setup_deferrable_timer macro

2016-05-06T08:42:29+00:00

Add the trivial missing macro to setup a deferrable timer.

Signed-off-by: Lucas Stach 
Acked-by: Thomas Gleixner

timer: Reduce timer migration overhead if disabled

2015-06-19T13:18:28+00:00

Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet 
Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Cc: Viresh Kumar 
Cc: John Stultz 
Cc: Joonwoo Park 
Cc: Wenbo Wang 
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner

timer: Stats: Simplify the flags handling

2015-06-19T13:18:27+00:00

Simplify the handling of the flag storage for the timer statistics. No
intermediate storage anymore. Just hand over the flags field.

I left the printout of 'deferrable' for now because changing this
would be an ABI update and I have no idea how strong people feel about
that. OTOH, I wonder whether we should kill the whole timer stats
stuff because all of that information can be retrieved via ftrace/perf
as well.

Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Cc: Eric Dumazet 
Cc: Viresh Kumar 
Cc: John Stultz 
Cc: Joonwoo Park 
Cc: Wenbo Wang 
Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.de
Signed-off-by: Thomas Gleixner

timer: Replace timer base by a cpu index

2015-06-19T13:18:27+00:00

Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.

Aside of that this allows to get rid of following nuisances:

 - boot_tvec_base

   That statically allocated 4k bss data is just kept around so the
   timer has a home when it gets statically initialized. It serves no
   other purpose.

   With the CPU index we assign the timer to CPU0 at static
   initialization time and therefor can avoid the whole boot_tvec_base
   dance.  That also simplifies the init code, which just can use the
   per cpu base.

   Before:
     text	   data	    bss	    dec	    hex	filename
    17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
   After:
     text	   data	    bss	    dec	    hex	filename
    17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o

 - Overloading the base pointer with various flags

   The CPU index has enough space to hold the flags (deferrable,
   irqsafe) so we can get rid of the extra masking and bit fiddling
   with the base pointer.

As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.

This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.

We might have used bitfields for this, but that would change the
static initializers and the init function for no value to accomodate
big endian bitfields.

Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Cc: Eric Dumazet 
Cc: Viresh Kumar 
Cc: John Stultz 
Cc: Joonwoo Park 
Cc: Wenbo Wang 
Cc: Steven Rostedt 
Cc: Badhri Jagan Sridharan 
Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.de
Signed-off-by: Thomas Gleixner