diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-14 10:27:07 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-14 10:27:07 -0700 |
| commit | c1fe867b5bf9c57ab7856486d342720e2b205eed (patch) | |
| tree | 3c9a31afe6b81498b821304dfa107dad7ee2da60 /kernel/sched | |
| parent | 1d5e40351e7d521d7d143447d57315b6eb1e1160 (diff) | |
| parent | ff1c0c5d07028a84837950b619d30da623f8ddb2 (diff) | |
Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- A rework of the hrtimer subsystem to reduce the overhead for
frequently armed timers, especially the hrtick scheduler timer:
- Better timer locality decision
- Simplification of the evaluation of the first expiry time by
keeping track of the neighbor timers in the RB-tree by providing
a RB-tree variant with neighbor links. That avoids walking the
RB-tree on removal to find the next expiry time, but even more
important allows to quickly evaluate whether a timer which is
rearmed changes the position in the RB-tree with the modified
expiry time or not. If not, the dequeue/enqueue sequence which
both can end up in rebalancing can be completely avoided.
- Deferred reprogramming of the underlying clock event device. This
optimizes for the situation where a hrtimer callback sets the
need resched bit. In that case the code attempts to defer the
re-programming of the clock event device up to the point where
the scheduler has picked the next task and has the next hrtick
timer armed. In case that there is no immediate reschedule or
soft interrupts have to be handled before reaching the reschedule
point in the interrupt entry code the clock event is reprogrammed
in one of those code paths to prevent that the timer becomes
stale.
- Support for clocksource coupled clockevents
The TSC deadline timer is coupled to the TSC. The next event is
programmed in TSC time. Currently this is done by converting the
CLOCK_MONOTONIC based expiry value into a relative timeout,
converting it into TSC ticks, reading the TSC adding the delta
ticks and writing the deadline MSR.
As the timekeeping core has the conversion factors for the TSC
already, the whole back and forth conversion can be completely
avoided. The timekeeping core calculates the reverse conversion
factors from nanoseconds to TSC ticks and utilizes the base
timestamps of TSC and CLOCK_MONOTONIC which are updated once per
tick. This allows a direct conversion into the TSC deadline value
without reading the time and as a bonus keeps the deadline
conversion in sync with the TSC conversion factors, which are
updated by adjtimex() on systems with NTP/PTP enabled.
- Allow inlining of the clocksource read and clockevent write
functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.
With all those enhancements in place a hrtick enabled scheduler
provides the same performance as without hrtick. But also other
hrtimer users obviously benefit from these optimizations.
- Robustness improvements and cleanups of historical sins in the
hrtimer and timekeeping code.
- Rewrite of the clocksource watchdog.
The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design,
which was made in the context of systems far smaller than today, is
based on the assumption that the to be monitored clocksource (TSC)
can be trivially compared against a known to be stable clocksource
(HPET/ACPI-PM timer).
Over the years this rather naive approach turned out to have major
flaws. Long delays between the watchdog invocations can cause wrap
arounds of the reference clocksource. The access to the reference
clocksource degrades on large multi-sockets systems dure to
interconnect congestion. This has been addressed with various
heuristics which degraded the accuracy of the watchdog to the point
that it fails to detect actual TSC problems on older hardware which
exposes slow inter CPU drifts due to firmware manipulating the TSC to
hide SMI time.
The rewrite addresses this by:
- Restricting the validation against the reference clocksource to
the boot CPU which is usually closest to the legacy block which
contains the reference clocksource (HPET/ACPI-PM).
- Do a round robin validation betwen the boot CPU and the other
CPUs based only on the TSC with an algorithm similar to the TSC
synchronization code during CPU hotplug.
- Being more leniant versus remote timeouts
- The usual tiny fixes, cleanups and enhancements all over the place
* tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
alarmtimer: Access timerqueue node under lock in suspend
hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
posix-timers: Fix stale function name in comment
timers: Get this_cpu once while clearing the idle state
clocksource: Rewrite watchdog code completely
clocksource: Don't use non-continuous clocksources as watchdog
x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
MIPS: Don't select CLOCKSOURCE_WATCHDOG
parisc: Remove unused clocksource flags
hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
hrtimer: Mark index and clockid of clock base as const
hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
hrtimer: Drop spurious space in 'enum hrtimer_base_type'
hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
hrtimer: Remove hrtimer_get_expires_ns()
timekeeping: Mark offsets array as const
timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
timer_list: Print offset as signed integer
tracing: Use explicit array size instead of sentinel elements in symbol printing
...
Diffstat (limited to 'kernel/sched')
| -rw-r--r-- | kernel/sched/core.c | 91 | ||||
| -rw-r--r-- | kernel/sched/deadline.c | 2 | ||||
| -rw-r--r-- | kernel/sched/fair.c | 55 | ||||
| -rw-r--r-- | kernel/sched/features.h | 5 | ||||
| -rw-r--r-- | kernel/sched/sched.h | 41 |
5 files changed, 127 insertions, 67 deletions
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 496dff740dca..4495929f4c9b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -872,7 +872,14 @@ void update_rq_clock(struct rq *rq) * Use HR-timers to deliver accurate preemption points. */ -static void hrtick_clear(struct rq *rq) +enum { + HRTICK_SCHED_NONE = 0, + HRTICK_SCHED_DEFER = BIT(1), + HRTICK_SCHED_START = BIT(2), + HRTICK_SCHED_REARM_HRTIMER = BIT(3) +}; + +static void __used hrtick_clear(struct rq *rq) { if (hrtimer_active(&rq->hrtick_timer)) hrtimer_cancel(&rq->hrtick_timer); @@ -897,12 +904,24 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer) return HRTIMER_NORESTART; } -static void __hrtick_restart(struct rq *rq) +static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires) +{ + /* + * Queued is false when the timer is not started or currently + * running the callback. In both cases, restart. If queued check + * whether the expiry time actually changes substantially. + */ + return !hrtimer_is_queued(timer) || + abs(expires - hrtimer_get_expires(timer)) > 5000; +} + +static void hrtick_cond_restart(struct rq *rq) { struct hrtimer *timer = &rq->hrtick_timer; ktime_t time = rq->hrtick_time; - hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD); + if (hrtick_needs_rearm(timer, time)) + hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD); } /* @@ -914,7 +933,7 @@ static void __hrtick_start(void *arg) struct rq_flags rf; rq_lock(rq, &rf); - __hrtick_restart(rq); + hrtick_cond_restart(rq); rq_unlock(rq, &rf); } @@ -925,7 +944,6 @@ static void __hrtick_start(void *arg) */ void hrtick_start(struct rq *rq, u64 delay) { - struct hrtimer *timer = &rq->hrtick_timer; s64 delta; /* @@ -933,27 +951,67 @@ void hrtick_start(struct rq *rq, u64 delay) * doesn't make sense and can cause timer DoS. */ delta = max_t(s64, delay, 10000LL); - rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta); + + /* + * If this is in the middle of schedule() only note the delay + * and let hrtick_schedule_exit() deal with it. + */ + if (rq->hrtick_sched) { + rq->hrtick_sched |= HRTICK_SCHED_START; + rq->hrtick_delay = delta; + return; + } + + rq->hrtick_time = ktime_add_ns(ktime_get(), delta); + if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time)) + return; if (rq == this_rq()) - __hrtick_restart(rq); + hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD); else smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd); } -static void hrtick_rq_init(struct rq *rq) +static inline void hrtick_schedule_enter(struct rq *rq) { - INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq); - hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD); + rq->hrtick_sched = HRTICK_SCHED_DEFER; + if (hrtimer_test_and_clear_rearm_deferred()) + rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER; } -#else /* !CONFIG_SCHED_HRTICK: */ -static inline void hrtick_clear(struct rq *rq) + +static inline void hrtick_schedule_exit(struct rq *rq) { + if (rq->hrtick_sched & HRTICK_SCHED_START) { + rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay); + hrtick_cond_restart(rq); + } else if (idle_rq(rq)) { + /* + * No need for using hrtimer_is_active(). The timer is CPU local + * and interrupts are disabled, so the callback cannot be + * running and the queued state is valid. + */ + if (hrtimer_is_queued(&rq->hrtick_timer)) + hrtimer_cancel(&rq->hrtick_timer); + } + + if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER) + __hrtimer_rearm_deferred(); + + rq->hrtick_sched = HRTICK_SCHED_NONE; } -static inline void hrtick_rq_init(struct rq *rq) +static void hrtick_rq_init(struct rq *rq) { + INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq); + rq->hrtick_sched = HRTICK_SCHED_NONE; + hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, + HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM); } +#else /* !CONFIG_SCHED_HRTICK: */ +static inline void hrtick_clear(struct rq *rq) { } +static inline void hrtick_rq_init(struct rq *rq) { } +static inline void hrtick_schedule_enter(struct rq *rq) { } +static inline void hrtick_schedule_exit(struct rq *rq) { } #endif /* !CONFIG_SCHED_HRTICK */ /* @@ -5032,6 +5090,7 @@ static inline void finish_lock_switch(struct rq *rq) */ spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); __balance_callbacks(rq, NULL); + hrtick_schedule_exit(rq); raw_spin_rq_unlock_irq(rq); } @@ -6785,9 +6844,6 @@ static void __sched notrace __schedule(int sched_mode) schedule_debug(prev, preempt); - if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) - hrtick_clear(rq); - klp_sched_try_switch(prev); local_irq_disable(); @@ -6814,6 +6870,8 @@ static void __sched notrace __schedule(int sched_mode) rq_lock(rq, &rf); smp_mb__after_spinlock(); + hrtick_schedule_enter(rq); + /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; update_rq_clock(rq); @@ -6916,6 +6974,7 @@ keep_resched: rq_unpin_lock(rq, &rf); __balance_callbacks(rq, NULL); + hrtick_schedule_exit(rq); raw_spin_rq_unlock_irq(rq); } trace_sched_exit_tp(is_switch); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 674de6a48551..8c3c1fe8d3a6 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1097,7 +1097,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) act = ns_to_ktime(dl_next_period(dl_se)); } - now = hrtimer_cb_get_time(timer); + now = ktime_get(); delta = ktime_to_ns(now) - rq_clock(rq); act = ktime_add_ns(act, delta); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ab4114712be7..2be80780ff51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5600,7 +5600,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) * validating it and just reschedule. */ if (queued) { - resched_curr_lazy(rq_of(cfs_rq)); + resched_curr(rq_of(cfs_rq)); return; } #endif @@ -6805,27 +6805,41 @@ static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct static void hrtick_start_fair(struct rq *rq, struct task_struct *p) { struct sched_entity *se = &p->se; + unsigned long scale = 1024; + unsigned long util = 0; + u64 vdelta; + u64 delta; WARN_ON_ONCE(task_rq(p) != rq); - if (rq->cfs.h_nr_queued > 1) { - u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; - u64 slice = se->slice; - s64 delta = slice - ran; + if (rq->cfs.h_nr_queued <= 1) + return; - if (delta < 0) { - if (task_current_donor(rq, p)) - resched_curr(rq); - return; - } - hrtick_start(rq, delta); + /* + * Compute time until virtual deadline + */ + vdelta = se->deadline - se->vruntime; + if ((s64)vdelta < 0) { + if (task_current_donor(rq, p)) + resched_curr(rq); + return; } + delta = (se->load.weight * vdelta) / NICE_0_LOAD; + + /* + * Correct for instantaneous load of other classes. + */ + util += cpu_util_irq(rq); + if (util && util < 1024) { + scale *= 1024; + scale /= (1024 - util); + } + + hrtick_start(rq, (scale * delta) / 1024); } /* - * called from enqueue/dequeue and updates the hrtick when the - * current task is from our class and nr_running is low enough - * to matter. + * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1. */ static void hrtick_update(struct rq *rq) { @@ -6834,6 +6848,9 @@ static void hrtick_update(struct rq *rq) if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class) return; + if (hrtick_active(rq)) + return; + hrtick_start_fair(rq, donor); } #else /* !CONFIG_SCHED_HRTICK: */ @@ -7156,9 +7173,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); - /* Fix-up what dequeue_task_fair() skipped */ - hrtick_update(rq); - /* * Fix-up what block_task() skipped. * @@ -7192,8 +7206,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) /* * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED). */ - - hrtick_update(rq); return true; } @@ -13435,11 +13447,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) entity_tick(cfs_rq, se, queued); } - if (queued) { - if (!need_resched()) - hrtick_start_fair(rq, curr); + if (queued) return; - } if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 136a6584be79..d06228462607 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true) */ SCHED_FEAT(WAKEUP_PREEMPTION, true) +#ifdef CONFIG_HRTIMER_REARM_DEFERRED +SCHED_FEAT(HRTICK, true) +SCHED_FEAT(HRTICK_DL, true) +#else SCHED_FEAT(HRTICK, false) SCHED_FEAT(HRTICK_DL, false) +#endif /* * Decrement CPU capacity based on time not spent running tasks diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1ef9ba480f51..a67c73ecdf79 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1288,6 +1288,8 @@ struct rq { call_single_data_t hrtick_csd; struct hrtimer hrtick_timer; ktime_t hrtick_time; + ktime_t hrtick_delay; + unsigned int hrtick_sched; #endif #ifdef CONFIG_SCHEDSTATS @@ -3033,46 +3035,31 @@ extern unsigned int sysctl_numa_balancing_hot_threshold; * - enabled by features * - hrtimer is actually high res */ -static inline int hrtick_enabled(struct rq *rq) +static inline bool hrtick_enabled(struct rq *rq) { - if (!cpu_active(cpu_of(rq))) - return 0; - return hrtimer_is_hres_active(&rq->hrtick_timer); + return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled(); } -static inline int hrtick_enabled_fair(struct rq *rq) +static inline bool hrtick_enabled_fair(struct rq *rq) { - if (!sched_feat(HRTICK)) - return 0; - return hrtick_enabled(rq); + return sched_feat(HRTICK) && hrtick_enabled(rq); } -static inline int hrtick_enabled_dl(struct rq *rq) +static inline bool hrtick_enabled_dl(struct rq *rq) { - if (!sched_feat(HRTICK_DL)) - return 0; - return hrtick_enabled(rq); + return sched_feat(HRTICK_DL) && hrtick_enabled(rq); } extern void hrtick_start(struct rq *rq, u64 delay); - -#else /* !CONFIG_SCHED_HRTICK: */ - -static inline int hrtick_enabled_fair(struct rq *rq) -{ - return 0; -} - -static inline int hrtick_enabled_dl(struct rq *rq) -{ - return 0; -} - -static inline int hrtick_enabled(struct rq *rq) +static inline bool hrtick_active(struct rq *rq) { - return 0; + return hrtimer_active(&rq->hrtick_timer); } +#else /* !CONFIG_SCHED_HRTICK: */ +static inline bool hrtick_enabled_fair(struct rq *rq) { return false; } +static inline bool hrtick_enabled_dl(struct rq *rq) { return false; } +static inline bool hrtick_enabled(struct rq *rq) { return false; } #endif /* !CONFIG_SCHED_HRTICK */ #ifndef arch_scale_freq_tick |
