Merge tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner: - A rework of the hrtimer subsystem to reduce the overhead for frequently armed timers, especially the hrtick scheduler timer: - Better timer locality decision - Simplification of the evaluation of the first expiry time by keeping track of the neighbor timers in the RB-tree by providing a RB-tree variant with neighbor links. That avoids walking the RB-tree on removal to find the next expiry time, but even more important allows to quickly evaluate whether a timer which is rearmed changes the position in the RB-tree with the modified expiry time or not. If not, the dequeue/enqueue sequence which both can end up in rebalancing can be completely avoided. - Deferred reprogramming of the underlying clock event device. This optimizes for the situation where a hrtimer callback sets the need resched bit. In that case the code attempts to defer the re-programming of the clock event device up to the point where the scheduler has picked the next task and has the next hrtick timer armed. In case that there is no immediate reschedule or soft interrupts have to be handled before reaching the reschedule point in the interrupt entry code the clock event is reprogrammed in one of those code paths to prevent that the timer becomes stale. - Support for clocksource coupled clockevents The TSC deadline timer is coupled to the TSC. The next event is programmed in TSC time. Currently this is done by converting the CLOCK_MONOTONIC based expiry value into a relative timeout, converting it into TSC ticks, reading the TSC adding the delta ticks and writing the deadline MSR. As the timekeeping core has the conversion factors for the TSC already, the whole back and forth conversion can be completely avoided. The timekeeping core calculates the reverse conversion factors from nanoseconds to TSC ticks and utilizes the base timestamps of TSC and CLOCK_MONOTONIC which are updated once per tick. This allows a direct conversion into the TSC deadline value without reading the time and as a bonus keeps the deadline conversion in sync with the TSC conversion factors, which are updated by adjtimex() on systems with NTP/PTP enabled. - Allow inlining of the clocksource read and clockevent write functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR. With all those enhancements in place a hrtick enabled scheduler provides the same performance as without hrtick. But also other hrtimer users obviously benefit from these optimizations. - Robustness improvements and cleanups of historical sins in the hrtimer and timekeeping code. - Rewrite of the clocksource watchdog. The clocksource watchdog code has over time reached the state of an impenetrable maze of duct tape and staples. The original design, which was made in the context of systems far smaller than today, is based on the assumption that the to be monitored clocksource (TSC) can be trivially compared against a known to be stable clocksource (HPET/ACPI-PM timer). Over the years this rather naive approach turned out to have major flaws. Long delays between the watchdog invocations can cause wrap arounds of the reference clocksource. The access to the reference clocksource degrades on large multi-sockets systems dure to interconnect congestion. This has been addressed with various heuristics which degraded the accuracy of the watchdog to the point that it fails to detect actual TSC problems on older hardware which exposes slow inter CPU drifts due to firmware manipulating the TSC to hide SMI time. The rewrite addresses this by: - Restricting the validation against the reference clocksource to the boot CPU which is usually closest to the legacy block which contains the reference clocksource (HPET/ACPI-PM). - Do a round robin validation betwen the boot CPU and the other CPUs based only on the TSC with an algorithm similar to the TSC synchronization code during CPU hotplug. - Being more leniant versus remote timeouts - The usual tiny fixes, cleanups and enhancements all over the place * tag 'timers-core-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) alarmtimer: Access timerqueue node under lock in suspend hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check posix-timers: Fix stale function name in comment timers: Get this_cpu once while clearing the idle state clocksource: Rewrite watchdog code completely clocksource: Don't use non-continuous clocksources as watchdog x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly MIPS: Don't select CLOCKSOURCE_WATCHDOG parisc: Remove unused clocksource flags hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES hrtimer: Mark index and clockid of clock base as const hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event hrtimer: Drop spurious space in 'enum hrtimer_base_type' hrtimer: Don't zero-initialize ret in hrtimer_nanosleep() hrtimer: Remove hrtimer_get_expires_ns() timekeeping: Mark offsets array as const timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals() timer_list: Print offset as signed integer tracing: Use explicit array size instead of sentinel elements in symbol printing ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2026-04-14 10:27:07 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2026-04-14 10:27:07 -0700
commit: c1fe867b5bf9c57ab7856486d342720e2b205eed (patch)
tree: 3c9a31afe6b81498b821304dfa107dad7ee2da60 /kernel/sched
parent: 1d5e40351e7d521d7d143447d57315b6eb1e1160 (diff)
parent: ff1c0c5d07028a84837950b619d30da623f8ddb2 (diff)
5 files changed, 127 insertions, 67 deletions
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..4495929f4c9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -872,7 +872,14 @@ void update_rq_clock(struct rq *rq)
  * Use HR-timers to deliver accurate preemption points.
  */
 
-static void hrtick_clear(struct rq *rq)
+enum {
+	HRTICK_SCHED_NONE		= 0,
+	HRTICK_SCHED_DEFER		= BIT(1),
+	HRTICK_SCHED_START		= BIT(2),
+	HRTICK_SCHED_REARM_HRTIMER	= BIT(3)
+};
+
+static void __used hrtick_clear(struct rq *rq)
 {
 	if (hrtimer_active(&rq->hrtick_timer))
 		hrtimer_cancel(&rq->hrtick_timer);
@@ -897,12 +904,24 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void __hrtick_restart(struct rq *rq)
+static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires)
+{
+	/*
+	 * Queued is false when the timer is not started or currently
+	 * running the callback. In both cases, restart. If queued check
+	 * whether the expiry time actually changes substantially.
+	 */
+	return !hrtimer_is_queued(timer) ||
+		abs(expires - hrtimer_get_expires(timer)) > 5000;
+}
+
+static void hrtick_cond_restart(struct rq *rq)
 {
 	struct hrtimer *timer = &rq->hrtick_timer;
 	ktime_t time = rq->hrtick_time;
 
-	hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
+	if (hrtick_needs_rearm(timer, time))
+		hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
 }
 
 /*
@@ -914,7 +933,7 @@ static void __hrtick_start(void *arg)
 	struct rq_flags rf;
 
 	rq_lock(rq, &rf);
-	__hrtick_restart(rq);
+	hrtick_cond_restart(rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -925,7 +944,6 @@ static void __hrtick_start(void *arg)
  */
 void hrtick_start(struct rq *rq, u64 delay)
 {
-	struct hrtimer *timer = &rq->hrtick_timer;
 	s64 delta;
 
 	/*
@@ -933,27 +951,67 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 * doesn't make sense and can cause timer DoS.
 	 */
 	delta = max_t(s64, delay, 10000LL);
-	rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta);
+
+	/*
+	 * If this is in the middle of schedule() only note the delay
+	 * and let hrtick_schedule_exit() deal with it.
+	 */
+	if (rq->hrtick_sched) {
+		rq->hrtick_sched |= HRTICK_SCHED_START;
+		rq->hrtick_delay = delta;
+		return;
+	}
+
+	rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
+	if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time))
+		return;
 
 	if (rq == this_rq())
-		__hrtick_restart(rq);
+		hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD);
 	else
 		smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
 }
 
-static void hrtick_rq_init(struct rq *rq)
+static inline void hrtick_schedule_enter(struct rq *rq)
 {
-	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
-	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+	rq->hrtick_sched = HRTICK_SCHED_DEFER;
+	if (hrtimer_test_and_clear_rearm_deferred())
+		rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER;
 }
-#else /* !CONFIG_SCHED_HRTICK: */
-static inline void hrtick_clear(struct rq *rq)
+
+static inline void hrtick_schedule_exit(struct rq *rq)
 {
+	if (rq->hrtick_sched & HRTICK_SCHED_START) {
+		rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
+		hrtick_cond_restart(rq);
+	} else if (idle_rq(rq)) {
+		/*
+		 * No need for using hrtimer_is_active(). The timer is CPU local
+		 * and interrupts are disabled, so the callback cannot be
+		 * running and the queued state is valid.
+		 */
+		if (hrtimer_is_queued(&rq->hrtick_timer))
+			hrtimer_cancel(&rq->hrtick_timer);
+	}
+
+	if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER)
+		__hrtimer_rearm_deferred();
+
+	rq->hrtick_sched = HRTICK_SCHED_NONE;
 }
 
-static inline void hrtick_rq_init(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
 {
+	INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
+	rq->hrtick_sched = HRTICK_SCHED_NONE;
+	hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM);
 }
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline void hrtick_clear(struct rq *rq) { }
+static inline void hrtick_rq_init(struct rq *rq) { }
+static inline void hrtick_schedule_enter(struct rq *rq) { }
+static inline void hrtick_schedule_exit(struct rq *rq) { }
 #endif /* !CONFIG_SCHED_HRTICK */
 
 /*
@@ -5032,6 +5090,7 @@ static inline void finish_lock_switch(struct rq *rq)
 	 */
 	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq, NULL);
+	hrtick_schedule_exit(rq);
 	raw_spin_rq_unlock_irq(rq);
 }
 
@@ -6785,9 +6844,6 @@ static void __sched notrace __schedule(int sched_mode)
 
 	schedule_debug(prev, preempt);
 
-	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
-		hrtick_clear(rq);
-
 	klp_sched_try_switch(prev);
 
 	local_irq_disable();
@@ -6814,6 +6870,8 @@ static void __sched notrace __schedule(int sched_mode)
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
 
+	hrtick_schedule_enter(rq);
+
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;
 	update_rq_clock(rq);
@@ -6916,6 +6974,7 @@ keep_resched:
 
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
+		hrtick_schedule_exit(rq);
 		raw_spin_rq_unlock_irq(rq);
 	}
 	trace_sched_exit_tp(is_switch);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 674de6a48551..8c3c1fe8d3a6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,7 +1097,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 		act = ns_to_ktime(dl_next_period(dl_se));
 	}
 
-	now = hrtimer_cb_get_time(timer);
+	now = ktime_get();
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be7..2be80780ff51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5600,7 +5600,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * validating it and just reschedule.
 	 */
 	if (queued) {
-		resched_curr_lazy(rq_of(cfs_rq));
+		resched_curr(rq_of(cfs_rq));
 		return;
 	}
 #endif
@@ -6805,27 +6805,41 @@ static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
+	unsigned long scale = 1024;
+	unsigned long util = 0;
+	u64 vdelta;
+	u64 delta;
 
 	WARN_ON_ONCE(task_rq(p) != rq);
 
-	if (rq->cfs.h_nr_queued > 1) {
-		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-		u64 slice = se->slice;
-		s64 delta = slice - ran;
+	if (rq->cfs.h_nr_queued <= 1)
+		return;
 
-		if (delta < 0) {
-			if (task_current_donor(rq, p))
-				resched_curr(rq);
-			return;
-		}
-		hrtick_start(rq, delta);
+	/*
+	 * Compute time until virtual deadline
+	 */
+	vdelta = se->deadline - se->vruntime;
+	if ((s64)vdelta < 0) {
+		if (task_current_donor(rq, p))
+			resched_curr(rq);
+		return;
 	}
+	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+	/*
+	 * Correct for instantaneous load of other classes.
+	 */
+	util += cpu_util_irq(rq);
+	if (util && util < 1024) {
+		scale *= 1024;
+		scale /= (1024 - util);
+	}
+
+	hrtick_start(rq, (scale * delta) / 1024);
 }
 
 /*
- * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1.
  */
 static void hrtick_update(struct rq *rq)
 {
@@ -6834,6 +6848,9 @@ static void hrtick_update(struct rq *rq)
 	if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class)
 		return;
 
+	if (hrtick_active(rq))
+		return;
+
 	hrtick_start_fair(rq, donor);
 }
 #else /* !CONFIG_SCHED_HRTICK: */
@@ -7156,9 +7173,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
 
-		/* Fix-up what dequeue_task_fair() skipped */
-		hrtick_update(rq);
-
 		/*
 		 * Fix-up what block_task() skipped.
 		 *
@@ -7192,8 +7206,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	/*
 	 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
 	 */
-
-	hrtick_update(rq);
 	return true;
 }
 
@@ -13435,11 +13447,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (queued) {
-		if (!need_resched())
-			hrtick_start_fair(rq, curr);
+	if (queued)
 		return;
-	}
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 136a6584be79..d06228462607 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
+#else
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(HRTICK_DL, false)
+#endif
 
 /*
  * Decrement CPU capacity based on time not spent running tasks
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1ef9ba480f51..a67c73ecdf79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1288,6 +1288,8 @@ struct rq {
 	call_single_data_t	hrtick_csd;
 	struct hrtimer		hrtick_timer;
 	ktime_t			hrtick_time;
+	ktime_t			hrtick_delay;
+	unsigned int		hrtick_sched;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
@@ -3033,46 +3035,31 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
  *  - enabled by features
  *  - hrtimer is actually high res
  */
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_enabled(struct rq *rq)
 {
-	if (!cpu_active(cpu_of(rq)))
-		return 0;
-	return hrtimer_is_hres_active(&rq->hrtick_timer);
+	return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled();
 }
 
-static inline int hrtick_enabled_fair(struct rq *rq)
+static inline bool hrtick_enabled_fair(struct rq *rq)
 {
-	if (!sched_feat(HRTICK))
-		return 0;
-	return hrtick_enabled(rq);
+	return sched_feat(HRTICK) && hrtick_enabled(rq);
 }
 
-static inline int hrtick_enabled_dl(struct rq *rq)
+static inline bool hrtick_enabled_dl(struct rq *rq)
 {
-	if (!sched_feat(HRTICK_DL))
-		return 0;
-	return hrtick_enabled(rq);
+	return sched_feat(HRTICK_DL) && hrtick_enabled(rq);
 }
 
 extern void hrtick_start(struct rq *rq, u64 delay);
-
-#else /* !CONFIG_SCHED_HRTICK: */
-
-static inline int hrtick_enabled_fair(struct rq *rq)
-{
-	return 0;
-}
-
-static inline int hrtick_enabled_dl(struct rq *rq)
-{
-	return 0;
-}
-
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_active(struct rq *rq)
 {
-	return 0;
+	return hrtimer_active(&rq->hrtick_timer);
 }
 
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline bool hrtick_enabled_fair(struct rq *rq) { return false; }
+static inline bool hrtick_enabled_dl(struct rq *rq) { return false; }
+static inline bool hrtick_enabled(struct rq *rq) { return false; }
 #endif /* !CONFIG_SCHED_HRTICK */
 
 #ifndef arch_scale_freq_tick
author	Linus Torvalds <torvalds@linux-foundation.org>	2026-04-14 10:27:07 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2026-04-14 10:27:07 -0700
commit	c1fe867b5bf9c57ab7856486d342720e2b205eed (patch)
tree	3c9a31afe6b81498b821304dfa107dad7ee2da60 /kernel/sched
parent	1d5e40351e7d521d7d143447d57315b6eb1e1160 (diff)
parent	ff1c0c5d07028a84837950b619d30da623f8ddb2 (diff)