From 22368fe1f9bbf39db2b5b52859589883273e80ce Mon Sep 17 00:00:00 2001 From: Juri Lelli Date: Wed, 27 Nov 2024 07:37:40 +0100 Subject: sched/deadline: Fix replenish_dl_new_period dl_server condition The condition in replenish_dl_new_period() that checks if a reservation (dl_server) is deferred and is not handling a starvation case is obviously wrong. Fix it. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: Juri Lelli Signed-off-by: Peter Zijlstra (Intel) Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com --- kernel/sched/deadline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d9d5a702f1a6..206691d35b7d 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -781,7 +781,7 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, * If it is a deferred reservation, and the server * is not handling an starvation case, defer it. */ - if (dl_se->dl_defer & !dl_se->dl_defer_running) { + if (dl_se->dl_defer && !dl_se->dl_defer_running) { dl_se->dl_throttled = 1; dl_se->dl_defer_armed = 1; } -- cgit v1.2.3 From 70ee7947a29029736a1a06c73a48ff37674a851b Mon Sep 17 00:00:00 2001 From: Josh Don Date: Mon, 11 Nov 2024 10:27:38 -0800 Subject: sched: fix warning in sched_setaffinity Commit 8f9ea86fdf99b added some logic to sched_setaffinity that included a WARN when a per-task affinity assignment races with a cpuset update. Specifically, we can have a race where a cpuset update results in the task affinity no longer being a subset of the cpuset. That's fine; we have a fallback to instead use the cpuset mask. However, we have a WARN set up that will trigger if the cpuset mask has no overlap at all with the requested task affinity. This shouldn't be a warning condition; its trivial to create this condition. Reproduced the warning by the following setup: - $PID inside a cpuset cgroup - another thread repeatedly switching the cpuset cpus from 1-2 to just 1 - another thread repeatedly setting the $PID affinity (via taskset) to 2 Fixes: 8f9ea86fdf99b ("sched: Always preserve the user requested cpumask") Signed-off-by: Josh Don Acked-and-tested-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Acked-by: Waiman Long Tested-by: Madadi Vineeth Reddy Link: https://lkml.kernel.org/r/20241111182738.1832953-1-joshdon@google.com --- kernel/sched/syscalls.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 0d71fcbaf1e3..ff0e5ab4e37c 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1200,7 +1200,7 @@ int __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx) bool empty = !cpumask_and(new_mask, new_mask, ctx->user_mask); - if (WARN_ON_ONCE(empty)) + if (empty) cpumask_copy(new_mask, cpus_allowed); } __set_cpus_allowed_ptr(p, ctx); -- cgit v1.2.3 From ea9cffc0a154124821531991d5afdd7e8b20d7aa Mon Sep 17 00:00:00 2001 From: K Prateek Nayak Date: Tue, 19 Nov 2024 05:44:30 +0000 Subject: sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() The need_resched() check currently in nohz_csd_func() can be tracked to have been added in scheduler_ipi() back in 2011 via commit ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance") Since then, it has travelled quite a bit but it seems like an idle_cpu() check currently is sufficient to detect the need to bail out from an idle load balancing. To justify this removal, consider all the following case where an idle load balancing could race with a task wakeup: o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") a target perceived to be idle (target_rq->nr_running == 0) will return true for ttwu_queue_cond(target) which will offload the task wakeup to the idle target via an IPI. In all such cases target_rq->ttwu_pending will be set to 1 before queuing the wake function. If an idle load balance races here, following scenarios are possible: - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual IPI is sent to the CPU to wake it out of idle. If the nohz_csd_func() queues before sched_ttwu_pending(), the idle load balance will bail out since idle_cpu(target) returns 0 since target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after sched_ttwu_pending() it should see rq->nr_running to be non-zero and bail out of idle load balancing. - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI, the sender will simply set TIF_NEED_RESCHED for the target to put it out of idle and flush_smp_call_function_queue() in do_idle() will execute the call function. Depending on the ordering of the queuing of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in nohz_csd_func() should either see target_rq->ttwu_pending = 1 or target_rq->nr_running to be non-zero if there is a genuine task wakeup racing with the idle load balance kick. o The waker CPU perceives the target CPU to be busy (targer_rq->nr_running != 0) but the CPU is in fact going idle and due to a series of unfortunate events, the system reaches a case where the waker CPU decides to perform the wakeup by itself in ttwu_queue() on the target CPU but target is concurrently selected for idle load balance (XXX: Can this happen? I'm not sure, but we'll consider the mother of all coincidences to estimate the worst case scenario). ttwu_do_activate() calls enqueue_task() which would increment "rq->nr_running" post which it calls wakeup_preempt() which is responsible for setting TIF_NEED_RESCHED (via a resched IPI or by setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key thing to note in this case is that rq->nr_running is already non-zero in case of a wakeup before TIF_NEED_RESCHED is set which would lead to idle_cpu() check returning false. In all cases, it seems that need_resched() check is unnecessary when checking for idle_cpu() first since an impending wakeup racing with idle load balancer will either set the "rq->ttwu_pending" or indicate a newly woken task via "rq->nr_running". Chasing the reason why this check might have existed in the first place, I came across Peter's suggestion on the fist iteration of Suresh's patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was: sched_ttwu_do_pending(list); if (unlikely((rq->idle == current) && rq->nohz_balance_kick && !need_resched())) raise_softirq_irqoff(SCHED_SOFTIRQ); Since the condition to raise the SCHED_SOFIRQ was preceded by sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in the current upstream kernel, the need_resched() check was necessary to catch a newly queued task. Peter suggested modifying it to: if (idle_cpu() && rq->nohz_balance_kick && !need_resched()) raise_softirq_irqoff(SCHED_SOFTIRQ); where idle_cpu() seems to have replaced "rq->idle == current" check. Even back then, the idle_cpu() check would have been sufficient to catch a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") overloads the interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based on Peter's suggestion. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Suggested-by: Peter Zijlstra Signed-off-by: K Prateek Nayak Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 95e40895a519..803b238da23c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1283,7 +1283,7 @@ static void nohz_csd_func(void *info) WARN_ON(!(flags & NOHZ_KICK_MASK)); rq->idle_balance = idle_cpu(cpu); - if (rq->idle_balance && !need_resched()) { + if (rq->idle_balance) { rq->nohz_idle_balance = flags; raise_softirq_irqoff(SCHED_SOFTIRQ); } -- cgit v1.2.3 From ff47a0acfcce309cf9e175149c75614491953c8f Mon Sep 17 00:00:00 2001 From: K Prateek Nayak Date: Tue, 19 Nov 2024 05:44:31 +0000 Subject: sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the TIF_NEED_RESCHED flag in idle task's thread info and relying on flush_smp_call_function_queue() in idle exit path to run the call-function. A softirq raised by the call-function is handled shortly after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag remains set and is only cleared later when schedule_idle() calls __schedule(). need_resched() check in _nohz_idle_balance() exists to bail out of load balancing if another task has woken up on the CPU currently in-charge of idle load balancing which is being processed in SCHED_SOFTIRQ context. Since the optimization mentioned above overloads the interpretation of TIF_NEED_RESCHED, check for idle_cpu() before going with the existing need_resched() check which can catch a genuine task wakeup on an idle CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as well as the case where ksoftirqd needs to be preempted as a result of new task wakeup or slice expiry. In case of PREEMPT_RT or threadirqs, although the idle load balancing may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd is the only fair task going back to sleep will trigger a newidle balance on the CPU which will alleviate some imbalance if it exists if idle balance fails to do so. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Signed-off-by: K Prateek Nayak Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fbdca89c677f..05b8f1eb2c14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12568,7 +12568,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags) * work being done for other CPUs. Next load * balancing owner will pick it up. */ - if (need_resched()) { + if (!idle_cpu(this_cpu) && need_resched()) { if (flags & NOHZ_STATS_KICK) has_blocked_load = true; if (flags & NOHZ_NEXT_KICK) -- cgit v1.2.3 From e932c4ab38f072ce5894b2851fea8bc5754bb8e5 Mon Sep 17 00:00:00 2001 From: K Prateek Nayak Date: Tue, 19 Nov 2024 05:44:32 +0000 Subject: sched/core: Prevent wakeup of ksoftirqd during idle load balance Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on from the IPI handler on the idle CPU. If the SMP function is invoked from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd because soft interrupts are handled before ksoftirqd get on the CPU. Adding a trace_printk() in nohz_csd_func() at the spot of raising SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup, and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the current behavior: -0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func -0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000 -0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] -0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED] -0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120 ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 ... Use __raise_softirq_irqoff() to raise the softirq. The SMP function call is always invoked on the requested CPU in an interrupt handler. It is guaranteed that soft interrupts are handled at the end. Following are the observations with the changes when enabling the same set of events: -0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance -0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED] -0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] No unnecessary ksoftirqd wakeups are seen from idle task's context to service the softirq. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1] Reported-by: Julia Lawall Suggested-by: Sebastian Andrzej Siewior Signed-off-by: K Prateek Nayak Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Sebastian Andrzej Siewior Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 803b238da23c..c6d8232ad9ee 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1285,7 +1285,7 @@ static void nohz_csd_func(void *info) rq->idle_balance = idle_cpu(cpu); if (rq->idle_balance) { rq->nohz_idle_balance = flags; - raise_softirq_irqoff(SCHED_SOFTIRQ); + __raise_softirq_irqoff(SCHED_SOFTIRQ); } } -- cgit v1.2.3 From 0664e2c311b9fa43b33e3e81429cd0c2d7f9c638 Mon Sep 17 00:00:00 2001 From: Wander Lairson Costa Date: Wed, 24 Jul 2024 11:22:47 -0300 Subject: sched/deadline: Fix warning in migrate_enable for boosted tasks When running the following command: while true; do stress-ng --cyclic 30 --timeout 30s --minimize --quiet done a warning is eventually triggered: WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794 setup_new_dl_entity+0x13e/0x180 ... Call Trace: ? show_trace_log_lvl+0x1c4/0x2df ? enqueue_dl_entity+0x631/0x6e0 ? setup_new_dl_entity+0x13e/0x180 ? __warn+0x7e/0xd0 ? report_bug+0x11a/0x1a0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 enqueue_dl_entity+0x631/0x6e0 enqueue_task_dl+0x7d/0x120 __do_set_cpus_allowed+0xe3/0x280 __set_cpus_allowed_ptr_locked+0x140/0x1d0 __set_cpus_allowed_ptr+0x54/0xa0 migrate_enable+0x7e/0x150 rt_spin_unlock+0x1c/0x90 group_send_sig_info+0xf7/0x1a0 ? kill_pid_info+0x1f/0x1d0 kill_pid_info+0x78/0x1d0 kill_proc_info+0x5b/0x110 __x64_sys_kill+0x93/0xc0 do_syscall_64+0x5c/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f0dab31f92b This warning occurs because set_cpus_allowed dequeues and enqueues tasks with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning is triggered. A boosted task already had its parameters set by rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary, hence the WARN_ON call. Check if we are requeueing a boosted task and avoid calling setup_new_dl_entity if that's the case. Fixes: 295d6d5e3736 ("sched/deadline: Fix switching to -deadline") Signed-off-by: Wander Lairson Costa Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com --- kernel/sched/deadline.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel/sched') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 206691d35b7d..db47f33cb7d2 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2042,6 +2042,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) } else if (flags & ENQUEUE_REPLENISH) { replenish_dl_entity(dl_se); } else if ((flags & ENQUEUE_RESTORE) && + !is_dl_boosted(dl_se) && dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) { setup_new_dl_entity(dl_se); } -- cgit v1.2.3 From 5f1b64e9a9b7ee9cfd32c6b2fab796e29bfed075 Mon Sep 17 00:00:00 2001 From: Adrian Huang Date: Wed, 13 Nov 2024 18:21:46 +0800 Subject: sched/numa: fix memory leak due to the overwritten vma->numab_state [Problem Description] When running the hackbench program of LTP, the following memory leak is reported by kmemleak. # /opt/ltp/testcases/bin/hackbench 20 thread 1000 Running with 20*40 (== 800) tasks. # dmesg | grep kmemleak ... kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak) # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888cd8ca2c40 (size 64): comm "hackbench", pid 17142, jiffies 4299780315 hex dump (first 32 bytes): ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc bff18fd4): [] __kmalloc_cache_noprof+0x2f9/0x3f0 [] task_numa_work+0x725/0xa00 [] task_work_run+0x58/0x90 [] syscall_exit_to_user_mode+0x1c8/0x1e0 [] do_syscall_64+0x85/0x150 [] entry_SYSCALL_64_after_hwframe+0x76/0x7e ... This issue can be consistently reproduced on three different servers: * a 448-core server * a 256-core server * a 192-core server [Root Cause] Since multiple threads are created by the hackbench program (along with the command argument 'thread'), a shared vma might be accessed by two or more cores simultaneously. When two or more cores observe that vma->numab_state is NULL at the same time, vma->numab_state will be overwritten. Although current code ensures that only one thread scans the VMAs in a single 'numa_scan_period', there might be a chance for another thread to enter in the next 'numa_scan_period' while we have not gotten till numab_state allocation [1]. Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000` cannot the reproduce the issue. It is verified with 200+ test runs. [Solution] Use the cmpxchg atomic operation to ensure that only one thread executes the vma->numab_state assignment. [1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/ Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma") Signed-off-by: Adrian Huang Reported-by: Jiwei Sun Reviewed-by: Raghavendra K T Reviewed-by: Vlastimil Babka Cc: Ben Segall Cc: Dietmar Eggemann Cc: Ingo Molnar Cc: Juri Lelli Cc: Mel Gorman Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Valentin Schneider Cc: Vincent Guittot Cc: Signed-off-by: Andrew Morton --- kernel/sched/fair.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fbdca89c677f..a59ae2e23daf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3399,10 +3399,16 @@ retry_pids: /* Initialise new per-VMA NUMAB state. */ if (!vma->numab_state) { - vma->numab_state = kzalloc(sizeof(struct vma_numab_state), - GFP_KERNEL); - if (!vma->numab_state) + struct vma_numab_state *ptr; + + ptr = kzalloc(sizeof(*ptr), GFP_KERNEL); + if (!ptr) + continue; + + if (cmpxchg(&vma->numab_state, NULL, ptr)) { + kfree(ptr); continue; + } vma->numab_state->start_scan_seq = mm->numa_scan_seq; -- cgit v1.2.3 From 493afbd187c4c9cc1642792c0d9ba400c3d6d90d Mon Sep 17 00:00:00 2001 From: K Prateek Nayak Date: Thu, 28 Nov 2024 12:59:54 +0530 Subject: sched/fair: Fix NEXT_BUDDY Adam reports that enabling NEXT_BUDDY insta triggers a WARN in pick_next_entity(). Moving clear_buddies() up before the delayed dequeue bits ensures no ->next buddy becomes delayed. Further ensure no new ->next buddy ever starts as delayed. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Adam Li Signed-off-by: K Prateek Nayak Signed-off-by: Peter Zijlstra (Intel) Tested-by: Adam Li Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com --- kernel/sched/fair.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 05b8f1eb2c14..9d7a2dd2c260 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5478,6 +5478,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) bool sleep = flags & DEQUEUE_SLEEP; update_curr(cfs_rq); + clear_buddies(cfs_rq, se); if (flags & DEQUEUE_DELAYED) { SCHED_WARN_ON(!se->sched_delayed); @@ -5494,8 +5495,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (sched_feat(DELAY_DEQUEUE) && delay && !entity_eligible(cfs_rq, se)) { - if (cfs_rq->next == se) - cfs_rq->next = NULL; update_load_avg(cfs_rq, se, 0); se->sched_delayed = 1; return false; @@ -5520,8 +5519,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_stats_dequeue_fair(cfs_rq, se, flags); - clear_buddies(cfs_rq, se); - update_entity_lag(cfs_rq, se); if (sched_feat(PLACE_REL_DEADLINE) && !sleep) { se->deadline -= se->vruntime; @@ -8774,7 +8771,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) { + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { set_next_buddy(pse); } -- cgit v1.2.3 From c1f43c342e1f2e32f0620bf2e972e2a9ea0a1e60 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Mon, 2 Dec 2024 18:45:56 +0100 Subject: sched/fair: Fix sched_can_stop_tick() for fair tasks We can't stop the tick of a rq if there are at least 2 tasks enqueued in the whole hierarchy and not only at the root cfs rq. rq->cfs.nr_running tracks the number of sched_entity at one level whereas rq->cfs.h_nr_running tracks all queued tasks in the hierarchy. Fixes: 11cc374f4643b ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()") Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel/sched') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c6d8232ad9ee..3e5a6bf587f9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1341,7 +1341,7 @@ bool sched_can_stop_tick(struct rq *rq) if (scx_enabled() && !scx_can_stop_tick(rq)) return false; - if (rq->cfs.nr_running > 1) + if (rq->cfs.h_nr_running > 1) return false; /* -- cgit v1.2.3 From 76f2f783294d7d55c2564e2dfb0a7279ba0bc264 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 2 Dec 2024 18:45:57 +0100 Subject: sched/eevdf: More PELT vs DELAYED_DEQUEUE Vincent and Dietmar noted that while commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes the entity runnable stats, it does not adjust the cfs_rq runnable stats, which are based off of h_nr_running. Track h_nr_delayed such that we can discount those and adjust the signal. Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/ Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/ [ Fixes checkpatch warnings and rebased ] Signed-off-by: Peter Zijlstra (Intel) Reported-by: Dietmar Eggemann Reported-by: Vincent Guittot Signed-off-by: "Peter Zijlstra (Intel)" Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Tested-by: K Prateek Nayak Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org --- kernel/sched/debug.c | 1 + kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/pelt.c | 2 +- kernel/sched/sched.h | 8 ++++++-- 4 files changed, 54 insertions(+), 8 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index a48b2a701ec2..a1be00a988bf 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -845,6 +845,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread)); SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running); SEQ_printf(m, " .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running); + SEQ_printf(m, " .%-30s: %d\n", "h_nr_delayed", cfs_rq->h_nr_delayed); SEQ_printf(m, " .%-30s: %d\n", "idle_nr_running", cfs_rq->idle_nr_running); SEQ_printf(m, " .%-30s: %d\n", "idle_h_nr_running", diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9d7a2dd2c260..97ee48c8bf5e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5465,9 +5465,33 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq); -static inline void finish_delayed_dequeue_entity(struct sched_entity *se) +static void set_delayed(struct sched_entity *se) +{ + se->sched_delayed = 1; + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + + cfs_rq->h_nr_delayed++; + if (cfs_rq_throttled(cfs_rq)) + break; + } +} + +static void clear_delayed(struct sched_entity *se) { se->sched_delayed = 0; + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + + cfs_rq->h_nr_delayed--; + if (cfs_rq_throttled(cfs_rq)) + break; + } +} + +static inline void finish_delayed_dequeue_entity(struct sched_entity *se) +{ + clear_delayed(se); if (sched_feat(DELAY_ZERO) && se->vlag > 0) se->vlag = 0; } @@ -5496,7 +5520,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (sched_feat(DELAY_DEQUEUE) && delay && !entity_eligible(cfs_rq, se)) { update_load_avg(cfs_rq, se, 0); - se->sched_delayed = 1; + set_delayed(se); return false; } } @@ -5908,7 +5932,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) struct rq *rq = rq_of(cfs_rq); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - long task_delta, idle_task_delta, dequeue = 1; + long task_delta, idle_task_delta, delayed_delta, dequeue = 1; long rq_h_nr_running = rq->cfs.h_nr_running; raw_spin_lock(&cfs_b->lock); @@ -5941,6 +5965,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) task_delta = cfs_rq->h_nr_running; idle_task_delta = cfs_rq->idle_h_nr_running; + delayed_delta = cfs_rq->h_nr_delayed; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se); int flags; @@ -5964,6 +5989,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running -= task_delta; qcfs_rq->idle_h_nr_running -= idle_task_delta; + qcfs_rq->h_nr_delayed -= delayed_delta; if (qcfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ @@ -5986,6 +6012,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running -= task_delta; qcfs_rq->idle_h_nr_running -= idle_task_delta; + qcfs_rq->h_nr_delayed -= delayed_delta; } /* At this point se is NULL and we are at root level*/ @@ -6011,7 +6038,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) struct rq *rq = rq_of(cfs_rq); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - long task_delta, idle_task_delta; + long task_delta, idle_task_delta, delayed_delta; long rq_h_nr_running = rq->cfs.h_nr_running; se = cfs_rq->tg->se[cpu_of(rq)]; @@ -6047,6 +6074,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) task_delta = cfs_rq->h_nr_running; idle_task_delta = cfs_rq->idle_h_nr_running; + delayed_delta = cfs_rq->h_nr_delayed; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se); @@ -6064,6 +6092,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running += task_delta; qcfs_rq->idle_h_nr_running += idle_task_delta; + qcfs_rq->h_nr_delayed += delayed_delta; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) @@ -6081,6 +6110,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running += task_delta; qcfs_rq->idle_h_nr_running += idle_task_delta; + qcfs_rq->h_nr_delayed += delayed_delta; /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) @@ -6934,7 +6964,7 @@ requeue_delayed_entity(struct sched_entity *se) } update_load_avg(cfs_rq, se, 0); - se->sched_delayed = 0; + clear_delayed(se); } /* @@ -6948,6 +6978,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; int idle_h_nr_running = task_has_idle_policy(p); + int h_nr_delayed = 0; int task_new = !(flags & ENQUEUE_WAKEUP); int rq_h_nr_running = rq->cfs.h_nr_running; u64 slice = 0; @@ -6974,6 +7005,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (p->in_iowait) cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); + if (task_new) + h_nr_delayed = !!se->sched_delayed; + for_each_sched_entity(se) { if (se->on_rq) { if (se->sched_delayed) @@ -6996,6 +7030,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; + cfs_rq->h_nr_delayed += h_nr_delayed; if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = 1; @@ -7019,6 +7054,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; + cfs_rq->h_nr_delayed += h_nr_delayed; if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = 1; @@ -7081,6 +7117,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) struct task_struct *p = NULL; int idle_h_nr_running = 0; int h_nr_running = 0; + int h_nr_delayed = 0; struct cfs_rq *cfs_rq; u64 slice = 0; @@ -7088,6 +7125,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) p = task_of(se); h_nr_running = 1; idle_h_nr_running = task_has_idle_policy(p); + if (!task_sleep && !task_delayed) + h_nr_delayed = !!se->sched_delayed; } else { cfs_rq = group_cfs_rq(se); slice = cfs_rq_min_slice(cfs_rq); @@ -7105,6 +7144,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; + cfs_rq->h_nr_delayed -= h_nr_delayed; if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = h_nr_running; @@ -7143,6 +7183,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; + cfs_rq->h_nr_delayed -= h_nr_delayed; if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = h_nr_running; diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index fc07382361a8..fee75cc2c47b 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -321,7 +321,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) { if (___update_load_sum(now, &cfs_rq->avg, scale_load_down(cfs_rq->load.weight), - cfs_rq->h_nr_running, + cfs_rq->h_nr_running - cfs_rq->h_nr_delayed, cfs_rq->curr != NULL)) { ___update_load_avg(&cfs_rq->avg, 1); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 76f5f53a645f..1e494af2cd23 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -649,6 +649,7 @@ struct cfs_rq { unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */ unsigned int idle_nr_running; /* SCHED_IDLE */ unsigned int idle_h_nr_running; /* SCHED_IDLE */ + unsigned int h_nr_delayed; s64 avg_vruntime; u64 avg_load; @@ -898,8 +899,11 @@ struct dl_rq { static inline void se_update_runnable(struct sched_entity *se) { - if (!entity_is_task(se)) - se->runnable_weight = se->my_q->h_nr_running; + if (!entity_is_task(se)) { + struct cfs_rq *cfs_rq = se->my_q; + + se->runnable_weight = cfs_rq->h_nr_running - cfs_rq->h_nr_delayed; + } } static inline long se_runnable(struct sched_entity *se) -- cgit v1.2.3 From b53127db1dbf7f1047cf35c10922d801dcd40324 Mon Sep 17 00:00:00 2001 From: "Vineeth Pillai (Google)" Date: Thu, 12 Dec 2024 22:22:36 -0500 Subject: sched/dlserver: Fix dlserver double enqueue dlserver can get dequeued during a dlserver pick_task due to the delayed deueue feature and this can lead to issues with dlserver logic as it still thinks that dlserver is on the runqueue. The dlserver throttling and replenish logic gets confused and can lead to double enqueue of dlserver. Double enqueue of dlserver could happend due to couple of reasons: Case 1 ------ Delayed dequeue feature[1] can cause dlserver being stopped during a pick initiated by dlserver: __pick_next_task pick_task_dl -> server_pick_task pick_task_fair pick_next_entity (if (sched_delayed)) dequeue_entities dl_server_stop server_pick_task goes ahead with update_curr_dl_se without knowing that dlserver is dequeued and this confuses the logic and may lead to unintended enqueue while the server is stopped. Case 2 ------ A race condition between a task dequeue on one cpu and same task's enqueue on this cpu by a remote cpu while the lock is released causing dlserver double enqueue. One cpu would be in the schedule() and releasing RQ-lock: current->state = TASK_INTERRUPTIBLE(); schedule(); deactivate_task() dl_stop_server(); pick_next_task() pick_next_task_fair() sched_balance_newidle() rq_unlock(this_rq) at which point another CPU can take our RQ-lock and do: try_to_wake_up() ttwu_queue() rq_lock() ... activate_task() dl_server_start() --> first enqueue wakeup_preempt() := check_preempt_wakeup_fair() update_curr() update_curr_task() if (current->dl_server) dl_server_update() enqueue_dl_entity() --> second enqueue This bug was not apparent as the enqueue in dl_server_start doesn't usually happen because of the defer logic. But as a side effect of the first case(dequeue during dlserver pick), dl_throttled and dl_yield will be set and this causes the time accounting of dlserver to messup and then leading to a enqueue in dl_server_start. Have an explicit flag representing the status of dlserver to avoid the confusion. This is set in dl_server_start and reset in dlserver_stop. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Suggested-by: Peter Zijlstra Signed-off-by: "Vineeth Pillai (Google)" Signed-off-by: Peter Zijlstra (Intel) Tested-by: Marcel Ziswiler # ROCK 5B Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org --- kernel/sched/deadline.c | 8 ++++++-- kernel/sched/sched.h | 5 +++++ 2 files changed, 11 insertions(+), 2 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index db47f33cb7d2..d94f2ed6d1f4 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1647,6 +1647,7 @@ void dl_server_start(struct sched_dl_entity *dl_se) if (!dl_se->dl_runtime) return; + dl_se->dl_server_active = 1; enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP); if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl)) resched_curr(dl_se->rq); @@ -1661,6 +1662,7 @@ void dl_server_stop(struct sched_dl_entity *dl_se) hrtimer_try_to_cancel(&dl_se->dl_timer); dl_se->dl_defer_armed = 0; dl_se->dl_throttled = 0; + dl_se->dl_server_active = 0; } void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, @@ -2421,8 +2423,10 @@ again: if (dl_server(dl_se)) { p = dl_se->server_pick_task(dl_se); if (!p) { - dl_se->dl_yielded = 1; - update_curr_dl_se(rq, dl_se, 0); + if (dl_server_active(dl_se)) { + dl_se->dl_yielded = 1; + update_curr_dl_se(rq, dl_se, 0); + } goto again; } rq->dl_server = dl_se; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1e494af2cd23..c5d67a43fe52 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -398,6 +398,11 @@ extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); +static inline bool dl_server_active(struct sched_dl_entity *dl_se) +{ + return dl_se->dl_server_active; +} + #ifdef CONFIG_CGROUP_SCHED extern struct list_head task_groups; -- cgit v1.2.3 From c7f7e9c73178e0e342486fd31e7f363ef60e3f83 Mon Sep 17 00:00:00 2001 From: "Vineeth Pillai (Google)" Date: Thu, 12 Dec 2024 22:22:37 -0500 Subject: sched/dlserver: Fix dlserver time accounting dlserver time is accounted when: - dlserver is active and the dlserver proxies the cfs task. - dlserver is active but deferred and cfs task runs after being picked through the normal fair class pick. dl_server_update is called in two places to make sure that both the above times are accounted for. But it doesn't check if dlserver is active or not. Now that we have this dl_server_active flag, we can consolidate dl_server_update into one place and all we need to check is whether dlserver is active or not. When dlserver is active there is only two possible conditions: - dlserver is deferred. - cfs task is running on behalf of dlserver. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: "Vineeth Pillai (Google)" Signed-off-by: Peter Zijlstra (Intel) Tested-by: Marcel Ziswiler # ROCK 5B Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org --- kernel/sched/fair.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'kernel/sched') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 97ee48c8bf5e..53a4f78b8340 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1159,8 +1159,6 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec) trace_sched_stat_runtime(p, delta_exec); account_group_exec_runtime(p, delta_exec); cgroup_account_cputime(p, delta_exec); - if (p->dl_server) - dl_server_update(p->dl_server, delta_exec); } static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr) @@ -1237,11 +1235,16 @@ static void update_curr(struct cfs_rq *cfs_rq) update_curr_task(p, delta_exec); /* - * Any fair task that runs outside of fair_server should - * account against fair_server such that it can account for - * this time and possibly avoid running this period. + * If the fair_server is active, we need to account for the + * fair_server time whether or not the task is running on + * behalf of fair_server or not: + * - If the task is running on behalf of fair_server, we need + * to limit its time based on the assigned runtime. + * - Fair task that runs outside of fair_server should account + * against fair_server such that it can account for this time + * and possibly avoid running this period. */ - if (p->dl_server != &rq->fair_server) + if (dl_server_active(&rq->fair_server)) dl_server_update(&rq->fair_server, delta_exec); } -- cgit v1.2.3