diff options
| author | Alexei Starovoitov <ast@kernel.org> | 2025-09-08 09:56:59 -0700 |
|---|---|---|
| committer | Alexei Starovoitov <ast@kernel.org> | 2025-09-08 10:04:37 -0700 |
| commit | 60ef54156148ffaa719845bc5b1fdeafa67763fc (patch) | |
| tree | c2d395afe4ef6141d2169ffc22f8da6b0806937f | |
| parent | 93a83d044314b041ffe2bb1d43b8b0cea7f60921 (diff) | |
| parent | a857210b104f5c186e883d0f2b0eb660349c1e54 (diff) | |
Merge branch 'bpf-replace-wq-users-and-add-wq_percpu-to-alloc_workqueue-users'
Marco Crivellari says:
====================
Below is a summary of a discussion about the Workqueue API and cpu isolation
considerations. Details and more information are available here:
"workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND."
https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
=== Current situation: problems ===
Let's consider a nohz_full system with isolated CPUs: wq_unbound_cpumask is
set to the housekeeping CPUs, for !WQ_UNBOUND the local CPU is selected.
This leads to different scenarios if a work item is scheduled on an isolated
CPU where "delay" value is 0 or greater then 0:
schedule_delayed_work(, 0);
This will be handled by __queue_work() that will queue the work item on the
current local (isolated) CPU, while:
schedule_delayed_work(, 1);
Will move the timer on an housekeeping CPU, and schedule the work there.
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
=== Plan and future plans ===
This patchset is the first stone on a refactoring needed in order to
address the points aforementioned; it will have a positive impact also
on the cpu isolation, in the long term, moving away percpu workqueue in
favor to an unbound model.
These are the main steps:
1) API refactoring (that this patch is introducing)
- Make more clear and uniform the system wq names, both per-cpu and
unbound. This to avoid any possible confusion on what should be
used.
- Introduction of WQ_PERCPU: this flag is the complement of WQ_UNBOUND,
introduced in this patchset and used on all the callers that are not
currently using WQ_UNBOUND.
WQ_UNBOUND will be removed in a future release cycle.
Most users don't need to be per-cpu, because they don't have
locality requirements, because of that, a next future step will be
make "unbound" the default behavior.
2) Check who really needs to be per-cpu
- Remove the WQ_PERCPU flag when is not strictly required.
3) Add a new API (prefer local cpu)
- There are users that don't require a local execution, like mentioned
above; despite that, local execution yeld to performance gain.
This new API will prefer the local execution, without requiring it.
=== Introduced Changes by this series ===
1) [P 1-2] Replace use of system_wq and system_unbound_wq
system_wq is a per-CPU workqueue, but his name is not clear.
system_unbound_wq is to be used when locality is not required.
Because of that, system_wq has been renamed in system_percpu_wq, and
system_unbound_wq has been renamed in system_dfl_wq.
2) [P 3] add WQ_PERCPU to remaining alloc_workqueue() users
Every alloc_workqueue() caller should use one among WQ_PERCPU or
WQ_UNBOUND. This is actually enforced warning if both or none of them
are present at the same time.
WQ_UNBOUND will be removed in a next release cycle.
=== For Maintainers ===
There are prerequisites for this series, already merged in the master branch.
The commits are:
128ea9f6ccfb6960293ae4212f4f97165e42222d ("workqueue: Add system_percpu_wq and
system_dfl_wq")
930c2ea566aff59e962c50b2421d5fcc3b98b8be ("workqueue: Add new WQ_PERCPU flag")
====================
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20250905085309.94596-1-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| -rw-r--r-- | kernel/bpf/cgroup.c | 5 | ||||
| -rw-r--r-- | kernel/bpf/cpumap.c | 2 | ||||
| -rw-r--r-- | kernel/bpf/helpers.c | 4 | ||||
| -rw-r--r-- | kernel/bpf/memalloc.c | 2 | ||||
| -rw-r--r-- | kernel/bpf/syscall.c | 2 |
5 files changed, 8 insertions, 7 deletions
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 9912c7b9a266..248f517d66d0 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -27,14 +27,15 @@ EXPORT_SYMBOL(cgroup_bpf_enabled_key); /* * cgroup bpf destruction makes heavy use of work items and there can be a lot * of concurrent destructions. Use a separate workqueue so that cgroup bpf - * destruction work items don't end up filling up max_active of system_wq + * destruction work items don't end up filling up max_active of system_percpu_wq * which may lead to deadlock. */ static struct workqueue_struct *cgroup_bpf_destroy_wq; static int __init cgroup_bpf_wq_init(void) { - cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1); + cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", + WQ_PERCPU, 1); if (!cgroup_bpf_destroy_wq) panic("Failed to alloc workqueue for cgroup bpf destroy.\n"); return 0; diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index b2b7b8ec2c2a..a9b347ccbec2 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -550,7 +550,7 @@ static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap, old_rcpu = unrcu_pointer(xchg(&cmap->cpu_map[key_cpu], RCU_INITIALIZER(rcpu))); if (old_rcpu) { INIT_RCU_WORK(&old_rcpu->free_work, __cpu_map_entry_free); - queue_rcu_work(system_wq, &old_rcpu->free_work); + queue_rcu_work(system_percpu_wq, &old_rcpu->free_work); } } diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 588bc7e36436..1ef1e65bd7d0 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1594,7 +1594,7 @@ void bpf_timer_cancel_and_free(void *val) * timer callback. */ if (this_cpu_read(hrtimer_running)) { - queue_work(system_unbound_wq, &t->cb.delete_work); + queue_work(system_dfl_wq, &t->cb.delete_work); return; } @@ -1607,7 +1607,7 @@ void bpf_timer_cancel_and_free(void *val) if (hrtimer_try_to_cancel(&t->timer) >= 0) kfree_rcu(t, cb.rcu); else - queue_work(system_unbound_wq, &t->cb.delete_work); + queue_work(system_dfl_wq, &t->cb.delete_work); } else { bpf_timer_delete_work(&t->cb.delete_work); } diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 889374722d0a..bd45dda9dc35 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -736,7 +736,7 @@ static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress) /* Defer barriers into worker to let the rest of map memory to be freed */ memset(ma, 0, sizeof(*ma)); INIT_WORK(©->work, free_mem_alloc_deferred); - queue_work(system_unbound_wq, ©->work); + queue_work(system_dfl_wq, ©->work); } void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 0fbfa8532c39..3f178a0f8eb1 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -905,7 +905,7 @@ static void bpf_map_free_in_work(struct bpf_map *map) /* Avoid spawning kworkers, since they all might contend * for the same mutex like slab_mutex. */ - queue_work(system_unbound_wq, &map->work); + queue_work(system_dfl_wq, &map->work); } static void bpf_map_free_rcu_gp(struct rcu_head *rcu) |
