summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-01-13genirq: Update effective affinity for redirected interruptsRadu Rendec
For redirected interrupts, irq_chip_redirect_set_affinity() does not update the effective affinity mask, which then triggers the warning in irq_validate_effective_affinity(). Also, because the effective affinity mask is empty, the cpumask_test_cpu(smp_processor_id(), m) condition in demux_redirect_remote() is always false, and the interrupt is always redirected, even if it's already running on the target CPU. Set the effective affinity mask to be the same as the requested affinity mask. It's worth noting that irq_do_set_affinity() filters out offline CPUs before calling chip->irq_set_affinity() (unless `force` is set), so the mask passed to irq_chip_redirect_set_affinity() is already filtered. The solution is not ideal because it may lie about the effective affinity of the demultiplexed ("child") interrupt. If the requested affinity mask includes multiple CPUs, the effective affinity, in reality, is the intersection between the requested mask and the demultiplexing ("parent") interrupt's effective affinity mask, plus the first CPU in the requested mask. Accurately describing the effective affinity of the demultiplexed interrupt is not trivial because it requires keeping track of the demultiplexing interrupt's effective affinity. That is tricky in the context of CPU hot(un)plugging, where interrupt migration ordering is not guaranteed. The solution in the initial version of the fixed patch, which stored the first CPU of the demultiplexing interrupt's effective affinity in the `target_cpu` field, has its own drawbacks and limitations. Fixes: fcc1d0dabdb6 ("genirq: Add interrupt redirection infrastructure") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Radu Rendec <rrendec@redhat.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Link: https://patch.msgid.link/20260112211402.2927336-1-rrendec@redhat.com Closes: https://lore.kernel.org/all/44509520-f29b-4b8a-8986-5eae3e022eb7@nvidia.com/
2026-01-13genirq: Warn about using IRQF_ONESHOT without a threaded handlerSebastian Andrzej Siewior
IRQF_ONESHOT disables the interrupt source until after the threaded handler completed its work. This is needed to allow the threaded handler to run - otherwise the CPU will get back to the interrupt handler because the interrupt source remains active and the threaded handler will not able to do its work. Specifying IRQF_ONESHOT without a threaded handler does not make sense. It could be a leftover if the handler _was_ threaded and changed back to primary and the flag was not removed. This can be problematic in the `threadirqs' case because the handler is exempt from forced-threading. This in turn can become a problem on a PREEMPT_RT system if the handler attempts to acquire sleeping locks. Warn about missing threaded handlers with the IRQF_ONESHOT flag. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Link: https://patch.msgid.link/20260112134013.eQWyReHR@linutronix.de
2026-01-12bpf, btf: Enforce destructor kfunc type with CFISami Tolvanen
Ensure that registered destructor kfuncs have the same type as btf_dtor_kfunc_t to avoid a kernel panic on systems with CONFIG_CFI enabled. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260110082548.113748-10-samitolvanen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-12bpf: crypto: Use the correct destructor kfunc typeSami Tolvanen
With CONFIG_CFI enabled, the kernel strictly enforces that indirect function calls use a function pointer type that matches the target function. I ran into the following type mismatch when running BPF self-tests: CFI failure at bpf_obj_free_fields+0x190/0x238 (target: bpf_crypto_ctx_release+0x0/0x94; expected type: 0xa488ebfc) Internal error: Oops - CFI: 00000000f2008228 [#1] SMP ... As bpf_crypto_ctx_release() is also used in BPF programs and using a void pointer as the argument would make the verifier unhappy, add a simple stub function with the correct type and register it as the destructor kfunc instead. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20260110082548.113748-7-samitolvanen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-12Merge tag 'cgroup-for-6.19-rc5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix -Wflex-array-member-not-at-end warnings in cgroup_root * tag 'cgroup-for-6.19-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Eliminate cgrp_ancestor_storage in cgroup_root
2026-01-12cpuset: replace direct lockdep_assert_held() with ↵Zhao Mengmeng
lockdep_assert_cpuset_lock_held() We already added lockdep_assert_cpuset_lock_held(), use this new function to keep consistency. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()Waiman Long
As stated in commit 1c09b195d37f ("cpuset: fix a regression in validating config change"), it is not allowed to clear masks of a cpuset if there're tasks in it. This is specific to v1 since empty "cpuset.cpus" or "cpuset.mems" will cause the v2 cpuset to inherit the effective CPUs or memory nodes from its parent. So it is OK to have empty cpus or mems even if there are tasks in the cpuset. Move this empty cpus/mems check in validate_change() to cpuset1_validate_change() to allow more flexibility in setting cpus or mems in v2. cpuset_is_populated() needs to be moved into cpuset-internal.h as it is needed by the empty cpus/mems checking code. Also add a test case to test_cpuset_prs.sh to verify that. Reported-by: Chen Ridong <chenridong@huaweicloud.com> Closes: https://lore.kernel.org/lkml/7a3ec392-2e86-4693-aa9f-1e668a668b9c@huaweicloud.com/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflictWaiman Long
Currently, when setting a cpuset's cpuset.cpus to a value that conflicts with the cpuset.cpus/cpuset.cpus.exclusive of a sibling partition, the sibling's partition state becomes invalid. This is overly harsh and is probably not necessary. The cpuset.cpus.exclusive control file, if set, will override the cpuset.cpus of the same cpuset when creating a cpuset partition. So cpuset.cpus has less priority than cpuset.cpus.exclusive in setting up a partition. However, it cannot override a conflicting cpuset.cpus file in a sibling cpuset and the partition creation process will fail. This is inconsistent. That will also make using cpuset.cpus.exclusive less valuable as a tool to set up cpuset partitions as the users have to check if such a cpuset.cpus conflict exists or not. Fix these problems by making sure that once a cpuset.cpus.exclusive is set without failure, it will always be allowed to form a valid partition as long as at least one CPU can be granted from its parent irrespective of the state of the siblings' cpuset.cpus values. Of course, setting cpuset.cpus.exclusive will fail if it conflicts with the cpuset.cpus.exclusive or the cpuset.cpus.exclusive.effective value of a sibling. Partition can still be created by setting only cpuset.cpus without setting cpuset.cpus.exclusive. However, any conflicting CPUs in sibling's cpuset.cpus.exclusive.effective and cpuset.cpus.exclusive values will be removed from its cpuset.cpus.exclusive.effective as long as there is still one or more CPUs left and can be granted from its parent. This CPU stripping is currently done in rm_siblings_excl_cpus(). The new code will now try its best to enable the creation of new partitions with only cpuset.cpus set without invalidating existing ones. However it is not guaranteed that all the CPUs requested in cpuset.cpus will be used in the new partition even when all these CPUs can be granted from the parent. This is similar to the fact that cpuset.cpus.effective may not be able to include all the CPUs requested in cpuset.cpus. In this case, the parent may not able to grant all the exclusive CPUs requested in cpuset.cpus to cpuset.cpus.exclusive.effective if some of them have already been granted to other partitions earlier. With the creation of multiple sibling partitions by setting only cpuset.cpus, this does have the side effect that their exact cpuset.cpus.exclusive.effective settings will depend on the order of partition creation if there are conflicts. Due to the exclusive nature of the CPUs in a partition, it is not easy to make it fair other than the old behavior of invalidating all the conflicting partitions. For example, # echo "0-2" > A1/cpuset.cpus # echo "root" > A1/cpuset.cpus.partition # cat A1/cpuset.cpus.partition root # cat A1/cpuset.cpus.exclusive.effective 0-2 # echo "2-4" > B1/cpuset.cpus # echo "root" > B1/cpuset.cpus.partition # cat B1/cpuset.cpus.partition root # cat B1/cpuset.cpus.exclusive.effective 3-4 # cat B1/cpuset.cpus.effective 3-4 For users who want to be sure that they can get most of the CPUs they want, cpuset.cpus.exclusive should be used instead if they can set it successfully without failure. Setting cpuset.cpus.exclusive will guarantee that sibling conflicts from then onward is no longer possible. To make this change, we have to separate out the is_cpu_exclusive() check in cpus_excl_conflict() into a cgroup v1 only cpuset1_cpus_excl_conflict() helper. The cpus_allowed_validate_change() helper is now no longer needed and can be removed. Some existing tests in test_cpuset_prs.sh are updated and new ones are added to reflect the new behavior. The cgroup-v2.rst doc file is also updated the clarify what exclusive CPUs will be used when a partition is created. Reported-by: Sun Shaojie <sunshaojie@kylinos.cn> Closes: https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12cgroup/cpuset: Don't fail cpuset.cpus change in v2Waiman Long
Commit fe8cd2736e75 ("cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition") introduced a new check to disallow the setting of a new cpuset.cpus.exclusive value that is a superset of a sibling's cpuset.cpus value so that there will at least be one CPU left in the sibling in case the cpuset becomes a valid partition root. This new check does have the side effect of failing a cpuset.cpus change that make it a subset of a sibling's cpuset.cpus.exclusive value. With v2, users are supposed to be allowed to set whatever value they want in cpuset.cpus without failure. To maintain this rule, the check is now restricted to only when cpuset.cpus.exclusive is being changed not when cpuset.cpus is changed. The cgroup-v2.rst doc file is also updated to reflect this change. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()Waiman Long
Since commit f62a5d39368e ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition"), the compute_effective_exclusive_cpumask() helper was extended to strip exclusive CPUs from siblings when computing effective_xcpus (cpuset.cpus.exclusive.effective). This helper was later renamed to compute_excpus() in commit 86bbbd1f33ab ("cpuset: Refactor exclusive CPU mask computation logic"). This helper is supposed to be used consistently to compute effective_xcpus. However, there is an exception within the callback critical section in update_cpumasks_hier() when exclusive_cpus of a valid partition root is empty. This can cause effective_xcpus value to differ depending on where exactly it is last computed. Fix this by using compute_excpus() in this case to give a consistent result. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12cgroup/cpuset: Streamline rm_siblings_excl_cpus()Waiman Long
If exclusive_cpus is set, effective_xcpus must be a subset of exclusive_cpus. Currently, rm_siblings_excl_cpus() checks both exclusive_cpus and effective_xcpus consecutively. It is simpler to check only exclusive_cpus if non-empty or just effective_xcpus otherwise. No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12Merge back material related to system sleep for 6.20Rafael J. Wysocki
2026-01-12sched: Move clock related paravirt code to kernel/schedJuergen Gross
Paravirt clock related functions are available in multiple archs. In order to share the common parts, move the common static keys to kernel/sched/ and remove them from the arch specific files. Make a common paravirt_steal_clock() implementation available in kernel/sched/cputime.c, guarding it with a new config option CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch in case it wants to use that common variant. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
2026-01-12paravirt: Remove asm/paravirt_api_clock.hJuergen Gross
All architectures supporting CONFIG_PARAVIRT share the same contents of asm/paravirt_api_clock.h: #include <asm/paravirt.h> So remove all incarnations of asm/paravirt_api_clock.h and remove the only place where it is included, as there asm/paravirt.h is included anyway. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc, scheduler bits Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-6-jgross@suse.com
2026-01-12verification/rvgen: Remove unused variable declaration from containersGabriele Monaco
The monitor container source files contained a declaration and a definition for the rv_monitor variable. The former is superfluous and can be removed. Remove the variable declaration from the template as well as the existing monitor containers. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251126104241.291258-9-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-01-12verification/dot2c: Remove superfluous enum assignment and add last commaGabriele Monaco
The header files generated by dot2c currently create enums for states and events assigning the first element to 0. This is superfluous as it happens automatically if no value is specified. Also it doesn't add a comma to the last enum elements, which slightly complicates the diff if states or events are added. Remove the assignment to 0 and add a comma to last elements, this simplifies the logic for the code generator. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251126104241.291258-8-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-01-12rv: Refactor da_monitor to minimise macrosGabriele Monaco
The da_monitor helper functions are generated from macros of the type: DECLARE_DA_FUNCTION(name, type) \ static void da_func_x_##name(type arg) {} \ static void da_func_y_##name(type arg) {} \ This is good to minimise code duplication but the long macros made of skipped end of lines is rather hard to parse. Since functions are static, the advantage of naming them differently for each monitor is minimal. Refactor the da_monitor.h file to minimise macros, instead of declaring functions from macros, we simply declare them with the same name for all monitors (e.g. da_func_x) and for any remaining reference to the monitor name (e.g. tracepoints, enums, global variables) we use the CONCATENATE macro. In this way the file is much easier to maintain while keeping the same generality. Functions depending on the monitor types are now conditionally compiled according to the value of RV_MON_TYPE, which must be defined in the monitor source. The monitor type can be specified as in the original implementation, although it's best to keep the default implementation (unsigned char) as not all parts of code support larger data types, and likely there's no need. We keep the empty macro definitions to ease review of this change with diff tools, but cleanup is required. Also adapt existing monitors to keep the build working. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251126104241.291258-2-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-01-11Merge tag 'sched-urgent-2026-01-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a crash in sched_mm_cid_after_execve()" * tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()
2026-01-11Merge tag 'perf-urgent-2026-01-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix perf swevent hrtimer deinit regression" * tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ensure swevent hrtimer is properly destroyed
2026-01-11treewide: Update email addressThomas Gleixner
In a vain attempt to consolidate the email zoo switch everything to the kernel.org account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-11Merge branch 'rcu-misc.20260111a'Boqun Feng
* rcu-misc.20260111a: rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() rcu: Fix rcu_read_unlock() deadloop due to softirq rcutorture: Correctly compute probability to invoke ->exp_current() rcu: Make expedited RCU CPU stall warnings detect stall-end races
2026-01-11rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS earlyJoel Fernandes
The RCU grace period mechanism uses a two-phase FQS (Force Quiescent State) design where the first FQS saves dyntick-idle snapshots and the second FQS compares them. This results in long and unnecessary latency for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ) whenever one FQS wait sufficed. Some investigations showed that the GP kthread's CPU is the holdout CPU a lot of times after the first FQS as - it cannot be detected as "idle" because it's actively running the FQS scan in the GP kthread. Therefore, at the end of rcu_gp_init(), immediately report a quiescent state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The GP kthread cannot be in an RCU read-side critical section while running GP initialization, so this is safe and results in significant latency improvements. The following tests were performed: (1) synchronize_rcu() benchmarking 100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs jiffies settings): Baseline (without fix): | Run | Mean | Min | Max | |-----|-----------|----------|-----------| | 1 | 10.088 ms | 9.989 ms | 18.848 ms | | 2 | 10.064 ms | 9.982 ms | 16.470 ms | | 3 | 10.051 ms | 9.988 ms | 15.113 ms | | 4 | 10.125 ms | 9.929 ms | 22.411 ms | | 5 | 8.695 ms | 5.996 ms | 15.471 ms | | 6 | 10.157 ms | 9.977 ms | 25.723 ms | | 7 | 10.102 ms | 9.990 ms | 20.224 ms | | 8 | 8.050 ms | 5.985 ms | 10.007 ms | | 9 | 10.059 ms | 9.978 ms | 15.934 ms | | 10 | 10.077 ms | 9.984 ms | 17.703 ms | With fix: | Run | Mean | Min | Max | |-----|----------|----------|-----------| | 1 | 6.027 ms | 5.915 ms | 8.589 ms | | 2 | 6.032 ms | 5.984 ms | 9.241 ms | | 3 | 6.010 ms | 5.986 ms | 7.004 ms | | 4 | 6.076 ms | 5.993 ms | 10.001 ms | | 5 | 6.084 ms | 5.893 ms | 10.250 ms | | 6 | 6.034 ms | 5.908 ms | 9.456 ms | | 7 | 6.051 ms | 5.993 ms | 10.000 ms | | 8 | 6.057 ms | 5.941 ms | 10.001 ms | | 9 | 6.016 ms | 5.927 ms | 7.540 ms | | 10 | 6.036 ms | 5.993 ms | 9.579 ms | Summary: - Mean latency: 9.75 ms -> 6.04 ms (38% improvement) - Max latency: 25.72 ms -> 10.25 ms (60% improvement) (2) Bridge setup/teardown latency (Uladzislau Rezki) x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete: real time 1 - default: 24.221s 2 - this patch: 20.754s (14% faster) 3 - this patch + wake_from_gp: 15.895s (34% faster) 4 - wake_from_gp only: 18.947s (22% faster) Per-synchronize_rcu() latency (in usec): 1 2 3 4 median: 37249.5 31540.5 15765 22480 min: 7881 7918 9803 7857 max: 63651 55639 31861 32040 This patch combined with rcu_normal_wake_from_gp reduces bridge setup/teardown time from 24 seconds to 16 seconds. (3) CPU overhead verification (Uladzislau Rezki) System CPU time across 5 runs showed no measurable increase: default: 1.698s - 1.937s this patch: 1.667s - 1.930s Conclusion: variations are within noise, no CPU overhead regression. (4) rcutorture Tested TREE and SRCU configurations - no regressions. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Samir M <samir@linux.ibm.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-09PM: EM: Add dump to get-perf-domains in the EM YNL specChangwoo Min
Add dump to get-perf-domains, so that a user can fetch either information about a specific performance domain with do or information about all performance domains with dump. Share the reply format of do and dump using perf-domain-attrs, so remove perf-domains. The YNL spec, autogenerated files, and the do implementation are updated, and the dump implementation is added. Suggested-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260108053212.642478-5-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09PM: EM: Change cpus' type from string to u64 array in the EM YNL specChangwoo Min
Previously, the cpus attribute was a string format which was a "%*pb" stringification of a bitmap. That is not very consumable for a UAPI, so let’s change it to an u64 array of CPU ids. Suggested-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260108053212.642478-4-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09PM: EM: Rename em.yaml to dev-energymodel.yamlChangwoo Min
The EM YNL specification used many acronyms, including ‘em’, ‘pd’, ‘ps’, etc. While the acronyms are short and convenient, they could be confusing. So, let’s spell them out to be more specific. The following changes were made in the spec. Note that the protocol name cannot exceed GENL_NAMSIZ (16). em -> dev-energymodel pds -> perf-domains pd -> perf-domain pd-id -> perf-domain-id pd-table -> perf-table ps -> perf-state get-pds -> get-perf-domains get-pd-table -> get-perf-table pd-created -> perf-domain-created pd-updated -> perf-domain-updated pd-deleted -> perf-domain-deleted In addition. doc strings were added to the spec. based on the comments in energy_model.h. Two flag attributes (perf-state-flags and perf-domain-flags) were added for easily interpreting the bit flags. Finally, the autogenerated files and em_netlink.c were updated accordingly to reflect the name changes. Suggested-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260108053212.642478-3-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09Merge tag 'pm-6.19-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "This fixes a crash in the hibernation image saving code that can be triggered when the given compression algorithm is unavailable (Malaya Kumar Rout)" * tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Fix crash when freeing invalid crypto compressor
2026-01-09sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()Cong Wang
sched_mm_cid_after_execve() is called in bprm_execve()'s cleanup path even when exec_binprm() fails. For the init task's first execve(), this causes a problem: 1. current->mm is NULL (kernel threads don't have an mm) 2. sched_mm_cid_before_execve() exits early because mm is NULL 3. exec_binprm() fails (e.g., ENOENT for missing script interpreter) 4. sched_mm_cid_after_execve() is called with mm still NULL 5. sched_mm_cid_fork() is called unconditionally, triggering WARN_ON This is easily reproduced by booting with an init that is a shell script (#!/bin/sh) where the interpreter doesn't exist in the initramfs. Fix this by checking if t->mm is NULL before calling sched_mm_cid_fork(), matching the behavior of sched_mm_cid_before_execve() which already handles this case via sched_mm_cid_exit()'s early return. Fixes: b0c3d51b54f8 ("sched/mmcid: Provide precomputed maximal value") Signed-off-by: Cong Wang <cwang@multikernel.io> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20251223215113.639686-1-xiyou.wangcong@gmail.com
2026-01-08PM: EM: Fix memory leak in em_create_pd() error pathMalaya Kumar Rout
When ida_alloc() fails in em_create_pd(), the function returns without freeing the previously allocated 'pd' structure, leading to a memory leak. The 'pd' pointer is allocated either at line 436 (for CPU devices with cpumask) or line 442 (for other devices) using kzalloc(). Additionally, the function incorrectly returns -ENOMEM when ida_alloc() fails, ignoring the actual error code returned by ida_alloc(), which can fail for reasons other than memory exhaustion. Fix both issues by: 1. Freeing the 'pd' structure with kfree() when ida_alloc() fails 2. Returning the actual error code from ida_alloc() instead of -ENOMEM This ensures proper cleanup on the error path and accurate error reporting. Fixes: cbe5aeedecc7 ("PM: EM: Assign a unique ID when creating a performance domain") Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260105103730.65626-1-mrout@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-08sched: Further restrict the preemption modesPeter Zijlstra
The introduction of PREEMPT_LAZY was for multiple reasons: - PREEMPT_RT suffered from over-scheduling, hurting performance compared to !PREEMPT_RT. - the introduction of (more) features that rely on preemption; like folio_zero_user() which can do large memset() without preemption checks. (Xen already had a horrible hack to deal with long running hypercalls) - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo cult or in response to poor to replicate workloads. By moving to a model that is fundamentally preemptable these things become managable and avoid needing to introduce more horrible hacks. Since this is a requirement; limit PREEMPT_NONE to architectures that do not support preemption at all. Further limit PREEMPT_VOLUNTARY to those architectures that do not yet have PREEMPT_LAZY support (with the eventual goal to make this the empty set and completely remove voluntary preemption and cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.) This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86) with only two preemption models: full and lazy. While Lazy has been the recommended setting for a while, not all distributions have managed to make the switch yet. Force things along. Keep the patch minimal in case of hard to address regressions that might pop up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
2026-01-08sched: Reorder some fields in struct rqBlake Jones
This colocates some hot fields in "struct rq" to be on the same cache line as others that are often accessed at the same time or in similar ways. Using data from a Google-internal fleet-scale profiler, I found three distinct groups of hot fields in struct rq: - (1) The runqueue lock: __lock. - (2) Those accessed from hot code in pick_next_task_fair(): nr_running, nr_numa_running, nr_preferred_running, ttwu_pending, cpu_capacity, curr, idle. - (3) Those accessed from some other hot codepaths, e.g. update_curr(), update_rq_clock(), and scheduler_tick(): clock_task, clock_pelt, clock, lost_idle_time, clock_update_flags, clock_pelt_idle, clock_idle. The cycles spent on accessing these different groups of fields broke down roughly as follows: - 50% on group (1) (the runqueue lock, always read-write) - 39% on group (2) (load:store ratio around 38:1) - 8% on group (3) (load:store ratio around 5:1) - 3% on all the other fields Most of the fields in group (3) are already in a cache line grouping; this patch just adds "clock" and "clock_update_flags" to that group. The fields in group (2) are scattered across several cache lines; the main effect of this patch is to group them together, on a single line at the beginning of the structure. A few other less performance-critical fields (nr_switches, numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick) were also reordered to reduce holes in the data structure. Since the runqueue lock is acquired from so many different contexts, and is basically always accessed using an atomic operation, putting it on either of the cache lines for groups (2) or (3) would slow down accesses to those fields dramatically, since those groups are read-mostly accesses. To test this, I wrote a focused load test that would put load on the pick_next_task_fair() path. A parent process would fork many child processes, and each child would nanosleep() for 1 msec many times in a loop. The load test was monitored with "perf", and I looked at the amount of cycles that were spent with sched_balance_rq() on the stack. The test was reliably spending ~5% of all of its cycles there. I ran it 100 times on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one running the tip of sched/core, the other running this change - using 360 child processes and 8192 1-msec sleeps per child. The mean cycle count dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler cycles. Given that this change reduces cache misses in a very hot kernel codepath, there's likely to be additional application performance improvement due to reduced cache conflicts from kernel data structures. On a Power11 system with 128-byte cache lines, my test showed a ~5% decrease in relevant scheduler cycles, along with a slight increase in user time - both positive indicators. This data comes from https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/ This is the case even though the additional "____cacheline_aligned" that puts the runqueue lock on the next cache line adds an additional 64 bytes of padding on those machines. This patch does not change the size of "struct rq" on machines with 64-byte cache lines. I also ran "hackbench" to try to test this change, but it didn't show conclusive results. Looking at a CPU cycle profile of the hackbench run, it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or kmem_cache_free() - almost all of which was spent updating memcg counters or contending on the list_lock in kmem_cache_node. And it spent less than 0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's not surprising that it didn't show useful results here. The "__no_randomize_layout" was added to reflect the fact that performance of code that references this data structure is unusually sensitive to placement of its members. Signed-off-by: Blake Jones <blakejones@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com
2026-01-08sched/fair: Use cpumask_weight_and() in sched_balance_find_dst_group()Yury Norov (NVIDIA)
In the group_has_spare case, the function creates a temporary cpumask to just calculate weight of (p->cpus_ptr & sched_group_span(local)). We've got a dedicated helper for it. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Link: https://patch.msgid.link/20251207034247.402926-1-yury.norov@gmail.com
2026-01-08sched/fair: Simplify task_numa_find_cpu()Yury Norov (NVIDIA)
Use for_each_cpu_and() and drop some housekeeping code. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251207033037.399608-1-yury.norov@gmail.com
2026-01-08sched/fair: Drop useless cpumask_empty() in find_energy_efficient_cpu()Yury Norov (NVIDIA)
cpumask_empty() call is O(N) and useless because the previous cpumask_and() returns false for empty 'cpus'. Drop it. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20251207040543.407695-1-yury.norov@gmail.com
2026-01-07bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str()Deepanshu Kartikey
BPF_MAP_TYPE_INSN_ARRAY maps store instruction pointers in their ips array, not string data. The map_direct_value_addr callback for this map type returns the address of the ips array, which is not suitable for use as a constant string argument. When a BPF program passes a pointer to an insn_array map value as ARG_PTR_TO_CONST_STR (e.g., to bpf_snprintf), the verifier's null-termination check in check_reg_const_str() operates on the wrong memory region, and at runtime bpf_bprintf_prepare() can read out of bounds searching for a null terminator. Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() since this map type is not designed to hold string data. Reported-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2c29addf92581b410079 Tested-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Acked-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260107021037.289644-1-kartikey406@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-07cgroup: Eliminate cgrp_ancestor_storage in cgroup_rootMichal Koutný
The cgrp_ancestor_storage has two drawbacks: - it's not guaranteed that the member immediately follows struct cgrp in cgroup_root (root cgroup's ancestors[0] might thus point to a padding and not in cgrp_ancestor_storage proper), - this idiom raises warnings with -Wflex-array-member-not-at-end. Instead of relying on the auxiliary member in cgroup_root, define the 0-th level ancestor inside struct cgroup (needed for static allocation of cgrp_dfl_root), deeper cgroups would allocate flexible _low_ancestors[]. Unionized alias through ancestors[] will transparently join the two ranges. The above change would still leave the flexible array at the end of struct cgroup inside cgroup_root, so move cgrp also towards the end of cgroup_root to resolve the -Wflex-array-member-not-at-end. Link: https://lore.kernel.org/r/5fb74444-2fbb-476e-b1bf-3f3e279d0ced@embeddedor.com/ Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Closes: https://lore.kernel.org/r/b3eb050d-9451-4b60-b06c-ace7dab57497@embeddedor.com/ Cc: David Laight <david.laight.linux@gmail.com> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-08dma-mapping: Remove dma_mark_clean (again)Robin Murphy
With IA-64 now gone, there are no users of the dma_mark_clean hook, so we can retire it for good. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/c004927f01962726ff1dcf94d1b4efff84db805a.1767727673.git.robin.murphy@arm.com
2026-01-07trace: ftrace_dump_on_oops[] is not exported, make it staticBen Dooks
The ftrace_dump_on_oops string is not used outside of trace.c so make it static to avoid the export warning from sparse: kernel/trace/trace.c:141:6: warning: symbol 'ftrace_dump_on_oops' was not declared. Should it be static? Fixes: dd293df6395a2 ("tracing: Move trace sysctls into trace.c") Link: https://patch.msgid.link/20260106231054.84270-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07tracing: Add recursion protection in kernel stack trace recordingSteven Rostedt
A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07ring-buffer: Avoid softlockup in ring_buffer_resize() during memory freeWupeng Ma
When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be freed, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer free as Commit f6bd2c92488c ("ring-buffer: Avoid softlockup in ring_buffer_resize()") does. Detailed call trace as follow: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 24-....: (14837 ticks this GP) idle=521c/1/0x4000000000000000 softirq=230597/230597 fqs=5329 rcu: (t=15004 jiffies g=26003221 q=211022 ncpus=96) CPU: 24 UID: 0 PID: 11253 Comm: bash Kdump: loaded Tainted: G EL 6.18.2+ #278 NONE pc : arch_local_irq_restore+0x8/0x20 arch_local_irq_restore+0x8/0x20 (P) free_frozen_page_commit+0x28c/0x3b0 __free_frozen_pages+0x1c0/0x678 ___free_pages+0xc0/0xe0 free_pages+0x3c/0x50 ring_buffer_resize.part.0+0x6a8/0x880 ring_buffer_resize+0x3c/0x58 __tracing_resize_ring_buffer.part.0+0x34/0xd8 tracing_resize_ring_buffer+0x8c/0xd0 tracing_entries_write+0x74/0xd8 vfs_write+0xcc/0x288 ksys_write+0x74/0x118 __arm64_sys_write+0x24/0x38 Cc: <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251228065008.2396573-1-mawupeng1@huawei.com Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07tracing: Drop unneeded assignment to soft_modeJulia Lawall
soft_mode is not read in the enable case, so drop the assignment. Drop also the comment text that refers to the assignment and realign the comment. Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251226110531.4129794-1-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()Zqiang
For use the init_srcu_struct*() to initialized srcu structure, the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to allocate memory. similarly, if set SRCU_SIZING_INIT, the srcu_sup's->node can still use GFP_KERNEL flags to allocate memory, not need to use GFP_ATOMIC flags all the time. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07rcu: Fix rcu_read_unlock() deadloop due to softirqYao Kai
Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") removes the recursion-protection code from __rcu_read_unlock(). Therefore, we could invoke the deadloop in raise_softirq_irqoff() with ftrace enabled as follows: WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180 Modules linked in: my_irq_work(O) CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180 RSP: 0018:ffffc900000034a8 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329 RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000 R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054 FS: 0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <IRQ> trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 __is_insn_slot_addr+0x54/0x70 kernel_text_address+0x48/0xc0 __kernel_text_address+0xd/0x40 unwind_get_return_address+0x1e/0x40 arch_stack_walk+0x9c/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 __raise_softirq_irqoff+0x61/0x80 __flush_smp_call_function_queue+0x115/0x420 __sysvec_call_function_single+0x17/0xb0 sysvec_call_function_single+0x8c/0xc0 </IRQ> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") fixed the infinite loop in rcu_read_unlock_special() for IRQ work by setting a flag before calling irq_work_queue_on(). We fix this issue by setting the same flag before calling raise_softirq_irqoff() and rename the flag to defer_qs_pending for more common. Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Reported-by: Tengda Wu <wutengda2@huawei.com> Signed-off-by: Yao Kai <yaokai34@huawei.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07rcutorture: Correctly compute probability to invoke ->exp_current()Paul E. McKenney
Lack of parentheses causes the ->exp_current() function, for example, srcu_expedite_current(), to be called only once in four billion times instead of the intended once in 256 times. This commit therefore adds the needed parentheses. Reported-by: Chris Mason <clm@meta.com> Reported-by: Joel Fernandes <joelagnelf@nvidia.com> Fixes: 950063c6e897 ("rcutorture: Test srcu_expedite_current()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07rcu: Make expedited RCU CPU stall warnings detect stall-end racesPaul E. McKenney
If an expedited RCU CPU stall ends just at the stall-warning timeout, the current code will print an expedited stall-warning message, but one that doesn't identify any CPUs or tasks causing the stall. This is most likely to happen for short-timeout stalls, for example, the 20-millisecond timeouts that are sometimes used for small embedded devices. Needless to say, these semi-empty stall-warning messages can be rather confusing. One option would be to suppress the stall-warning message entirely in this case, but the near-miss information can be quite valuable. Detect this race condition and emits a "INFO: Expedited stall ended before state dump start" message to clarify matters. [boqun: Apply feedback from Borislav] Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-06bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for ↵Leon Hwang
percpu_cgroup_storage maps Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to allow updating values for all CPUs with a single value for update_elem API. Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to allow: * update value for specified CPU for update_elem API. * lookup value for specified CPU for lookup_elem API. The BPF_F_CPU flag is passed via map_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-6-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage mapsLeon Hwang
Copy map value using 'copy_map_value_long()'. It's to keep consistent style with the way of other percpu maps. No functional change intended. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-5-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and ↵Leon Hwang
lru_percpu_hash maps Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags along with embedded cpu info. * elem_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array mapsLeon Hwang
Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce support for the BPF_F_CPU flag in percpu_array maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags of lookup_elem and update_elem APIs along with embedded cpu info. * elem_flags of lookup_batch and update_batch APIs along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flagsLeon Hwang
Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for following APIs: * 'map_lookup_elem()' * 'map_update_elem()' * 'generic_map_lookup_batch()' * 'generic_map_update_batch()' And, get the correct value size for these APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-07powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memoryGaurav Batra
Leverage ARCH_HAS_DMA_MAP_DIRECT config option for coherent allocations as well. This will bypass DMA ops for memory allocations that have been pre-mapped. Always set device bus_dma_limit when memory is pre-mapped. In some architectures, like PowerPC, pmemory can be converted to regular memory via daxctl command. This will gate the coherent allocations to pre-mapped RAM only, by dma_coherent_ok(). Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251107161105.85999-1-gbatra@linux.ibm.com