| Age | Commit message (Collapse) | Author |
|
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-6-frederic@kernel.org
|
|
hierarchy
The CPU doing the prepare work for a remote target must be online from
the tree point of view and its hierarchy must be active, otherwise
propagating its active state up to the new root branch would be either
incorrect or racy.
Assert those conditions with more sanity checks.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-5-frederic@kernel.org
|
|
When a CPU from a new node boots, the old root may happen to be
connected to the new root even if their node mismatch, as depicted in
the following scenario:
1) CPU 0 boots and creates the first group for node 0.
[GRP0:0]
node 0
|
CPU 0
2) CPU 1 from node 1 boots and creates a new top that corresponds to
node 1, but it also connects the old root from node 0 to the new root
from node 1 by mistake.
[GRP1:0]
node 1
/ \
/ \
[GRP0:0] [GRP0:1]
node 0 node 1
| |
CPU 0 CPU 1
3) This eventually leads to an imbalanced tree where some node 0 CPUs
migrate node 1 timers (and vice versa) way before reaching the
crossnode groups, resulting in more frequent remote memory accesses
than expected.
[GRP2:0]
NUMA_NO_NODE
/ \
[GRP1:0] [GRP1:1]
node 1 node 0
/ \ |
/ \ [...]
[GRP0:0] [GRP0:1]
node 0 node 1
| |
CPU 0... CPU 1...
A balanced tree should only contain groups having children that belong
to the same node:
[GRP2:0]
NUMA_NO_NODE
/ \
[GRP1:0] [GRP1:0]
node 0 node 1
/ \ / \
/ \ / \
[GRP0:0] [...] [...] [GRP0:1]
node 0 node 1
| |
CPU 0... CPU 1...
In order to fix this, the hierarchy must be unfolded up to the crossnode
level as soon as a node mismatch is detected. For example the stage 2
above should lead to this layout:
[GRP2:0]
NUMA_NO_NODE
/ \
[GRP1:0] [GRP1:1]
node 0 node 1
/ \
/ \
[GRP0:0] [GRP0:1]
node 0 node 1
| |
CPU 0 CPU 1
This means that not only GRP1:0 must be created but also GRP1:1 and
GRP2:0 in order to prepare a balanced tree for next CPUs to boot.
Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org
|
|
Initializing the tmc's group, the group's number of children and the
group's parent can all be done without locking because:
1) Reading the group's parent and its group mask is done locklessly.
2) The connections prepared for a given CPU hierarchy are visible to the
target CPU once online, thanks to the CPU hotplug enforced memory
ordering.
3) In case of a newly created upper level, the new root and its
connections and initialization are made visible by the CPU which made
the connections. When that CPUs goes idle in the future, the new link
is published by tmigr_inactive_up() through the atomic RmW on
->migr_state.
4) If CPUs were still walking up the active hierarchy, they could observe
the new root earlier. In this case the ordering is enforced by an
early initialization of the group mask and by barriers that maintain
address dependency as explained in:
b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit")
de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers")
5) Timers are propagated by a chain of group locking from the bottom to
the top. And while doing so, the tree also propagates groups links
and initialization. Therefore remote expiration, which also relies
on group locking, will observe those links and initialization while
holding the root lock before walking the tree remotely and update
remote timers. This is especially important for migrators in the
active hierarchy that may observe the new root early.
Therefore the locking is unnecessary at initialization. If anything, it
just brings confusion. Remove it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org
|
|
Both the "do while" and "while" loops in tmigr_setup_groups() eventually
mimic the behaviour of "for" loops.
Simplify accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org
|
|
On large NUMA systems, while running a test program that saturates the
inter-processor and inter-NUMA links, acquiring the jiffies_lock can be
very expensive.
If the cpu designated to do jiffies updates (tick_do_timer_cpu) gets
delayed and other cpus decide to do the jiffies update themselves, a large
number of them decide to do so at the same time.
The inexpensive check against tick_next_period is far quicker than actually
acquiring the lock, so most of these get in line to obtain the lock. If
obtaining the lock is slow enough, this spirals into the vast majority of
CPUs continuously being stuck waiting for this lock, just to obtain it and
find out that time has already been updated by another cpu. For example, on
one random entry to kdb by manually-injected NMI, 2912 of 3840 CPUs were
observed to be stuck there.
To avoid this, allow only one non-timekeeper CPU to call
tick_do_update_jiffies64() at any given time, resetting ts->stalled jiffies
only if the jiffies update function is actually called.
With this change, manually interrupting the test at most two CPUs are
observed to invoke tick_do_update_jiffies64() - the timekeeper and one
other.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251027183456.343407-1-steve.wahl@hpe.com
|
|
Pull bpf fixes from Alexei Starovoitov:
- Mark migrate_disable/enable() as always_inline to avoid issues with
partial inlining (Yonghong Song)
- Fix powerpc stack register definition in libbpf bpf_tracing.h (Andrii
Nakryiko)
- Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann)
- Conditionally include dynptr copy kfuncs (Malin Jonsson)
- Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal)
- Do not audit capability check in x86 do_jit() (Ondrej Mosnacek)
- Fix arm64 JIT of BPF_ST insn when it writes into arena memory
(Puranjay Mohan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf/arm64: Fix BPF_ST into arena memory
bpf: Make migrate_disable always inline to avoid partial inlining
bpf: Reject negative head_room in __bpf_skb_change_head
bpf: Conditionally include dynptr copy kfuncs
libbpf: Fix powerpc's stack register definition in bpf_tracing.h
bpf: Do not audit capability check in do_jit()
bpf: Sync pending IRQ work before freeing ring buffer
|
|
Reading /proc/irq/N/smp_affinity* races with irq_set_affinity() and
irq_move_masked_irq(), leading to old or torn output for users.
After a user writes a new CPU mask to /proc/irq/N/affinity*, the syscall
returns success, yet a subsequent read of the same file immediately returns
a value different from what was just written.
That's due to a race between show_irq_affinity() and irq_move_masked_irq()
which lets the read observe a transient, inconsistent affinity mask.
Cure it by guarding the read with irq_desc::lock.
[ tglx: Massaged change log ]
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251028090408.76331-1-songmuchun@bytedance.com
|
|
The 'ret' local variable in fprobe_remove_node_in_module() was used
for checking the error state in the loop, but commit dfe0d675df82
("tracing: fprobe: use rhltable for fprobe_ip_table") removed the loop.
So we don't need it anymore.
Link: https://lore.kernel.org/all/175867358989.600222.6175459620045800878.stgit@devnote2/
Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Menglong Dong <menglong8.dong@gmail.com>
|
|
strcpy() is deprecated; use memcpy() instead.
Link: https://lore.kernel.org/all/20250820214717.778243-3-thorsten.blum@linux.dev/
Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
rcu_read_lock() is not needed in fprobe_entry, but rcu_dereference_check()
is used in rhltable_lookup(), which causes suspicious RCU usage warning:
WARNING: suspicious RCU usage
6.17.0-rc1-00001-gdfe0d675df82 #1 Tainted: G S
-----------------------------
include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
......
stack backtrace:
CPU: 1 UID: 0 PID: 4652 Comm: ftracetest Tainted: G S
Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND
Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
Call Trace:
<TASK>
dump_stack_lvl+0x7c/0x90
lockdep_rcu_suspicious+0x14f/0x1c0
__rhashtable_lookup+0x1e0/0x260
? __pfx_kernel_clone+0x10/0x10
fprobe_entry+0x9a/0x450
? __lock_acquire+0x6b0/0xca0
? find_held_lock+0x2b/0x80
? __pfx_fprobe_entry+0x10/0x10
? __pfx_kernel_clone+0x10/0x10
? lock_acquire+0x14c/0x2d0
? __might_fault+0x74/0xc0
function_graph_enter_regs+0x2a0/0x550
? __do_sys_clone+0xb5/0x100
? __pfx_function_graph_enter_regs+0x10/0x10
? _copy_to_user+0x58/0x70
? __pfx_kernel_clone+0x10/0x10
? __x64_sys_rt_sigprocmask+0x114/0x180
? __pfx___x64_sys_rt_sigprocmask+0x10/0x10
? __pfx_kernel_clone+0x10/0x10
ftrace_graph_func+0x87/0xb0
As we discussed in [1], fix this by using guard(rcu)() in fprobe_entry()
to protect the rhltable_lookup() and rhl_for_each_entry_rcu() with
rcu_read_lock and suppress this warning.
Link: https://lore.kernel.org/all/20250904062729.151931-1-dongml2@chinatelecom.cn/
Link: https://lore.kernel.org/all/20250829021436.19982-1-dongml2@chinatelecom.cn/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202508281655.54c87330-lkp@intel.com
Fixes: dfe0d675df82 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Since traceprobe_parse_context is reusable among a probe arguments,
it is more efficient to allocate it outside of the loop for parsing
probe argument as kprobe and fprobe events do.
Link: https://lore.kernel.org/all/175509541393.193596.16330324746701582114.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Use __free() to cleanup ugly gotos in __trace_uprobe_create().
Link: https://lore.kernel.org/all/175509540482.193596.6541098946023873304.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Use __free(trace_event_probe_cleanup) to remove unneeded gotos and
cleanup the last part of trace_eprobe_parse_filter().
Link: https://lore.kernel.org/all/175509539571.193596.4674012182718751429.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Use __free() for trace_probe_log_clear() to cleanup error log interface.
Link: https://lore.kernel.org/all/175509538609.193596.16646724647358218778.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
For now, all the kernel functions who are hooked by the fprobe will be
added to the hash table "fprobe_ip_table". The key of it is the function
address, and the value of it is "struct fprobe_hlist_node".
The budget of the hash table is FPROBE_IP_TABLE_SIZE, which is 256. And
this means the overhead of the hash table lookup will grow linearly if
the count of the functions in the fprobe more than 256. When we try to
hook all the kernel functions, the overhead will be huge.
Therefore, replace the hash table with rhltable to reduce the overhead.
Link: https://lore.kernel.org/all/20250819031825.55653-1-dongml2@chinatelecom.cn/
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
|
|
node_to_ns() checks for NULL and the assert isn't really helpful and
will have to be dropped later anyway.
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-7-2e6f823ebdc0@kernel.org
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Otherwise we trip VFS_WARN_ON_ONC() in __ns_tree_add_raw().
Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-6-2e6f823ebdc0@kernel.org
Fixes: 7c6059398533 ("cgroup: support ns lookup")
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix three regressions, two recent ones and one introduced during
the 6.17 development cycle:
- Add an exit latency check to the menu cpuidle governor in the case
when it considers using a real idle state instead of a polling one
to address a performance regression (Rafael Wysocki)
- Revert an attempted cleanup of a system suspend code path that
introduced a regression elsewhere (Samuel Wu)
- Allow pm_restrict_gfp_mask() to be called multiple times in a row
and adjust pm_restore_gfp_mask() accordingly to avoid having to
play nasty games with these calls during hibernation (Rafael
Wysocki)"
* tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Allow pm_restrict_gfp_mask() stacking
cpuidle: governors: menu: Select polling state in some more cases
Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
|
|
Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4:
- Add an exit latency check to the menu cpuidle governor in the case
when it considers using a real idle state instead of a polling one to
address a performance regression (Rafael Wysocki)
- Revert an attempted cleanup of a system suspend code path that
introduced a regression elsewhere (Samuel Wu)
- Allow pm_restrict_gfp_mask() to be called multiple times in a row
and adjust pm_restore_gfp_mask() accordingly to avoid having to play
nasty games with these calls during hibernation (Rafael Wysocki)
* pm-cpuidle:
cpuidle: governors: menu: Select polling state in some more cases
* pm-sleep:
PM: sleep: Allow pm_restrict_gfp_mask() stacking
Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
|
|
cgroup1 freezer piggybacks on the PM freezer, which inadvertently allowed
userspace to produce uninterruptible tasks at will. To avoid the issue,
cgroup2 freezer switched to a separate job control based mechanism. While
this happened a long time ago, the code and comment haven't been updated
making it confusing to people who aren't familiar with the history.
Rename cgroup_freezing() to cgroup1_freezing() and update comments on top of
freezing() and frozen() to clarify that cgroup2 freezer isn't covered by the
PM freezer mechanism.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/aPZ3q6Hm865NicBC@slm.duckdns.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add a sysfs attribute `/sys/power/hibernate_compression_threads` to
allow runtime configuration of the number of threads used for
compressing and decompressing hibernation images.
The new sysfs interface enables dynamic adjustment at runtime:
# cat /sys/power/hibernate_compression_threads
3
# echo 4 > /sys/power/hibernate_compression_threads
This change provides greater flexibility for debugging and performance
tuning of hibernation without requiring a reboot.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/c68c62f97fabf32507b8794ad8c16cd22ee656ac.1761046167.git.luoxueqin@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The number of compression/decompression threads has a direct impact on
hibernate image generation and resume latency. Using more threads can
reduce overall resume time, but on systems with fewer CPU cores it may
also introduce contention and reduce efficiency.
Performance was evaluated on an 8-core ARM system, averaged over 10 runs:
Threads Hibernate(s) Resume(s)
--------------------------------
3 12.14 18.86
4 12.28 17.48
5 11.09 16.77
6 11.08 16.44
With 5–6 threads, resume latency improves by approximately 12% compared
to the default 3-thread configuration, with negligible impact on
hibernate time.
Introduce a new kernel parameter `hibernate_compression_threads=` that
allows users and integrators to tune the number of
compression/decompression threads at boot. This provides a way to
balance performance and CPU utilization across a wide range of hardware
without recompiling the kernel.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/f24b3ca6416e230a515a154ed4c121d72a7e05a6.1761046167.git.luoxueqin@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Convert crc->unc_len and crc->unc from fixed-size arrays to dynamically
allocated arrays, sized according to the actual number of threads selected
at runtime. This removes the fixed limit imposed by CMP_THREADS.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/b5db63bb95729482d2649b12d3a11cb7547b7fcc.1761046167.git.luoxueqin@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
emitted record
printk() tries to flush messages with NBCON_PRIO_EMERGENCY on
nbcon consoles immediately. It might take seconds to flush all
pending lines on slow serial consoles. Note that there might be
hundreds of messages, for example:
[ 3.771531][ T1] pci 0000:3e:08.1: [8086:324
** replaying previous printk message **
[ 3.771531][ T1] pci 0000:3e:08.1: [8086:3246] type 00 class 0x088000 PCIe Root Complex Integrated Endpoint
[ ... more than 2000 lines, about 200kB messages ... ]
[ 3.837752][ T1] pci 0000:20:01.0: Adding to iommu group 18
[ 3.837851][ T
** replaying previous printk message **
[ 3.837851][ T1] pci 0000:20:03.0: Adding to iommu group 19
[ 3.837946][ T1] pci 0000:20:05.0: Adding to iommu group 20
[ ... more than 500 messages for iommu groups 21-590 ...]
[ 3.912932][ T1] pci 0000:f6:00.1: Adding to iommu group 591
[ 3.913070][ T1] pci 0000:f6:00.2: Adding to iommu group 592
[ 3.913243][ T1] DMAR: Intel(R) Virtualization Technology for Directed I/O
[ 3.913245][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 3.913245][ T1] software IO TLB: mapped [mem 0x000000004f000000-0x0000000053000000] (64MB)
[ 3.913324][ T1] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer
[ 3.913325][ T1] RAPL PMU: hw unit of domain package 2^-14 Joules
[ 3.913326][ T1] RAPL PMU: hw unit of domain dram 2^-14 Joules
[ 3.913327][ T1] RAPL PMU: hw unit of domain psys 2^-0 Joules
[ 3.933486][ T1] ------------[ cut here ]------------
[ 3.933488][ T1] WARNING: CPU: 2 PID: 1 at arch/x86/events/intel/uncore.c:1156 uncore_pci_pmu_register+0x15e/0x180
[ 3.930291][ C0] watchdog: Watchdog detected hard LOCKUP on cpu 0
[ 3.930291][ C0] Kernel panic - not syncing: Hard LOCKUP
[...]
[ 3.930291][ C0] CPU: 0 UID: 0 PID: 18 Comm: pr/ttyS0 Not tainted...
[...]
[ 3.930291][ C0] RIP: 0010:nbcon_reacquire_nobuf+0x11/0x50
[ 3.930291][ C0] Call Trace:
[...]
[ 3.930291][ C0] <TASK>
[ 3.930291][ C0] serial8250_console_write+0x16d/0x5c0
[ 3.930291][ C0] nbcon_emit_next_record+0x22c/0x250
[ 3.930291][ C0] nbcon_emit_one+0x93/0xe0
[ 3.930291][ C0] nbcon_kthread_func+0x13c/0x1c0
The are visible two takeovers of the console ownership:
- The 1st one is triggered by the "WARNING: CPU: 2 PID: 1 at
arch/x86/..." line printed with NBCON_PRIO_EMERGENCY.
- The 2nd one is triggered by the "Kernel panic - not syncing:
Hard LOCKUP" line printed with NBCON_PRIO_PANIC.
There are more than 2500 lines, at about 240kB, emitted between
the takeover and the 1st "WARNING" line in the emergency context.
This amount of pending messages had to be flushed by
nbcon_atomic_flush_pending() when WARN() printed its first line.
The atomic flush was holding the nbcon console context for too long so
that it triggered hard lockup on the CPU running the printk kthread
"pr/ttyS0". The kthread needed to reacquire the console ownership
for restoring the original serial port state in serial8250_console_write().
Prevent the hardlockup by releasing the nbcon console ownership after
each emitted record.
Note that __nbcon_atomic_flush_pending_con() used to hold the console
ownership all the time because it blocked the printk kthread. Otherwise
the kthread tried to flush the messages in parallel which caused repeated
takeovers and more replayed messages.
It is not longer a problem because the repeated takeovers are blocked
by the counter of emergency contexts, see nbcon_cpu_emergency_cnt.
Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250926124912.243464-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The printk kthread might be running when there is a panic in progress.
But it is not able to acquire the console ownership any longer.
Prevent the desperate attempts to acquire the ownership and allow sleeping
in panic. It would make it behave the same as when there is any CPU
in an emergency context.
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250926124912.243464-3-pmladek@suse.com
[pmladek@suse.com: Rebased on top of 6.18-rc1 (panic_in_progress() moved to panic.c)]
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
In emergency contexts, printk() tries to flush messages directly even
on nbcon consoles. And it is allowed to takeover the console ownership
and interrupt the printk kthread in the middle of a message.
Only one takeover and one repeated message should be enough in most
situations. The first emergency message flushes the backlog and printk
kthreads get to sleep. Next emergency messages are flushed directly
and printk() does not wake up the kthreads.
However, the one takeover is not guaranteed. Any printk() in normal
context on another CPU could wake up the kthreads. Or a new emergency
message might be added before the kthreads get to sleep. Note that
the interrupted .write_thread() callbacks usually have to call
nbcon_reacquire_nobuf() and restore the original device setting
before checking for pending messages.
The risk of the repeated takeovers will be even bigger because
__nbcon_atomic_flush_pending_con is going to release the console
ownership after each emitted record. It will be needed to prevent
hardlockup reports on other CPUs which are busy waiting for
the context ownership, for example, by nbcon_reacquire_nobuf() or
__uart_port_nbcon_acquire().
The repeated takeovers break the output, for example:
[ 5042.650211][ T2220] Call Trace:
[ 5042.6511
** replaying previous printk message **
[ 5042.651192][ T2220] <TASK>
[ 5042.652160][ T2220] kunit_run_
** replaying previous printk message **
[ 5042.652160][ T2220] kunit_run_tests+0x72/0x90
[ 5042.653340][ T22
** replaying previous printk message **
[ 5042.653340][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5042.654628][ T2220] ? stack_trace_save+0x4d/0x70
[ 5042.6553
** replaying previous printk message **
[ 5042.655394][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5042.656713][ T2220] ? save_trace+0x5b/0x180
A more robust solution is to block the printk kthread entirely whenever
*any* CPU enters an emergency context. This ensures that critical messages
can be flushed without contention from the normal, non-atomic printing
path.
Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250926124912.243464-2-pmladek@suse.com
[pmladek@suse.com: Added changes proposed by John Ogness]
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
BPF stream kfuncs need to be non-sleeping as they can be called from
programs running in any context, this requires a way to allocate memory
from any context. Currently, this is done by a custom per-CPU NMI-safe
bump allocation mechanism, backed by alloc_pages_nolock() and
free_pages_nolock() primitives.
As kmalloc_nolock() and kfree_nolock() primitives are available now, the
custom allocator can be removed in favor of these.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251023161448.4263-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Disable propagation and unwinding of the waiter queue in case the head
waiter detects a deadlock condition, but keep it enabled in case of the
timeout fallback.
Currently, when the head waiter experiences an AA deadlock, it will
signal all its successors in the queue to exit with an error. This is
not ideal for cases where the same lock is held in contexts which can
cause errors in an unrestricted fashion (e.g., BPF programs, or kernel
paths invoked through BPF programs), and core kernel logic which is
written in a correct fashion and does not expect deadlocks.
The same reasoning can be extended to ABBA situations. Depending on the
actual runtime schedule, one or both of the head waiters involved in an
ABBA situation can detect and exit directly without terminating their
waiter queue. If the ABBA situation manifests again, the waiters will
keep exiting until progress can be made, or a timeout is triggered in
case of more complicated locking dependencies.
We still preserve the queue destruction in case of timeouts, as either
the locking dependencies are too complex to be captured by AA and ABBA
heuristics, or the owner is perpetually stuck. As such, it would be
unwise to continue to apply the timeout for each new head waiter without
terminating the queue, since we may end up waiting for more than 250 ms
in aggregate with all participants in the locking transaction.
The patch itself is fairly simple; we can simply signal our successor to
become the next head waiter, and leave the queue without attempting to
acquire the lock.
With this change, the behavior for waiters in case of deadlocks
experienced by a predecessor changes. It is guaranteed that call sites
will no longer receive errors if the predecessors encounter deadlocks
and the successors do not participate in one. This should lower the
failure rate for waiters that are not doing improper locking opreations,
just because they were unlucky to queue behind a misbehaving waiter.
However, timeouts are still a possibility, hence they must be accounted
for, so users cannot rely upon errors not occuring at all.
Suggested-by: Amery Hung <ameryhung@gmail.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251029181828.231529-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Allow pm_restrict_gfp_mask() to be called many times in a row to avoid
issues with calling dpm_suspend_start() when the GFP mask has been
already restricted.
Only the first invocation of pm_restrict_gfp_mask() will actually
restrict the GFP mask and the subsequent calls will warn if there is
a mismatch between the expected allowed GFP mask and the actual one.
Moreover, if pm_restrict_gfp_mask() is called many times in a row,
pm_restore_gfp_mask() needs to be called matching number of times in
a row to actually restore the GFP mask. Calling it when the GFP mask
has not been restricted will cause it to warn.
This is necessary for the GFP mask restriction starting in
hibernation_snapshot() to continue throughout the entire hibernation
flow until it completes or it is aborted (either by a wakeup event or
by an error).
Fixes: 449c9c02537a1 ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()")
Fixes: 469d80a3712c ("PM: hibernate: Fix hybrid-sleep")
Reported-by: Askar Safin <safinaskar@gmail.com>
Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/
Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Mario Limonciello (AMD) <superm1@kernel.org>
Cc: 6.16+ <stable@vger.kernel.org> # 6.16+
Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki
|
|
The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:
1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
can enqueue a task, but during consume_remote_task() when the rq lock is
released, a higher priority task can be enqueued and ultimately picked while
cpu_released remains false. This gap is closeable via RETRY_TASK handling.
2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
DSQ. By the time the sched path runs on the target CPU, higher class tasks may
already be queued. In such cases, nothing on sched_ext side will be invoked,
and the only solution would be a hook invoked regardless of sched class, which
isn't desirable.
Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.
The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.
Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
BPF kfunc checks and calls reenq_local() to perform the actual work.
This is a prep patch to allow reenq_local() to be called from other contexts.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
schedule_deferred() currently requires the rq lock to be held so that it can
use scheduler hooks for efficiency when available. However, there are cases
where deferred actions need to be scheduled from contexts that don't hold the
rq lock.
Split into schedule_deferred() which can be called from any context and just
queues irq_work, and schedule_deferred_locked() which requires the rq lock and
can optimize by using scheduler hooks when available. Update the existing call
site to use the _locked variant.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull to receive f4fa7c25f632 ("sched_ext: Fix use of uninitialized variable
in scx_bpf_cpuperf_set()") which conflicts with changes for planned
sub-sched changes.
|
|
scx_bpf_cpuperf_set() has a typo where it dereferences the local
variable @sch, instead of the global @scx_root pointer. Fix by
dereferencing the correct variable.
Fixes: 956f2b11a8a4f ("sched_ext: Drop kf_cpu_valid()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
printk_legacy_map is used to hide lock nesting violations caused by
legacy drivers and is using the wrong override type. LD_WAIT_SLEEP is
for always sleeping lock types such as mutex_t. LD_WAIT_CONFIG is for
lock type which are sleeping while spinning on PREEMPT_RT such as
spinlock_t.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20251026150726.GA23223@redhat.com
[pmladek@suse.com: Fixed indentation.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
When em_create_perf_table() returns failure, pd is freed, there dev->em_pd
is not valid. Then accessing dev->em_pd->node will trigger kernel panic
in em_dev_register_pd_no_update(). So return early if 'ret' is non-zero.
Kernel dump:
cpu cpu0: EM: invalid power: 0
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000008
Mem abort info:
pc : em_dev_register_pd_no_update+0xb4/0x79c
lr : em_dev_register_pd_no_update+0x9c/0x79c
Call trace:
em_dev_register_pd_no_update+0xb4/0x79c (P)
em_dev_register_perf_domain+0x18/0x58
scmi_cpufreq_register_em+0x84/0xb8
cpufreq_online+0x48c/0xb74
cpufreq_add_dev+0x80/0x98
subsys_interface_register+0x100/0x11c
cpufreq_register_driver+0x158/0x278
scmi_cpufreq_probe+0x1f8/0x2e0
scmi_dev_probe+0x28/0x3c
really_probe+0xbc/0x29c
__driver_probe_device+0x78/0x12c
driver_probe_device+0x3c/0x15c
__device_attach_driver+0xb8/0x134
bus_for_each_drv+0x84/0xe4
Fixes: cbe5aeedecc7 ("PM: EM: Assign a unique ID when creating a performance domain")
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20251028-fix-energy-v1-1-ab854fd6a97c@nxp.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Add support for deferred userspace unwind to perf.
Where perf currently relies on in-place stack unwinding; from NMI
context and all that. This moves the userspace part of the unwind to
right before the return-to-userspace.
This has two distinct benefits, the biggest is that it moves the
unwind to a faultable context. It becomes possible to fault in debug
info (.eh_frame, SFrame etc.) that might not otherwise be readily
available. And secondly, it de-duplicates the user callchain where
multiple samples happen during the same kernel entry.
To facilitate this the perf interface is extended with a new record
type:
PERF_RECORD_CALLCHAIN_DEFERRED
and two new attribute flags:
perf_event_attr::defer_callchain - to request the user unwind be deferred
perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records
The existing PERF_RECORD_SAMPLE callchain section gets a new
context type:
PERF_CONTEXT_USER_DEFERRED
After which will come a single entry, denoting the 'cookie' of the
deferred callchain that should be attached here, matching the 'cookie'
field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED.
The 'defer_callchain' flag is expected on all events with
PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event
responsible for collecting side-band events (like mmap, comm etc.).
Setting 'defer_output' on multiple events will get you duplicated
PERF_RECORD_CALLCHAIN_DEFERRED records.
Based on earlier patches by Josh and Steven.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net
|
|
When userspace is interrupted at the start of a function, before we
get a chance to complete the frame, unwind will miss one caller.
X86 has a uprobe specific fixup for this, add bits to the generic
unwinder to support this.
Suggested-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251024145156.GM4068168@noisy.programming.kicks-ass.net
|
|
It is important to be able to unwind compat tasks too.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250924080119.613695709@infradead.org
|
|
2^log_2(n) == n
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080119.497867836@infradead.org
|
|
The unwind_task_info::unwind_mask was manipulated using a mixture of:
regular store
WRITE_ONCE()
try_cmpxchg()
set_bit()
atomic_long_*()
Clean up and make it consistently atomic_long_t.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250924080119.384384486@infradead.org
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250924080119.271671514@infradead.org
|
|
The get_cookie() function hard relies on IRQs being disabled, but this
isn't immediately obvious when reading the function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080119.122507632@infradead.org
|
|
task_work_add(RWA_RESUME) isn't NMI-safe, use TWA_NMI_CURRENT when
used from NMI context.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080119.005422353@infradead.org
|
|
Explain why unwind_deferred_task_exit() exist and its constraints.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.893367437@infradead.org
|
|
__schedule()
// disable irqs
<NMI>
task_work_add(current, work, TWA_NMI_CURRENT);
</NMI>
// current = next;
// enable irqs
<IRQ>
task_work_set_notify_irq()
test_and_set_tsk_thread_flag(current,
TIF_NOTIFY_RESUME); // wrong task!
</IRQ>
// original task skips task work on its next return to user (or exit!)
Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.")
Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.425949403@infradead.org
|
|
After conversion of arch code to use physical address mapping,
there are no users of .map_page() and .unmap_page() callbacks,
so let's remove them.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-14-3bbfe3a25cdf@kernel.org
|
|
After ARM and XEN conversions to use physical addresses for the mapping,
there are no in-kernel users for map_resource/unmap_resource callbacks,
so remove them.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-6-3bbfe3a25cdf@kernel.org
|