summaryrefslogtreecommitdiff
path: root/kernel/events
AgeCommit message (Collapse)Author
2025-08-21uprobes: Remove redundant __GFP_NOWARNQianfeng Rong
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250805025000.346647-1-rongqianfeng@vivo.com
2025-08-21uprobes/x86: Add support to optimize uprobesJiri Olsa
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path does following: - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline with uprobe syscall - overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we plan to use nop5 as USDT probe instruction (which currently uses single byte nop) and speed up the USDT probes. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. The uprobe is un-optimized in arch specific set_orig_insn call. The instruction overwrite is x86 arch specific and needs to go through 3 updates: (on top of nop5 instruction) - write int3 into 1st byte - write last 4 bytes of the call instruction - update the call instruction opcode And cleanup goes though similar reverse stages: - overwrite call opcode with breakpoint (int3) - write last 4 bytes of the nop5 instruction - write the nop5 first instruction byte We do not unmap and release uprobe trampoline when it's no longer needed, because there's no easy way to make sure none of the threads is still inside the trampoline. But we do not waste memory, because there's just single page for all the uprobe trampoline mappings. We do waste frame on page mapping for every 4GB by keeping the uprobe trampoline page mapped, but that seems ok. We take the benefit from the fact that set_swbp and set_orig_insn are called under mmap_write_lock(mm), so we can use the current instruction as the state the uprobe is in - nop5/breakpoint/call trampoline - and decide the needed action (optimize/un-optimize) based on that. Attaching the speed up from benchs/run_bench_uprobes.sh script: current: usermode-count : 152.604 ± 0.044M/s syscall-count : 13.359 ± 0.042M/s --> uprobe-nop : 3.229 ± 0.002M/s uprobe-push : 3.086 ± 0.004M/s uprobe-ret : 1.114 ± 0.004M/s uprobe-nop5 : 1.121 ± 0.005M/s uretprobe-nop : 2.145 ± 0.002M/s uretprobe-push : 2.070 ± 0.001M/s uretprobe-ret : 0.931 ± 0.001M/s uretprobe-nop5 : 0.957 ± 0.001M/s after the change: usermode-count : 152.448 ± 0.244M/s syscall-count : 14.321 ± 0.059M/s uprobe-nop : 3.148 ± 0.007M/s uprobe-push : 2.976 ± 0.004M/s uprobe-ret : 1.068 ± 0.003M/s --> uprobe-nop5 : 7.038 ± 0.007M/s uretprobe-nop : 2.109 ± 0.004M/s uretprobe-push : 2.035 ± 0.001M/s uretprobe-ret : 0.908 ± 0.001M/s uretprobe-nop5 : 3.377 ± 0.009M/s I see bit more speed up on Intel (above) compared to AMD. The big nop5 speed up is partly due to emulating nop5 and partly due to optimization. The key speed up we do this for is the USDT switch from nop to nop5: uprobe-nop : 3.148 ± 0.007M/s uprobe-nop5 : 7.038 ± 0.007M/s Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-11-jolsa@kernel.org
2025-08-21uprobes/x86: Add uprobe syscall to speed up uprobeJiri Olsa
Adding new uprobe syscall that calls uprobe handlers for given 'breakpoint' address. The idea is that the 'breakpoint' address calls the user space trampoline which executes the uprobe syscall. The syscall handler reads the return address of the initial call to retrieve the original 'breakpoint' address. With this address we find the related uprobe object and call its consumers. Adding the arch_uprobe_trampoline_mapping function that provides uprobe trampoline mapping. This mapping is backed with one global page initialized at __init time and shared by the all the mapping instances. We do not allow to execute uprobe syscall if the caller is not from uprobe trampoline mapping. The uprobe syscall ensures the consumer (bpf program) sees registers values in the state before the trampoline was called. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
2025-08-21uprobes/x86: Add mapping for optimized uprobe trampolinesJiri Olsa
Adding support to add special mapping for user space trampoline with following functions: uprobe_trampoline_get - find or add uprobe_trampoline uprobe_trampoline_put - remove or destroy uprobe_trampoline The user space trampoline is exported as arch specific user space special mapping through tramp_mapping, which is initialized in following changes with new uprobe syscall. The uprobe trampoline needs to be callable/reachable from the probed address, so while searching for available address we use is_reachable_by_call function to decide if the uprobe trampoline is callable from the probe address. All uprobe_trampoline objects are stored in uprobes_state object and are cleaned up when the process mm_struct goes down. Adding new arch hooks for that, because this change is x86_64 specific. Locking is provided by callers in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
2025-08-21uprobes: Add do_ref_ctr argument to uprobe_write functionJiri Olsa
Making update_ref_ctr call in uprobe_write conditional based on do_ref_ctr argument. This way we can use uprobe_write for instruction update without doing ref_ctr_offset update. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-8-jolsa@kernel.org
2025-08-21uprobes: Add is_register argument to uprobe_write and uprobe_write_opcodeJiri Olsa
The uprobe_write has special path to restore the original page when we write original instruction back. This happens when uprobe_write detects that we want to write anything else but breakpoint instruction. Moving the detection away and passing it to uprobe_write as argument, so it's possible to write different instructions (other than just breakpoint and rest). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-7-jolsa@kernel.org
2025-08-21uprobes: Add nbytes argument to uprobe_writeJiri Olsa
Adding nbytes argument to uprobe_write and related functions as preparation for writing whole instructions in following changes. Also renaming opcode arguments to insn, which seems to fit better. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-6-jolsa@kernel.org
2025-08-21uprobes: Add uprobe_write functionJiri Olsa
Adding uprobe_write function that does what uprobe_write_opcode did so far, but allows to pass verify callback function that checks the memory location before writing the opcode. It will be used in following changes to implement specific checking logic for instruction update. The uprobe_write_opcode now calls uprobe_write with verify_opcode as the verify callback. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-5-jolsa@kernel.org
2025-08-21uprobes: Make copy_from_page globalJiri Olsa
Making copy_from_page global and adding uprobe prefix. Adding the uprobe prefix to copy_to_page as well for symmetry. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-4-jolsa@kernel.org
2025-08-21uprobes: Rename arch_uretprobe_trampoline functionJiri Olsa
We are about to add uprobe trampoline, so cleaning up the namespace. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-3-jolsa@kernel.org
2025-08-21uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lockJiri Olsa
Currently unapply_uprobe takes mmap_read_lock, but it might call remove_breakpoint which eventually changes user pages. Current code writes either breakpoint or original instruction, so it can go away with read lock as explained in here [1]. But with the upcoming change that writes multiple instructions on the probed address we need to ensure that any update to mm's pages is exclusive. [1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-2-jolsa@kernel.org
2025-08-15perf: Convert mmap() refcounts to refcount_tThomas Gleixner
The recently fixed reference count leaks could have been detected by using refcount_t and refcount_t would have mitigated the potential overflow at least. Now that the code is properly structured, convert the mmap() related mmap_count variants over to refcount_t. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104020.071507932@infradead.org
2025-08-15perf: Identify the 0->1 transition for event::mmap_countPeter Zijlstra
Needed because refcount_inc() doesn't allow the 0->1 transition. Specifically, this is the case where we've created the RB, this means there was no RB, and as such there could not have been an mmap. Additionally we hold mmap_mutex to serialize everything. This must be the first. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250812104019.956479989@infradead.org
2025-08-15perf: Use scoped_guard() for mmap_mutex in perf_mmap()Peter Zijlstra
Mostly just re-indent noise. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.838047976@infradead.org
2025-08-15perf: Split out the RB allocationPeter Zijlstra
Move the RB buffer allocation branch into its own function. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.722214699@infradead.org
2025-08-15perf: Make RB allocation branch self sufficientPeter Zijlstra
Ensure @rb usage doesn't extend out of the branch block. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.605285302@infradead.org
2025-08-15perf: Split out the AUX buffer allocationPeter Zijlstra
Move the AUX buffer allocation branch into its own function. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.494205648@infradead.org
2025-08-15perf: Reflow to get rid of aux_success labelPeter Zijlstra
Mostly re-indent noise needed to get rid of that label. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.362581570@infradead.org
2025-08-15perf: Use guard() for aux_mutex in perf_mmap()Peter Zijlstra
After duplicating the common code into the rb/aux branches is it possible to use a simple guard() for the aux_mutex. Making the aux branch self-contained. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.246250452@infradead.org
2025-08-15perf: Remove redundant aux_unlock labelPeter Zijlstra
unlock and aux_unlock are now identical, remove the aux_unlock one. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.131293512@infradead.org
2025-08-15perf: Move common code into both rb and aux branchesPeter Zijlstra
if (cond) { A; } else { B; } C; into if (cond) { A; C; } else { B; C; } Notably C has a success branch and both A and B have two places for success. For A (rb case), duplicate the success case because later patches will result in them no longer being identical. For B (aux case), share using goto (cleaned up later). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.016252852@infradead.org
2025-08-15perf: Merge consecutive conditionals in perf_mmap()Peter Zijlstra
if (cond) { A; } else { B; } if (cond) { C; } else { D; } into: if (cond) { A; C; } else { B; D; } Notably the conditions are not identical in form, but are equivalent. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.900078502@infradead.org
2025-08-15perf: Move perf_mmap_calc_limits() into both rb and aux branchesPeter Zijlstra
if (cond) { A; } else { B; } C; into if (cond) { A; C; } else { B; C; } Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.781244099@infradead.org
2025-08-15perf: Split out VM accountingThomas Gleixner
Similarly to the mlock limit calculation the VM accounting is required for both the ringbuffer and the AUX buffer allocations. To prepare for splitting them out into separate functions, move the accounting into a helper function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.660347811@infradead.org
2025-08-15perf: Split out mlock limit handlingThomas Gleixner
To prepare for splitting the buffer allocation out into separate functions for the ring buffer and the AUX buffer, split out mlock limit handling into a helper function, which can be called from both. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.541975109@infradead.org
2025-08-15perf: Remove redundant condition for AUX buffer sizeThomas Gleixner
It is already checked whether the VMA size is the same as nr_pages * PAGE_SIZE, so later checking both: aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE is redundant. Remove the vma_size check as nr_pages is what is actually used in the allocation function. That prepares for splitting out the buffer allocation into separate functions, so that only nr_pages needs to be handed in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.424519320@infradead.org
2025-08-15perf: Avoid undefined behavior from stopping/starting inactive eventsYunseong Kim
Calling pmu->start()/stop() on perf events in PERF_EVENT_STATE_OFF can leave event->hw.idx at -1. When PMU drivers later attempt to use this negative index as a shift exponent in bitwise operations, it leads to UBSAN shift-out-of-bounds reports. The issue is a logical flaw in how event groups handle throttling when some members are intentionally disabled. Based on the analysis and the reproducer provided by Mark Rutland (this issue on both arm64 and x86-64). The scenario unfolds as follows: 1. A group leader event is configured with a very aggressive sampling period (e.g., sample_period = 1). This causes frequent interrupts and triggers the throttling mechanism. 2. A child event in the same group is created in a disabled state (.disabled = 1). This event remains in PERF_EVENT_STATE_OFF. Since it hasn't been scheduled onto the PMU, its event->hw.idx remains initialized at -1. 3. When throttling occurs, perf_event_throttle_group() and later perf_event_unthrottle_group() iterate through all siblings, including the disabled child event. 4. perf_event_throttle()/unthrottle() are called on this inactive child event, which then call event->pmu->start()/stop(). 5. The PMU driver receives the event with hw.idx == -1 and attempts to use it as a shift exponent. e.g., in macros like PMCNTENSET(idx), leading to the UBSAN report. The throttling mechanism attempts to start/stop events that are not actively scheduled on the hardware. Move the state check into perf_event_throttle()/perf_event_unthrottle() so that inactive events are skipped entirely. This ensures only active events with a valid hw.idx are processed, preventing undefined behavior and silencing UBSAN warnings. The corrected check ensures true before proceeding with PMU operations. The problem can be reproduced with the syzkaller reproducer: Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") Signed-off-by: Yunseong Kim <ysk@kzalloc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250812181046.292382-2-ysk@kzalloc.com
2025-08-05perf/core: Prevent VMA split of buffer mappingsThomas Gleixner
The perf mmap code is careful about mmap()'ing the user page with the ringbuffer and additionally the auxiliary buffer, when the event supports it. Once the first mapping is established, subsequent mapping have to use the same offset and the same size in both cases. The reference counting for the ringbuffer and the auxiliary buffer depends on this being correct. Though perf does not prevent that a related mapping is split via mmap(2), munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls, which take reference counts, but then the subsequent perf_mmap_close() calls are not longer fulfilling the offset and size checks. This leads to reference count leaks. As perf already has the requirement for subsequent mappings to match the initial mapping, the obvious consequence is that VMA splits, caused by resizing of a mapping or partial unmapping, have to be prevented. Implement the vm_operations_struct::may_split() callback and return unconditionally -EINVAL. That ensures that the mapping offsets and sizes cannot be changed after the fact. Remapping to a different fixed address with the same size is still possible as it takes the references for the new mapping and drops those of the old mapping. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: stable@vger.kernel.org
2025-08-05perf/core: Handle buffer mapping fail correctly in perf_mmap()Thomas Gleixner
After successful allocation of a buffer or a successful attachment to an existing buffer perf_mmap() tries to map the buffer read only into the page table. If that fails, the already set up page table entries are zapped, but the other perf specific side effects of that failure are not handled. The calling code just cleans up the VMA and does not invoke perf_mmap_close(). This leaks reference counts, corrupts user->vm accounting and also results in an unbalanced invocation of event::event_mapped(). Cure this by moving the event::event_mapped() invocation before the map_range() call so that on map_range() failure perf_mmap_close() can be invoked without causing an unbalanced event::event_unmapped() call. perf_mmap_close() undoes the reference counts and eventually frees buffers. Fixes: b709eb872e19 ("perf: map pages in advance") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org
2025-08-05perf/core: Exit early on perf_mmap() failThomas Gleixner
When perf_mmap() fails to allocate a buffer, it still invokes the event_mapped() callback of the related event. On X86 this might increase the perf_rdpmc_allowed reference counter. But nothing undoes this as perf_mmap_close() is never called in this case, which causes another reference count leak. Return early on failure to prevent that. Fixes: 1e0fb9ec679c ("perf: Add pmu callbacks to track event mapping and unmapping") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org
2025-08-05perf/core: Don't leak AUX buffer refcount on allocation failureThomas Gleixner
Failure of the AUX buffer allocation leaks the reference count. Set the reference count to 1 only when the allocation succeeds. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org
2025-08-05perf/core: Preserve AUX buffer allocation failure resultThomas Gleixner
A recent overhaul sets the return value to 0 unconditionally after the allocations, which causes reference count leaks and corrupts the user->vm accounting. Preserve the AUX buffer allocation failure return value, so that the subsequent code works correctly. Fixes: 0983593f32c4 ("perf/core: Lift event->mmap_mutex in perf_mmap()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org
2025-08-03Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Significant patch series in this pull request: - "squashfs: Remove page->mapping references" (Matthew Wilcox) gets us closer to being able to remove page->mapping - "relayfs: misc changes" (Jason Xing) does some maintenance and minor feature addition work in relayfs - "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches us from static preallocation of the kdump crashkernel's working memory over to dynamic allocation. So the difficulty of a-priori estimation of the second kernel's needs is removed and the first kernel obtains extra memory - "generalize panic_print's dump function to be used by other kernel parts" (Feng Tang) implements some consolidation and rationalization of the various ways in which a failing kernel splats information at the operator * tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits) tools/getdelays: add backward compatibility for taskstats version kho: add test for kexec handover delaytop: enhance error logging and add PSI feature description samples: Kconfig: fix spelling mistake "instancess" -> "instances" fat: fix too many log in fat_chain_add() scripts/spelling.txt: add notifer||notifier to spelling.txt xen/xenbus: fix typo "notifer" net: mvneta: fix typo "notifer" drm/xe: fix typo "notifer" cxl: mce: fix typo "notifer" KVM: x86: fix typo "notifer" MAINTAINERS: add maintainers for delaytop ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() ucount: fix atomic_long_inc_below() argument type kexec: enable CMA based contiguous allocation stackdepot: make max number of pools boot-time configurable lib/xxhash: remove unused functions init/Kconfig: restore CONFIG_BROKEN help text lib/raid6: update recov_rvv.c zero page usage docs: update docs after introducing delaytop ...
2025-07-31Merge tag 'mm-stable-2025-07-30-15-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
2025-07-09uprobes: revert ref_ctr_offset in uprobe_unregister error pathJiri Olsa
There's error path that could lead to inactive uprobe: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place, but ref_ctr is changed to 0 (it's not restored to 1 in the fail path) uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 0 - uprobe CAN NOT be triggered now 4) uprobe_unregister fails because ref_ctr value is unexpected Fix this by reverting the updated ref_ctr value back to 1 in step 2), which is the case when uprobe_unregister fails (int3 stays in place), but we have already updated refctr. The new scenario will go as follows: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place and ref_ctr is reverted to 1.. uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 1 - uprobe CAN be triggered now 4) uprobe_unregister succeeds Link: https://lkml.kernel.org/r/20250514101809.2010193-1-jolsa@kernel.org Fixes: 1cc33161a83d ("uprobes: Support SDT markers having reference count (semaphore)") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: filter zone device pages returned from folio_walk_start()Alistair Popple
Previously dax pages were skipped by the pagewalk code as pud_special() or vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are refcounted normally that is no longer the case, so the pagewalk code will start returning them. Most callers already explicitly filter for DAX or zone device pages so don't need updating. However some don't, so add checks to those callers. Link: https://lkml.kernel.org/r/4ecb7b357fc5b435588024770b88bbb695c30090.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use folio_expected_ref_count() helper for reference countingShivank Garg
Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09perf/core: Fix WARN in perf_sigtrap()Tetsuo Handa
Since exit_task_work() runs after perf_event_exit_task_context() updated ctx->task to TASK_TOMBSTONE, perf_sigtrap() from perf_pending_task() might observe event->ctx->task == TASK_TOMBSTONE. Swap the early exit tests in order not to hit WARN_ON_ONCE(). Closes: https://syzkaller.appspot.com/bug?extid=2fe61cb2a86066be6985 Reported-by: syzbot <syzbot+2fe61cb2a86066be6985@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/b1c224bd-97f9-462c-a3e3-125d5e19c983@I-love.SAKURA.ne.jp
2025-07-03perf: Revert to requiring CAP_SYS_ADMIN for uprobesPeter Zijlstra
Jann reports that uprobes can be used destructively when used in the middle of an instruction. The kernel only verifies there is a valid instruction at the requested offset, but due to variable instruction length cannot determine if this is an instruction as seen by the intended execution stream. Additionally, Mark Rutland notes that on architectures that mix data in the text segment (like arm64), a similar things can be done if the data word is 'mistaken' for an instruction. As such, require CAP_SYS_ADMIN for uprobes. Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
2025-06-30perf/core: Fix the WARN_ON_ONCE is out of lock protected regionLuo Gengkun
commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to fix a concurrency problem between perf_cgroup_switch and perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into lock-protected region, so the warning is still be triggered. Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
2025-06-26perf/aux: Fix pending disable flow when the AUX ring buffer overrunsLeo Yan
If an AUX event overruns, the event core layer intends to disable the event by setting the 'pending_disable' flag. Unfortunately, the event is not actually disabled afterwards. In commit: ca6c21327c6a ("perf: Fix missing SIGTRAPs") the 'pending_disable' flag was changed to a boolean. However, the AUX event code was not updated accordingly. The flag ends up holding a CPU number. If this number is zero, the flag is taken as false and the IRQ work is never triggered. Later, with commit: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()") a new IRQ work 'pending_disable_irq' was introduced to handle event disabling. The AUX event path was not updated to kick off the work queue. To fix this bug, when an AUX ring buffer overrun is detected, call perf_event_disable_inatomic() to initiate the pending disable flow. Also update the outdated comment for setting the flag, to reflect the boolean values (0 or 1). Fixes: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()") Fixes: ca6c21327c6a ("perf: Fix missing SIGTRAPs") Signed-off-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: James Clark <james.clark@linaro.org> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Liang Kan <kan.liang@linux.intel.com> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-perf-users@vger.kernel.org Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com
2025-06-22Merge tag 'perf_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Avoid a crash on a heterogeneous machine where not all cores support the same hw events features - Avoid a deadlock when throttling events - Document the perf event states more - Make sure a number of perf paths switching off or rescheduling events call perf_cgroup_event_disable() - Make sure perf does task sampling before its userspace mapping is torn down, and not after * tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix crash in icl_update_topdown_event() perf: Fix the throttle error of some clock events perf: Add comment to enum perf_event_state perf/core: Fix WARN in perf_cgroup_switch() perf: Fix dangling cgroup pointer in cpuctx perf: Fix cgroup state vs ERROR perf: Fix sample vs do_exit()
2025-06-11perf: Fix the throttle error of some clock eventsKan Liang
Both ARM and IBM CI reports RCU stall, which can be reproduced by the below perf command. perf record -a -e cpu-clock -- sleep 2 The issue is introduced by the generic throttle patch set, which unconditionally invoke the event_stop() when throttle is triggered. The cpu-clock and task-clock are two special SW events, which rely on the hrtimer. The throttle is invoked in the hrtimer handler. The event_stop()->hrtimer_cancel() waits for the handler to finish, which is a deadlock. Instead of invoking the stop(), the HRTIMER_NORESTART should be used to stop the timer. There may be two ways to fix it: - Introduce a PMU flag to track the case. Avoid the event_stop in perf_event_throttle() if the flag is detected. It has been implemented in the https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@linux.intel.com/ The new flag was thought to be an overkill for the issue. - Add a check in the event_stop. Return immediately if the throttle is invoked in the hrtimer handler. Rely on the existing HRTIMER_NORESTART method to stop the timer. The latter is implemented here. Move event->hw.interrupts = MAX_INTERRUPTS before the stop(). It makes the order the same as perf_event_unthrottle(). Except the patch, no one checks the hw.interrupts in the stop(). There is no impact from the order change. When stops in the throttle, the event should not be updated, stop(event, 0). But the cpu_clock_event_stop() doesn't handle the flag. In logic, it's wrong. But it didn't bring any problems with the old code, because the stop() was not invoked when handling the throttle. Checking the flag before updating the event. Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") Closes: https://lore.kernel.org/lkml/20250527161656.GJ2566836@e132581.arm.com/ Closes: https://lore.kernel.org/lkml/djxlh5fx326gcenwrr52ry3pk4wxmugu4jccdjysza7tlc5fef@ktp4rffawgcw/ Closes: https://lore.kernel.org/lkml/8e8f51d8-af64-4d9e-934b-c0ee9f131293@linux.ibm.com/ Closes: https://lore.kernel.org/lkml/4ce106d0-950c-aadc-0b6a-f0215cd39913@maine.edu/ Reported-by: Leo Yan <leo.yan@arm.com> Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20250606192546.915765-1-kan.liang@linux.intel.com
2025-06-05perf/core: Fix WARN in perf_cgroup_switch()Luo Gengkun
There may be concurrency between perf_cgroup_switch and perf_cgroup_event_disable. Consider the following scenario: after a new perf cgroup event is created on CPU0, the new event may not trigger a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1 disables this perf event, it executes __perf_remove_from_context-> list _del_event->perf_cgroup_event_disable on CPU1, which causes a race with perf_cgroup_switch running on CPU0. The following describes the details of this concurrency scenario: CPU0 CPU1 perf_cgroup_switch: ... # cpuctx->cgrp is not NULL here if (READ_ONCE(cpuctx->cgrp) == NULL) return; perf_remove_from_context: ... raw_spin_lock_irq(&ctx->lock); ... # ctx->is_active == 0 because reprogramm is not # tigger, so CPU1 can do __perf_remove_from_context # for CPU0 __perf_remove_from_context: perf_cgroup_event_disable: ... if (--ctx->nr_cgroups) ... # this warning will happened because CPU1 changed # ctx.nr_cgroups to 0. WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0); [peterz: use guard instead of goto unlock] Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
2025-06-05perf: Fix dangling cgroup pointer in cpuctxYeoreum Yun
Commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit") moves the event->state update to before list_del_event(). This makes the event->state test in list_del_event() always false; never calling perf_cgroup_event_disable(). As a result, cpuctx->cgrp won't be cleared properly; causing havoc. Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Wang <00107082@163.com> Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/
2025-06-05perf: Fix cgroup state vs ERRORPeter Zijlstra
While chasing down a missing perf_cgroup_event_disable() elsewhere, Leo Yan found that both perf_put_aux_event() and perf_remove_sibling_event() were also missing one. Specifically, the rule is that events that switch to OFF,ERROR need to call perf_cgroup_event_disable(). Unify the disable paths to ensure this. Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data") Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability") Reported-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
2025-06-05perf: Fix sample vs do_exit()Peter Zijlstra
Baisheng Gao reported an ARM64 crash, which Mark decoded as being a synchronous external abort -- most likely due to trying to access MMIO in bad ways. The crash further shows perf trying to do a user stack sample while in exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address space it is trying to access. It turns out that we stop perf after we tear down the userspace mm; a receipie for disaster, since perf likes to access userspace for various reasons. Flip this order by moving up where we stop perf in do_exit(). Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER to abort when the current task does not have an mm (exit_mm() makes sure to set current->mm = NULL; before commencing with the actual teardown). Such that CPU wide events don't trip on this same problem. Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample") Reported-by: Baisheng Gao <baisheng.gao@unisoc.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net
2025-05-31Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-21perf: Only dump the throttle log for the leaderKan Liang
The PERF_RECORD_THROTTLE records are dumped for all throttled events. It's not necessary for group events, which are throttled altogether. Optimize it by only dump the throttle log for the leader. The sample right after the THROTTLE record must be generated by the actual target event. It is good enough for the perf tool to locate the actual target event. Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
2025-05-21perf: Fix the throttle logic for a groupKan Liang
The current throttle logic doesn't work well with a group, e.g., the following sampling-read case. $ perf record -e "{cycles,cycles}:S" ... $ perf report -D | grep THROTTLE | tail -2 THROTTLE events: 426 ( 9.0%) UNTHROTTLE events: 425 ( 9.0%) $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1): ... sample_read: .... group nr 2 ..... id 0000000000000327, value 000000000cbb993a, lost 0 ..... id 0000000000000328, value 00000002211c26df, lost 0 The second cycles event has a much larger value than the first cycles event in the same group. The current throttle logic in the generic code only logs the THROTTLE event. It relies on the specific driver implementation to disable events. For all ARCHs, the implementation is similar. Only the event is disabled, rather than the group. The logic to disable the group should be generic for all ARCHs. Add the logic in the generic code. The following patch will remove the buggy driver-specific implementation. The throttle only happens when an event is overflowed. Stop the entire group when any event in the group triggers the throttle. The MAX_INTERRUPTS is set to all throttle events. The unthrottled could happen in 3 places. - event/group sched. All events in the group are scheduled one by one. All of them will be unthrottled eventually. Nothing needs to be changed. - The perf_adjust_freq_unthr_events for each tick. Needs to restart the group altogether. - The __perf_event_period(). The whole group needs to be restarted altogether as well. With the fix, $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2): ... sample_read: .... group nr 2 ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0 ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0 Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com