From f0cfe17bcc1dd2f0872966b554a148e888833ee9 Mon Sep 17 00:00:00 2001 From: Daniel Bristot de Oliveira Date: Wed, 9 Mar 2022 14:13:02 +0100 Subject: tracing/osnoise: Do not unregister events twice Nicolas reported that using: # trace-cmd record -e all -M 10 -p osnoise --poll Resulted in the following kernel warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1217 at kernel/tracepoint.c:404 tracepoint_probe_unregister+0x280/0x370 [...] CPU: 0 PID: 1217 Comm: trace-cmd Not tainted 5.17.0-rc6-next-20220307-nico+ #19 RIP: 0010:tracepoint_probe_unregister+0x280/0x370 [...] CR2: 00007ff919b29497 CR3: 0000000109da4005 CR4: 0000000000170ef0 Call Trace: osnoise_workload_stop+0x36/0x90 tracing_set_tracer+0x108/0x260 tracing_set_trace_write+0x94/0xd0 ? __check_object_size.part.0+0x10a/0x150 ? selinux_file_permission+0x104/0x150 vfs_write+0xb5/0x290 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7ff919a18127 [...] ---[ end trace 0000000000000000 ]--- The warning complains about an attempt to unregister an unregistered tracepoint. This happens on trace-cmd because it first stops tracing, and then switches the tracer to nop. Which is equivalent to: # cd /sys/kernel/tracing/ # echo osnoise > current_tracer # echo 0 > tracing_on # echo nop > current_tracer The osnoise tracer stops the workload when no trace instance is actually collecting data. This can be caused both by disabling tracing or disabling the tracer itself. To avoid unregistering events twice, use the existing trace_osnoise_callback_enabled variable to check if the events (and the workload) are actually active before trying to deactivate them. Link: https://lore.kernel.org/all/c898d1911f7f9303b7e14726e7cc9678fbfb4a0e.camel@redhat.com/ Link: https://lkml.kernel.org/r/938765e17d5a781c2df429a98f0b2e7cc317b022.1646823913.git.bristot@kernel.org Cc: stable@vger.kernel.org Cc: Marcelo Tosatti Fixes: 2fac8d6486d5 ("tracing/osnoise: Allow multiple instances of the same tracer") Reported-by: Nicolas Saenz Julienne Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_osnoise.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index cfddb30e65ab..2aa3efdca755 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2200,6 +2200,17 @@ static void osnoise_workload_stop(void) if (osnoise_has_registered_instances()) return; + /* + * If callbacks were already disabled in a previous stop + * call, there is no need to disable then again. + * + * For instance, this happens when tracing is stopped via: + * echo 0 > tracing_on + * echo nop > current_tracer. + */ + if (!trace_osnoise_callback_enabled) + return; + trace_osnoise_callback_enabled = false; /* * Make sure that ftrace_nmi_enter/exit() see -- cgit v1.2.3 From caf4c86bf136845982c5103b2661751b40c474c0 Mon Sep 17 00:00:00 2001 From: Nicolas Saenz Julienne Date: Mon, 7 Mar 2022 19:07:40 +0100 Subject: tracing/osnoise: Force quiescent states while tracing At the moment running osnoise on a nohz_full CPU or uncontested FIFO priority and a PREEMPT_RCU kernel might have the side effect of extending grace periods too much. This will entice RCU to force a context switch on the wayward CPU to end the grace period, all while introducing unwarranted noise into the tracer. This behaviour is unavoidable as overly extending grace periods might exhaust the system's memory. This same exact problem is what extended quiescent states (EQS) were created for, conversely, rcu_momentary_dyntick_idle() emulates them by performing a zero duration EQS. So let's make use of it. In the common case rcu_momentary_dyntick_idle() is fairly inexpensive: atomically incrementing a local per-CPU counter and doing a store. So it shouldn't affect osnoise's measurements (which has a 1us granularity), so we'll call it unanimously. The uncommon case involve calling rcu_momentary_dyntick_idle() after having the osnoise process: - Receive an expedited quiescent state IPI with preemption disabled or during an RCU critical section. (activates rdp->cpu_no_qs.b.exp code-path). - Being preempted within in an RCU critical section and having the subsequent outermost rcu_read_unlock() called with interrupts disabled. (t->rcu_read_unlock_special.b.blocked code-path). Neither of those are possible at the moment, and are unlikely to be in the future given the osnoise's loop design. On top of this, the noise generated by the situations described above is unavoidable, and if not exposed by rcu_momentary_dyntick_idle() will be eventually seen in subsequent rcu_read_unlock() calls or schedule operations. Link: https://lkml.kernel.org/r/20220307180740.577607-1-nsaenzju@redhat.com Cc: stable@vger.kernel.org Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Signed-off-by: Nicolas Saenz Julienne Acked-by: Paul E. McKenney Acked-by: Daniel Bristot de Oliveira Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_osnoise.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 2aa3efdca755..5e3c62a08fc0 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -1386,6 +1386,26 @@ static int run_osnoise(void) osnoise_stop_tracing(); } + /* + * In some cases, notably when running on a nohz_full CPU with + * a stopped tick PREEMPT_RCU has no way to account for QSs. + * This will eventually cause unwarranted noise as PREEMPT_RCU + * will force preemption as the means of ending the current + * grace period. We avoid this problem by calling + * rcu_momentary_dyntick_idle(), which performs a zero duration + * EQS allowing PREEMPT_RCU to end the current grace period. + * This call shouldn't be wrapped inside an RCU critical + * section. + * + * Note that in non PREEMPT_RCU kernels QSs are handled through + * cond_resched() + */ + if (IS_ENABLED(CONFIG_PREEMPT_RCU)) { + local_irq_disable(); + rcu_momentary_dyntick_idle(); + local_irq_enable(); + } + /* * For the non-preemptive kernel config: let threads runs, if * they so wish. -- cgit v1.2.3 From 78cbc6513217b00be6a9904415ef7ff3619eb035 Mon Sep 17 00:00:00 2001 From: Jiapeng Chong Date: Mon, 7 Mar 2022 08:43:03 +0800 Subject: ftrace: Fix some W=1 warnings in kernel doc comments Clean up the following clang-w1 warning: kernel/trace/ftrace.c:7827: warning: Function parameter or member 'ops' not described in 'unregister_ftrace_function'. kernel/trace/ftrace.c:7805: warning: Function parameter or member 'ops' not described in 'register_ftrace_function'. Link: https://lkml.kernel.org/r/20220307004303.26399-1-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot Signed-off-by: Jiapeng Chong Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ftrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index a4b462b6f944..6105b7036482 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -7790,7 +7790,7 @@ int ftrace_is_dead(void) /** * register_ftrace_function - register a function for profiling - * @ops - ops structure that holds the function for profiling. + * @ops: ops structure that holds the function for profiling. * * Register a function to be called by all functions in the * kernel. @@ -7817,7 +7817,7 @@ EXPORT_SYMBOL_GPL(register_ftrace_function); /** * unregister_ftrace_function - unregister a function for profiling. - * @ops - ops structure that holds the function to unregister + * @ops: ops structure that holds the function to unregister * * Unregister a function that was added to be called by ftrace profiling. */ -- cgit v1.2.3 From c993ee0f9f81caf5767a50d1faeba39a0dc82af2 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:23:31 +0000 Subject: watch_queue: Fix filter limit check In watch_queue_set_filter(), there are a couple of places where we check that the filter type value does not exceed what the type_filter bitmap can hold. One place calculates the number of bits by: if (tf[i].type >= sizeof(wfilter->type_filter) * 8) which is fine, but the second does: if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG) which is not. This can lead to a couple of out-of-bounds writes due to a too-large type: (1) __set_bit() on wfilter->type_filter (2) Writing more elements in wfilter->filters[] than we allocated. Fix this by just using the proper WATCH_TYPE__NR instead, which is the number of types we actually know about. The bug may cause an oops looking something like: BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740 Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611 ... Call Trace: dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x150 ... kasan_report.cold+0x7f/0x11b ... watch_queue_set_filter+0x659/0x740 ... __x64_sys_ioctl+0x127/0x190 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Allocated by task 611: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 watch_queue_set_filter+0x23a/0x740 __x64_sys_ioctl+0x127/0x190 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88800d2c66a0 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 28 bytes inside of 32-byte region [ffff88800d2c66a0, ffff88800d2c66c0) Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index 9c9eb20dd2c5..427b0318e303 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -320,7 +320,7 @@ long watch_queue_set_filter(struct pipe_inode_info *pipe, tf[i].info_mask & WATCH_INFO_LENGTH) goto err_filter; /* Ignore any unknown types */ - if (tf[i].type >= sizeof(wfilter->type_filter) * 8) + if (tf[i].type >= WATCH_TYPE__NR) continue; nr_filter++; } @@ -336,7 +336,7 @@ long watch_queue_set_filter(struct pipe_inode_info *pipe, q = wfilter->filters; for (i = 0; i < filter.nr_filters; i++) { - if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG) + if (tf[i].type >= WATCH_TYPE__NR) continue; q->type = tf[i].type; -- cgit v1.2.3 From c1853fbadcba1497f4907971e7107888e0714c81 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:23:46 +0000 Subject: watch_queue: Fix to release page in ->release() When a pipe ring descriptor points to a notification message, the refcount on the backing page is incremented by the generic get function, but the release function, which marks the bitmap, doesn't drop the page ref. Fix this by calling generic_pipe_buf_release() at the end of watch_queue_pipe_buf_release(). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index 427b0318e303..dfb3a7e28280 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -54,6 +54,7 @@ static void watch_queue_pipe_buf_release(struct pipe_inode_info *pipe, bit += page->index; set_bit(bit, wqueue->notes_bitmap); + generic_pipe_buf_release(pipe, buf); } // No try_steal function => no stealing -- cgit v1.2.3 From 96a4d8912b28451cd62825fd7caa0e66e091d938 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:24:08 +0000 Subject: watch_queue: Fix to always request a pow-of-2 pipe ring size The pipe ring size must always be a power of 2 as the head and tail pointers are masked off by AND'ing with the size of the ring - 1. watch_queue_set_size(), however, lets you specify any number of notes between 1 and 511. This number is passed through to pipe_resize_ring() without checking/forcing its alignment. Fix this by rounding the number of slots required up to the nearest power of two. The request is meant to guarantee that at least that many notifications can be generated before the queue is full, so rounding down isn't an option, but, alternatively, it may be better to give an error if we aren't allowed to allocate that much ring space. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index dfb3a7e28280..4bcd400984a7 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -244,7 +244,7 @@ long watch_queue_set_size(struct pipe_inode_info *pipe, unsigned int nr_notes) goto error; } - ret = pipe_resize_ring(pipe, nr_notes); + ret = pipe_resize_ring(pipe, roundup_pow_of_two(nr_notes)); if (ret < 0) goto error; -- cgit v1.2.3 From a66bd7575b5f449ee0ba20cfd21c3bc5b04ef361 Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Fri, 11 Mar 2022 13:24:15 +0000 Subject: watch_queue: Use the bitmap API when applicable Use bitmap_alloc() to simplify code, improve the semantic and reduce some open-coded arithmetic in allocator arguments. Also change a memset(0xff) into an equivalent bitmap_fill() to keep consistency. Signed-off-by: Christophe JAILLET Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index 4bcd400984a7..5b516eb2c7cc 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -220,7 +220,6 @@ long watch_queue_set_size(struct pipe_inode_info *pipe, unsigned int nr_notes) struct page **pages; unsigned long *bitmap; unsigned long user_bufs; - unsigned int bmsize; int ret, i, nr_pages; if (!wqueue) @@ -259,13 +258,11 @@ long watch_queue_set_size(struct pipe_inode_info *pipe, unsigned int nr_notes) pages[i]->index = i * WATCH_QUEUE_NOTES_PER_PAGE; } - bmsize = (nr_notes + BITS_PER_LONG - 1) / BITS_PER_LONG; - bmsize *= sizeof(unsigned long); - bitmap = kmalloc(bmsize, GFP_KERNEL); + bitmap = bitmap_alloc(nr_notes, GFP_KERNEL); if (!bitmap) goto error_p; - memset(bitmap, 0xff, bmsize); + bitmap_fill(bitmap, nr_notes); wqueue->notes = pages; wqueue->notes_bitmap = bitmap; wqueue->nr_pages = nr_pages; -- cgit v1.2.3 From 3b4c0371928c17af03e8397ac842346624017ce6 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:24:22 +0000 Subject: watch_queue: Fix the alloc bitmap size to reflect notes allocated Currently, watch_queue_set_size() sets the number of notes available in wqueue->nr_notes according to the number of notes allocated, but sets the size of the bitmap to the unrounded number of notes originally asked for. Fix this by setting the bitmap size to the number of notes we're actually going to make available (ie. the number allocated). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index 5b516eb2c7cc..9c476d2cbac0 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -243,6 +243,7 @@ long watch_queue_set_size(struct pipe_inode_info *pipe, unsigned int nr_notes) goto error; } + nr_notes = nr_pages * WATCH_QUEUE_NOTES_PER_PAGE; ret = pipe_resize_ring(pipe, roundup_pow_of_two(nr_notes)); if (ret < 0) goto error; @@ -266,7 +267,7 @@ long watch_queue_set_size(struct pipe_inode_info *pipe, unsigned int nr_notes) wqueue->notes = pages; wqueue->notes_bitmap = bitmap; wqueue->nr_pages = nr_pages; - wqueue->nr_notes = nr_pages * WATCH_QUEUE_NOTES_PER_PAGE; + wqueue->nr_notes = nr_notes; return 0; error_p: -- cgit v1.2.3 From 7ea1a0124b6da246b5bc8c66cddaafd36acf3ecb Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:24:29 +0000 Subject: watch_queue: Free the alloc bitmap when the watch_queue is torn down Free the watch_queue note allocation bitmap when the watch_queue is destroyed. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index 9c476d2cbac0..c12267ccc70e 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -370,6 +370,7 @@ static void __put_watch_queue(struct kref *kref) for (i = 0; i < wqueue->nr_pages; i++) __free_page(wqueue->notes[i]); + bitmap_free(wqueue->notes_bitmap); wfilter = rcu_access_pointer(wqueue->filter); if (wfilter) -- cgit v1.2.3 From 2ed147f015af2b48f41c6f0b6746aa9ea85c19f3 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:24:36 +0000 Subject: watch_queue: Fix lack of barrier/sync/lock between post and read There's nothing to synchronise post_one_notification() versus pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the reader only takes pipe->mutex which cannot bar notification posting as that may need to be made from contexts that cannot sleep. Fix this by setting pipe->head with a barrier in post_one_notification() and reading pipe->head with a barrier in pipe_read(). If that's not sufficient, the rd_wait.lock will need to be taken, possibly in a ->confirm() op so that it only applies to notifications. The lock would, however, have to be dropped before copy_page_to_iter() is invoked. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index c12267ccc70e..37bcd900fd77 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -113,7 +113,7 @@ static bool post_one_notification(struct watch_queue *wqueue, buf->offset = offset; buf->len = len; buf->flags = PIPE_BUF_FLAG_WHOLE; - pipe->head = head + 1; + smp_store_release(&pipe->head, head + 1); /* vs pipe_read() */ if (!test_and_clear_bit(note, wqueue->notes_bitmap)) { spin_unlock_irq(&pipe->rd_wait.lock); -- cgit v1.2.3 From 4edc0760412b0c4ecefc7e02cb855b310b122825 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 11 Mar 2022 13:24:47 +0000 Subject: watch_queue: Make comment about setting ->defunct more accurate watch_queue_clear() has a comment stating that setting ->defunct to true preventing new additions as well as preventing notifications. Whilst the latter is true, the first bit is superfluous since at the time this function is called, the pipe cannot be accessed to add new event sources. Remove the "new additions" bit from the comment. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn Signed-off-by: David Howells Signed-off-by: Linus Torvalds --- kernel/watch_queue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c index 37bcd900fd77..00703444a219 100644 --- a/kernel/watch_queue.c +++ b/kernel/watch_queue.c @@ -566,7 +566,7 @@ void watch_queue_clear(struct watch_queue *wqueue) rcu_read_lock(); spin_lock_bh(&wqueue->lock); - /* Prevent new additions and prevent notifications from happening */ + /* Prevent new notifications from being stored. */ wqueue->defunct = true; while (!hlist_empty(&wqueue->watches)) { -- cgit v1.2.3 From 8208257d2d04d4953a8cb9f1426d245a95c4fea2 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Wed, 16 Mar 2022 16:15:06 -0700 Subject: configs/debug: restore DEBUG_INFO=y for overriding Previously, I failed to realize that Kees' patch [1] has not been merged into the mainline yet, and dropped DEBUG_INFO=y too eagerly from the mainline. As the results, "make debug.config" won't be able to flip DEBUG_INFO=n from the existing .config. This should close the gaps of a few weeks before Kees' patch is there, and work regardless of their merging status anyway. Link: https://lore.kernel.org/all/20220125075126.891825-1-keescook@chromium.org/ [1] Link: https://lkml.kernel.org/r/20220308153524.8618-1-quic_qiancai@quicinc.com Signed-off-by: Qian Cai Reported-by: Daniel Thompson Reviewed-by: Daniel Thompson Cc: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/configs/debug.config | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config index 07df6d93c4df..e8db8d938661 100644 --- a/kernel/configs/debug.config +++ b/kernel/configs/debug.config @@ -16,6 +16,7 @@ CONFIG_SYMBOLIC_ERRNAME=y # # Compile-time checks and compiler options # +CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y CONFIG_DEBUG_SECTION_MISMATCH=y CONFIG_FRAME_WARN=2048 -- cgit v1.2.3