From f8858d96061f5942216c6abb0194c3ea7b78e1e8 Mon Sep 17 00:00:00 2001 From: Shrikanth Hegde Date: Sat, 2 Sep 2023 13:42:04 +0530 Subject: sched/fair: Optimize should_we_balance() for large SMT systems MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit should_we_balance() is called in load_balance() to find out if the CPU that is trying to do the load balance is the right one or not. With commit: b1bfeab9b002("sched/fair: Consider the idle state of the whole core for load balance") the code tries to find an idle core to do the load balancing and falls back on an idle sibling CPU if there is no idle core. However, on larger SMT systems, it could be needlessly iterating to find a idle by scanning all the CPUs in an non-idle core. If the core is not idle, and first SMT sibling which is idle has been found, then its not needed to check other SMT siblings for idleness Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE. balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and same process would be repeated for CPU4 and CPU6 but this is unnecessary. Since calling is_core_idle loops through all CPU's in the SMT mask, effect is multiplied by weight of smt_mask. For example,when say 1 CPU is busy, we would skip loop for 2 CPU's and skip iterating over 8CPU's. That effect would be more in DIE/NUMA domain where there are more cores. Testing and performance evaluation ================================== The test has been done on this system which has 12 cores, i.e 24 small cores with SMT=4: lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Model name: POWER10 (architected), altivec supported Thread(s) per core: 8 Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For base tip/sched/core the time taken is collected by making the should_we_balance() noinline. time is in nanoseconds. The values are collected by running the funclatency tracer for 60 seconds. values are average of 3 such runs. This represents the expected reduced time with patch. tip/sched/core was at commit: 2f88c8e802c8 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well") Results: ------------------------------------------------------------------------------ workload tip/sched/core with_patch(%gain) ------------------------------------------------------------------------------ idle system 809.3 695.0(16.45) stress ng – 12 threads -l 100 1013.5 893.1(13.49) stress ng – 24 threads -l 100 1073.5 980.0(9.54) stress ng – 48 threads -l 100 683.0 641.0(6.55) stress ng – 96 threads -l 100 2421.0 2300(5.26) stress ng – 96 threads -l 15 375.5 377.5(-0.53) stress ng – 96 threads -l 25 635.5 637.5(-0.31) stress ng – 96 threads -l 35 934.0 891.0(4.83) Ran schbench(old), hackbench and stress_ng to evaluate the workload performance between tip/sched/core and with patch. No modification to tip/sched/core TL;DR: Good improvement is seen with schbench. when hackbench and stress_ng runs for longer good improvement is seen. ------------------------------------------------------------------------------ schbench(old) tip +patch(%gain) 10 iterations sched/core ------------------------------------------------------------------------------ 1 Threads 50.0th: 8.00 9.00(-12.50) 75.0th: 9.60 9.00(6.25) 90.0th: 11.80 10.20(13.56) 95.0th: 12.60 10.40(17.46) 99.0th: 13.60 11.90(12.50) 99.5th: 14.10 12.60(10.64) 99.9th: 15.90 14.60(8.18) 2 Threads 50.0th: 9.90 9.20(7.07) 75.0th: 12.60 10.10(19.84) 90.0th: 15.50 12.00(22.58) 95.0th: 17.70 14.00(20.90) 99.0th: 21.20 16.90(20.28) 99.5th: 22.60 17.50(22.57) 99.9th: 30.40 19.40(36.18) 4 Threads 50.0th: 12.50 10.60(15.20) 75.0th: 15.30 12.00(21.57) 90.0th: 18.60 14.10(24.19) 95.0th: 21.30 16.20(23.94) 99.0th: 26.00 20.70(20.38) 99.5th: 27.60 22.50(18.48) 99.9th: 33.90 31.40(7.37) 8 Threads 50.0th: 16.30 14.30(12.27) 75.0th: 20.20 17.40(13.86) 90.0th: 24.50 21.90(10.61) 95.0th: 27.30 24.70(9.52) 99.0th: 35.00 31.20(10.86) 99.5th: 46.40 33.30(28.23) 99.9th: 89.30 57.50(35.61) 16 Threads 50.0th: 22.70 20.70(8.81) 75.0th: 30.10 27.40(8.97) 90.0th: 36.00 32.80(8.89) 95.0th: 39.60 36.40(8.08) 99.0th: 49.20 44.10(10.37) 99.5th: 64.90 50.50(22.19) 99.9th: 143.50 100.60(29.90) 32 Threads 50.0th: 34.60 35.50(-2.60) 75.0th: 48.20 50.50(-4.77) 90.0th: 59.20 62.40(-5.41) 95.0th: 65.20 69.00(-5.83) 99.0th: 80.40 83.80(-4.23) 99.5th: 102.10 98.90(3.13) 99.9th: 727.10 506.80(30.30) schbench does improve in general. There is some run to run variation with schbench. Did a validation run to confirm that trend is similar. ------------------------------------------------------------------------------ hackbench tip +patch(%gain) 20 iterations, 50000 loops sched/core ------------------------------------------------------------------------------ Process 10 groups : 11.74 11.70(0.34) Process 20 groups : 22.73 22.69(0.18) Process 30 groups : 33.39 33.40(-0.03) Process 40 groups : 43.73 43.61(0.27) Process 50 groups : 53.82 54.35(-0.98) Process 60 groups : 64.16 65.29(-1.76) thread 10 Time : 12.81 12.79(0.16) thread 20 Time : 24.63 24.47(0.65) Process(Pipe) 10 Time : 6.40 6.34(0.94) Process(Pipe) 20 Time : 10.62 10.63(-0.09) Process(Pipe) 30 Time : 15.09 14.84(1.66) Process(Pipe) 40 Time : 19.42 19.01(2.11) Process(Pipe) 50 Time : 24.04 23.34(2.91) Process(Pipe) 60 Time : 28.94 27.51(4.94) thread(Pipe) 10 Time : 6.96 6.87(1.29) thread(Pipe) 20 Time : 11.74 11.73(0.09) hackbench shows slight improvement with pipe. Slight degradation in process. ------------------------------------------------------------------------------ stress_ng tip +patch(%gain) 10 iterations 100000 cpu_ops sched/core ------------------------------------------------------------------------------ --cpu=96 -util=100 Time taken : 5.30, 5.01(5.47) --cpu=48 -util=100 Time taken : 7.94, 6.73(15.24) --cpu=24 -util=100 Time taken : 11.67, 8.75(25.02) --cpu=12 -util=100 Time taken : 15.71, 15.02(4.39) --cpu=96 -util=10 Time taken : 22.71, 22.19(2.29) --cpu=96 -util=20 Time taken : 12.14, 12.37(-1.89) --cpu=96 -util=30 Time taken : 8.76, 8.86(-1.14) --cpu=96 -util=40 Time taken : 7.13, 7.14(-0.14) --cpu=96 -util=50 Time taken : 6.10, 6.13(-0.49) --cpu=96 -util=60 Time taken : 5.42, 5.41(0.18) --cpu=96 -util=70 Time taken : 4.94, 4.94(0.00) --cpu=96 -util=80 Time taken : 4.56, 4.53(0.66) --cpu=96 -util=90 Time taken : 4.27, 4.26(0.23) Good improvement seen with 24 CPUs. In this case only one CPU is busy, and no core is idle. Decent improvement with 100% utilization case. no difference in other utilization. Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance") Signed-off-by: Shrikanth Hegde Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com --- kernel/sched/fair.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8dbff6e7ad4f..33a2b6bba676 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6619,6 +6619,7 @@ dequeue_throttle: /* Working cpumask for: load_balance, load_balance_newidle. */ static DEFINE_PER_CPU(cpumask_var_t, load_balance_mask); static DEFINE_PER_CPU(cpumask_var_t, select_rq_mask); +static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask); #ifdef CONFIG_NO_HZ_COMMON @@ -10917,6 +10918,7 @@ static int active_load_balance_cpu_stop(void *data); static int should_we_balance(struct lb_env *env) { + struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask); struct sched_group *sg = env->sd->groups; int cpu, idle_smt = -1; @@ -10940,8 +10942,9 @@ static int should_we_balance(struct lb_env *env) return 1; } + cpumask_copy(swb_cpus, group_balance_mask(sg)); /* Try to find first idle CPU */ - for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { + for_each_cpu_and(cpu, swb_cpus, env->cpus) { if (!idle_cpu(cpu)) continue; @@ -10953,6 +10956,14 @@ static int should_we_balance(struct lb_env *env) if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) { if (idle_smt == -1) idle_smt = cpu; + /* + * If the core is not idle, and first SMT sibling which is + * idle has been found, then its not needed to check other + * SMT siblings for idleness: + */ +#ifdef CONFIG_SCHED_SMT + cpumask_andnot(swb_cpus, swb_cpus, cpu_smt_mask(cpu)); +#endif continue; } @@ -12918,6 +12929,8 @@ __init void init_sched_fair_class(void) for_each_possible_cpu(i) { zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i)); zalloc_cpumask_var_node(&per_cpu(select_rq_mask, i), GFP_KERNEL, cpu_to_node(i)); + zalloc_cpumask_var_node(&per_cpu(should_we_balance_tmpmask, i), + GFP_KERNEL, cpu_to_node(i)); #ifdef CONFIG_CFS_BANDWIDTH INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i)); -- cgit v1.2.3 From f5ca233e2e66dc1c249bf07eefa37e34a6c9346a Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Wed, 6 Sep 2023 22:47:12 -0400 Subject: tracing: Increase trace array ref count on enable and filter files When the trace event enable and filter files are opened, increment the trace array ref counter, otherwise they can be accessed when the trace array is being deleted. The ref counter keeps the trace array from being deleted while those files are opened. Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Reported-by: Zheng Yejian Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 27 +++++++++++++++++++++++++++ kernel/trace/trace.h | 2 ++ kernel/trace/trace_events.c | 6 ++++-- 3 files changed, 33 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 35783a7baf15..0827037ee3b8 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4973,6 +4973,33 @@ int tracing_open_generic_tr(struct inode *inode, struct file *filp) return 0; } +/* + * The private pointer of the inode is the trace_event_file. + * Update the tr ref count associated to it. + */ +int tracing_open_file_tr(struct inode *inode, struct file *filp) +{ + struct trace_event_file *file = inode->i_private; + int ret; + + ret = tracing_check_open_get_tr(file->tr); + if (ret) + return ret; + + filp->private_data = inode->i_private; + + return 0; +} + +int tracing_release_file_tr(struct inode *inode, struct file *filp) +{ + struct trace_event_file *file = inode->i_private; + + trace_array_put(file->tr); + + return 0; +} + static int tracing_mark_open(struct inode *inode, struct file *filp) { stream_open(inode, filp); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 5669dd1f90d9..77debe53f07c 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -610,6 +610,8 @@ void tracing_reset_all_online_cpus(void); void tracing_reset_all_online_cpus_unlocked(void); int tracing_open_generic(struct inode *inode, struct file *filp); int tracing_open_generic_tr(struct inode *inode, struct file *filp); +int tracing_open_file_tr(struct inode *inode, struct file *filp); +int tracing_release_file_tr(struct inode *inode, struct file *filp); bool tracing_is_disabled(void); bool tracer_tracing_is_on(struct trace_array *tr); void tracer_tracing_on(struct trace_array *tr); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index ed367d713be0..2af92177b765 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2103,9 +2103,10 @@ static const struct file_operations ftrace_set_event_notrace_pid_fops = { }; static const struct file_operations ftrace_enable_fops = { - .open = tracing_open_generic, + .open = tracing_open_file_tr, .read = event_enable_read, .write = event_enable_write, + .release = tracing_release_file_tr, .llseek = default_llseek, }; @@ -2122,9 +2123,10 @@ static const struct file_operations ftrace_event_id_fops = { }; static const struct file_operations ftrace_event_filter_fops = { - .open = tracing_open_generic, + .open = tracing_open_file_tr, .read = event_filter_read, .write = event_filter_write, + .release = tracing_release_file_tr, .llseek = default_llseek, }; -- cgit v1.2.3 From 7d660c9b2bc95107f90a9f4c4759be85309a6550 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Wed, 6 Sep 2023 22:47:13 -0400 Subject: tracing: Have tracing_max_latency inc the trace array ref count The tracing_max_latency file points to the trace_array max_latency field. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when tracing_max_latency is opened. Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Cc: Zheng Yejian Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 0827037ee3b8..c8b8b4c6feaf 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1772,7 +1772,7 @@ static void trace_create_maxlat_file(struct trace_array *tr, init_irq_work(&tr->fsnotify_irqwork, latency_fsnotify_workfn_irq); tr->d_max_latency = trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, - d_tracer, &tr->max_latency, + d_tracer, tr, &tracing_max_lat_fops); } @@ -1805,7 +1805,7 @@ void latency_fsnotify(struct trace_array *tr) #define trace_create_maxlat_file(tr, d_tracer) \ trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, \ - d_tracer, &tr->max_latency, &tracing_max_lat_fops) + d_tracer, tr, &tracing_max_lat_fops) #endif @@ -6717,14 +6717,18 @@ static ssize_t tracing_max_lat_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) { - return tracing_nsecs_read(filp->private_data, ubuf, cnt, ppos); + struct trace_array *tr = filp->private_data; + + return tracing_nsecs_read(&tr->max_latency, ubuf, cnt, ppos); } static ssize_t tracing_max_lat_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { - return tracing_nsecs_write(filp->private_data, ubuf, cnt, ppos); + struct trace_array *tr = filp->private_data; + + return tracing_nsecs_write(&tr->max_latency, ubuf, cnt, ppos); } #endif @@ -7778,10 +7782,11 @@ static const struct file_operations tracing_thresh_fops = { #ifdef CONFIG_TRACER_MAX_TRACE static const struct file_operations tracing_max_lat_fops = { - .open = tracing_open_generic, + .open = tracing_open_generic_tr, .read = tracing_max_lat_read, .write = tracing_max_lat_write, .llseek = generic_file_llseek, + .release = tracing_release_generic_tr, }; #endif -- cgit v1.2.3 From 9b37febc578b2e1ad76a105aab11d00af5ec3d27 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Wed, 6 Sep 2023 22:47:14 -0400 Subject: tracing: Have current_trace inc the trace array ref count The current_trace updates the trace array tracer. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace array when current_trace is opened. Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Cc: Zheng Yejian Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c8b8b4c6feaf..b82df33d20ff 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7791,10 +7791,11 @@ static const struct file_operations tracing_max_lat_fops = { #endif static const struct file_operations set_tracer_fops = { - .open = tracing_open_generic, + .open = tracing_open_generic_tr, .read = tracing_set_trace_read, .write = tracing_set_trace_write, .llseek = generic_file_llseek, + .release = tracing_release_generic_tr, }; static const struct file_operations tracing_pipe_fops = { -- cgit v1.2.3 From 7e2cfbd2d3c86afcd5c26b5c4b1dd251f63c5838 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Wed, 6 Sep 2023 22:47:15 -0400 Subject: tracing: Have option files inc the trace array ref count The option files update the options for a given trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when an option file is opened. Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Cc: Zheng Yejian Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b82df33d20ff..0608ad20cf30 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8988,12 +8988,33 @@ trace_options_write(struct file *filp, const char __user *ubuf, size_t cnt, return cnt; } +static int tracing_open_options(struct inode *inode, struct file *filp) +{ + struct trace_option_dentry *topt = inode->i_private; + int ret; + + ret = tracing_check_open_get_tr(topt->tr); + if (ret) + return ret; + + filp->private_data = inode->i_private; + return 0; +} + +static int tracing_release_options(struct inode *inode, struct file *file) +{ + struct trace_option_dentry *topt = file->private_data; + + trace_array_put(topt->tr); + return 0; +} static const struct file_operations trace_options_fops = { - .open = tracing_open_generic, + .open = tracing_open_options, .read = trace_options_read, .write = trace_options_write, .llseek = generic_file_llseek, + .release = tracing_release_options, }; /* -- cgit v1.2.3 From e5c624f027ac74f97e97c8f36c69228ac9f1102d Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Wed, 6 Sep 2023 22:47:16 -0400 Subject: tracing: Have event inject files inc the trace array ref count The event inject files add events for a specific trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when a event inject file is opened. Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Cc: Zheng Yejian Fixes: 6c3edaf9fd6a ("tracing: Introduce trace event injection") Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events_inject.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_inject.c b/kernel/trace/trace_events_inject.c index abe805d471eb..8650562bdaa9 100644 --- a/kernel/trace/trace_events_inject.c +++ b/kernel/trace/trace_events_inject.c @@ -328,7 +328,8 @@ event_inject_read(struct file *file, char __user *buf, size_t size, } const struct file_operations event_inject_fops = { - .open = tracing_open_generic, + .open = tracing_open_file_tr, .read = event_inject_read, .write = event_inject_write, + .release = tracing_release_file_tr, }; -- cgit v1.2.3 From f6bd2c92488c30ef53b5bd80c52f0a7eee9d545a Mon Sep 17 00:00:00 2001 From: Zheng Yejian Date: Wed, 6 Sep 2023 16:19:30 +0800 Subject: ring-buffer: Avoid softlockup in ring_buffer_resize() When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: Signed-off-by: Zheng Yejian Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 78502d4c7214..72ccf75defd0 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -2198,6 +2198,8 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, err = -ENOMEM; goto out_err; } + + cond_resched(); } cpus_read_lock(); -- cgit v1.2.3 From 41bc46c12a8053a1b3279a379bd6b5e87b045b85 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 7 Sep 2023 22:06:51 +0200 Subject: bpf: Add override check to kprobe multi link attach Currently the multi_kprobe link attach does not check error injection list for programs with bpf_override_return helper and allows them to attach anywhere. Adding the missing check. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Jiri Olsa Signed-off-by: Andrii Nakryiko Reviewed-by: Alan Maguire Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20230907200652.926951-1-jolsa@kernel.org --- kernel/trace/bpf_trace.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a7264b2c17ad..c1c1af63ced2 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2853,6 +2853,17 @@ static int get_modules_for_addrs(struct module ***mods, unsigned long *addrs, u3 return arr.mods_cnt; } +static int addrs_check_error_injection_list(unsigned long *addrs, u32 cnt) +{ + u32 i; + + for (i = 0; i < cnt; i++) { + if (!within_error_injection_list(addrs[i])) + return -EINVAL; + } + return 0; +} + int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) { struct bpf_kprobe_multi_link *link = NULL; @@ -2930,6 +2941,11 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr goto error; } + if (prog->kprobe_override && addrs_check_error_injection_list(addrs, cnt)) { + err = -EINVAL; + goto error; + } + link = kzalloc(sizeof(*link), GFP_KERNEL); if (!link) { err = -ENOMEM; -- cgit v1.2.3 From 95a404bd60af6c4d9d8db01ad14fe8957ece31ca Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Thu, 7 Sep 2023 12:28:20 -0400 Subject: ring-buffer: Do not attempt to read past "commit" When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu Cc: Mark Rutland Reported-by: Tze-nan Wu Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 72ccf75defd0..a1651edc48d5 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -2390,6 +2390,11 @@ rb_iter_head_event(struct ring_buffer_iter *iter) */ commit = rb_page_commit(iter_head_page); smp_rmb(); + + /* An event needs to be at least 8 bytes in size */ + if (iter->head > commit - 8) + goto reset; + event = __rb_page_index(iter_head_page, iter->head); length = rb_event_length(event); -- cgit v1.2.3 From 1ef26d8b2ca5d8715563c951083cf6c385c77d1f Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Thu, 7 Sep 2023 22:19:11 -0400 Subject: tracing: Use the new eventfs descriptor for print trigger The check to create the print event "trigger" was using the obsolete "dir" value of the trace_event_file to determine if it should create the trigger or not. But that value will now be NULL because it uses the event file descriptor. Change it to test the "ef" field of the trace_event_file structure so that the trace_marker "trigger" file appears again. Link: https://lkml.kernel.org/r/20230908022001.371815239@goodmis.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Cc: Ajay Kaher Fixes: 27152bceea1df ("eventfs: Move tracing/events to eventfs") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 0608ad20cf30..122c23c9eb28 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -9792,8 +9792,8 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) tr, &tracing_mark_fops); file = __find_event_file(tr, "ftrace", "print"); - if (file && file->dir) - trace_create_file("trigger", TRACE_MODE_WRITE, file->dir, + if (file && file->ef) + eventfs_add_file("trigger", TRACE_MODE_WRITE, file->ef, file, &event_trigger_fops); tr->trace_marker_file = file; -- cgit v1.2.3 From 6fdac58c560e4d164eb8161987bee045147cabe4 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Thu, 7 Sep 2023 22:19:12 -0400 Subject: tracing: Remove unused trace_event_file dir field Now that eventfs structure is used to create the events directory via the eventfs dynamically allocate code, the "dir" field of the trace_event_file structure is no longer used. Remove it. Link: https://lkml.kernel.org/r/20230908022001.580400115@goodmis.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Andrew Morton Cc: Ajay Kaher Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events.c | 13 ------------- 1 file changed, 13 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 2af92177b765..065c63991858 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -992,19 +992,6 @@ static void remove_subsystem(struct trace_subsystem_dir *dir) static void remove_event_file_dir(struct trace_event_file *file) { - struct dentry *dir = file->dir; - struct dentry *child; - - if (dir) { - spin_lock(&dir->d_lock); /* probably unneeded */ - list_for_each_entry(child, &dir->d_subdirs, d_child) { - if (d_really_is_positive(child)) /* probably unneeded */ - d_inode(child)->i_private = NULL; - } - spin_unlock(&dir->d_lock); - - tracefs_remove(dir); - } eventfs_remove(file->ef); list_del(&file->list); remove_subsystem(file->system); -- cgit v1.2.3 From d52b59315bf5e86e83c00bfae47cedd388dad6a8 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Fri, 8 Sep 2023 21:39:20 +0800 Subject: bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The following warning was reported when running "./test_progs -a link_api -a linked_list" on a RISC-V QEMU VM: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill Modules linked in: bpf_testmod(OE) CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2 Hardware name: riscv-virtio,qemu (DT) epc : bpf_mem_refill+0x1fc/0x206 ra : irq_work_single+0x68/0x70 epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20 gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600 t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70 s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000 a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060 a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000 s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30 s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000 t5 : ff6000008fd28278 t6 : 0000000000040000 [] bpf_mem_refill+0x1fc/0x206 [] irq_work_single+0x68/0x70 [] irq_work_run_list+0x28/0x36 [] irq_work_run+0x38/0x66 [] handle_IPI+0x3a/0xb4 [] handle_percpu_devid_irq+0xa4/0x1f8 [] generic_handle_domain_irq+0x28/0x36 [] ipi_mux_process+0xac/0xfa [] sbi_ipi_handle+0x2e/0x88 [] generic_handle_domain_irq+0x28/0x36 [] riscv_intc_irq+0x36/0x4e [] handle_riscv_irq+0x54/0x86 [] do_irq+0x66/0x98 ---[ end trace 0000000000000000 ]--- The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in free_bulk(). The direct reason is that a object is allocated and freed by bpf_mem_caches with different unit_size. The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes slab cache in the specific VM. When linked_list test allocates a 72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is backed by 128-bytes slab cache. When the object is freed, bpf_mem_free() uses ksize() to choose the corresponding bpf_mem_cache. Because the object is allocated from 128-bytes slab cache, ksize() returns 128, bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and triggers the warning. A similar warning will also be reported when using CONFIG_SLAB instead of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32. An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to choose a bpf_mem_cache which has the same unit_size with the backing slab cache, but it may introduce performance degradation, so fix the warning by adjusting the indexes in size_index according to the value of KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Björn Töpel Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 9c49ae53deaf..98d9e96fba3c 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -916,3 +916,41 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags) return !ret ? NULL : ret + LLIST_NODE_SZ; } + +/* Most of the logic is taken from setup_kmalloc_cache_index_table() */ +static __init int bpf_mem_cache_adjust_size(void) +{ + unsigned int size, index; + + /* Normally KMALLOC_MIN_SIZE is 8-bytes, but it can be + * up-to 256-bytes. + */ + size = KMALLOC_MIN_SIZE; + if (size <= 192) + index = size_index[(size - 1) / 8]; + else + index = fls(size - 1) - 1; + for (size = 8; size < KMALLOC_MIN_SIZE && size <= 192; size += 8) + size_index[(size - 1) / 8] = index; + + /* The minimal alignment is 64-bytes, so disable 96-bytes cache and + * use 128-bytes cache instead. + */ + if (KMALLOC_MIN_SIZE >= 64) { + index = size_index[(128 - 1) / 8]; + for (size = 64 + 8; size <= 96; size += 8) + size_index[(size - 1) / 8] = index; + } + + /* The minimal alignment is 128-bytes, so disable 192-bytes cache and + * use 256-bytes cache instead. + */ + if (KMALLOC_MIN_SIZE >= 128) { + index = fls(256 - 1) - 1; + for (size = 128 + 8; size <= 192; size += 8) + size_index[(size - 1) / 8] = index; + } + + return 0; +} +subsys_initcall(bpf_mem_cache_adjust_size); -- cgit v1.2.3 From b1d53958b69312e43c118d4093d8f93d3f6f80af Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Fri, 8 Sep 2023 21:39:21 +0800 Subject: bpf: Don't prefill for unused bpf_mem_cache When the unit_size of a bpf_mem_cache is unmatched with the object_size of the underlying slab cache, the bpf_mem_cache will not be used, and the allocation will be redirected to a bpf_mem_cache with a bigger unit_size instead, so there is no need to prefill for these unused bpf_mem_caches. Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 98d9e96fba3c..90c1ed8210a2 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -459,8 +459,7 @@ static void notrace irq_work_raise(struct bpf_mem_cache *c) * Typical case will be between 11K and 116K closer to 11K. * bpf progs can and should share bpf_mem_cache when possible. */ - -static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) +static void init_refill_work(struct bpf_mem_cache *c) { init_irq_work(&c->refill_work, bpf_mem_refill); if (c->unit_size <= 256) { @@ -476,7 +475,10 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) c->high_watermark = max(96 * 256 / c->unit_size, 3); } c->batch = max((c->high_watermark - c->low_watermark) / 4 * 3, 1); +} +static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) +{ /* To avoid consuming memory assume that 1st run of bpf * prog won't be doing more than 4 map_update_elem from * irq disabled region @@ -521,6 +523,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->objcg = objcg; c->percpu_size = percpu_size; c->tgt = c; + init_refill_work(c); prefill_mem_cache(c, cpu); } ma->cache = pc; @@ -544,6 +547,15 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) c->unit_size = sizes[i]; c->objcg = objcg; c->tgt = c; + + init_refill_work(c); + /* Another bpf_mem_cache will be used when allocating + * c->unit_size in bpf_mem_alloc(), so doesn't prefill + * for the bpf_mem_cache because these free objects will + * never be used. + */ + if (i != bpf_mem_cache_idx(c->unit_size)) + continue; prefill_mem_cache(c, cpu); } } -- cgit v1.2.3 From c930472552022bd09aab3cd946ba3f243070d5c7 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Fri, 8 Sep 2023 21:39:22 +0800 Subject: bpf: Ensure unit_size is matched with slab cache object size Add extra check in bpf_mem_alloc_init() to ensure the unit_size of bpf_mem_cache is matched with the object_size of underlying slab cache. If these two sizes are unmatched, print a warning once and return -EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and the potential issue can be prevented. Suggested-by: Alexei Starovoitov Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 90c1ed8210a2..1c22b90e754a 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -486,6 +486,24 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) alloc_bulk(c, c->unit_size <= 256 ? 4 : 1, cpu_to_node(cpu), false); } +static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) +{ + struct llist_node *first; + unsigned int obj_size; + + first = c->free_llist.first; + if (!first) + return 0; + + obj_size = ksize(first); + if (obj_size != c->unit_size) { + WARN_ONCE(1, "bpf_mem_cache[%u]: unexpected object size %u, expect %u\n", + idx, obj_size, c->unit_size); + return -EINVAL; + } + return 0; +} + /* When size != 0 bpf_mem_cache for each cpu. * This is typical bpf hash map use case when all elements have equal size. * @@ -496,10 +514,10 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) { static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; + int cpu, i, err, unit_size, percpu_size = 0; struct bpf_mem_caches *cc, __percpu *pcc; struct bpf_mem_cache *c, __percpu *pc; struct obj_cgroup *objcg = NULL; - int cpu, i, unit_size, percpu_size = 0; if (size) { pc = __alloc_percpu_gfp(sizeof(*pc), 8, GFP_KERNEL); @@ -537,6 +555,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) pcc = __alloc_percpu_gfp(sizeof(*cc), 8, GFP_KERNEL); if (!pcc) return -ENOMEM; + err = 0; #ifdef CONFIG_MEMCG_KMEM objcg = get_obj_cgroup_from_current(); #endif @@ -557,10 +576,20 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) if (i != bpf_mem_cache_idx(c->unit_size)) continue; prefill_mem_cache(c, cpu); + err = check_obj_size(c, i); + if (err) + goto out; } } + +out: ma->caches = pcc; - return 0; + /* refill_work is either zeroed or initialized, so it is safe to + * call irq_work_sync(). + */ + if (err) + bpf_mem_alloc_destroy(ma); + return err; } static void drain_mem_cache(struct bpf_mem_cache *c) -- cgit v1.2.3 From 62663b849662c1a5126b6274d91671b90566ef13 Mon Sep 17 00:00:00 2001 From: Tero Kristo Date: Mon, 11 Sep 2023 17:17:04 +0300 Subject: tracing/synthetic: Print out u64 values properly The synth traces incorrectly print pointer to the synthetic event values instead of the actual value when using u64 type. Fix by addressing the contents of the union properly. Link: https://lore.kernel.org/linux-trace-kernel/20230911141704.3585965-1-tero.kristo@linux.intel.com Fixes: ddeea494a16f ("tracing/synthetic: Use union instead of casts") Cc: stable@vger.kernel.org Signed-off-by: Tero Kristo Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events_synth.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c index 9897d0bfcab7..14cb275a0bab 100644 --- a/kernel/trace/trace_events_synth.c +++ b/kernel/trace/trace_events_synth.c @@ -337,7 +337,7 @@ static void print_synth_event_num_val(struct trace_seq *s, break; default: - trace_seq_printf(s, print_fmt, name, val, space); + trace_seq_printf(s, print_fmt, name, val->as_u64, space); break; } } -- cgit v1.2.3 From a34a9f1a19afe9c60ca0ea61dfeee63a1c2baac8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Toke=20H=C3=B8iland-J=C3=B8rgensen?= Date: Mon, 11 Sep 2023 15:28:14 +0200 Subject: bpf: Avoid deadlock when using queue and stack maps from NMI MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sysbot discovered that the queue and stack maps can deadlock if they are being used from a BPF program that can be called from NMI context (such as one that is attached to a perf HW counter event). To fix this, add an in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if grabbing the lock fails. Fixes: f1a2e44a3aec ("bpf: add queue and stack maps") Reported-by: Hsin-Wei Hung Tested-by: Hsin-Wei Hung Co-developed-by: Hsin-Wei Hung Signed-off-by: Toke Høiland-Jørgensen Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/queue_stack_maps.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index 8d2ddcb7566b..d869f51ea93a 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -98,7 +98,12 @@ static long __queue_map_get(struct bpf_map *map, void *value, bool delete) int err = 0; void *ptr; - raw_spin_lock_irqsave(&qs->lock, flags); + if (in_nmi()) { + if (!raw_spin_trylock_irqsave(&qs->lock, flags)) + return -EBUSY; + } else { + raw_spin_lock_irqsave(&qs->lock, flags); + } if (queue_stack_map_is_empty(qs)) { memset(value, 0, qs->map.value_size); @@ -128,7 +133,12 @@ static long __stack_map_get(struct bpf_map *map, void *value, bool delete) void *ptr; u32 index; - raw_spin_lock_irqsave(&qs->lock, flags); + if (in_nmi()) { + if (!raw_spin_trylock_irqsave(&qs->lock, flags)) + return -EBUSY; + } else { + raw_spin_lock_irqsave(&qs->lock, flags); + } if (queue_stack_map_is_empty(qs)) { memset(value, 0, qs->map.value_size); @@ -193,7 +203,12 @@ static long queue_stack_map_push_elem(struct bpf_map *map, void *value, if (flags & BPF_NOEXIST || flags > BPF_EXIST) return -EINVAL; - raw_spin_lock_irqsave(&qs->lock, irq_flags); + if (in_nmi()) { + if (!raw_spin_trylock_irqsave(&qs->lock, irq_flags)) + return -EBUSY; + } else { + raw_spin_lock_irqsave(&qs->lock, irq_flags); + } if (queue_stack_map_is_full(qs)) { if (!replace) { -- cgit v1.2.3 From 1a49f4195d3498fe458a7f5ff7ec5385da70d92e Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Tue, 12 Sep 2023 03:55:37 +0300 Subject: bpf: Avoid dummy bpf_offload_netdev in __bpf_prog_dev_bound_init Fix for a bug observable under the following sequence of events: 1. Create a network device that does not support XDP offload. 2. Load a device bound XDP program with BPF_F_XDP_DEV_BOUND_ONLY flag (such programs are not offloaded). 3. Load a device bound XDP program with zero flags (such programs are offloaded). At step (2) __bpf_prog_dev_bound_init() associates with device (1) a dummy bpf_offload_netdev struct with .offdev field set to NULL. At step (3) __bpf_prog_dev_bound_init() would reuse dummy struct allocated at step (2). However, downstream usage of the bpf_offload_netdev assumes that .offdev field can't be NULL, e.g. in bpf_prog_offload_verifier_prep(). Adjust __bpf_prog_dev_bound_init() to require bpf_offload_netdev with non-NULL .offdev for offloaded BPF programs. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: syzbot+291100dcb32190ec02a8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/000000000000d97f3c060479c4f8@google.com/ Signed-off-by: Eduard Zingerman Link: https://lore.kernel.org/r/20230912005539.2248244-2-eddyz87@gmail.com Signed-off-by: Martin KaFai Lau --- kernel/bpf/offload.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index 3e4f2ec1af06..87d6693d8233 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -199,12 +199,14 @@ static int __bpf_prog_dev_bound_init(struct bpf_prog *prog, struct net_device *n offload->netdev = netdev; ondev = bpf_offload_find_netdev(offload->netdev); + /* When program is offloaded require presence of "true" + * bpf_offload_netdev, avoid the one created for !ondev case below. + */ + if (bpf_prog_is_offloaded(prog->aux) && (!ondev || !ondev->offdev)) { + err = -EINVAL; + goto err_free; + } if (!ondev) { - if (bpf_prog_is_offloaded(prog->aux)) { - err = -EINVAL; - goto err_free; - } - /* When only binding to the device, explicitly * create an entry in the hashtable. */ -- cgit v1.2.3 From 40d84e198b0ae64df71ac0e70675b16900b90bde Mon Sep 17 00:00:00 2001 From: Chen Yu Date: Wed, 6 Sep 2023 12:18:41 +0800 Subject: PM: hibernate: Rename function parameter from snapshot_test to exclusive Several functions reply on snapshot_test to decide whether to open the resume device exclusively. However there is no strict connection between the snapshot_test and the open mode. Rename the 'snapshot_test' input parameter to 'exclusive' to better reflect the use case. No functional change is expected. Signed-off-by: Chen Yu Signed-off-by: Rafael J. Wysocki --- kernel/power/power.h | 4 ++-- kernel/power/swap.c | 14 ++++++++------ 2 files changed, 10 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/power/power.h b/kernel/power/power.h index 46eb14dc50c3..a98f95e309a3 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -168,11 +168,11 @@ extern int swsusp_swap_in_use(void); #define SF_HW_SIG 8 /* kernel/power/hibernate.c */ -int swsusp_check(bool snapshot_test); +int swsusp_check(bool exclusive); extern void swsusp_free(void); extern int swsusp_read(unsigned int *flags_p); extern int swsusp_write(unsigned int flags); -void swsusp_close(bool snapshot_test); +void swsusp_close(bool exclusive); #ifdef CONFIG_SUSPEND extern int swsusp_unmark(void); #endif diff --git a/kernel/power/swap.c b/kernel/power/swap.c index f6ebcd00c410..74edbce2320b 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -1513,12 +1513,13 @@ end: static void *swsusp_holder; /** - * swsusp_check - Check for swsusp signature in the resume device + * swsusp_check - Check for swsusp signature in the resume device + * @exclusive: Open the resume device exclusively. */ -int swsusp_check(bool snapshot_test) +int swsusp_check(bool exclusive) { - void *holder = snapshot_test ? &swsusp_holder : NULL; + void *holder = exclusive ? &swsusp_holder : NULL; int error; hib_resume_bdev = blkdev_get_by_dev(swsusp_resume_device, BLK_OPEN_READ, @@ -1563,17 +1564,18 @@ put: } /** - * swsusp_close - close swap device. + * swsusp_close - close swap device. + * @exclusive: Close the resume device which is exclusively opened. */ -void swsusp_close(bool snapshot_test) +void swsusp_close(bool exclusive) { if (IS_ERR(hib_resume_bdev)) { pr_debug("Image device not initialised\n"); return; } - blkdev_put(hib_resume_bdev, snapshot_test ? &swsusp_holder : NULL); + blkdev_put(hib_resume_bdev, exclusive ? &swsusp_holder : NULL); } /** -- cgit v1.2.3 From 148b6f4cc3920e563094540fe1a12d00d3bbccae Mon Sep 17 00:00:00 2001 From: Chen Yu Date: Wed, 6 Sep 2023 12:18:52 +0800 Subject: PM: hibernate: Fix the exclusive get block device in test_resume mode Commit 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") fixes a hibernation issue under test_resume mode. That commit is supposed to open the block device in non-exclusive mode when in test_resume. However the code does the opposite, which is against its description. In summary, the swap device is only opened exclusively by swsusp_check() with its corresponding *close(), and must be in non test_resume mode. This is to avoid the race condition that different processes scribble the device at the same time. All the other cases should use non-exclusive mode. Fix it by really disabling exclusive mode under test_resume. Fixes: 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") Closes: https://lore.kernel.org/lkml/000000000000761f5f0603324129@google.com/ Reported-by: Pengfei Xu Signed-off-by: Chen Yu Tested-by: Chenzhou Feng Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 2b4a946a6ff5..8d35b9f9aaa3 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -786,9 +786,9 @@ int hibernate(void) unlock_device_hotplug(); if (snapshot_test) { pm_pr_dbg("Checking hibernation image\n"); - error = swsusp_check(snapshot_test); + error = swsusp_check(false); if (!error) - error = load_image_and_restore(snapshot_test); + error = load_image_and_restore(false); } thaw_processes(); @@ -945,14 +945,14 @@ static int software_resume(void) pm_pr_dbg("Looking for hibernation image.\n"); mutex_lock(&system_transition_mutex); - error = swsusp_check(false); + error = swsusp_check(true); if (error) goto Unlock; /* The snapshot device should not be opened while we're running */ if (!hibernate_acquire()) { error = -EBUSY; - swsusp_close(false); + swsusp_close(true); goto Unlock; } @@ -973,7 +973,7 @@ static int software_resume(void) goto Close_Finish; } - error = load_image_and_restore(false); + error = load_image_and_restore(true); thaw_processes(); Finish: pm_notifier_call_chain(PM_POST_RESTORE); @@ -987,7 +987,7 @@ static int software_resume(void) pm_pr_dbg("Hibernation image not present or could not be loaded.\n"); return error; Close_Finish: - swsusp_close(false); + swsusp_close(true); goto Finish; } -- cgit v1.2.3 From c8414dab164a74bd3bb859a2d836cb537d6b9298 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Tue, 12 Sep 2023 21:47:52 +0800 Subject: eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec() Inject fault while probing btrfs.ko, if kstrdup() fails in eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR to assign file->ef. But the eventfs_remove() check NULL in trace_module_remove_events(), which causes the below NULL pointer dereference. As both Masami and Steven suggest, allocater side should handle the error carefully and remove it, so fix the places where it failed. Could not create tracefs 'raid56_write' directory Btrfs loaded, zoned=no, fsverity=no Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102544000 [000000000000001c] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 8021q garp mrp stp llc ipv6 [last unloaded: btrfs] CPU: 15 PID: 1343 Comm: rmmod Tainted: G N 6.5.0+ #40 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : eventfs_remove_rec+0x24/0xc0 lr : eventfs_remove+0x68/0x1d8 sp : ffff800082d63b60 x29: ffff800082d63b60 x28: ffffb84b80ddd00c x27: ffffb84b3054ba40 x26: 0000000000000002 x25: ffff800082d63bf8 x24: ffffb84b8398e440 x23: ffffb84b82af3000 x22: dead000000000100 x21: dead000000000122 x20: ffff800082d63bf8 x19: fffffffffffffff4 x18: ffffb84b82508820 x17: 0000000000000000 x16: 0000000000000000 x15: 000083bc876a3166 x14: 000000000000006d x13: 000000000000006d x12: 0000000000000000 x11: 0000000000000001 x10: 00000000000017e0 x9 : 0000000000000001 x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffb84b84289804 x5 : 0000000000000000 x4 : 9696969696969697 x3 : ffff33a5b7601f38 x2 : 0000000000000000 x1 : ffff800082d63bf8 x0 : fffffffffffffff4 Call trace: eventfs_remove_rec+0x24/0xc0 eventfs_remove+0x68/0x1d8 remove_event_file_dir+0x88/0x100 event_remove+0x140/0x15c trace_module_notify+0x1fc/0x230 notifier_call_chain+0x98/0x17c blocking_notifier_call_chain+0x4c/0x74 __arm64_sys_delete_module+0x1a4/0x298 invoke_syscall+0x44/0x100 el0_svc_common.constprop.1+0x68/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x3c/0xc4 el0t_64_sync_handler+0xa0/0xc4 el0t_64_sync+0x174/0x178 Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x384b00c00000 from 0xffff800080000000 PHYS_OFFSET: 0xffffcc5b80000000 CPU features: 0x88000203,3c020000,1000421b Memory Limit: none Rebooting in 1 seconds.. Link: https://lore.kernel.org/linux-trace-kernel/20230912134752.1838524-1-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20230912025808.668187-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230911052818.1020547-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230909072817.182846-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230908074816.3724716-1-ruanjinjie@huawei.com/ Cc: Ajay Kaher Fixes: 5bdcd5f5331a ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Jinjie Ruan Suggested-by: Masami Hiramatsu (Google) Suggested-by: Steven Rostedt Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 065c63991858..91951d038ba4 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2286,6 +2286,7 @@ event_subsystem_dir(struct trace_array *tr, const char *name, { struct event_subsystem *system, *iter; struct trace_subsystem_dir *dir; + struct eventfs_file *ef; int res; /* First see if we did not already create this dir */ @@ -2318,13 +2319,14 @@ event_subsystem_dir(struct trace_array *tr, const char *name, } else __get_system(system); - dir->ef = eventfs_add_subsystem_dir(name, parent); - if (IS_ERR(dir->ef)) { + ef = eventfs_add_subsystem_dir(name, parent); + if (IS_ERR(ef)) { pr_warn("Failed to create system directory %s\n", name); __put_system(system); goto out_free; } + dir->ef = ef; dir->tr = tr; dir->ref_count = 1; dir->nr_events = 1; @@ -2404,6 +2406,7 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file) struct trace_event_call *call = file->event_call; struct eventfs_file *ef_subsystem = NULL; struct trace_array *tr = file->tr; + struct eventfs_file *ef; const char *name; int ret; @@ -2420,12 +2423,14 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file) return -ENOMEM; name = trace_event_name(call); - file->ef = eventfs_add_dir(name, ef_subsystem); - if (IS_ERR(file->ef)) { + ef = eventfs_add_dir(name, ef_subsystem); + if (IS_ERR(ef)) { pr_warn("Could not create tracefs '%s' directory\n", name); return -1; } + file->ef = ef; + if (call->class->reg && !(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) eventfs_add_file("enable", TRACE_MODE_WRITE, file->ef, file, &ftrace_enable_fops); -- cgit v1.2.3 From a8f12572860ad8ba659d96eee9cf09e181f6ebcc Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Fri, 8 Sep 2023 18:33:35 +0200 Subject: bpf: Fix a erroneous check after snprintf() snprintf() does not return negative error code on error, it returns the number of characters which *would* be generated for the given input. Fix the error handling check. Fixes: 57539b1c0ac2 ("bpf: Enable annotating trusted nested pointers") Signed-off-by: Christophe JAILLET Link: https://lore.kernel.org/r/393bdebc87b22563c08ace094defa7160eb7a6c0.1694190795.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 1095bbe29859..8090d7fb11ef 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -8501,7 +8501,7 @@ bool btf_nested_type_is_trusted(struct bpf_verifier_log *log, tname = btf_name_by_offset(btf, walk_type->name_off); ret = snprintf(safe_tname, sizeof(safe_tname), "%s%s", tname, suffix); - if (ret < 0) + if (ret >= sizeof(safe_tname)) return false; safe_id = btf_find_by_name_kind(btf, safe_tname, BTF_INFO_KIND(walk_type->info)); -- cgit v1.2.3 From 214bfd267f4929722b374b43fda456c21cd6f016 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 11 Sep 2023 23:08:12 -0700 Subject: bpf, cgroup: fix multiple kernel-doc warnings Fix missing or extra function parameter kernel-doc warnings in cgroup.c: kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb' kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb' kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk' kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk' kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr' kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr' kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops' kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops' kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl' kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl' kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace' kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace' Signed-off-by: Randy Dunlap Cc: Martin KaFai Lau Cc: bpf@vger.kernel.org Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Andrii Nakryiko Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/cgroup.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 5b2741aa0d9b..03b3d4492980 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -785,7 +785,8 @@ found: * to descendants * @cgrp: The cgroup which descendants to traverse * @link: A link for which to replace BPF program - * @type: Type of attach operation + * @new_prog: &struct bpf_prog for the target BPF program with its refcnt + * incremented * * Must be called with cgroup_mutex held. */ @@ -1334,7 +1335,7 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr, * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering * @sk: The socket sending or receiving traffic * @skb: The skb that is being sent or received - * @type: The type of program to be executed + * @atype: The type of program to be executed * * If no socket is passed, or the socket is not of type INET or INET6, * this function does nothing and returns 0. @@ -1424,7 +1425,7 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb); /** * __cgroup_bpf_run_filter_sk() - Run a program on a sock * @sk: sock structure to manipulate - * @type: The type of program to be executed + * @atype: The type of program to be executed * * socket is passed is expected to be of type INET or INET6. * @@ -1449,7 +1450,7 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk); * provided by user sockaddr * @sk: sock struct that will use sockaddr * @uaddr: sockaddr struct provided by user - * @type: The type of program to be executed + * @atype: The type of program to be executed * @t_ctx: Pointer to attach type specific context * @flags: Pointer to u32 which contains higher bits of BPF program * return value (OR'ed together). @@ -1496,7 +1497,7 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_addr); * @sock_ops: bpf_sock_ops_kern struct to pass to program. Contains * sk with connection information (IP addresses, etc.) May not contain * cgroup info if it is a req sock. - * @type: The type of program to be executed + * @atype: The type of program to be executed * * socket passed is expected to be of type INET or INET6. * @@ -1670,7 +1671,7 @@ const struct bpf_verifier_ops cg_dev_verifier_ops = { * @ppos: value-result argument: value is position at which read from or write * to sysctl is happening, result is new position if program overrode it, * initial value otherwise - * @type: type of program to be executed + * @atype: type of program to be executed * * Program is run when sysctl is being accessed, either read or written, and * can allow or deny such access. -- cgit v1.2.3 From a6a241764f69c62d23fc6960920cc662ae4069e9 Mon Sep 17 00:00:00 2001 From: Ross Lagerwall Date: Mon, 11 Sep 2023 11:32:51 +0100 Subject: swiotlb: use the calculated number of areas Commit 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size") calculated the reduced number of areas in swiotlb_init_remap() but didn't actually use the value. Replace usage of default_nareas accordingly. Fixes: 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size") Signed-off-by: Ross Lagerwall Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 394494a6b1f3..85dd94323b98 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -399,14 +399,13 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags, } mem->areas = memblock_alloc(array_size(sizeof(struct io_tlb_area), - default_nareas), SMP_CACHE_BYTES); + nareas), SMP_CACHE_BYTES); if (!mem->areas) { pr_warn("%s: Failed to allocate mem->areas.\n", __func__); return; } - swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false, - default_nareas); + swiotlb_init_io_tlb_pool(mem, __pa(tlb), nslabs, false, nareas); add_mem_pool(&io_tlb_default_mem, mem); if (flags & SWIOTLB_VERBOSE) -- cgit v1.2.3 From 450e749707bc1755f22b505d9cd942d4869dc535 Mon Sep 17 00:00:00 2001 From: Tim Chen Date: Thu, 7 Sep 2023 10:42:21 -0700 Subject: sched/fair: Fix SMT4 group_smt_balance handling For SMT4, any group with more than 2 tasks will be marked as group_smt_balance. Retain the behaviour of group_has_spare by marking the busiest group as the group which has the least number of idle_cpus. Also, handle rounding effect of adding (ncores_local + ncores_busy) when the local is fully idle and busy group imbalance is less than 2 tasks. Local group should try to pull at least 1 task in this case so imbalance should be set to 2 instead. Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups") Acked-by: Shrikanth Hegde Signed-off-by: Tim Chen Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com --- kernel/sched/fair.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 33a2b6bba676..cb225921bbca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9580,7 +9580,7 @@ static inline long sibling_imbalance(struct lb_env *env, imbalance /= ncores_local + ncores_busiest; /* Take advantage of resource in an empty sched group */ - if (imbalance == 0 && local->sum_nr_running == 0 && + if (imbalance <= 1 && local->sum_nr_running == 0 && busiest->sum_nr_running > 1) imbalance = 2; @@ -9768,6 +9768,15 @@ static bool update_sd_pick_busiest(struct lb_env *env, break; case group_smt_balance: + /* + * Check if we have spare CPUs on either SMT group to + * choose has spare or fully busy handling. + */ + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) + goto has_spare; + + fallthrough; + case group_fully_busy: /* * Select the fully busy group with highest avg_load. In @@ -9807,6 +9816,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, else return true; } +has_spare: /* * Select not overloaded group with lowest number of idle cpus -- cgit v1.2.3 From cccd32816506cbac3a4c65d9dff51b3125ef1a03 Mon Sep 17 00:00:00 2001 From: Lukas Wunner Date: Fri, 15 Sep 2023 09:55:39 +0200 Subject: panic: Reenable preemption in WARN slowpath Commit: 5a5d7e9badd2 ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG") amended warn_slowpath_fmt() to disable preemption until the WARN splat has been emitted. However the commit neglected to reenable preemption in the !fmt codepath, i.e. when a WARN splat is emitted without additional format string. One consequence is that users may see more splats than intended. E.g. a WARN splat emitted in a work item results in at least two extra splats: BUG: workqueue leaked lock or atomic (emitted by process_one_work()) BUG: scheduling while atomic (emitted by worker_thread() -> schedule()) Ironically the point of the commit was to *avoid* extra splats. ;) Fix it. Fixes: 5a5d7e9badd2 ("cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG") Signed-off-by: Lukas Wunner Signed-off-by: Ingo Molnar Cc: Linus Torvalds Cc: Thomas Gleixner Cc: Paul E. McKenney Link: https://lore.kernel.org/r/3ec48fde01e4ee6505f77908ba351bad200ae3d1.1694763684.git.lukas@wunner.de --- kernel/panic.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/panic.c b/kernel/panic.c index 07239d4ad81e..ffa037fa777d 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -697,6 +697,7 @@ void warn_slowpath_fmt(const char *file, int line, unsigned taint, if (!fmt) { __warn(file, line, __builtin_return_address(0), taint, NULL, NULL); + warn_rcu_exit(rcu); return; } -- cgit v1.2.3 From dca7acd84e93f2881e3f63465bbb5d89a40b5d17 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 13 Sep 2023 21:59:43 +0800 Subject: bpf: Skip unit_size checking for global per-cpu allocator For global per-cpu allocator, the size of free object in free list doesn't match with unit_size and now there is no way to get the size of per-cpu pointer saved in free object, so just skip the checking. Reported-by: Stephen Rothwell Closes: https://lore.kernel.org/bpf/20230913133436.0eeec4cb@canb.auug.org.au/ Signed-off-by: Hou Tao Tested-by: Biju Das Link: https://lore.kernel.org/r/20230913135943.3137292-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 1c22b90e754a..cf1941516643 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -491,6 +491,13 @@ static int check_obj_size(struct bpf_mem_cache *c, unsigned int idx) struct llist_node *first; unsigned int obj_size; + /* For per-cpu allocator, the size of free objects in free list doesn't + * match with unit_size and now there is no way to get the size of + * per-cpu pointer saved in free object, so just skip the checking. + */ + if (c->percpu_size) + return 0; + first = c->free_llist.first; if (!first) return 0; -- cgit v1.2.3 From 57eb5e1c5c57972c95e8efab6bc81b87161b0b07 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Fri, 15 Sep 2023 12:14:20 +0200 Subject: bpf: Fix uprobe_multi get_pid_task error path Dan reported Smatch static checker warning due to missing error value set in uprobe multi link's get_pid_task error path. Reported-by: Dan Carpenter Closes: https://lore.kernel.org/bpf/c5ffa7c0-6b06-40d5-aca2-63833b5cd9af@moroto.mountain/ Signed-off-by: Jiri Olsa Reviewed-by: Song Liu Link: https://lore.kernel.org/r/20230915101420.1193800-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index c1c1af63ced2..868008f56fec 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3223,8 +3223,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr rcu_read_lock(); task = get_pid_task(find_vpid(pid), PIDTYPE_PID); rcu_read_unlock(); - if (!task) + if (!task) { + err = -ESRCH; goto error_path_put; + } } err = -ENOMEM; -- cgit v1.2.3 From a6828214480e2f00a8a7e64c7a55fc42b0f54e1c Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Tue, 5 Sep 2023 17:49:35 -0400 Subject: workqueue: Removed double allocation of wq_update_pod_attrs_buf First commit 2930155b2e272 ("workqueue: Initialize unbound CPU pods later in the boot") added the initialization of wq_update_pod_attrs_buf to workqueue_init_early(), and then latter on, commit 84193c07105c6 ("workqueue: Generalize unbound CPU pods") added it as well. This appeared in a kmemleak run where the second allocation made the first allocation leak. Fixes: 84193c07105c6 ("workqueue: Generalize unbound CPU pods") Signed-off-by: Steven Rostedt (Google) Reviewed-by: Geert Uytterhoeven Signed-off-by: Tejun Heo --- kernel/workqueue.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c85825e17df8..129328b765fb 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -6535,9 +6535,6 @@ void __init workqueue_init_early(void) BUG_ON(!zalloc_cpumask_var_node(&pt->pod_cpus[0], GFP_KERNEL, NUMA_NO_NODE)); - wq_update_pod_attrs_buf = alloc_workqueue_attrs(); - BUG_ON(!wq_update_pod_attrs_buf); - pt->nr_pods = 1; cpumask_copy(pt->pod_cpus[0], cpu_possible_mask); pt->pod_node[0] = NUMA_NO_NODE; -- cgit v1.2.3 From dd64c873ed11cdae340be06dcd2364870fd3e4fc Mon Sep 17 00:00:00 2001 From: Zqiang Date: Mon, 11 Sep 2023 16:27:22 +0800 Subject: workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init() Currently, if the wq_cpu_intensive_thresh_us is set to specific value, will cause the wq_cpu_intensive_thresh_init() early exit and missed creation of pwq_release_worker. this commit therefore create the pwq_release_worker in advance before checking the wq_cpu_intensive_thresh_us. Signed-off-by: Zqiang Signed-off-by: Tejun Heo Fixes: 967b494e2fd1 ("workqueue: Use a kthread_worker to release pool_workqueues") --- kernel/workqueue.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 129328b765fb..b9f053a5a5f0 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -6602,13 +6602,13 @@ static void __init wq_cpu_intensive_thresh_init(void) unsigned long thresh; unsigned long bogo; + pwq_release_worker = kthread_create_worker(0, "pool_workqueue_release"); + BUG_ON(IS_ERR(pwq_release_worker)); + /* if the user set it to a specific value, keep it */ if (wq_cpu_intensive_thresh_us != ULONG_MAX) return; - pwq_release_worker = kthread_create_worker(0, "pool_workqueue_release"); - BUG_ON(IS_ERR(pwq_release_worker)); - /* * The default of 10ms is derived from the fact that most modern (as of * 2023) processors can do a lot in 10ms and that it's just below what -- cgit v1.2.3 From cff9b2332ab762b7e0586c793c431a8f2ea4db04 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 15 Sep 2023 13:44:44 -0400 Subject: kernel/sched: Modify initial boot task idle setup Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by setting the PF_IDLE flag later in the boot sequence. Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Signed-off-by: Liam R. Howlett Signed-off-by: Peter Zijlstra (Intel) Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/ --- kernel/sched/core.c | 2 +- kernel/sched/idle.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2299a5cfbfb9..802551e0009b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu) * PF_KTHREAD should already be set at this point; regardless, make it * look like a proper per-CPU kthread. */ - idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; + idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY; kthread_set_per_cpu(idle, cpu); #ifdef CONFIG_SMP diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 342f58a329f5..5007b25c5bc6 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -373,6 +373,7 @@ EXPORT_SYMBOL_GPL(play_idle_precise); void cpu_startup_entry(enum cpuhp_state state) { + current->flags |= PF_IDLE; arch_cpu_idle_prepare(); cpuhp_online_idle(state); while (1) -- cgit v1.2.3 From 4653e5dd04cb869526477a76b87d0aa1a5c65101 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Mon, 11 Sep 2023 14:24:06 -0600 Subject: task_work: add kerneldoc annotation for 'data' argument A previous commit changed the arguments to task_work_cancel_match(), but didn't document all of them. Link: https://lkml.kernel.org/r/93938bff-baa3-4091-85f5-784aae297a07@kernel.dk Fixes: c7aab1a7c52b ("task_work: add helper for more targeted task_work canceling") Signed-off-by: Jens Axboe Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202309120307.zis3yQGe-lkp@intel.com/ Acked-by: Oleg Nesterov Signed-off-by: Andrew Morton --- kernel/task_work.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/task_work.c b/kernel/task_work.c index 065e1ef8fc8d..95a7e1b7f1da 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -78,6 +78,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, * task_work_cancel_match - cancel a pending work added by task_work_add() * @task: the task which should execute the work * @match: match function to call + * @data: data to be passed in to match function * * RETURNS: * The found work or NULL if not found. -- cgit v1.2.3 From 0c7752d5b1ac62c8b926c907f34073ef7e9ad42b Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 11 Sep 2023 23:08:22 -0700 Subject: pidfd: prevent a kernel-doc warning Change the comment to match the function name that the SYSCALL_DEFINE() macros generate to prevent a kernel-doc warning. kernel/pid.c:628: warning: expecting prototype for pidfd_open(). Prototype was for sys_pidfd_open() instead Link: https://lkml.kernel.org/r/20230912060822.2500-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap Cc: Christian Brauner Signed-off-by: Andrew Morton --- kernel/pid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/pid.c b/kernel/pid.c index fee14a4486a3..6500ef956f2f 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -609,7 +609,7 @@ int pidfd_create(struct pid *pid, unsigned int flags) } /** - * pidfd_open() - Open new pid file descriptor. + * sys_pidfd_open() - Open new pid file descriptor. * * @pid: pid for which to retrieve a pidfd * @flags: flags to pass -- cgit v1.2.3 From 81335f90e8a88b81932df011105c46e708744f44 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Mon, 18 Sep 2023 14:01:10 -0700 Subject: bpf: unconditionally reset backtrack_state masks on global func exit In mark_chain_precision() logic, when we reach the entry to a global func, it is expected that R1-R5 might be still requested to be marked precise. This would correspond to some integer input arguments being tracked as precise. This is all expected and handled as a special case. What's not expected is that we'll leave backtrack_state structure with some register bits set. This is because for subsequent precision propagations backtrack_state is reused without clearing masks, as all code paths are carefully written in a way to leave empty backtrack_state with zeroed out masks, for speed. The fix is trivial, we always clear register bit in the register mask, and then, optionally, set reg->precise if register is SCALAR_VALUE type. Reported-by: Chris Mason Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs") Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bb78212fa5b2..c0c7d137066a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4047,11 +4047,9 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno) bitmap_from_u64(mask, bt_reg_mask(bt)); for_each_set_bit(i, mask, 32) { reg = &st->frame[0]->regs[i]; - if (reg->type != SCALAR_VALUE) { - bt_clear_reg(bt, i); - continue; - } - reg->precise = true; + bt_clear_reg(bt, i); + if (reg->type == SCALAR_VALUE) + reg->precise = true; } return 0; } -- cgit v1.2.3 From 45d99ea451d0c30bfd4864f0fe485d7dac014902 Mon Sep 17 00:00:00 2001 From: Zheng Yejian Date: Thu, 21 Sep 2023 20:54:25 +0800 Subject: ring-buffer: Fix bytes info in per_cpu buffer stats The 'bytes' info in file 'per_cpu/cpu/stats' means the number of bytes in cpu buffer that have not been consumed. However, currently after consuming data by reading file 'trace_pipe', the 'bytes' info was not changed as expected. # cat per_cpu/cpu0/stats entries: 0 overrun: 0 commit overrun: 0 bytes: 568 <--- 'bytes' is problematical !!! oldest event ts: 8651.371479 now ts: 8653.912224 dropped events: 0 read events: 8 The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it: 1. When stat 'read_bytes', account consumed event in rb_advance_reader(); 2. When stat 'entries_bytes', exclude the discarded padding event which is smaller than minimum size because it is invisible to reader. Then use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for page-based read/remove/overrun. Also correct the comments of ring_buffer_bytes_cpu() in this patch. Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events") Signed-off-by: Zheng Yejian Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index a1651edc48d5..28daf0ce95c5 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -354,6 +354,11 @@ static void rb_init_page(struct buffer_data_page *bpage) local_set(&bpage->commit, 0); } +static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage) +{ + return local_read(&bpage->page->commit); +} + static void free_buffer_page(struct buffer_page *bpage) { free_page((unsigned long)bpage->page); @@ -2003,7 +2008,7 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages) * Increment overrun to account for the lost events. */ local_add(page_entries, &cpu_buffer->overrun); - local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes); + local_sub(rb_page_commit(to_remove_page), &cpu_buffer->entries_bytes); local_inc(&cpu_buffer->pages_lost); } @@ -2367,11 +2372,6 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->reader_page->read); } -static __always_inline unsigned rb_page_commit(struct buffer_page *bpage) -{ - return local_read(&bpage->page->commit); -} - static struct ring_buffer_event * rb_iter_head_event(struct ring_buffer_iter *iter) { @@ -2517,7 +2517,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer, * the counters. */ local_add(entries, &cpu_buffer->overrun); - local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes); + local_sub(rb_page_commit(next_page), &cpu_buffer->entries_bytes); local_inc(&cpu_buffer->pages_lost); /* @@ -2660,9 +2660,6 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer, event = __rb_page_index(tail_page, tail); - /* account for padding bytes */ - local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes); - /* * Save the original length to the meta data. * This will be used by the reader to add lost event @@ -2676,7 +2673,8 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer, * write counter enough to allow another writer to slip * in on this page. * We put in a discarded commit instead, to make sure - * that this space is not used again. + * that this space is not used again, and this space will + * not be accounted into 'entries_bytes'. * * If we are less than the minimum size, we don't need to * worry about it. @@ -2701,6 +2699,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer, /* time delta must be non zero */ event->time_delta = 1; + /* account for padding bytes */ + local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes); + /* Make sure the padding is visible before the tail_page->write update */ smp_wmb(); @@ -4215,7 +4216,7 @@ u64 ring_buffer_oldest_event_ts(struct trace_buffer *buffer, int cpu) EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts); /** - * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer + * ring_buffer_bytes_cpu - get the number of bytes unconsumed in a cpu buffer * @buffer: The ring buffer * @cpu: The per CPU buffer to read from. */ @@ -4723,6 +4724,7 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer) length = rb_event_length(event); cpu_buffer->reader_page->read += length; + cpu_buffer->read_bytes += length; } static void rb_advance_iter(struct ring_buffer_iter *iter) @@ -5816,7 +5818,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, } else { /* update the entry counter */ cpu_buffer->read += rb_page_entries(reader); - cpu_buffer->read_bytes += BUF_PAGE_SIZE; + cpu_buffer->read_bytes += rb_page_commit(reader); /* swap the pages */ rb_init_page(bpage); -- cgit v1.2.3 From 2d5780bbef8dbe6375d481cbea212606a80e4453 Mon Sep 17 00:00:00 2001 From: Petr Tesarik Date: Tue, 26 Sep 2023 20:55:56 +0200 Subject: swiotlb: fix the check whether a device has used software IO TLB When CONFIG_SWIOTLB_DYNAMIC=y, devices which do not use the software IO TLB can avoid swiotlb lookup. A flag is added by commit 1395706a1490 ("swiotlb: search the software IO TLB only if the device makes use of it"), the flag is correctly set, but it is then never checked. Add the actual check here. Note that this code is an alternative to the default pool check, not an additional check, because: 1. swiotlb_find_pool() also searches the default pool; 2. if dma_uses_io_tlb is false, the default swiotlb pool is not used. Tested in a KVM guest against a QEMU RAM-backed SATA disk over virtio and *not* using software IO TLB, this patch increases IOPS by approx 2% for 4-way parallel I/O. The write memory barrier in swiotlb_dyn_alloc() is not needed, because a newly allocated pool must always be observed by swiotlb_find_slots() before an address from that pool is passed to is_swiotlb_buffer(). Correctness was verified using the following litmus test: C swiotlb-new-pool (* * Result: Never * * Check that a newly allocated pool is always visible when the * corresponding swiotlb buffer is visible. *) { mem_pools = default; } P0(int **mem_pools, int *pool) { /* add_mem_pool() */ WRITE_ONCE(*pool, 999); rcu_assign_pointer(*mem_pools, pool); } P1(int **mem_pools, int *flag, int *buf) { /* swiotlb_find_slots() */ int *r0; int r1; rcu_read_lock(); r0 = READ_ONCE(*mem_pools); r1 = READ_ONCE(*r0); rcu_read_unlock(); if (r1) { WRITE_ONCE(*flag, 1); smp_mb(); } /* device driver (presumed) */ WRITE_ONCE(*buf, r1); } P2(int **mem_pools, int *flag, int *buf) { /* device driver (presumed) */ int r0 = READ_ONCE(*buf); /* is_swiotlb_buffer() */ int r1; int *r2; int r3; smp_rmb(); r1 = READ_ONCE(*flag); if (r1) { /* swiotlb_find_pool() */ rcu_read_lock(); r2 = READ_ONCE(*mem_pools); r3 = READ_ONCE(*r2); rcu_read_unlock(); } } exists (2:r0<>0 /\ 2:r3=0) (* Not found. *) Fixes: 1395706a1490 ("swiotlb: search the software IO TLB only if the device makes use of it") Reported-by: Jonathan Corbet Closes: https://lore.kernel.org/linux-iommu/87a5uz3ob8.fsf@meer.lwn.net/ Signed-off-by: Petr Tesarik Reviewed-by: Catalin Marinas Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 85dd94323b98..01637677736f 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -728,9 +728,6 @@ static void swiotlb_dyn_alloc(struct work_struct *work) } add_mem_pool(mem, pool); - - /* Pairs with smp_rmb() in is_swiotlb_buffer(). */ - smp_wmb(); } /** @@ -1151,9 +1148,26 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr, spin_unlock_irqrestore(&dev->dma_io_tlb_lock, flags); found: - dev->dma_uses_io_tlb = true; - /* Pairs with smp_rmb() in is_swiotlb_buffer() */ - smp_wmb(); + WRITE_ONCE(dev->dma_uses_io_tlb, true); + + /* + * The general barrier orders reads and writes against a presumed store + * of the SWIOTLB buffer address by a device driver (to a driver private + * data structure). It serves two purposes. + * + * First, the store to dev->dma_uses_io_tlb must be ordered before the + * presumed store. This guarantees that the returned buffer address + * cannot be passed to another CPU before updating dev->dma_uses_io_tlb. + * + * Second, the load from mem->pools must be ordered before the same + * presumed store. This guarantees that the returned buffer address + * cannot be observed by another CPU before an update of the RCU list + * that was made by swiotlb_dyn_alloc() on a third CPU (cf. multicopy + * atomicity). + * + * See also the comment in is_swiotlb_buffer(). + */ + smp_mb(); *retpool = pool; return index; -- cgit v1.2.3 From fc09027786c900368de98d03d40af058bcb01ad9 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Sat, 23 Sep 2023 01:14:08 +0000 Subject: sched/rt: Fix live lock between select_fallback_rq() and RT push During RCU-boost testing with the TREE03 rcutorture config, I found that after a few hours, the machine locks up. On tracing, I found that there is a live lock happening between 2 CPUs. One CPU has an RT task running, while another CPU is being offlined which also has an RT task running. During this offlining, all threads are migrated. The migration thread is repeatedly scheduled to migrate actively running tasks on the CPU being offlined. This results in a live lock because select_fallback_rq() keeps picking the CPU that an RT task is already running on only to get pushed back to the CPU being offlined. It is anyway pointless to pick CPUs for pushing tasks to if they are being offlined only to get migrated away to somewhere else. This could also add unwanted latency to this task. Fix these issues by not selecting CPUs in RT if they are not 'active' for scheduling, using the cpu_active_mask. Other parts in core.c already use cpu_active_mask to prevent tasks from being put on CPUs going offline. With this fix I ran the tests for days and could not reproduce the hang. Without the patch, I hit it in a few hours. Signed-off-by: Joel Fernandes (Google) Signed-off-by: Ingo Molnar Tested-by: Paul E. McKenney Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org --- kernel/sched/cpupri.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index a286e726eb4b..42c40cfdf836 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, if (lowest_mask) { cpumask_and(lowest_mask, &p->cpus_mask, vec->mask); + cpumask_and(lowest_mask, lowest_mask, cpu_active_mask); /* * We have to ensure that we have at least one bit -- cgit v1.2.3 From f9b0e1088bbf35933e25c839b75094039059b3be Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 29 Sep 2023 22:41:20 +0200 Subject: bpf, mprog: Fix maximum program check on mprog attachment After Paul's recent improvement to syzkaller to improve coverage for bpf_mprog and tcx, it hit a splat that the program limit was surpassed. What happened is that the maximum number of progs got added, followed by another prog add request which adds with BPF_F_BEFORE flag relative to the last program in the array. The idx >= bpf_mprog_max() check in bpf_mprog_attach() still passes because the index is below the maximum but the maximum will be surpassed. We need to add a check upfront for insertions to catch this situation. Fixes: 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs") Reported-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com Reported-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com Reported-by: syzbot+2558ca3567a77b7af4e3@syzkaller.appspotmail.com Co-developed-by: Nikolay Aleksandrov Signed-off-by: Nikolay Aleksandrov Signed-off-by: Daniel Borkmann Signed-off-by: Andrii Nakryiko Tested-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com Tested-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com Link: https://github.com/google/syzkaller/pull/4207 Link: https://lore.kernel.org/bpf/20230929204121.20305-1-daniel@iogearbox.net --- kernel/bpf/mprog.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c index 32d2c4829eb8..007d98c799e2 100644 --- a/kernel/bpf/mprog.c +++ b/kernel/bpf/mprog.c @@ -253,6 +253,9 @@ int bpf_mprog_attach(struct bpf_mprog_entry *entry, goto out; } idx = tidx; + } else if (bpf_mprog_total(entry) == bpf_mprog_max()) { + ret = -ERANGE; + goto out; } if (flags & BPF_F_BEFORE) { tidx = bpf_mprog_pos_before(entry, &rtuple); -- cgit v1.2.3 From e2a8f20dd8e9df695f736e51cd9115ae55be92d1 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 26 Sep 2023 20:09:05 +0800 Subject: Crash: add lock to serialize crash hotplug handling Eric reported that handling corresponding crash hotplug event can be failed easily when many memory hotplug event are notified in a short period. They failed because failing to take __kexec_lock. ======= [ 78.714569] Fallback order for Node 0: 0 [ 78.714575] Built 1 zonelists, mobility grouping on. Total pages: 1817886 [ 78.717133] Policy zone: Normal [ 78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate [ 80.056643] PEFILE: Unsigned PE binary ======= The memory hotplug events are notified very quickly and very many, while the handling of crash hotplug is much slower relatively. So the atomic variable __kexec_lock and kexec_trylock() can't guarantee the serialization of crash hotplug handling. Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug handling specifically. This doesn't impact the usage of __kexec_lock. Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com Fixes: 247262756121 ("crash: add generic infrastructure for crash hotplug support") Signed-off-by: Baoquan He Tested-by: Eric DeVolder Reviewed-by: Eric DeVolder Reviewed-by: Valentin Schneider Cc: Sourabh Jain Cc: Signed-off-by: Andrew Morton --- kernel/crash_core.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'kernel') diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 03a7932cde0a..2f675ef045d4 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -739,6 +739,17 @@ subsys_initcall(crash_notes_memory_init); #undef pr_fmt #define pr_fmt(fmt) "crash hp: " fmt +/* + * Different than kexec/kdump loading/unloading/jumping/shrinking which + * usually rarely happen, there will be many crash hotplug events notified + * during one short period, e.g one memory board is hot added and memory + * regions are online. So mutex lock __crash_hotplug_lock is used to + * serialize the crash hotplug handling specifically. + */ +DEFINE_MUTEX(__crash_hotplug_lock); +#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock) +#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock) + /* * This routine utilized when the crash_hotplug sysfs node is read. * It reflects the kernel's ability/permission to update the crash @@ -748,9 +759,11 @@ int crash_check_update_elfcorehdr(void) { int rc = 0; + crash_hotplug_lock(); /* Obtain lock while reading crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); + crash_hotplug_unlock(); return 0; } if (kexec_crash_image) { @@ -761,6 +774,7 @@ int crash_check_update_elfcorehdr(void) } /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); return rc; } @@ -783,9 +797,11 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) { struct kimage *image; + crash_hotplug_lock(); /* Obtain lock while changing crash information */ if (!kexec_trylock()) { pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n"); + crash_hotplug_unlock(); return; } @@ -852,6 +868,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu) out: /* Release lock now that update complete */ kexec_unlock(); + crash_hotplug_unlock(); } static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v) -- cgit v1.2.3 From 9077fc228f09c9f975c498c55f5d2e882cd0da59 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Thu, 28 Sep 2023 18:15:58 +0800 Subject: bpf: Use kmalloc_size_roundup() to adjust size_index Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as reported by Nathan, the adjustment is not enough, because __kmalloc_minalign() also decides the minimal alignment of slab object as shown in new_kmalloc_cache() and its value may be greater than KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM). Instead of invoking __kmalloc_minalign() in bpf subsystem to find the maximal alignment, just using kmalloc_size_roundup() directly to get the corresponding slab object size for each allocation size. If these two sizes are unmatched, adjust size_index to select a bpf_mem_cache with unit_size equal to the object_size of the underlying slab cache for the allocation size. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Nathan Chancellor Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/ Signed-off-by: Hou Tao Tested-by: Emil Renner Berthing Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 44 +++++++++++++++++++------------------------- 1 file changed, 19 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index cf1941516643..d93ddac283d4 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -965,37 +965,31 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags) return !ret ? NULL : ret + LLIST_NODE_SZ; } -/* Most of the logic is taken from setup_kmalloc_cache_index_table() */ static __init int bpf_mem_cache_adjust_size(void) { - unsigned int size, index; + unsigned int size; - /* Normally KMALLOC_MIN_SIZE is 8-bytes, but it can be - * up-to 256-bytes. + /* Adjusting the indexes in size_index() according to the object_size + * of underlying slab cache, so bpf_mem_alloc() will select a + * bpf_mem_cache with unit_size equal to the object_size of + * the underlying slab cache. + * + * The maximal value of KMALLOC_MIN_SIZE and __kmalloc_minalign() is + * 256-bytes, so only do adjustment for [8-bytes, 192-bytes]. */ - size = KMALLOC_MIN_SIZE; - if (size <= 192) - index = size_index[(size - 1) / 8]; - else - index = fls(size - 1) - 1; - for (size = 8; size < KMALLOC_MIN_SIZE && size <= 192; size += 8) - size_index[(size - 1) / 8] = index; + for (size = 192; size >= 8; size -= 8) { + unsigned int kmalloc_size, index; - /* The minimal alignment is 64-bytes, so disable 96-bytes cache and - * use 128-bytes cache instead. - */ - if (KMALLOC_MIN_SIZE >= 64) { - index = size_index[(128 - 1) / 8]; - for (size = 64 + 8; size <= 96; size += 8) - size_index[(size - 1) / 8] = index; - } + kmalloc_size = kmalloc_size_roundup(size); + if (kmalloc_size == size) + continue; - /* The minimal alignment is 128-bytes, so disable 192-bytes cache and - * use 256-bytes cache instead. - */ - if (KMALLOC_MIN_SIZE >= 128) { - index = fls(256 - 1) - 1; - for (size = 128 + 8; size <= 192; size += 8) + if (kmalloc_size <= 192) + index = size_index[(kmalloc_size - 1) / 8]; + else + index = fls(kmalloc_size - 1) - 1; + /* Only overwrite if necessary */ + if (size_index[(size - 1) / 8] != index) size_index[(size - 1) / 8] = index; } -- cgit v1.2.3 From 1e0cb399c7653462d9dadf8ab9425337c355d358 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (Google)" Date: Fri, 29 Sep 2023 18:01:13 -0400 Subject: ring-buffer: Update "shortest_full" in polling It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall Tested-by: Julia Lawall Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 28daf0ce95c5..515cafdb18d9 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1137,6 +1137,9 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, if (full) { poll_wait(filp, &work->full_waiters, poll_table); work->full_waiters_pending = true; + if (!cpu_buffer->shortest_full || + cpu_buffer->shortest_full > full) + cpu_buffer->shortest_full = full; } else { poll_wait(filp, &work->waiters, poll_table); work->waiters_pending = true; -- cgit v1.2.3 From 23cce5f25491968b23fb9c399bbfb25f13870cd9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Cl=C3=A9ment=20L=C3=A9ger?= Date: Fri, 29 Sep 2023 21:16:37 +0200 Subject: tracing: relax trace_event_eval_update() execution with cond_resched() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu Signed-off-by: Clément Léger Tested-by: Atish Patra Reviewed-by: Atish Patra Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 91951d038ba4..f49d6ddb6342 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2770,6 +2770,7 @@ void trace_event_eval_update(struct trace_eval_map **map, int len) update_event_fields(call, map[i]); } } + cond_resched(); } up_write(&trace_event_sem); } -- cgit v1.2.3 From 2de9ee94054263940122aee8720e902b30c27930 Mon Sep 17 00:00:00 2001 From: Beau Belgrave Date: Mon, 25 Sep 2023 23:08:28 +0000 Subject: tracing/user_events: Align set_bit() address for all archs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All architectures should use a long aligned address passed to set_bit(). User processes can pass either a 32-bit or 64-bit sized value to be updated when tracing is enabled when on a 64-bit kernel. Both cases are ensured to be naturally aligned, however, that is not enough. The address must be long aligned without affecting checks on the value within the user process which require different adjustments for the bit for little and big endian CPUs. Add a compat flag to user_event_enabler that indicates when a 32-bit value is being used on a 64-bit kernel. Long align addresses and correct the bit to be used by set_bit() to account for this alignment. Ensure compat flags are copied during forks and used during deletion clears. Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/ Cc: stable@vger.kernel.org Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Reported-by: Clément Léger Suggested-by: Clément Léger Signed-off-by: Beau Belgrave Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events_user.c | 58 +++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 6f046650e527..b87f41187c6a 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -127,8 +127,13 @@ struct user_event_enabler { /* Bit 7 is for freeing status of enablement */ #define ENABLE_VAL_FREEING_BIT 7 -/* Only duplicate the bit value */ -#define ENABLE_VAL_DUP_MASK ENABLE_VAL_BIT_MASK +/* Bit 8 is for marking 32-bit on 64-bit */ +#define ENABLE_VAL_32_ON_64_BIT 8 + +#define ENABLE_VAL_COMPAT_MASK (1 << ENABLE_VAL_32_ON_64_BIT) + +/* Only duplicate the bit and compat values */ +#define ENABLE_VAL_DUP_MASK (ENABLE_VAL_BIT_MASK | ENABLE_VAL_COMPAT_MASK) #define ENABLE_BITOPS(e) (&(e)->values) @@ -174,6 +179,30 @@ struct user_event_validator { int flags; }; +static inline void align_addr_bit(unsigned long *addr, int *bit, + unsigned long *flags) +{ + if (IS_ALIGNED(*addr, sizeof(long))) { +#ifdef __BIG_ENDIAN + /* 32 bit on BE 64 bit requires a 32 bit offset when aligned. */ + if (test_bit(ENABLE_VAL_32_ON_64_BIT, flags)) + *bit += 32; +#endif + return; + } + + *addr = ALIGN_DOWN(*addr, sizeof(long)); + + /* + * We only support 32 and 64 bit values. The only time we need + * to align is a 32 bit value on a 64 bit kernel, which on LE + * is always 32 bits, and on BE requires no change when unaligned. + */ +#ifdef __LITTLE_ENDIAN + *bit += 32; +#endif +} + typedef void (*user_event_func_t) (struct user_event *user, struct iov_iter *i, void *tpdata, bool *faulted); @@ -482,6 +511,7 @@ static int user_event_enabler_write(struct user_event_mm *mm, unsigned long *ptr; struct page *page; void *kaddr; + int bit = ENABLE_BIT(enabler); int ret; lockdep_assert_held(&event_mutex); @@ -497,6 +527,8 @@ static int user_event_enabler_write(struct user_event_mm *mm, test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler)))) return -EBUSY; + align_addr_bit(&uaddr, &bit, ENABLE_BITOPS(enabler)); + ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, &page, NULL); @@ -515,9 +547,9 @@ static int user_event_enabler_write(struct user_event_mm *mm, /* Update bit atomically, user tracers must be atomic as well */ if (enabler->event && enabler->event->status) - set_bit(ENABLE_BIT(enabler), ptr); + set_bit(bit, ptr); else - clear_bit(ENABLE_BIT(enabler), ptr); + clear_bit(bit, ptr); kunmap_local(kaddr); unpin_user_pages_dirty_lock(&page, 1, true); @@ -849,6 +881,12 @@ static struct user_event_enabler enabler->event = user; enabler->addr = uaddr; enabler->values = reg->enable_bit; + +#if BITS_PER_LONG >= 64 + if (reg->enable_size == 4) + set_bit(ENABLE_VAL_32_ON_64_BIT, ENABLE_BITOPS(enabler)); +#endif + retry: /* Prevents state changes from racing with new enablers */ mutex_lock(&event_mutex); @@ -2377,7 +2415,8 @@ static long user_unreg_get(struct user_unreg __user *ureg, } static int user_event_mm_clear_bit(struct user_event_mm *user_mm, - unsigned long uaddr, unsigned char bit) + unsigned long uaddr, unsigned char bit, + unsigned long flags) { struct user_event_enabler enabler; int result; @@ -2385,7 +2424,7 @@ static int user_event_mm_clear_bit(struct user_event_mm *user_mm, memset(&enabler, 0, sizeof(enabler)); enabler.addr = uaddr; - enabler.values = bit; + enabler.values = bit | flags; retry: /* Prevents state changes from racing with new enablers */ mutex_lock(&event_mutex); @@ -2415,6 +2454,7 @@ static long user_events_ioctl_unreg(unsigned long uarg) struct user_event_mm *mm = current->user_event_mm; struct user_event_enabler *enabler, *next; struct user_unreg reg; + unsigned long flags; long ret; ret = user_unreg_get(ureg, ®); @@ -2425,6 +2465,7 @@ static long user_events_ioctl_unreg(unsigned long uarg) if (!mm) return -ENOENT; + flags = 0; ret = -ENOENT; /* @@ -2441,6 +2482,9 @@ static long user_events_ioctl_unreg(unsigned long uarg) ENABLE_BIT(enabler) == reg.disable_bit) { set_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler)); + /* We must keep compat flags for the clear */ + flags |= enabler->values & ENABLE_VAL_COMPAT_MASK; + if (!test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler))) user_event_enabler_destroy(enabler, true); @@ -2454,7 +2498,7 @@ static long user_events_ioctl_unreg(unsigned long uarg) /* Ensure bit is now cleared for user, regardless of event status */ if (!ret) ret = user_event_mm_clear_bit(mm, reg.disable_addr, - reg.disable_bit); + reg.disable_bit, flags); return ret; } -- cgit v1.2.3 From 2f2fc17bab0011430ceb6f2dc1959e7d1f981444 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 15 Sep 2023 00:48:55 +0200 Subject: sched/eevdf: Also update slice on placement Tasks that never consume their full slice would not update their slice value. This means that tasks that are spawned before the sysctl scaling keep their original (UP) slice length. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net --- kernel/sched/fair.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cb225921bbca..7d73652acbb2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4919,10 +4919,12 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {} static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - u64 vslice = calc_delta_fair(se->slice, se); - u64 vruntime = avg_vruntime(cfs_rq); + u64 vslice, vruntime = avg_vruntime(cfs_rq); s64 lag = 0; + se->slice = sysctl_sched_base_slice; + vslice = calc_delta_fair(se->slice, se); + /* * Due to how V is constructed as the weighted average of entities, * adding tasks with positive lag, or removing tasks with negative lag -- cgit v1.2.3 From 650cad561cce04b62a8c8e0446b685ef171bc3bb Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 26 Sep 2023 14:29:50 +0200 Subject: sched/eevdf: Fix avg_vruntime() The expectation is that placing a task at avg_vruntime() makes it eligible. Turns out there is a corner case where this is not the case. Specifically, avg_vruntime() relies on the fact that integer division is a flooring function (eg. it discards the remainder). By this property the value returned is slightly left of the true average. However! when the average is a negative (relative to min_vruntime) the effect is flipped and it becomes a ceil, with the result that the returned value is just right of the average and thus not eligible. Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime") Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/fair.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7d73652acbb2..ef7490c4b8b4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -664,6 +664,10 @@ void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta; } +/* + * Specifically: avg_runtime() + 0 must result in entity_eligible() := true + * For this to be so, the result of this function must have a left bias. + */ u64 avg_vruntime(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; @@ -677,8 +681,12 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) load += weight; } - if (load) + if (load) { + /* sign flips effective floor / ceil */ + if (avg < 0) + avg -= (load - 1); avg = div_s64(avg, load); + } return cfs_rq->min_vruntime + avg; } -- cgit v1.2.3 From b21f18ef964b2c71aa0b451df6d17b7bcad8280d Mon Sep 17 00:00:00 2001 From: Pavankumar Kondeti Date: Wed, 4 Oct 2023 10:31:15 +0530 Subject: PM: hibernate: Fix copying the zero bitmap to safe pages The following crash is observed 100% of the time during resume from the hibernation on a x86 QEMU system. [ 12.931887] ? __die_body+0x1a/0x60 [ 12.932324] ? page_fault_oops+0x156/0x420 [ 12.932824] ? search_exception_tables+0x37/0x50 [ 12.933389] ? fixup_exception+0x21/0x300 [ 12.933889] ? exc_page_fault+0x69/0x150 [ 12.934371] ? asm_exc_page_fault+0x26/0x30 [ 12.934869] ? get_buffer.constprop.0+0xac/0x100 [ 12.935428] snapshot_write_next+0x7c/0x9f0 [ 12.935929] ? submit_bio_noacct_nocheck+0x2c2/0x370 [ 12.936530] ? submit_bio_noacct+0x44/0x2c0 [ 12.937035] ? hib_submit_io+0xa5/0x110 [ 12.937501] load_image+0x83/0x1a0 [ 12.937919] swsusp_read+0x17f/0x1d0 [ 12.938355] ? create_basic_memory_bitmaps+0x1b7/0x240 [ 12.938967] load_image_and_restore+0x45/0xc0 [ 12.939494] software_resume+0x13c/0x180 [ 12.939994] resume_store+0xa3/0x1d0 The commit being fixed introduced a bug in copying the zero bitmap to safe pages. A temporary bitmap is allocated with PG_ANY flag in prepare_image() to make a copy of zero bitmap after the unsafe pages are marked. Freeing this temporary bitmap with PG_UNSAFE_KEEP later results in an inconsistent state of unsafe pages. Since free bit is left as is for this temporary bitmap after free, these pages are treated as unsafe pages when they are allocated again. This results in incorrect calculation of the number of pages pre-allocated for the image. nr_pages = (nr_zero_pages + nr_copy_pages) - nr_highmem - allocated_unsafe_pages; The allocate_unsafe_pages is estimated to be higher than the actual which results in running short of pages in safe_pages_list. Hence the crash is observed in get_buffer() due to NULL pointer access of safe_pages_list. Fix this issue by creating the temporary zero bitmap from safe pages (free bit not set) so that the corresponding free bits can be cleared while freeing this bitmap. Fixes: 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") Suggested-by:: Brian Geffon Signed-off-by: Pavankumar Kondeti Reviewed-by: Brian Geffon Signed-off-by: Rafael J. Wysocki --- kernel/power/snapshot.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 87e9f7e2bdc0..0f12e0a97e43 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -2647,7 +2647,7 @@ static int prepare_image(struct memory_bitmap *new_bm, struct memory_bitmap *bm, memory_bm_free(bm, PG_UNSAFE_KEEP); /* Make a copy of zero_bm so it can be created in safe pages */ - error = memory_bm_create(&tmp, GFP_ATOMIC, PG_ANY); + error = memory_bm_create(&tmp, GFP_ATOMIC, PG_SAFE); if (error) goto Free; @@ -2660,7 +2660,7 @@ static int prepare_image(struct memory_bitmap *new_bm, struct memory_bitmap *bm, goto Free; duplicate_memory_bitmap(zero_bm, &tmp); - memory_bm_free(&tmp, PG_UNSAFE_KEEP); + memory_bm_free(&tmp, PG_UNSAFE_CLEAR); /* At this point zero_bm is in safe pages and it can be used for restoring. */ if (nr_highmem > 0) { -- cgit v1.2.3 From 643445531829d89dc5ddbe0c5ee4ff8f84ce8687 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Wed, 20 Sep 2023 14:07:04 +0800 Subject: workqueue: Fix UAF report by KASAN in pwq_release_workfn() Currently, for UNBOUND wq, if the apply_wqattrs_prepare() return error, the apply_wqattr_cleanup() will be called and use the pwq_release_worker kthread to release resources asynchronously. however, the kfree(wq) is invoked directly in failure path of alloc_workqueue(), if the kfree(wq) has been executed and when the pwq_release_workfn() accesses wq, this leads to the following scenario: BUG: KASAN: slab-use-after-free in pwq_release_workfn+0x339/0x380 kernel/workqueue.c:4124 Read of size 4 at addr ffff888027b831c0 by task pool_workqueue_/3 CPU: 0 PID: 3 Comm: pool_workqueue_ Not tainted 6.5.0-rc7-next-20230825-syzkaller #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0xc4/0x620 mm/kasan/report.c:475 kasan_report+0xda/0x110 mm/kasan/report.c:588 pwq_release_workfn+0x339/0x380 kernel/workqueue.c:4124 kthread_worker_fn+0x2fc/0xa80 kernel/kthread.c:823 kthread+0x33a/0x430 kernel/kthread.c:388 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 Allocated by task 5054: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc+0xa2/0xb0 mm/kasan/common.c:383 kmalloc include/linux/slab.h:599 [inline] kzalloc include/linux/slab.h:720 [inline] alloc_workqueue+0x16f/0x1490 kernel/workqueue.c:4684 kvm_mmu_init_tdp_mmu+0x23/0x100 arch/x86/kvm/mmu/tdp_mmu.c:19 kvm_mmu_init_vm+0x248/0x2e0 arch/x86/kvm/mmu/mmu.c:6180 kvm_arch_init_vm+0x39/0x720 arch/x86/kvm/x86.c:12311 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1222 [inline] kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:5089 [inline] kvm_dev_ioctl+0xa31/0x1c20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5131 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 5054: kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] ____kasan_slab_free+0x15b/0x1b0 mm/kasan/common.c:200 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1800 [inline] slab_free_freelist_hook+0x114/0x1e0 mm/slub.c:1826 slab_free mm/slub.c:3809 [inline] __kmem_cache_free+0xb8/0x2f0 mm/slub.c:3822 alloc_workqueue+0xe76/0x1490 kernel/workqueue.c:4746 kvm_mmu_init_tdp_mmu+0x23/0x100 arch/x86/kvm/mmu/tdp_mmu.c:19 kvm_mmu_init_vm+0x248/0x2e0 arch/x86/kvm/mmu/mmu.c:6180 kvm_arch_init_vm+0x39/0x720 arch/x86/kvm/x86.c:12311 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1222 [inline] kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:5089 [inline] kvm_dev_ioctl+0xa31/0x1c20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5131 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd This commit therefore flush pwq_release_worker in the alloc_and_link_pwqs() before invoke kfree(wq). Reported-by: syzbot+60db9f652c92d5bacba4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=60db9f652c92d5bacba4 Signed-off-by: Zqiang Signed-off-by: Tejun Heo --- kernel/workqueue.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index b9f053a5a5f0..ebe24a5e1435 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4600,6 +4600,12 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) } cpus_read_unlock(); + /* for unbound pwq, flush the pwq_release_worker ensures that the + * pwq_release_workfn() completes before calling kfree(wq). + */ + if (ret) + kthread_flush_worker(pwq_release_worker); + return ret; enomem: -- cgit v1.2.3 From 9e0bc36ab07c550d791bf17feeb479f1dfc42d89 Mon Sep 17 00:00:00 2001 From: Xuewen Yan Date: Wed, 19 Jul 2023 21:05:27 +0800 Subject: cpufreq: schedutil: Update next_freq when cpufreq_limits change When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at 2301000, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan Signed-off-by: Xuewen Yan Signed-off-by: Guohua Yan Signed-off-by: Ingo Molnar Acked-by: "Rafael J. Wysocki" Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com --- kernel/sched/cpufreq_schedutil.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 4492608b7d7f..458d359f5991 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -350,7 +350,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, * Except when the rq is capped by uclamp_max. */ if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) && - sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq) { + sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq && + !sg_policy->need_freq_update) { next_f = sg_policy->next_freq; /* Restore cached freq as next_freq has changed */ -- cgit v1.2.3 From a4fe78386afb94780f8e6fcd10a67c4d4dfe4da8 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Sat, 7 Oct 2023 00:06:49 +0200 Subject: bpf: Fix BPF_PROG_QUERY last field check While working on the ebpf-go [0] library integration for bpf_mprog and tcx, Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A typical workflow is to first gather the bpf_mprog count without passing program/ link arrays, followed by the second request which contains the actual array pointers. The initial call populates count and revision fields. The second call gets rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to query.revision instead of query.link_attach_flags since the former is really the last member. It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with an on-stack bpf_attr that is memset() each time (and therefore query.revision was reset to zero). [0] https://ebpf-go.dev Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Reported-by: Lorenz Bauer Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau --- kernel/bpf/syscall.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index eb01c31ed591..453a43695a23 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -3913,7 +3913,7 @@ static int bpf_prog_detach(const union bpf_attr *attr) return ret; } -#define BPF_PROG_QUERY_LAST_FIELD query.link_attach_flags +#define BPF_PROG_QUERY_LAST_FIELD query.revision static int bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) -- cgit v1.2.3 From edfa9af0a73ecc2000d7bb81d0b0fd3158cc9a65 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Sat, 7 Oct 2023 00:06:50 +0200 Subject: bpf: Handle bpf_mprog_query with NULL entry Improve consistency for bpf_mprog_query() API and let the latter also handle a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we copy a count of 0 and revision of 1 to user space, so that this can be fed into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self- test as part of this series has been added to assert this case. Suggested-by: Lorenz Bauer Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau --- kernel/bpf/mprog.c | 10 ++++++---- kernel/bpf/tcx.c | 8 +------- 2 files changed, 7 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/mprog.c b/kernel/bpf/mprog.c index 007d98c799e2..1394168062e8 100644 --- a/kernel/bpf/mprog.c +++ b/kernel/bpf/mprog.c @@ -401,14 +401,16 @@ int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr, struct bpf_mprog_cp *cp; struct bpf_prog *prog; const u32 flags = 0; + u32 id, count = 0; + u64 revision = 1; int i, ret = 0; - u32 id, count; - u64 revision; if (attr->query.query_flags || attr->query.attach_flags) return -EINVAL; - revision = bpf_mprog_revision(entry); - count = bpf_mprog_total(entry); + if (entry) { + revision = bpf_mprog_revision(entry); + count = bpf_mprog_total(entry); + } if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) return -EFAULT; if (copy_to_user(&uattr->query.revision, &revision, sizeof(revision))) diff --git a/kernel/bpf/tcx.c b/kernel/bpf/tcx.c index 13f0b5dc8262..1338a13a8b64 100644 --- a/kernel/bpf/tcx.c +++ b/kernel/bpf/tcx.c @@ -123,7 +123,6 @@ int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) { bool ingress = attr->query.attach_type == BPF_TCX_INGRESS; struct net *net = current->nsproxy->net_ns; - struct bpf_mprog_entry *entry; struct net_device *dev; int ret; @@ -133,12 +132,7 @@ int tcx_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) ret = -ENODEV; goto out; } - entry = tcx_entry_fetch(dev, ingress); - if (!entry) { - ret = -ENOENT; - goto out; - } - ret = bpf_mprog_query(attr, uattr, entry); + ret = bpf_mprog_query(attr, uattr, tcx_entry_fetch(dev, ingress)); out: rtnl_unlock(); return ret; -- cgit v1.2.3 From ba62d61128bda71fd02622f320ac59d861fc4baa Mon Sep 17 00:00:00 2001 From: Lorenz Bauer Date: Sat, 7 Oct 2023 00:06:51 +0200 Subject: bpf: Refuse unused attributes in bpf_prog_{attach,detach} The recently added tcx attachment extended the BPF UAPI for attaching and detaching by a couple of fields. Those fields are currently only supported for tcx, other types like cgroups and flow dissector silently ignore the new fields except for the new flags. This is problematic once we extend bpf_mprog to older attachment types, since it's hard to figure out whether the syscall really was successful if the kernel silently ignores non-zero values. Explicitly reject non-zero fields relevant to bpf_mprog for attachment types which don't use the latter yet. Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Signed-off-by: Lorenz Bauer Co-developed-by: Daniel Borkmann Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau --- kernel/bpf/syscall.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 453a43695a23..d77b2f8b9364 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -3796,7 +3796,6 @@ static int bpf_prog_attach(const union bpf_attr *attr) { enum bpf_prog_type ptype; struct bpf_prog *prog; - u32 mask; int ret; if (CHECK_ATTR(BPF_PROG_ATTACH)) @@ -3805,10 +3804,16 @@ static int bpf_prog_attach(const union bpf_attr *attr) ptype = attach_type_to_prog_type(attr->attach_type); if (ptype == BPF_PROG_TYPE_UNSPEC) return -EINVAL; - mask = bpf_mprog_supported(ptype) ? - BPF_F_ATTACH_MASK_MPROG : BPF_F_ATTACH_MASK_BASE; - if (attr->attach_flags & ~mask) - return -EINVAL; + if (bpf_mprog_supported(ptype)) { + if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG) + return -EINVAL; + } else { + if (attr->attach_flags & ~BPF_F_ATTACH_MASK_BASE) + return -EINVAL; + if (attr->relative_fd || + attr->expected_revision) + return -EINVAL; + } prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); if (IS_ERR(prog)) @@ -3878,6 +3883,10 @@ static int bpf_prog_detach(const union bpf_attr *attr) if (IS_ERR(prog)) return PTR_ERR(prog); } + } else if (attr->attach_flags || + attr->relative_fd || + attr->expected_revision) { + return -EINVAL; } switch (ptype) { -- cgit v1.2.3 From 8dafa9d0eb1a1550a0f4d462db9354161bc51e0c Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 6 Oct 2023 21:24:45 +0200 Subject: sched/eevdf: Fix min_deadline heap integrity Marek and Biju reported instances of: "EEVDF scheduling fail, picking leftmost" which Mike correlated with cgroup scheduling and the min_deadline heap getting corrupted; some trace output confirms: > And yeah, min_deadline is hosed somehow: > > validate_cfs_rq: --- / > __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr) > __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31) > __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33) > validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256 Turns out that reweight_entity(), which tries really hard to be fast, does not do the normal dequeue+update+enqueue pattern but *does* scale the deadline. However, it then fails to propagate the updated deadline value up the heap. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Marek Szyprowski Reported-by: Biju Das Reported-by: Mike Galbraith Signed-off-by: Peter Zijlstra (Intel) Tested-by: Marek Szyprowski Tested-by: Biju Das Tested-by: Mike Galbraith Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net --- kernel/sched/fair.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ef7490c4b8b4..a4b904a010c6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3613,6 +3613,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, */ deadline = div_s64(deadline * old_weight, weight); se->deadline = se->vruntime + deadline; + min_deadline_cb_propagate(&se->run_node, NULL); } #ifdef CONFIG_SMP -- cgit v1.2.3 From b01db23d5923a35023540edc4f0c5f019e11ac7d Mon Sep 17 00:00:00 2001 From: Benjamin Segall Date: Fri, 29 Sep 2023 17:09:30 -0700 Subject: sched/eevdf: Fix pick_eevdf() The old pick_eevdf() could fail to find the actual earliest eligible deadline when it descended to the right looking for min_deadline, but it turned out that that min_deadline wasn't actually eligible. In that case we need to go back and search through any left branches we skipped looking for the actual best _eligible_ min_deadline. This is more expensive, but still O(log n), and at worst should only involve descending two branches of the rbtree. I've run this through a userspace stress test (thank you tools/lib/rbtree.c), so hopefully this implementation doesn't miss any corner cases. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Ben Segall Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com --- kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 58 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a4b904a010c6..061a30a8925a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -872,14 +872,16 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) * * Which allows an EDF like search on (sub)trees. */ -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) +static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) { struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node; struct sched_entity *curr = cfs_rq->curr; struct sched_entity *best = NULL; + struct sched_entity *best_left = NULL; if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; + best = curr; /* * Once selected, run a task until it either becomes non-eligible or @@ -900,33 +902,75 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) } /* - * If this entity has an earlier deadline than the previous - * best, take this one. If it also has the earliest deadline - * of its subtree, we're done. + * Now we heap search eligible trees for the best (min_)deadline */ - if (!best || deadline_gt(deadline, best, se)) { + if (!best || deadline_gt(deadline, best, se)) best = se; - if (best->deadline == best->min_deadline) - break; - } /* - * If the earlest deadline in this subtree is in the fully - * eligible left half of our space, go there. + * Every se in a left branch is eligible, keep track of the + * branch with the best min_deadline */ + if (node->rb_left) { + struct sched_entity *left = __node_2_se(node->rb_left); + + if (!best_left || deadline_gt(min_deadline, best_left, left)) + best_left = left; + + /* + * min_deadline is in the left branch. rb_left and all + * descendants are eligible, so immediately switch to the second + * loop. + */ + if (left->min_deadline == se->min_deadline) + break; + } + + /* min_deadline is at this node, no need to look right */ + if (se->deadline == se->min_deadline) + break; + + /* else min_deadline is in the right branch. */ + node = node->rb_right; + } + + /* + * We ran into an eligible node which is itself the best. + * (Or nr_running == 0 and both are NULL) + */ + if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0) + return best; + + /* + * Now best_left and all of its children are eligible, and we are just + * looking for deadline == min_deadline + */ + node = &best_left->run_node; + while (node) { + struct sched_entity *se = __node_2_se(node); + + /* min_deadline is the current node */ + if (se->deadline == se->min_deadline) + return se; + + /* min_deadline is in the left branch */ if (node->rb_left && __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { node = node->rb_left; continue; } + /* else min_deadline is in the right branch */ node = node->rb_right; } + return NULL; +} - if (!best || (curr && deadline_gt(deadline, best, curr))) - best = curr; +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) +{ + struct sched_entity *se = __pick_eevdf(cfs_rq); - if (unlikely(!best)) { + if (!se) { struct sched_entity *left = __pick_first_entity(cfs_rq); if (left) { pr_err("EEVDF scheduling fail, picking leftmost\n"); @@ -934,7 +978,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) } } - return best; + return se; } #ifdef CONFIG_SCHED_DEBUG -- cgit v1.2.3 From 054c22bd784d6953ac85f545bed4a2a27b0e4ddb Mon Sep 17 00:00:00 2001 From: John Ogness Date: Fri, 6 Oct 2023 10:21:50 +0200 Subject: printk: flush consoles before checking progress Commit 9e70a5e109a4 ("printk: Add per-console suspended state") removed console lock usage during resume and replaced it with the clearly defined console_list_lock and srcu mechanisms. However, the console lock usage had an important side-effect of flushing the consoles. After its removal, consoles were no longer flushed before checking their progress. Add the console_lock/console_unlock dance to the beginning of __pr_flush() to actually flush the consoles before checking their progress. Also add comments to clarify this additional usage of the console lock. Note that console_unlock() does not guarantee flushing all messages since the commit dbdda842fe96f89 ("printk: Add console owner and waiter logic to load balance console writes"). Reported-by: Todd Brandt Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217955 Fixes: 9e70a5e109a4 ("printk: Add per-console suspended state") Co-developed-by: Petr Mladek Signed-off-by: Petr Mladek Signed-off-by: John Ogness Link: https://lore.kernel.org/r/20231006082151.6969-2-pmladek@suse.com --- kernel/printk/printk.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 8787d3a72114..1919899b6897 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -3738,12 +3738,18 @@ static bool __pr_flush(struct console *con, int timeout_ms, bool reset_on_progre seq = prb_next_seq(prb); + /* Flush the consoles so that records up to @seq are printed. */ + console_lock(); + console_unlock(); + for (;;) { diff = 0; /* * Hold the console_lock to guarantee safe access to - * console->seq. + * console->seq. Releasing console_lock flushes more + * records in case @seq is still not printed on all + * usable consoles. */ console_lock(); -- cgit v1.2.3 From 1ca0b605150501b7dc59f3016271da4eb3e96fce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Mon, 9 Oct 2023 15:58:11 +0200 Subject: cgroup: Remove duplicates in cgroup v1 tasks file MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit One PID may appear multiple times in a preloaded pidlist. (Possibly due to PID recycling but we have reports of the same task_struct appearing with different PIDs, thus possibly involving transfer of PID via de_thread().) Because v1 seq_file iterator uses PIDs as position, it leads to a message: > seq_file: buggy .next function kernfs_seq_next did not update position index Conservative and quick fix consists of removing duplicates from `tasks` file (as opposed to removing pidlists altogether). It doesn't affect correctness (it's sufficient to show a PID once), performance impact would be hidden by unconditional sorting of the pidlist already in place (asymptotically). Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/ Suggested-by: Firo Yang Signed-off-by: Michal Koutný Signed-off-by: Tejun Heo Cc: stable@vger.kernel.org --- kernel/cgroup/cgroup-v1.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index c487ffef6652..76db6c67e39a 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -360,10 +360,9 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type, } css_task_iter_end(&it); length = n; - /* now sort & (if procs) strip out duplicates */ + /* now sort & strip out duplicates (tgids or recycled thread PIDs) */ sort(array, length, sizeof(pid_t), cmppid, NULL); - if (type == CGROUP_FILE_PROCS) - length = pidlist_uniq(array, length); + length = pidlist_uniq(array, length); l = cgroup_pidlist_find_create(cgrp, type); if (!l) { -- cgit v1.2.3 From 829955981c557c7fc7416581c4cd68a8a0c28620 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 9 Oct 2023 11:14:13 -0500 Subject: bpf: Fix verifier log for async callback return values The verifier, as part of check_return_code(), verifies that async callbacks such as from e.g. timers, will return 0. It does this by correctly checking that R0->var_off is in tnum_const(0), which effectively checks that it's in a range of 0. If this condition fails, however, it prints an error message which says that the value should have been in (0x0; 0x1). This results in possibly confusing output such as the following in which an async callback returns 1: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1) The fix is easy -- we should just pass the tnum_const(0) as the correct range to verbose_invalid_scalar(), which will then print the following: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0) Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: David Vernet Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com --- kernel/bpf/verifier.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c0c7d137066a..873ade146f3d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -14479,7 +14479,7 @@ static int check_return_code(struct bpf_verifier_env *env) struct tnum enforce_attach_type_range = tnum_unknown; const struct bpf_prog *prog = env->prog; struct bpf_reg_state *reg; - struct tnum range = tnum_range(0, 1); + struct tnum range = tnum_range(0, 1), const_0 = tnum_const(0); enum bpf_prog_type prog_type = resolve_prog_type(env->prog); int err; struct bpf_func_state *frame = env->cur_state->frame[0]; @@ -14527,8 +14527,8 @@ static int check_return_code(struct bpf_verifier_env *env) return -EINVAL; } - if (!tnum_in(tnum_const(0), reg->var_off)) { - verbose_invalid_scalar(env, reg, &range, "async callback", "R0"); + if (!tnum_in(const_0, reg->var_off)) { + verbose_invalid_scalar(env, reg, &const_0, "async callback", "R0"); return -EINVAL; } return 0; -- cgit v1.2.3 From 7b42f401fc6571b6604441789d892d440829e33c Mon Sep 17 00:00:00 2001 From: Zqiang Date: Wed, 11 Oct 2023 16:27:59 +0800 Subject: workqueue: Use the kmem_cache_free() instead of kfree() to release pwq Currently, the kfree() be used for pwq objects allocated with kmem_cache_alloc() in alloc_and_link_pwqs(), this isn't wrong. but usually, use "trace_kmem_cache_alloc/trace_kmem_cache_free" to track memory allocation and free. this commit therefore use kmem_cache_free() instead of kfree() in alloc_and_link_pwqs() and also consistent with release of the pwq in rcu_free_pwq(). Signed-off-by: Zqiang Signed-off-by: Tejun Heo --- kernel/workqueue.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ebe24a5e1435..6f74cab2bd5a 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4610,8 +4610,12 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) enomem: if (wq->cpu_pwq) { - for_each_possible_cpu(cpu) - kfree(*per_cpu_ptr(wq->cpu_pwq, cpu)); + for_each_possible_cpu(cpu) { + struct pool_workqueue *pwq = *per_cpu_ptr(wq->cpu_pwq, cpu); + + if (pwq) + kmem_cache_free(pwq_cache, pwq); + } free_percpu(wq->cpu_pwq); wq->cpu_pwq = NULL; } -- cgit v1.2.3 From ca10d851b9ad0338c19e8e3089e24d565ebfffd7 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 10 Oct 2023 22:48:42 -0400 Subject: workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask() Commit 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") enabled implicit ordered attribute to be added to WQ_UNBOUND workqueues with max_active of 1. This prevented the changing of attributes to these workqueues leading to fix commit 0a94efb5acbb ("workqueue: implicit ordered attribute should be overridable"). However, workqueue_apply_unbound_cpumask() was not updated at that time. So sysfs changes to wq_unbound_cpumask has no effect on WQ_UNBOUND workqueues with implicit ordered attribute. Since not all WQ_UNBOUND workqueues are visible on sysfs, we are not able to make all the necessary cpumask changes even if we iterates all the workqueue cpumasks in sysfs and changing them one by one. Fix this problem by applying the corresponding change made to apply_workqueue_attrs_locked() in the fix commit to workqueue_apply_unbound_cpumask(). Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") Signed-off-by: Waiman Long Signed-off-by: Tejun Heo --- kernel/workqueue.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6f74cab2bd5a..51177ffe16f1 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5792,9 +5792,13 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask) list_for_each_entry(wq, &workqueues, list) { if (!(wq->flags & WQ_UNBOUND)) continue; + /* creating multiple pwqs breaks ordering guarantee */ - if (wq->flags & __WQ_ORDERED) - continue; + if (!list_empty(&wq->pwqs)) { + if (wq->flags & __WQ_ORDERED_EXPLICIT) + continue; + wq->flags &= ~__WQ_ORDERED; + } ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask); if (IS_ERR(ctx)) { -- cgit v1.2.3 From 5d9c7a1e3e8e18db8e10c546de648cda2a57be52 Mon Sep 17 00:00:00 2001 From: Lucy Mielke Date: Mon, 9 Oct 2023 19:09:46 +0200 Subject: workqueue: fix -Wformat-truncation in create_worker MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Compiling with W=1 emitted the following warning (Compiler: gcc (x86-64, ver. 13.2.1, .config: result of make allyesconfig, "Treat warnings as errors" turned off): kernel/workqueue.c:2188:54: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 5 and 14 [-Wformat-truncation=] kernel/workqueue.c:2188:50: note: directive argument in the range [0, 2147483647] kernel/workqueue.c:2188:17: note: ‘snprintf’ output between 4 and 23 bytes into a destination of size 16 setting "id_buf" to size 23 will silence the warning, since GCC determines snprintf's output to be max. 23 bytes in line 2188. Please let me know if there are any mistakes in my patch! Signed-off-by: Lucy Mielke Signed-off-by: Tejun Heo --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 51177ffe16f1..a3522b70218d 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2166,7 +2166,7 @@ static struct worker *create_worker(struct worker_pool *pool) { struct worker *worker; int id; - char id_buf[16]; + char id_buf[23]; /* ID is needed to determine kthread name */ id = ida_alloc(&pool->worker_ida, GFP_KERNEL); -- cgit v1.2.3 From 03adc61edad49e1bbecfb53f7ea5d78f398fe368 Mon Sep 17 00:00:00 2001 From: Dan Clash Date: Thu, 12 Oct 2023 14:55:18 -0700 Subject: audit,io_uring: io_uring openat triggers audit reference count underflow An io_uring openat operation can update an audit reference count from multiple threads resulting in the call trace below. A call to io_uring_submit() with a single openat op with a flag of IOSQE_ASYNC results in the following reference count updates. These first part of the system call performs two increments that do not race. do_syscall_64() __do_sys_io_uring_enter() io_submit_sqes() io_openat_prep() __io_openat_prep() getname() getname_flags() /* update 1 (increment) */ __audit_getname() /* update 2 (increment) */ The openat op is queued to an io_uring worker thread which starts the opportunity for a race. The system call exit performs one decrement. do_syscall_64() syscall_exit_to_user_mode() syscall_exit_to_user_mode_prepare() __audit_syscall_exit() audit_reset_context() putname() /* update 3 (decrement) */ The io_uring worker thread performs one increment and two decrements. These updates can race with the system call decrement. io_wqe_worker() io_worker_handle_work() io_wq_submit_work() io_issue_sqe() io_openat() io_openat2() do_filp_open() path_openat() __audit_inode() /* update 4 (increment) */ putname() /* update 5 (decrement) */ __audit_uring_exit() audit_reset_context() putname() /* update 6 (decrement) */ The fix is to change the refcnt member of struct audit_names from int to atomic_t. kernel BUG at fs/namei.c:262! Call Trace: ... ? putname+0x68/0x70 audit_reset_context.part.0.constprop.0+0xe1/0x300 __audit_uring_exit+0xda/0x1c0 io_issue_sqe+0x1f3/0x450 ? lock_timer_base+0x3b/0xd0 io_wq_submit_work+0x8d/0x2b0 ? __try_to_del_timer_sync+0x67/0xa0 io_worker_handle_work+0x17c/0x2b0 io_wqe_worker+0x10a/0x350 Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/ Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring") Signed-off-by: Dan Clash Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Reviewed-by: Jens Axboe Signed-off-by: Christian Brauner --- kernel/auditsc.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 21d2fa815e78..6f0d6fb6523f 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2212,7 +2212,7 @@ __audit_reusename(const __user char *uptr) if (!n->name) continue; if (n->name->uptr == uptr) { - n->name->refcnt++; + atomic_inc(&n->name->refcnt); return n->name; } } @@ -2241,7 +2241,7 @@ void __audit_getname(struct filename *name) n->name = name; n->name_len = AUDIT_NAME_FULL; name->aname = n; - name->refcnt++; + atomic_inc(&name->refcnt); } static inline int audit_copy_fcaps(struct audit_names *name, @@ -2373,7 +2373,7 @@ out_alloc: return; if (name) { n->name = name; - name->refcnt++; + atomic_inc(&name->refcnt); } out: @@ -2500,7 +2500,7 @@ void __audit_inode_child(struct inode *parent, if (found_parent) { found_child->name = found_parent->name; found_child->name_len = AUDIT_NAME_FULL; - found_child->name->refcnt++; + atomic_inc(&found_child->name->refcnt); } } -- cgit v1.2.3 From 700b2b439766e8aab8a7174991198497345bd411 Mon Sep 17 00:00:00 2001 From: "Masami Hiramatsu (Google)" Date: Tue, 17 Oct 2023 08:49:45 +0900 Subject: fprobe: Fix to ensure the number of active retprobes is not zero The number of active retprobes can be zero but it is not acceptable, so return EINVAL error if detected. Link: https://lore.kernel.org/all/169750018550.186853.11198884812017796410.stgit@devnote2/ Reported-by: wuqiang.matt Closes: https://lore.kernel.org/all/20231016222103.cb9f426edc60220eabd8aa6a@kernel.org/ Fixes: 5b0ab78998e3 ("fprobe: Add exit_handler support") Signed-off-by: Masami Hiramatsu (Google) --- kernel/trace/fprobe.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c index 3b21f4063258..881f90f0cbcf 100644 --- a/kernel/trace/fprobe.c +++ b/kernel/trace/fprobe.c @@ -189,7 +189,7 @@ static int fprobe_init_rethook(struct fprobe *fp, int num) { int i, size; - if (num < 0) + if (num <= 0) return -EINVAL; if (!fp->exit_handler) { @@ -202,8 +202,8 @@ static int fprobe_init_rethook(struct fprobe *fp, int num) size = fp->nr_maxactive; else size = num * num_possible_cpus() * 2; - if (size < 0) - return -E2BIG; + if (size <= 0) + return -EINVAL; fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler); if (!fp->rethook) -- cgit v1.2.3 From d2929762cc3f85528b0ca12f6f63c2a714f24778 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 17 Oct 2023 16:59:47 +0200 Subject: sched/eevdf: Fix heap corruption more Because someone is a flaming idiot... and forgot we have current as se->on_rq but not actually in the tree itself, and walking rb_parent() on an entry not in the tree is 'funky' and KASAN complains. Fixes: 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity") Reported-by: 0599jiangyc@gmail.com Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Peter Zijlstra (Intel) Tested-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020 Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 061a30a8925a..df348aa55d3c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3657,7 +3657,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, */ deadline = div_s64(deadline * old_weight, weight); se->deadline = se->vruntime + deadline; - min_deadline_cb_propagate(&se->run_node, NULL); + if (se != cfs_rq->curr) + min_deadline_cb_propagate(&se->run_node, NULL); } #ifdef CONFIG_SMP -- cgit v1.2.3 From 32671e3799ca2e4590773fd0e63aaa4229e50c06 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 18 Oct 2023 13:56:54 +0200 Subject: perf: Disallow mis-matched inherited group reads Because group consistency is non-atomic between parent (filedesc) and children (inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum non-matching counter groups -- with non-sensical results. Add group_generation to distinguish the case where a parent group removes and adds an event and thus has the same number, but a different configuration of events as inherited groups. This became a problem when commit fa8c269353d5 ("perf/core: Invert perf_read_group() loops") flipped the order of child_list and sibling_list. Previously it would iterate the group (sibling_list) first, and for each sibling traverse the child_list. In this order, only the group composition of the parent is relevant. By flipping the order the group composition of the child (inherited) events becomes an issue and the mis-match in group composition becomes evident. That said; even prior to this commit, while reading of a group that is not equally inherited was not broken, it still made no sense. (Ab)use ECHILD as error return to indicate issues with child process group composition. Fixes: fa8c269353d5 ("perf/core: Invert perf_read_group() loops") Reported-by: Budimir Markovic Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20231018115654.GK33217@noisy.programming.kicks-ass.net --- kernel/events/core.c | 39 +++++++++++++++++++++++++++++++++------ 1 file changed, 33 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 4c72a41f11af..d0663b9324e7 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1954,6 +1954,7 @@ static void perf_group_attach(struct perf_event *event) list_add_tail(&event->sibling_list, &group_leader->sibling_list); group_leader->nr_siblings++; + group_leader->group_generation++; perf_event__header_size(group_leader); @@ -2144,6 +2145,7 @@ static void perf_group_detach(struct perf_event *event) if (leader != event) { list_del_init(&event->sibling_list); event->group_leader->nr_siblings--; + event->group_leader->group_generation++; goto out; } @@ -5440,7 +5442,7 @@ static int __perf_read_group_add(struct perf_event *leader, u64 read_format, u64 *values) { struct perf_event_context *ctx = leader->ctx; - struct perf_event *sub; + struct perf_event *sub, *parent; unsigned long flags; int n = 1; /* skip @nr */ int ret; @@ -5450,6 +5452,33 @@ static int __perf_read_group_add(struct perf_event *leader, return ret; raw_spin_lock_irqsave(&ctx->lock, flags); + /* + * Verify the grouping between the parent and child (inherited) + * events is still in tact. + * + * Specifically: + * - leader->ctx->lock pins leader->sibling_list + * - parent->child_mutex pins parent->child_list + * - parent->ctx->mutex pins parent->sibling_list + * + * Because parent->ctx != leader->ctx (and child_list nests inside + * ctx->mutex), group destruction is not atomic between children, also + * see perf_event_release_kernel(). Additionally, parent can grow the + * group. + * + * Therefore it is possible to have parent and child groups in a + * different configuration and summing over such a beast makes no sense + * what so ever. + * + * Reject this. + */ + parent = leader->parent; + if (parent && + (parent->group_generation != leader->group_generation || + parent->nr_siblings != leader->nr_siblings)) { + ret = -ECHILD; + goto unlock; + } /* * Since we co-schedule groups, {enabled,running} times of siblings @@ -5483,8 +5512,9 @@ static int __perf_read_group_add(struct perf_event *leader, values[n++] = atomic64_read(&sub->lost_samples); } +unlock: raw_spin_unlock_irqrestore(&ctx->lock, flags); - return 0; + return ret; } static int perf_read_group(struct perf_event *event, @@ -5503,10 +5533,6 @@ static int perf_read_group(struct perf_event *event, values[0] = 1 + leader->nr_siblings; - /* - * By locking the child_mutex of the leader we effectively - * lock the child list of all siblings.. XXX explain how. - */ mutex_lock(&leader->child_mutex); ret = __perf_read_group_add(leader, read_format, values); @@ -13346,6 +13372,7 @@ static int inherit_group(struct perf_event *parent_event, !perf_get_aux_event(child_ctr, leader)) return -EINVAL; } + leader->group_generation = parent_event->group_generation; return 0; } -- cgit v1.2.3 From b022f0c7e404887a7c5229788fc99eff9f9a80d5 Mon Sep 17 00:00:00 2001 From: Francis Laniel Date: Fri, 20 Oct 2023 13:42:49 +0300 Subject: tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols When a kprobe is attached to a function that's name is not unique (is static and shares the name with other functions in the kernel), the kprobe is attached to the first function it finds. This is a bug as the function that it is attaching to is not necessarily the one that the user wants to attach to. Instead of blindly picking a function to attach to what is ambiguous, error with EADDRNOTAVAIL to let the user know that this function is not unique, and that the user must use another unique function with an address offset to get to the function they want to attach to. Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/ Cc: stable@vger.kernel.org Fixes: 413d37d1eb69 ("tracing: Add kprobe-based event tracer") Suggested-by: Masami Hiramatsu Signed-off-by: Francis Laniel Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/ Acked-by: Masami Hiramatsu (Google) Signed-off-by: Masami Hiramatsu (Google) --- kernel/trace/trace_kprobe.c | 63 +++++++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace_probe.h | 1 + 2 files changed, 64 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 3d7a180a8427..a8fef6ab0872 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -705,6 +705,25 @@ static struct notifier_block trace_kprobe_module_nb = { .priority = 1 /* Invoked after kprobe module callback */ }; +static int count_symbols(void *data, unsigned long unused) +{ + unsigned int *count = data; + + (*count)++; + + return 0; +} + +static unsigned int number_of_same_symbols(char *func_name) +{ + unsigned int count; + + count = 0; + kallsyms_on_each_match_symbol(count_symbols, func_name, &count); + + return count; +} + static int __trace_kprobe_create(int argc, const char *argv[]) { /* @@ -836,6 +855,31 @@ static int __trace_kprobe_create(int argc, const char *argv[]) } } + if (symbol && !strchr(symbol, ':')) { + unsigned int count; + + count = number_of_same_symbols(symbol); + if (count > 1) { + /* + * Users should use ADDR to remove the ambiguity of + * using KSYM only. + */ + trace_probe_log_err(0, NON_UNIQ_SYMBOL); + ret = -EADDRNOTAVAIL; + + goto error; + } else if (count == 0) { + /* + * We can return ENOENT earlier than when register the + * kprobe. + */ + trace_probe_log_err(0, BAD_PROBE_ADDR); + ret = -ENOENT; + + goto error; + } + } + trace_probe_log_set_index(0); if (event) { ret = traceprobe_parse_event_name(&event, &group, gbuf, @@ -1695,6 +1739,7 @@ static int unregister_kprobe_event(struct trace_kprobe *tk) } #ifdef CONFIG_PERF_EVENTS + /* create a trace_kprobe, but don't add it to global lists */ struct trace_event_call * create_local_trace_kprobe(char *func, void *addr, unsigned long offs, @@ -1705,6 +1750,24 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs, int ret; char *event; + if (func) { + unsigned int count; + + count = number_of_same_symbols(func); + if (count > 1) + /* + * Users should use addr to remove the ambiguity of + * using func only. + */ + return ERR_PTR(-EADDRNOTAVAIL); + else if (count == 0) + /* + * We can return ENOENT earlier than when register the + * kprobe. + */ + return ERR_PTR(-ENOENT); + } + /* * local trace_kprobes are not added to dyn_event, so they are never * searched in find_trace_kprobe(). Therefore, there is no concern of diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 02b432ae7513..850d9ecb6765 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -450,6 +450,7 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call, C(BAD_MAXACT, "Invalid maxactive number"), \ C(MAXACT_TOO_BIG, "Maxactive is too big"), \ C(BAD_PROBE_ADDR, "Invalid probed address or symbol"), \ + C(NON_UNIQ_SYMBOL, "The symbol is not unique"), \ C(BAD_RETPROBE, "Retprobe address must be an function entry"), \ C(NO_TRACEPOINT, "Tracepoint is not found"), \ C(BAD_ADDR_SUFFIX, "Invalid probed address suffix"), \ -- cgit v1.2.3