From 0351dc57b95b8b56f2a467122c13b6b16e0dc53f Mon Sep 17 00:00:00 2001 From: Jilin Yuan Date: Sun, 14 Aug 2022 17:39:41 +0800 Subject: audit: fix repeated words in comments Delete the redundant word 'doesn't'. Signed-off-by: Jilin Yuan [PM: subject line tweak] Signed-off-by: Paul Moore --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index dd8d9ab747c3..9226746dcf0a 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1833,7 +1833,7 @@ void __audit_free(struct task_struct *tsk) /* We are called either by do_exit() or the fork() error handling code; * in the former case tsk == current and in the latter tsk is a - * random task_struct that doesn't doesn't have any meaningful data we + * random task_struct that doesn't have any meaningful data we * need to log via audit_log_exit(). */ if (tsk == current && !context->dummy) { -- cgit v1.2.3 From 7cd4c5c2101cb092db00f61f69d24380cf7a0ee8 Mon Sep 17 00:00:00 2001 From: Frederick Lawler Date: Mon, 15 Aug 2022 11:20:25 -0500 Subject: security, lsm: Introduce security_create_user_ns() User namespaces are an effective tool to allow programs to run with permission without requiring the need for a program to run as root. User namespaces may also be used as a sandboxing technique. However, attackers sometimes leverage user namespaces as an initial attack vector to perform some exploit. [1,2,3] While it is not the unprivileged user namespace functionality, which causes the kernel to be exploitable, users/administrators might want to more granularly limit or at least monitor how various processes use this functionality, while vulnerable kernel subsystems are being patched. Preventing user namespace already creation comes in a few of forms in order of granularity: 1. /proc/sys/user/max_user_namespaces sysctl 2. Distro specific patch(es) 3. CONFIG_USER_NS To block a task based on its attributes, the LSM hook cred_prepare is a decent candidate for use because it provides more granular control, and it is called before create_user_ns(): cred = prepare_creds() security_prepare_creds() call_int_hook(cred_prepare, ... if (cred) create_user_ns(cred) Since security_prepare_creds() is meant for LSMs to copy and prepare credentials, access control is an unintended use of the hook. [4] Further, security_prepare_creds() will always return a ENOMEM if the hook returns any non-zero error code. This hook also does not handle the clone3 case which requires us to access a user space pointer to know if we're in the CLONE_NEW_USER call path which may be subject to a TOCTTOU attack. Lastly, cred_prepare is called in many call paths, and a targeted hook further limits the frequency of calls which is a beneficial outcome. Therefore introduce a new function security_create_user_ns() with an accompanying userns_create LSM hook. With the new userns_create hook, users will have more control over the observability and access control over user namespace creation. Users should expect that normal operation of user namespaces will behave as usual, and only be impacted when controls are implemented by users or administrators. This hook takes the prepared creds for LSM authors to write policy against. On success, the new namespace is applied to credentials, otherwise an error is returned. Links: 1. https://nvd.nist.gov/vuln/detail/CVE-2022-0492 2. https://nvd.nist.gov/vuln/detail/CVE-2022-25636 3. https://nvd.nist.gov/vuln/detail/CVE-2022-34918 4. https://lore.kernel.org/all/1c4b1c0d-12f6-6e9e-a6a3-cdce7418110c@schaufler-ca.com/ Reviewed-by: Christian Brauner (Microsoft) Reviewed-by: KP Singh Signed-off-by: Frederick Lawler Signed-off-by: Paul Moore --- kernel/user_namespace.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 5481ba44a8d6..3f464bbda0e9 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -113,6 +114,10 @@ int create_user_ns(struct cred *new) !kgid_has_mapping(parent_ns, group)) goto fail_dec; + ret = security_create_user_ns(new); + if (ret < 0) + goto fail_dec; + ret = -ENOMEM; ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL); if (!ns) -- cgit v1.2.3 From 401e64b3a4af4c7a2f6a00337232a3cf0bb757ed Mon Sep 17 00:00:00 2001 From: Frederick Lawler Date: Mon, 15 Aug 2022 11:20:26 -0500 Subject: bpf-lsm: Make bpf_lsm_userns_create() sleepable Users may want to audit calls to security_create_user_ns() and access user space memory. Also create_user_ns() runs without pagefault_disabled(). Therefore, make bpf_lsm_userns_create() sleepable for mandatory access control policies. Acked-by: Alexei Starovoitov Acked-by: Christian Brauner (Microsoft) Acked-by: KP Singh Signed-off-by: Frederick Lawler Signed-off-by: Paul Moore --- kernel/bpf/bpf_lsm.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index fa71d58b7ded..761998fda762 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -335,6 +335,7 @@ BTF_ID(func, bpf_lsm_task_getsecid_obj) BTF_ID(func, bpf_lsm_task_prctl) BTF_ID(func, bpf_lsm_task_setscheduler) BTF_ID(func, bpf_lsm_task_to_inode) +BTF_ID(func, bpf_lsm_userns_create) BTF_SET_END(sleepable_lsm_hooks) bool bpf_lsm_is_sleepable_hook(u32 btf_id) -- cgit v1.2.3 From 6d5afdc97ea71958287364a1f1d07e59ef151b11 Mon Sep 17 00:00:00 2001 From: Lukasz Luba Date: Tue, 16 Aug 2022 14:06:29 +0100 Subject: cpufreq: schedutil: Move max CPU capacity to sugov_policy There is no need to keep the max CPU capacity in the per_cpu instance. Furthermore, there is no need to check and update that variable (sg_cpu->max) every time in the frequency change request, which is part of hot path. Instead use struct sugov_policy to store that information. Initialize the max CPU capacity during the setup and start callback. We can do that since all CPUs in the same frequency domain have the same max capacity (capacity setup and thermal pressure are based on that). Acked-by: Viresh Kumar Signed-off-by: Lukasz Luba Signed-off-by: Rafael J. Wysocki --- kernel/sched/cpufreq_schedutil.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 1207c78f85c1..9161d1136d01 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -25,6 +25,9 @@ struct sugov_policy { unsigned int next_freq; unsigned int cached_raw_freq; + /* max CPU capacity, which is equal for all CPUs in freq. domain */ + unsigned long max; + /* The next fields are only needed if fast switch cannot be used: */ struct irq_work irq_work; struct kthread_work work; @@ -48,7 +51,6 @@ struct sugov_cpu { unsigned long util; unsigned long bw_dl; - unsigned long max; /* The field below is for single-CPU policies only: */ #ifdef CONFIG_NO_HZ_COMMON @@ -158,7 +160,6 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) { struct rq *rq = cpu_rq(sg_cpu->cpu); - sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu); sg_cpu->bw_dl = cpu_bw_dl(rq); sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), FREQUENCY_UTIL, NULL); @@ -253,6 +254,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, */ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time) { + struct sugov_policy *sg_policy = sg_cpu->sg_policy; unsigned long boost; /* No boost currently required */ @@ -280,7 +282,8 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time) * sg_cpu->util is already in capacity scale; convert iowait_boost * into the same scale so we can compare. */ - boost = (sg_cpu->iowait_boost * sg_cpu->max) >> SCHED_CAPACITY_SHIFT; + boost = sg_cpu->iowait_boost * sg_policy->max; + boost >>= SCHED_CAPACITY_SHIFT; boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL); if (sg_cpu->util < boost) sg_cpu->util = boost; @@ -337,7 +340,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, if (!sugov_update_single_common(sg_cpu, time, flags)) return; - next_f = get_next_freq(sg_policy, sg_cpu->util, sg_cpu->max); + next_f = get_next_freq(sg_policy, sg_cpu->util, sg_policy->max); /* * Do not reduce the frequency if the CPU has not been idle * recently, as the reduction is likely to be premature then. @@ -373,6 +376,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, unsigned int flags) { struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; unsigned long prev_util = sg_cpu->util; /* @@ -399,7 +403,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, sg_cpu->util = prev_util; cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl), - map_util_perf(sg_cpu->util), sg_cpu->max); + map_util_perf(sg_cpu->util), + sg_policy->max); sg_cpu->sg_policy->last_freq_update_time = time; } @@ -408,25 +413,19 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) { struct sugov_policy *sg_policy = sg_cpu->sg_policy; struct cpufreq_policy *policy = sg_policy->policy; - unsigned long util = 0, max = 1; + unsigned long util = 0; unsigned int j; for_each_cpu(j, policy->cpus) { struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); - unsigned long j_util, j_max; sugov_get_util(j_sg_cpu); sugov_iowait_apply(j_sg_cpu, time); - j_util = j_sg_cpu->util; - j_max = j_sg_cpu->max; - if (j_util * max > j_max * util) { - util = j_util; - max = j_max; - } + util = max(j_sg_cpu->util, util); } - return get_next_freq(sg_policy, util, max); + return get_next_freq(sg_policy, util, sg_policy->max); } static void @@ -752,7 +751,7 @@ static int sugov_start(struct cpufreq_policy *policy) { struct sugov_policy *sg_policy = policy->governor_data; void (*uu)(struct update_util_data *data, u64 time, unsigned int flags); - unsigned int cpu; + unsigned int cpu = cpumask_first(policy->cpus); sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; sg_policy->last_freq_update_time = 0; @@ -760,6 +759,7 @@ static int sugov_start(struct cpufreq_policy *policy) sg_policy->work_in_progress = false; sg_policy->limits_changed = false; sg_policy->cached_raw_freq = 0; + sg_policy->max = arch_scale_cpu_capacity(cpu); sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS); -- cgit v1.2.3 From e84d9f5214cb854fcd584aa78b5634794604d306 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Thu, 25 Aug 2022 15:32:37 -0400 Subject: audit: audit_context pid unused, context enum comment fix The pid member of struct audit_context is never used. Remove it. The audit_reset_context() comment about unconditionally resetting "ctx->state" should read "ctx->context". Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit.h | 2 +- kernel/auditsc.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.h b/kernel/audit.h index 58b66543b4d5..d6eb7b59c791 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -133,7 +133,7 @@ struct audit_context { struct sockaddr_storage *sockaddr; size_t sockaddr_len; /* Save things to print about task_struct */ - pid_t pid, ppid; + pid_t ppid; kuid_t uid, euid, suid, fsuid; kgid_t gid, egid, sgid, fsgid; unsigned long personality; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 9226746dcf0a..21e50e6d0fc0 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -965,7 +965,7 @@ static void audit_reset_context(struct audit_context *ctx) if (!ctx) return; - /* if ctx is non-null, reset the "ctx->state" regardless */ + /* if ctx is non-null, reset the "ctx->context" regardless */ ctx->context = AUDIT_CTX_UNUSED; if (ctx->dummy) return; @@ -1002,7 +1002,7 @@ static void audit_reset_context(struct audit_context *ctx) kfree(ctx->sockaddr); ctx->sockaddr = NULL; ctx->sockaddr_len = 0; - ctx->pid = ctx->ppid = 0; + ctx->ppid = 0; ctx->uid = ctx->euid = ctx->suid = ctx->fsuid = KUIDT_INIT(0); ctx->gid = ctx->egid = ctx->sgid = ctx->fsgid = KGIDT_INIT(0); ctx->personality = 0; -- cgit v1.2.3 From 3ed66951f952ed8f1a5d03e171722bf2631e8d58 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Thu, 25 Aug 2022 15:32:38 -0400 Subject: audit: explicitly check audit_context->context enum value Be explicit in checking the struct audit_context "context" member enum value rather than assuming the order of context enum values. Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 21e50e6d0fc0..d77c9805c6b1 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2068,7 +2068,7 @@ void __audit_syscall_exit(int success, long return_code) /* run through both filters to ensure we set the filterkey properly */ audit_filter_syscall(current, context); audit_filter_inodes(current, context); - if (context->current_state < AUDIT_STATE_RECORD) + if (context->current_state != AUDIT_STATE_RECORD) goto out; audit_return_fixup(context, success, return_code); -- cgit v1.2.3 From c3f3ea8af44d0c5fba79fe8b198087342d0c7e04 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Thu, 25 Aug 2022 15:32:39 -0400 Subject: audit: free audit_proctitle only on task exit Since audit_proctitle is generated at syscall exit time, its value is used immediately and cached for the next syscall. Since this is the case, then only clear it at task exit time. Otherwise, there is no point in caching the value OR bearing the overhead of regenerating it. Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index d77c9805c6b1..280b4720c7a0 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1016,7 +1016,6 @@ static void audit_reset_context(struct audit_context *ctx) WARN_ON(!list_empty(&ctx->killed_trees)); audit_free_module(ctx); ctx->fds[0] = -1; - audit_proctitle_free(ctx); ctx->type = 0; /* reset last for audit_free_*() */ } @@ -1077,6 +1076,7 @@ static inline void audit_free_context(struct audit_context *context) { /* resetting is extra work, but it is likely just noise */ audit_reset_context(context); + audit_proctitle_free(context); free_tree_refs(context); kfree(context->filterkey); kfree(context); -- cgit v1.2.3 From 501e4bb102ee2f380ab94abbf3d35dff9d4719a2 Mon Sep 17 00:00:00 2001 From: wuchi Date: Tue, 30 Aug 2022 00:26:09 +0800 Subject: audit: use time_after to compare time Using time_{*} macro to compare time is better Signed-off-by: wuchi Signed-off-by: Paul Moore --- kernel/audit.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index a75978ae38ad..9bc0b0301198 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -321,7 +321,6 @@ static inline int audit_rate_check(void) static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; - unsigned long elapsed; int retval = 0; if (!audit_rate_limit) return 1; @@ -330,9 +329,8 @@ static inline int audit_rate_check(void) if (++messages < audit_rate_limit) { retval = 1; } else { - now = jiffies; - elapsed = now - last_check; - if (elapsed > HZ) { + now = jiffies; + if (time_after(now, last_check + HZ)) { last_check = now; messages = 0; retval = 1; @@ -366,7 +364,7 @@ void audit_log_lost(const char *message) if (!print) { spin_lock_irqsave(&lock, flags); now = jiffies; - if (now - last_msg > HZ) { + if (time_after(now, last_msg + HZ)) { print = 1; last_msg = now; } -- cgit v1.2.3 From 6d60ea03ac2d3dcf6ddee6b45aa7213d8b0461c5 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Thu, 16 Jun 2022 21:53:47 +0800 Subject: rcu: Fix rcu_read_unlock_strict() strict QS reporting Kernels built with CONFIG_PREEMPT=n and CONFIG_RCU_STRICT_GRACE_PERIOD=y report the quiescent state directly from the outermost rcu_read_unlock(). However, the current CPU's rcu_data structure's ->cpu_no_qs.b.norm might still be set, in which case rcu_report_qs_rdp() will exit early, thus failing to report quiescent state. This commit therefore causes rcu_read_unlock_strict() to clear CPU's rcu_data structure's ->cpu_no_qs.b.norm field before invoking rcu_report_qs_rdp(). Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 438ecae6bd7e..86772c95ed0a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -824,6 +824,7 @@ void rcu_read_unlock_strict(void) if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread) return; rdp = this_cpu_ptr(&rcu_data); + rdp->cpu_no_qs.b.norm = false; rcu_report_qs_rdp(rdp); udelay(rcu_unlock_delay); } -- cgit v1.2.3 From bca4fa8cb0f4c096b515952f64e560fd784a0514 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Mon, 20 Jun 2022 14:42:24 +0800 Subject: rcu: Update rcu_preempt_deferred_qs() comments for !PREEMPT kernels In non-premptible kernels, tasks never do context switches within RCU read-side critical sections. Therefore, in such kernels, each leaf rcu_node structure's ->blkd_tasks list will always be empty. The comment on the non-preemptible version of rcu_preempt_deferred_qs() confuses this point, so this commit therefore fixes it. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 86772c95ed0a..4152816dd29f 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -932,10 +932,13 @@ static notrace bool rcu_preempt_need_deferred_qs(struct task_struct *t) return false; } -// Except that we do need to respond to a request by an expedited grace -// period for a quiescent state from this CPU. Note that requests from -// tasks are handled when removing the task from the blocked-tasks list -// below. +// Except that we do need to respond to a request by an expedited +// grace period for a quiescent state from this CPU. Note that in +// non-preemptible kernels, there can be no context switches within RCU +// read-side critical sections, which in turn means that the leaf rcu_node +// structure's blocked-tasks list is always empty. is therefore no need to +// actually check it. Instead, a quiescent state from this CPU suffices, +// and this function is only called from such a quiescent state. notrace void rcu_preempt_deferred_qs(struct task_struct *t) { struct rcu_data *rdp = this_cpu_ptr(&rcu_data); -- cgit v1.2.3 From fcb42c9a77d490ed0974e4d394519481aa06e585 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Tue, 5 Jul 2022 12:09:51 -0700 Subject: rcu: Add QS check in rcu_exp_handler() for non-preemptible kernels Kernels built with CONFIG_PREEMPTION=n and CONFIG_PREEMPT_COUNT=y maintain preempt_count() state. Because such kernels map __rcu_read_lock() and __rcu_read_unlock() to preempt_disable() and preempt_enable(), respectively, this allows the expedited grace period's !CONFIG_PREEMPT_RCU version of the rcu_exp_handler() IPI handler function to use preempt_count() to detect quiescent states. This preempt_count() usage might seem to risk failures due to use of implicit RCU readers in portions of the kernel under #ifndef CONFIG_PREEMPTION, except that rcu_core() already disallows such implicit RCU readers. The moral of this story is that you must use explicit read-side markings such as rcu_read_lock() or preempt_disable() even if the code knows that this kernel does not support preemption. This commit therefore adds a preempt_count()-based check for a quiescent state in the !CONFIG_PREEMPT_RCU version of the rcu_exp_handler() function for kernels built with CONFIG_PREEMPT_COUNT=y, reporting an immediate quiescent state when the interrupted code had both preemption and softirqs enabled. This change results in about a 2% reduction in expedited grace-period latency in kernels built with both CONFIG_PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney Link: https://lore.kernel.org/all/20220622103549.2840087-1-qiang1.zhang@intel.com/ --- kernel/rcu/tree_exp.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index be667583a554..b07998159d1f 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -828,11 +828,13 @@ static void rcu_exp_handler(void *unused) { struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp = rdp->mynode; + bool preempt_bh_enabled = !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)); if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) || __this_cpu_read(rcu_data.cpu_no_qs.b.exp)) return; - if (rcu_is_cpu_rrupt_from_idle()) { + if (rcu_is_cpu_rrupt_from_idle() || + (IS_ENABLED(CONFIG_PREEMPT_COUNT) && preempt_bh_enabled)) { rcu_report_exp_rdp(this_cpu_ptr(&rcu_data)); return; } -- cgit v1.2.3 From 6ca0292ccf968e0132e64d6a699d36ba3e92cb81 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Fri, 1 Jul 2022 10:44:04 +0800 Subject: rcu: Make tiny RCU support leak callbacks for debug-object errors Currently, only Tree RCU leaks callbacks setting when it detects a duplicate call_rcu(). This commit causes Tiny RCU to also leak callbacks in this situation. Because this is Tiny RCU, kernel size is important: 1. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=n (Production kernel) Original: text data bss dec hex filename 26290663 20159823 15212544 61663030 3ace736 vmlinux With this commit: text data bss dec hex filename 26290663 20159823 15212544 61663030 3ace736 vmlinux 2. CONFIG_TINY_RCU=y and CONFIG_DEBUG_OBJECTS_RCU_HEAD=y (Debugging kernel) Original: text data bss dec hex filename 26291319 20160143 15212544 61664006 3aceb06 vmlinux With this commit: text data bss dec hex filename 26291319 20160431 15212544 61664294 3acec26 vmlinux These results show that the kernel size is unchanged for production kernels, as desired. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tiny.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index f0561ee16b9c..943d431b908f 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -158,6 +158,10 @@ void synchronize_rcu(void) } EXPORT_SYMBOL_GPL(synchronize_rcu); +static void tiny_rcu_leak_callback(struct rcu_head *rhp) +{ +} + /* * Post an RCU callback to be invoked after the end of an RCU grace * period. But since we have but one CPU, that would be after any @@ -165,9 +169,20 @@ EXPORT_SYMBOL_GPL(synchronize_rcu); */ void call_rcu(struct rcu_head *head, rcu_callback_t func) { + static atomic_t doublefrees; unsigned long flags; - debug_rcu_head_queue(head); + if (debug_rcu_head_queue(head)) { + if (atomic_inc_return(&doublefrees) < 4) { + pr_err("%s(): Double-freed CB %p->%pS()!!! ", __func__, head, head->func); + mem_dump_obj(head); + } + + if (!__is_kvfree_rcu_offset((unsigned long)head->func)) + WRITE_ONCE(head->func, tiny_rcu_leak_callback); + return; + } + head->func = func; head->next = NULL; -- cgit v1.2.3 From 089254fd386eb6800dd7d7863f12a04ada0c35fa Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Aug 2022 08:48:12 -0700 Subject: rcu: Document reason for rcu_all_qs() call to preempt_disable() Given that rcu_all_qs() is in non-preemptible kernels, why on earth should it invoke preempt_disable()? This commit adds the reason, which is to work nicely with debugging enabled in CONFIG_PREEMPT_COUNT=y kernels. Reported-by: Neeraj Upadhyay Reported-by: Boqun Feng Reported-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 4152816dd29f..c46b3c74dad1 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -870,7 +870,7 @@ void rcu_all_qs(void) if (!raw_cpu_read(rcu_data.rcu_urgent_qs)) return; - preempt_disable(); + preempt_disable(); // For CONFIG_PREEMPT_COUNT=y kernels /* Load rcu_urgent_qs before other flags. */ if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) { preempt_enable(); -- cgit v1.2.3 From e73dfe30930b75c98746152e7a2f6a8ab6067b51 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Thu, 4 Aug 2022 10:34:19 +0800 Subject: sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task() The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei Signed-off-by: Paul E. McKenney Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider --- kernel/rcu/tree_stall.h | 5 ++--- kernel/sched/core.c | 3 +++ kernel/smp.c | 3 +-- 3 files changed, 6 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index c3fbbcc09327..5653560573e2 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -368,7 +368,7 @@ static void rcu_dump_cpu_stacks(void) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { if (cpu_is_offline(cpu)) pr_err("Offline CPU %d blocking current GP.\n", cpu); - else if (!trigger_single_cpu_backtrace(cpu)) + else dump_cpu_task(cpu); } raw_spin_unlock_irqrestore_rcu_node(rnp, flags); @@ -511,8 +511,7 @@ static void rcu_check_gp_kthread_starvation(void) pr_err("RCU GP kthread last ran on offline CPU %d.\n", cpu); } else { pr_err("Stack dump where RCU GP kthread last ran:\n"); - if (!trigger_single_cpu_backtrace(cpu)) - dump_cpu_task(cpu); + dump_cpu_task(cpu); } } wake_up_process(gpk); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee28253c9ac0..e15b6a7f34f4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11183,6 +11183,9 @@ struct cgroup_subsys cpu_cgrp_subsys = { void dump_cpu_task(int cpu) { + if (trigger_single_cpu_backtrace(cpu)) + return; + pr_info("Task dump for CPU %d:\n", cpu); sched_show_task(cpu_curr(cpu)); } diff --git a/kernel/smp.c b/kernel/smp.c index 650810a6f29b..e8cdc025a046 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -370,8 +370,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * if (cpu >= 0) { if (static_branch_unlikely(&csdlock_debug_extended)) csd_lock_print_extended(csd, cpu); - if (!trigger_single_cpu_backtrace(cpu)) - dump_cpu_task(cpu); + dump_cpu_task(cpu); if (!cpu_cur_csd) { pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu); arch_send_call_function_single_ipi(cpu); -- cgit v1.2.3 From bc1cca97e6da6c7c34db7c5b864bb354ca5305ac Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Thu, 4 Aug 2022 10:34:20 +0800 Subject: sched/debug: Show the registers of 'current' in dump_cpu_task() The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei Signed-off-by: Paul E. McKenney Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider --- kernel/sched/core.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e15b6a7f34f4..60fdc0faf1c9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -73,6 +73,7 @@ #include +#include #include #include @@ -11183,6 +11184,16 @@ struct cgroup_subsys cpu_cgrp_subsys = { void dump_cpu_task(int cpu) { + if (cpu == smp_processor_id() && in_hardirq()) { + struct pt_regs *regs; + + regs = get_irq_regs(); + if (regs) { + show_regs(regs); + return; + } + } + if (trigger_single_cpu_backtrace(cpu)) return; -- cgit v1.2.3 From 621189a1fe93cb2b34d62c5cdb9e258bca044813 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Mon, 8 Aug 2022 10:26:26 +0800 Subject: rcu: Avoid triggering strict-GP irq-work when RCU is idle Kernels built with PREEMPT_RCU=y and RCU_STRICT_GRACE_PERIOD=y trigger irq-work from rcu_read_unlock(), and the resulting irq-work handler invokes rcu_preempt_deferred_qs_handle(). The point of this triggering is to force grace periods to end quickly in order to give tools like KASAN a better chance of detecting RCU usage bugs such as leaking RCU-protected pointers out of an RCU read-side critical section. However, this irq-work triggering is unconditional. This works, but there is no point in doing this irq-work unless the current grace period is waiting on the running CPU or task, which is not the common case. After all, in the common case there are many rcu_read_unlock() calls per CPU per grace period. This commit therefore triggers the irq-work only when the current grace period is waiting on the running CPU or task. This change was tested as follows on a four-CPU system: echo rcu_preempt_deferred_qs_handler > /sys/kernel/debug/tracing/set_ftrace_filter echo 1 > /sys/kernel/debug/tracing/function_profile_enabled insmod rcutorture.ko sleep 20 rmmod rcutorture.ko echo 0 > /sys/kernel/debug/tracing/function_profile_enabled echo > /sys/kernel/debug/tracing/set_ftrace_filter This procedure produces results in this per-CPU set of files: /sys/kernel/debug/tracing/trace_stat/function* Sample output from one of these files is as follows: Function Hit Time Avg s^2 -------- --- ---- --- --- rcu_preempt_deferred_qs_handle 838746 182650.3 us 0.217 us 0.004 us The baseline sum of the "Hit" values (the number of calls to this function) was 3,319,015. With this commit, that sum was 1,140,359, for a 2.9x reduction. The worst-case variance across the CPUs was less than 25%, so this large effect size is statistically significant. The raw data is available in the Link: URL. Link: https://lore.kernel.org/all/20220808022626.12825-1-qiang1.zhang@intel.com/ Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c46b3c74dad1..207617f69aa5 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -641,7 +641,8 @@ static void rcu_read_unlock_special(struct task_struct *t) expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) || (rdp->grpmask & READ_ONCE(rnp->expmask)) || - IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) || + (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && + ((rdp->grpmask & READ_ONCE(rnp->qsmask)) || t->rcu_blocked_node)) || (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node); // Need to defer quiescent state until everything is enabled. -- cgit v1.2.3 From 7634b1eaa0cd135d5eedadb04ad3c91b1ecf28a9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 24 Aug 2022 14:46:56 -0700 Subject: rcu: Exclude outgoing CPU when it is the last to leave The rcu_boost_kthread_setaffinity() function removes the outgoing CPU from the set_cpus_allowed() mask for the corresponding leaf rcu_node structure's rcub priority-boosting kthread. Except that if the outgoing CPU will leave that structure without any online CPUs, the mask is set to the housekeeping CPU mask from housekeeping_cpumask(). Which is fine unless the outgoing CPU happens to be a housekeeping CPU. This commit therefore removes the outgoing CPU from the housekeeping mask. This would of course be problematic if the outgoing CPU was the last online housekeeping CPU, but in that case you are in a world of hurt anyway. If someone comes up with a valid use case for a system needing all the housekeeping CPUs to be offline, further adjustments can be made. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 207617f69aa5..32b424b571bd 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1243,8 +1243,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) cpu != outgoingcpu) cpumask_set_cpu(cpu, cm); cpumask_and(cm, cm, housekeeping_cpumask(HK_TYPE_RCU)); - if (cpumask_empty(cm)) + if (cpumask_empty(cm)) { cpumask_copy(cm, housekeeping_cpumask(HK_TYPE_RCU)); + if (outgoingcpu >= 0) + cpumask_clear_cpu(outgoingcpu, cm); + } set_cpus_allowed_ptr(t, cm); mutex_unlock(&rnp->boost_kthread_mutex); free_cpumask_var(cm); -- cgit v1.2.3 From 093590c16b447f53e66771c8579ae66c96f6ef61 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 22 Jun 2022 13:47:11 +0200 Subject: rcu: Back off upon fill_page_cache_func() allocation failure The fill_page_cache_func() function allocates couple of pages to store kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY) allocation which can fail under memory pressure. The function will, however keep retrying even when the previous attempt has failed. This retrying is in theory correct, but in practice the allocation is invoked from workqueue context, which means that if the memory reclaim gets stuck, these retries can hog the worker for quite some time. Although the workqueues subsystem automatically adjusts concurrency, such adjustment is not guaranteed to happen until the worker context sleeps. And the fill_page_cache_func() function's retry loop is not guaranteed to sleep (see the should_reclaim_retry() function). And we have seen this function cause workqueue lockups: kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s! [...] kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146 kernel: pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5 kernel: in-flight: 1917:fill_page_cache_func kernel: pending: dbs_work_handler, free_work, kfree_rcu_monitor Originally, we thought that the root cause of this lockup was several retries with direct reclaim, but this is not yet confirmed. Furthermore, we have seen similar lockups without any heavy memory pressure. This suggests that there are other factors contributing to these lockups. However, it is not really clear that endless retries are desireable. So let's make the fill_page_cache_func() function back off after allocation failure. Cc: Uladzislau Rezki (Sony) Cc: "Paul E. McKenney" Cc: Frederic Weisbecker Cc: Neeraj Upadhyay Cc: Josh Triplett Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: Lai Jiangshan Cc: Joel Fernandes Signed-off-by: Michal Hocko Reviewed-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 79aea7df4345..eb435941e92f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3183,15 +3183,16 @@ static void fill_page_cache_func(struct work_struct *work) bnode = (struct kvfree_rcu_bulk_data *) __get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); - if (bnode) { - raw_spin_lock_irqsave(&krcp->lock, flags); - pushed = put_cached_bnode(krcp, bnode); - raw_spin_unlock_irqrestore(&krcp->lock, flags); + if (!bnode) + break; - if (!pushed) { - free_page((unsigned long) bnode); - break; - } + raw_spin_lock_irqsave(&krcp->lock, flags); + pushed = put_cached_bnode(krcp, bnode); + raw_spin_unlock_irqrestore(&krcp->lock, flags); + + if (!pushed) { + free_page((unsigned long) bnode); + break; } } -- cgit v1.2.3 From 38269096351806bf7315f971c53205b676ada259 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Wed, 22 Jun 2022 22:51:02 +0000 Subject: rcu/kfree: Fix kfree_rcu_shrink_count() return value As per the comments in include/linux/shrinker.h, .count_objects callback should return the number of freeable items, but if there are no objects to free, SHRINK_EMPTY should be returned. The only time 0 is returned should be when we are unable to determine the number of objects, or the cache should be skipped for another reason. Signed-off-by: Joel Fernandes (Google) Reviewed-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index eb435941e92f..3d234d536d4c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3372,7 +3372,7 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) atomic_set(&krcp->backoff_page_cache_fill, 1); } - return count; + return count == 0 ? SHRINK_EMPTY : count; } static unsigned long -- cgit v1.2.3 From 51824b780b719c53113dc39e027fbf670dc66028 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Thu, 30 Jun 2022 18:33:35 +0200 Subject: rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval Currently the monitor work is scheduled with a fixed interval of HZ/20, which is roughly 50 milliseconds. The drawback of this approach is low utilization of the 512 page slots in scenarios with infrequence kvfree_rcu() calls. For example on an Android system: kworker/3:3-507 [003] .... 470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6 kworker/6:1-76 [006] .... 470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1 kworker/6:1-76 [006] .... 470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9 kworker/3:3-507 [003] .... 471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48 kworker/1:1-73 [001] .... 471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3 kworker/1:1-73 [001] .... 471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76 kworker/0:4-1411 [000] .... 472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1 kworker/0:4-1411 [000] .... 472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5 kworker/6:1-76 [006] .... 472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102 In many cases, out of 512 slots, fewer than 10 were actually used. In order to improve batching and make utilization more efficient this commit sets a drain interval to a fixed 5-seconds interval. Floods are detected when a page fills quickly, and in that case, the reclaim work is re-scheduled for the next scheduling-clock tick (jiffy). After this change: kworker/7:1-371 [007] .... 5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121 kworker/7:1-371 [007] .... 5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47 kworker/7:1-371 [007] .... 5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510 kworker/7:1-371 [007] .... 5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169 kworker/7:1-371 [007] .... 5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510 kworker/5:6-9428 [005] .... 5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123 kworker/4:7-9434 [004] .... 5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322 kworker/4:7-9434 [004] .... 5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185 kworker/7:1-371 [007] .... 5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510 Here, all but one of the cases, more than one hundreds slots were used, representing an order-of-magnitude improvement. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3d234d536d4c..7b90478b752e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2832,7 +2832,7 @@ EXPORT_SYMBOL_GPL(call_rcu); /* Maximum number of jiffies to wait before draining a batch. */ -#define KFREE_DRAIN_JIFFIES (HZ / 50) +#define KFREE_DRAIN_JIFFIES (5 * HZ) #define KFREE_N_BATCHES 2 #define FREE_N_CHANNELS 2 @@ -3093,6 +3093,21 @@ need_offload_krc(struct kfree_rcu_cpu *krcp) return !!krcp->head; } +static void +schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp) +{ + long delay, delay_left; + + delay = READ_ONCE(krcp->count) >= KVFREE_BULK_MAX_ENTR ? 1:KFREE_DRAIN_JIFFIES; + if (delayed_work_pending(&krcp->monitor_work)) { + delay_left = krcp->monitor_work.timer.expires - jiffies; + if (delay < delay_left) + mod_delayed_work(system_wq, &krcp->monitor_work, delay); + return; + } + queue_delayed_work(system_wq, &krcp->monitor_work, delay); +} + /* * This function is invoked after the KFREE_DRAIN_JIFFIES timeout. */ @@ -3150,7 +3165,7 @@ static void kfree_rcu_monitor(struct work_struct *work) // work to repeat an attempt. Because previous batches are // still in progress. if (need_offload_krc(krcp)) - schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES); + schedule_delayed_monitor_work(krcp); raw_spin_unlock_irqrestore(&krcp->lock, flags); } @@ -3339,7 +3354,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) // Set timer to drain after KFREE_DRAIN_JIFFIES. if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING) - schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES); + schedule_delayed_monitor_work(krcp); unlock_return: krc_this_cpu_unlock(krcp, flags); @@ -3415,7 +3430,7 @@ void __init kfree_rcu_scheduler_running(void) raw_spin_lock_irqsave(&krcp->lock, flags); if (need_offload_krc(krcp)) - schedule_delayed_work_on(cpu, &krcp->monitor_work, KFREE_DRAIN_JIFFIES); + schedule_delayed_monitor_work(krcp); raw_spin_unlock_irqrestore(&krcp->lock, flags); } } -- cgit v1.2.3 From 5334da2af25eceb6a88ae162e69d6586b5cb9abc Mon Sep 17 00:00:00 2001 From: Zqiang Date: Fri, 17 Jun 2022 22:15:19 +0800 Subject: rcu/nocb: Choose the right rcuog/rcuop kthreads to output The show_rcu_nocb_gp_state() function is supposed to dump out the rcuog kthread and the show_rcu_nocb_state() function is supposed to dump out the rcuo[ps] kthread. Currently, both do a mixture, which is not optimal for debugging, even though it does not affect functionality. This commit therefore adjusts these two functions to focus on their respective kthreads. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_nocb.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index a8f574d8850d..f20aec4f4394 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -1452,8 +1452,8 @@ static void show_rcu_nocb_gp_state(struct rcu_data *rdp) (long)rdp->nocb_gp_seq, rnp->grplo, rnp->grphi, READ_ONCE(rdp->nocb_gp_loops), rdp->nocb_gp_kthread ? task_state_to_char(rdp->nocb_gp_kthread) : '.', - rdp->nocb_cb_kthread ? (int)task_cpu(rdp->nocb_gp_kthread) : -1, - show_rcu_should_be_on_cpu(rdp->nocb_cb_kthread)); + rdp->nocb_gp_kthread ? (int)task_cpu(rdp->nocb_gp_kthread) : -1, + show_rcu_should_be_on_cpu(rdp->nocb_gp_kthread)); } /* Dump out nocb kthread state for the specified rcu_data structure. */ @@ -1497,7 +1497,7 @@ static void show_rcu_nocb_state(struct rcu_data *rdp) ".B"[!!rcu_cblist_n_cbs(&rdp->nocb_bypass)], rcu_segcblist_n_cbs(&rdp->cblist), rdp->nocb_cb_kthread ? task_state_to_char(rdp->nocb_cb_kthread) : '.', - rdp->nocb_cb_kthread ? (int)task_cpu(rdp->nocb_gp_kthread) : -1, + rdp->nocb_cb_kthread ? (int)task_cpu(rdp->nocb_cb_kthread) : -1, show_rcu_should_be_on_cpu(rdp->nocb_cb_kthread)); /* It is OK for GP kthreads to have GP state. */ -- cgit v1.2.3 From 638dce227a41534bbc4fdd73280e73e0b3570048 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Aug 2022 17:39:28 -0700 Subject: rcu/nocb: Add CPU number to CPU-{,de}offload failure messages Offline CPUs cannot be offloaded or deoffloaded. Any attempt to offload or deoffload an offline CPU causes a message to be printed on the console, which is good, but this message does not contain the CPU number, which is bad. Such a CPU number can be helpful when debugging, as it gives a clear indication that the CPU in question is in fact offline. This commit therefore adds the CPU number to the CPU-{,de}offload failure messages. Cc: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_nocb.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index f20aec4f4394..0a5f0ef41484 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -1111,7 +1111,7 @@ int rcu_nocb_cpu_deoffload(int cpu) if (!ret) cpumask_clear_cpu(cpu, rcu_nocb_mask); } else { - pr_info("NOCB: Can't CB-deoffload an offline CPU\n"); + pr_info("NOCB: Cannot CB-deoffload offline CPU %d\n", rdp->cpu); ret = -EINVAL; } } @@ -1196,7 +1196,7 @@ int rcu_nocb_cpu_offload(int cpu) if (!ret) cpumask_set_cpu(cpu, rcu_nocb_mask); } else { - pr_info("NOCB: Can't CB-offload an offline CPU\n"); + pr_info("NOCB: Cannot CB-offload offline CPU %d\n", rdp->cpu); ret = -EINVAL; } } -- cgit v1.2.3 From 91a967fd6934abc0c7e4b0d26728e38674278707 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 Jul 2022 15:37:05 -0700 Subject: rcu: Add full-sized polling for get_completed*() and poll_state*() The get_completed_synchronize_rcu() and poll_state_synchronize_rcu() APIs compress the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the first members of the full-state RCU grace-period polling API, namely the get_completed_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions. These use up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but which are guaranteed not to miss grace periods, at least in situations where the single-CPU grace-period optimization does not apply. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 9 ++++++ kernel/rcu/tiny.c | 10 +++++++ kernel/rcu/tree.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 91 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index d8e1b270a065..b31e6ed64d1b 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -336,8 +336,10 @@ struct rcu_torture_ops { void (*cond_sync_exp)(unsigned long oldstate); unsigned long (*get_gp_state)(void); unsigned long (*get_gp_completed)(void); + void (*get_gp_completed_full)(struct rcu_gp_oldstate *rgosp); unsigned long (*start_gp_poll)(void); bool (*poll_gp_state)(unsigned long oldstate); + bool (*poll_gp_state_full)(struct rcu_gp_oldstate *rgosp); void (*cond_sync)(unsigned long oldstate); call_rcu_func_t call; void (*cb_barrier)(void); @@ -503,8 +505,10 @@ static struct rcu_torture_ops rcu_ops = { .exp_sync = synchronize_rcu_expedited, .get_gp_state = get_state_synchronize_rcu, .get_gp_completed = get_completed_synchronize_rcu, + .get_gp_completed_full = get_completed_synchronize_rcu_full, .start_gp_poll = start_poll_synchronize_rcu, .poll_gp_state = poll_state_synchronize_rcu, + .poll_gp_state_full = poll_state_synchronize_rcu_full, .cond_sync = cond_synchronize_rcu, .get_gp_state_exp = get_state_synchronize_rcu, .start_gp_poll_exp = start_poll_synchronize_rcu_expedited, @@ -1212,6 +1216,7 @@ rcu_torture_writer(void *arg) bool boot_ended; bool can_expedite = !rcu_gp_is_expedited() && !rcu_gp_is_normal(); unsigned long cookie; + struct rcu_gp_oldstate cookie_full; int expediting = 0; unsigned long gp_snap; int i; @@ -1277,6 +1282,10 @@ rcu_torture_writer(void *arg) } cur_ops->readunlock(idx); } + if (cur_ops->get_gp_completed_full && cur_ops->poll_gp_state_full) { + cur_ops->get_gp_completed_full(&cookie_full); + WARN_ON_ONCE(!cur_ops->poll_gp_state_full(&cookie_full)); + } switch (synctype[torture_random(&rand) % nsynctypes]) { case RTWS_DEF_FREE: rcu_torture_writer_state = RTWS_DEF_FREE; diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index f0561ee16b9c..435edc785412 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -183,6 +183,16 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func) } EXPORT_SYMBOL_GPL(call_rcu); +/* + * Store a grace-period-counter "cookie". For more information, + * see the Tree RCU header comment. + */ +void get_completed_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) +{ + rgosp->rgos_norm = RCU_GET_STATE_COMPLETED; +} +EXPORT_SYMBOL_GPL(get_completed_synchronize_rcu_full); + /* * Return a grace-period-counter "cookie". For more information, * see the Tree RCU header comment. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 79aea7df4345..d47c9b6d8106 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3522,6 +3522,22 @@ void synchronize_rcu(void) } EXPORT_SYMBOL_GPL(synchronize_rcu); +/** + * get_completed_synchronize_rcu_full - Return a full pre-completed polled state cookie + * @rgosp: Place to put state cookie + * + * Stores into @rgosp a value that will always be treated by functions + * like poll_state_synchronize_rcu_full() as a cookie whose grace period + * has already completed. + */ +void get_completed_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) +{ + rgosp->rgos_norm = RCU_GET_STATE_COMPLETED; + rgosp->rgos_exp = RCU_GET_STATE_COMPLETED; + rgosp->rgos_polled = RCU_GET_STATE_COMPLETED; +} +EXPORT_SYMBOL_GPL(get_completed_synchronize_rcu_full); + /** * get_state_synchronize_rcu - Snapshot current RCU state * @@ -3580,7 +3596,7 @@ unsigned long start_poll_synchronize_rcu(void) EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); /** - * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period + * poll_state_synchronize_rcu - Has the specified RCU grace period completed? * * @oldstate: value from get_state_synchronize_rcu() or start_poll_synchronize_rcu() * @@ -3595,9 +3611,10 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); * But counter wrap is harmless. If the counter wraps, we have waited for * more than a billion grace periods (and way more on a 64-bit system!). * Those needing to keep oldstate values for very long time periods - * (many hours even on 32-bit systems) should check them occasionally - * and either refresh them or set a flag indicating that the grace period - * has completed. + * (many hours even on 32-bit systems) should check them occasionally and + * either refresh them or set a flag indicating that the grace period has + * completed. Alternatively, they can use get_completed_synchronize_rcu() + * to get a guaranteed-completed grace-period state. * * This function provides the same memory-ordering guarantees that * would be provided by a synchronize_rcu() that was invoked at the call @@ -3615,6 +3632,57 @@ bool poll_state_synchronize_rcu(unsigned long oldstate) } EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); +/** + * poll_state_synchronize_rcu_full - Has the specified RCU grace period completed? + * @rgosp: value from get_state_synchronize_rcu_full() or start_poll_synchronize_rcu_full() + * + * If a full RCU grace period has elapsed since the earlier call from + * which *rgosp was obtained, return @true, otherwise return @false. + * If @false is returned, it is the caller's responsibility to invoke this + * function later on until it does return @true. Alternatively, the caller + * can explicitly wait for a grace period, for example, by passing @rgosp + * to cond_synchronize_rcu() or by directly invoking synchronize_rcu(). + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited + * for more than a billion grace periods (and way more on a 64-bit + * system!). Those needing to keep rcu_gp_oldstate values for very + * long time periods (many hours even on 32-bit systems) should check + * them occasionally and either refresh them or set a flag indicating + * that the grace period has completed. Alternatively, they can use + * get_completed_synchronize_rcu_full() to get a guaranteed-completed + * grace-period state. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu() that was invoked at the call to + * the function that provided @rgosp, and that returned at the end of this + * function. And this guarantee requires that the root rcu_node structure's + * ->gp_seq field be checked instead of that of the rcu_state structure. + * The problem is that the just-ending grace-period's callbacks can be + * invoked between the time that the root rcu_node structure's ->gp_seq + * field is updated and the time that the rcu_state structure's ->gp_seq + * field is updated. Therefore, if a single synchronize_rcu() is to + * cause a subsequent poll_state_synchronize_rcu_full() to return @true, + * then the root rcu_node structure is the one that needs to be polled. + */ +bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) +{ + struct rcu_node *rnp = rcu_get_root(); + + smp_mb(); // Order against root rcu_node structure grace-period cleanup. + if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || + rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || + rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || + rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp) || + rgosp->rgos_polled == RCU_GET_STATE_COMPLETED || + rcu_seq_done_exact(&rcu_state.gp_seq_polled, rgosp->rgos_polled)) { + smp_mb(); /* Ensure GP ends before subsequent accesses. */ + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_full); + /** * cond_synchronize_rcu - Conditionally wait for an RCU grace period * -- cgit v1.2.3 From 3fdefca9b42c8bebe3beea5c1a067c9718ca0fc7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 Jul 2022 19:58:13 -0700 Subject: rcu: Add full-sized polling for get_state() The get_state_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the next member of the full-state RCU grace-period polling API, namely the get_state_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 10 ++++++---- kernel/rcu/tree.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index b31e6ed64d1b..4f196ebce7f2 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -335,6 +335,7 @@ struct rcu_torture_ops { bool (*poll_gp_state_exp)(unsigned long oldstate); void (*cond_sync_exp)(unsigned long oldstate); unsigned long (*get_gp_state)(void); + void (*get_gp_state_full)(struct rcu_gp_oldstate *rgosp); unsigned long (*get_gp_completed)(void); void (*get_gp_completed_full)(struct rcu_gp_oldstate *rgosp); unsigned long (*start_gp_poll)(void); @@ -504,6 +505,7 @@ static struct rcu_torture_ops rcu_ops = { .sync = synchronize_rcu, .exp_sync = synchronize_rcu_expedited, .get_gp_state = get_state_synchronize_rcu, + .get_gp_state_full = get_state_synchronize_rcu_full, .get_gp_completed = get_completed_synchronize_rcu, .get_gp_completed_full = get_completed_synchronize_rcu_full, .start_gp_poll = start_poll_synchronize_rcu, @@ -1293,12 +1295,12 @@ rcu_torture_writer(void *arg) break; case RTWS_EXP_SYNC: rcu_torture_writer_state = RTWS_EXP_SYNC; - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - cookie = cur_ops->get_gp_state(); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) + cur_ops->get_gp_state_full(&cookie_full); cur_ops->exp_sync(); cur_ops->exp_sync(); - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - WARN_ON_ONCE(!cur_ops->poll_gp_state(cookie)); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) + WARN_ON_ONCE(!cur_ops->poll_gp_state_full(&cookie_full)); rcu_torture_pipe_update(old_rp); break; case RTWS_COND_GET: diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d47c9b6d8106..3fa79ee78b5b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1755,6 +1755,8 @@ static noinline void rcu_gp_cleanup(void) dump_blkd_tasks(rnp, 10); WARN_ON_ONCE(rnp->qsmask); WRITE_ONCE(rnp->gp_seq, new_gp_seq); + if (!rnp->parent) + smp_mb(); // Order against failing poll_state_synchronize_rcu_full(). rdp = this_cpu_ptr(&rcu_data); if (rnp == rdp->mynode) needgp = __note_gp_changes(rnp, rdp) || needgp; @@ -3556,6 +3558,37 @@ unsigned long get_state_synchronize_rcu(void) } EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); +/** + * get_state_synchronize_rcu_full - Snapshot RCU state, both normal and expedited + * @rgosp: location to place combined normal/expedited grace-period state + * + * Places the normal and expedited grace-period states in @rgosp. This + * state value can be passed to a later call to cond_synchronize_rcu_full() + * or poll_state_synchronize_rcu_full() to determine whether or not a + * grace period (whether normal or expedited) has elapsed in the meantime. + * The rcu_gp_oldstate structure takes up twice the memory of an unsigned + * long, but is guaranteed to see all grace periods. In contrast, the + * combined state occupies less memory, but can sometimes fail to take + * grace periods into account. + * + * This does not guarantee that the needed grace period will actually + * start. + */ +void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) +{ + struct rcu_node *rnp = rcu_get_root(); + + /* + * Any prior manipulation of RCU-protected data must happen + * before the loads from ->gp_seq and ->expedited_sequence. + */ + smp_mb(); /* ^^^ */ + rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq); + rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence); + rgosp->rgos_polled = rcu_seq_snap(&rcu_state.gp_seq_polled); +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full); + /** * start_poll_synchronize_rcu - Snapshot and start RCU grace period * -- cgit v1.2.3 From ccb42229fb34bc3e93f7aa081da3e78eac68cd27 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 29 Jul 2022 22:05:17 -0700 Subject: rcutorture: Abstract synchronous and polled API testing This commit abstracts a do_rtws_sync() function that does synchronous grace-period testing, but also testing the polled API 25% of the time each for the normal and full-state variants of the polled API. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 48 ++++++++++++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 4f196ebce7f2..c3c94e343eb2 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1207,6 +1207,40 @@ static void rcu_torture_write_types(void) } } +/* + * Do the specified rcu_torture_writer() synchronous grace period, + * while also testing out the polled APIs. Note well that the single-CPU + * grace-period optimizations must be accounted for. + */ +static void do_rtws_sync(struct torture_random_state *trsp, void (*sync)(void)) +{ + unsigned long cookie; + struct rcu_gp_oldstate cookie_full; + bool dopoll; + bool dopoll_full; + unsigned long r = torture_random(trsp); + + dopoll = cur_ops->get_gp_state && cur_ops->poll_gp_state && !(r & 0x300); + dopoll_full = cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full && !(r & 0xc00); + if (dopoll || dopoll_full) + cpus_read_lock(); + if (dopoll) + cookie = cur_ops->get_gp_state(); + if (dopoll_full) + cur_ops->get_gp_state_full(&cookie_full); + if (dopoll || (!IS_ENABLED(CONFIG_TINY_RCU) && dopoll_full && num_online_cpus() <= 1)) + sync(); + sync(); + WARN_ONCE(dopoll && !cur_ops->poll_gp_state(cookie), + "%s: Cookie check 3 failed %pS() online %*pbl.", + __func__, sync, cpumask_pr_args(cpu_online_mask)); + WARN_ONCE(dopoll_full && !cur_ops->poll_gp_state_full(&cookie_full), + "%s: Cookie check 4 failed %pS() online %*pbl", + __func__, sync, cpumask_pr_args(cpu_online_mask)); + if (dopoll || dopoll_full) + cpus_read_unlock(); +} + /* * RCU torture writer kthread. Repeatedly substitutes a new structure * for that pointed to by rcu_torture_current, freeing the old structure @@ -1295,12 +1329,7 @@ rcu_torture_writer(void *arg) break; case RTWS_EXP_SYNC: rcu_torture_writer_state = RTWS_EXP_SYNC; - if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) - cur_ops->get_gp_state_full(&cookie_full); - cur_ops->exp_sync(); - cur_ops->exp_sync(); - if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) - WARN_ON_ONCE(!cur_ops->poll_gp_state_full(&cookie_full)); + do_rtws_sync(&rand, cur_ops->exp_sync); rcu_torture_pipe_update(old_rp); break; case RTWS_COND_GET: @@ -1339,12 +1368,7 @@ rcu_torture_writer(void *arg) break; case RTWS_SYNC: rcu_torture_writer_state = RTWS_SYNC; - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - cookie = cur_ops->get_gp_state(); - cur_ops->sync(); - cur_ops->sync(); - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - WARN_ON_ONCE(!cur_ops->poll_gp_state(cookie)); + do_rtws_sync(&rand, cur_ops->sync); rcu_torture_pipe_update(old_rp); break; default: -- cgit v1.2.3 From ed7d2f1abee48a90fae5fdf5c60d71803a5a1d10 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Aug 2022 16:57:17 -0700 Subject: rcutorture: Allow per-RCU-flavor polled double-GP check Only vanilla RCU needs a double grace period for its compressed polled grace-period old-state cookie. This commit therefore adds an rcu_torture_ops per-flavor function ->poll_need_2gp to allow this check to be adapted to the RCU flavor under test. A NULL pointer for this function says that doubled grace periods are never needed. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index c3c94e343eb2..f2564c7633a8 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -341,6 +341,7 @@ struct rcu_torture_ops { unsigned long (*start_gp_poll)(void); bool (*poll_gp_state)(unsigned long oldstate); bool (*poll_gp_state_full)(struct rcu_gp_oldstate *rgosp); + bool (*poll_need_2gp)(bool poll, bool poll_full); void (*cond_sync)(unsigned long oldstate); call_rcu_func_t call; void (*cb_barrier)(void); @@ -492,6 +493,11 @@ static void rcu_sync_torture_init(void) INIT_LIST_HEAD(&rcu_torture_removed); } +static bool rcu_poll_need_2gp(bool poll, bool poll_full) +{ + return poll || (!IS_ENABLED(CONFIG_TINY_RCU) && poll_full && num_online_cpus() <= 1); +} + static struct rcu_torture_ops rcu_ops = { .ttype = RCU_FLAVOR, .init = rcu_sync_torture_init, @@ -511,6 +517,7 @@ static struct rcu_torture_ops rcu_ops = { .start_gp_poll = start_poll_synchronize_rcu, .poll_gp_state = poll_state_synchronize_rcu, .poll_gp_state_full = poll_state_synchronize_rcu_full, + .poll_need_2gp = rcu_poll_need_2gp, .cond_sync = cond_synchronize_rcu, .get_gp_state_exp = get_state_synchronize_rcu, .start_gp_poll_exp = start_poll_synchronize_rcu_expedited, @@ -1228,7 +1235,7 @@ static void do_rtws_sync(struct torture_random_state *trsp, void (*sync)(void)) cookie = cur_ops->get_gp_state(); if (dopoll_full) cur_ops->get_gp_state_full(&cookie_full); - if (dopoll || (!IS_ENABLED(CONFIG_TINY_RCU) && dopoll_full && num_online_cpus() <= 1)) + if (cur_ops->poll_need_2gp && cur_ops->poll_need_2gp(dopoll, dopoll_full)) sync(); sync(); WARN_ONCE(dopoll && !cur_ops->poll_gp_state(cookie), -- cgit v1.2.3 From d594231aa50a0f113cfb4749da5162e6a0c3fa73 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Aug 2022 17:33:24 -0700 Subject: rcutorture: Verify RCU reader prevents full polling from completing This commit adds a test to rcu_torture_writer() that verifies that a ->get_gp_state_full() and ->poll_gp_state_full() polled grace-period sequence does not claim that a grace period elapsed within the confines of the corresponding read-side critical section. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f2564c7633a8..050f4d0a987f 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1309,6 +1309,8 @@ rcu_torture_writer(void *arg) atomic_inc(&rcu_torture_wcount[i]); WRITE_ONCE(old_rp->rtort_pipe_count, old_rp->rtort_pipe_count + 1); + + // Make sure readers block polled grace periods. if (cur_ops->get_gp_state && cur_ops->poll_gp_state) { idx = cur_ops->readlock(); cookie = cur_ops->get_gp_state(); @@ -1325,9 +1327,20 @@ rcu_torture_writer(void *arg) } cur_ops->readunlock(idx); } - if (cur_ops->get_gp_completed_full && cur_ops->poll_gp_state_full) { - cur_ops->get_gp_completed_full(&cookie_full); - WARN_ON_ONCE(!cur_ops->poll_gp_state_full(&cookie_full)); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) { + idx = cur_ops->readlock(); + cur_ops->get_gp_state_full(&cookie_full); + WARN_ONCE(cur_ops->poll_gp_state_full(&cookie_full), + "%s: Cookie check 5 failed %s(%d) online %*pbl\n", + __func__, + rcu_torture_writer_state_getname(), + rcu_torture_writer_state, + cpumask_pr_args(cpu_online_mask)); + if (cur_ops->get_gp_completed_full) { + cur_ops->get_gp_completed_full(&cookie_full); + WARN_ON_ONCE(!cur_ops->poll_gp_state_full(&cookie_full)); + } + cur_ops->readunlock(idx); } switch (synctype[torture_random(&rand) % nsynctypes]) { case RTWS_DEF_FREE: -- cgit v1.2.3 From 37d6ade31cf83305495a2875ab5d283e17322032 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Aug 2022 10:22:12 -0700 Subject: rcutorture: Remove redundant RTWS_DEF_FREE check This check does nothing because the state at this point in the code because the rcu_torture_writer_state value is guaranteed to instead be RTWS_REPLACE. This commit therefore removes this check. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 050f4d0a987f..236bd6b57277 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1314,8 +1314,7 @@ rcu_torture_writer(void *arg) if (cur_ops->get_gp_state && cur_ops->poll_gp_state) { idx = cur_ops->readlock(); cookie = cur_ops->get_gp_state(); - WARN_ONCE(rcu_torture_writer_state != RTWS_DEF_FREE && - cur_ops->poll_gp_state(cookie), + WARN_ONCE(cur_ops->poll_gp_state(cookie), "%s: Cookie check 1 failed %s(%d) %lu->%lu\n", __func__, rcu_torture_writer_state_getname(), -- cgit v1.2.3 From f4754ad2922e5a2b08c8aecf33d1ec03d7219fb4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Aug 2022 16:40:48 -0700 Subject: rcutorture: Verify long-running reader prevents full polling from completing This commit adds full-state polling checks to accompany the old-style polling checks in the rcu_torture_one_read() function. If a polling cycle within an RCU reader completes, a WARN_ONCE() is triggered. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 236bd6b57277..3d8542010847 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1770,6 +1770,7 @@ rcutorture_loop_extend(int *readstate, struct torture_random_state *trsp, static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) { unsigned long cookie; + struct rcu_gp_oldstate cookie_full; int i; unsigned long started; unsigned long completed; @@ -1787,6 +1788,8 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) rcutorture_one_extend(&readstate, newstate, trsp, rtrsp++); if (cur_ops->get_gp_state && cur_ops->poll_gp_state) cookie = cur_ops->get_gp_state(); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) + cur_ops->get_gp_state_full(&cookie_full); started = cur_ops->get_gp_seq(); ts = rcu_trace_clock_local(); p = rcu_dereference_check(rcu_torture_current, @@ -1827,6 +1830,13 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) rcu_torture_writer_state_getname(), rcu_torture_writer_state, cookie, cur_ops->get_gp_state()); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) + WARN_ONCE(cur_ops->poll_gp_state_full(&cookie_full), + "%s: Cookie check 6 failed %s(%d) online %*pbl\n", + __func__, + rcu_torture_writer_state_getname(), + rcu_torture_writer_state, + cpumask_pr_args(cpu_online_mask)); rcutorture_one_extend(&readstate, 0, trsp, rtrsp); WARN_ON_ONCE(readstate); // This next splat is expected behavior if leakpointer, especially -- cgit v1.2.3 From 76ea364161e72b1878126edf8d507d2a62fb47b0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Aug 2022 17:04:54 -0700 Subject: rcu: Add full-sized polling for start_poll() The start_poll_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the next member of the full-state RCU grace-period polling API, namely the start_poll_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 49 +++++++++++++++++++++++++++++++++-------- kernel/rcu/tree.c | 58 ++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 85 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 3d8542010847..68387ccc7ddf 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -88,6 +88,7 @@ torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); torture_param(bool, gp_normal, false, "Use normal (non-expedited) GP wait primitives"); torture_param(bool, gp_poll, false, "Use polling GP wait primitives"); torture_param(bool, gp_poll_exp, false, "Use polling expedited GP wait primitives"); +torture_param(bool, gp_poll_full, false, "Use polling full-state GP wait primitives"); torture_param(bool, gp_sync, false, "Use synchronous GP wait primitives"); torture_param(int, irqreader, 1, "Allow RCU readers from irq handlers"); torture_param(int, leakpointer, 0, "Leak pointer dereferences from readers"); @@ -198,12 +199,14 @@ static int rcu_torture_writer_state; #define RTWS_COND_SYNC 7 #define RTWS_COND_SYNC_EXP 8 #define RTWS_POLL_GET 9 -#define RTWS_POLL_GET_EXP 10 -#define RTWS_POLL_WAIT 11 -#define RTWS_POLL_WAIT_EXP 12 -#define RTWS_SYNC 13 -#define RTWS_STUTTER 14 -#define RTWS_STOPPING 15 +#define RTWS_POLL_GET_FULL 10 +#define RTWS_POLL_GET_EXP 11 +#define RTWS_POLL_WAIT 12 +#define RTWS_POLL_WAIT_FULL 13 +#define RTWS_POLL_WAIT_EXP 14 +#define RTWS_SYNC 15 +#define RTWS_STUTTER 16 +#define RTWS_STOPPING 17 static const char * const rcu_torture_writer_state_names[] = { "RTWS_FIXED_DELAY", "RTWS_DELAY", @@ -215,8 +218,10 @@ static const char * const rcu_torture_writer_state_names[] = { "RTWS_COND_SYNC", "RTWS_COND_SYNC_EXP", "RTWS_POLL_GET", + "RTWS_POLL_GET_FULL", "RTWS_POLL_GET_EXP", "RTWS_POLL_WAIT", + "RTWS_POLL_WAIT_FULL", "RTWS_POLL_WAIT_EXP", "RTWS_SYNC", "RTWS_STUTTER", @@ -339,6 +344,7 @@ struct rcu_torture_ops { unsigned long (*get_gp_completed)(void); void (*get_gp_completed_full)(struct rcu_gp_oldstate *rgosp); unsigned long (*start_gp_poll)(void); + void (*start_gp_poll_full)(struct rcu_gp_oldstate *rgosp); bool (*poll_gp_state)(unsigned long oldstate); bool (*poll_gp_state_full)(struct rcu_gp_oldstate *rgosp); bool (*poll_need_2gp)(bool poll, bool poll_full); @@ -515,6 +521,7 @@ static struct rcu_torture_ops rcu_ops = { .get_gp_completed = get_completed_synchronize_rcu, .get_gp_completed_full = get_completed_synchronize_rcu_full, .start_gp_poll = start_poll_synchronize_rcu, + .start_gp_poll_full = start_poll_synchronize_rcu_full, .poll_gp_state = poll_state_synchronize_rcu, .poll_gp_state_full = poll_state_synchronize_rcu_full, .poll_need_2gp = rcu_poll_need_2gp, @@ -1163,13 +1170,13 @@ static void rcu_torture_write_types(void) { bool gp_cond1 = gp_cond, gp_cond_exp1 = gp_cond_exp, gp_exp1 = gp_exp; bool gp_poll_exp1 = gp_poll_exp, gp_normal1 = gp_normal, gp_poll1 = gp_poll; - bool gp_sync1 = gp_sync; + bool gp_poll_full1 = gp_poll_full, gp_sync1 = gp_sync; /* Initialize synctype[] array. If none set, take default. */ if (!gp_cond1 && !gp_cond_exp1 && !gp_exp1 && !gp_poll_exp && - !gp_normal1 && !gp_poll1 && !gp_sync1) + !gp_normal1 && !gp_poll1 && !gp_poll_full1 && !gp_sync1) gp_cond1 = gp_cond_exp1 = gp_exp1 = gp_poll_exp1 = - gp_normal1 = gp_poll1 = gp_sync1 = true; + gp_normal1 = gp_poll1 = gp_poll_full1 = gp_sync1 = true; if (gp_cond1 && cur_ops->get_gp_state && cur_ops->cond_sync) { synctype[nsynctypes++] = RTWS_COND_GET; pr_info("%s: Testing conditional GPs.\n", __func__); @@ -1200,6 +1207,12 @@ static void rcu_torture_write_types(void) } else if (gp_poll && (!cur_ops->start_gp_poll || !cur_ops->poll_gp_state)) { pr_alert("%s: gp_poll without primitives.\n", __func__); } + if (gp_poll_full1 && cur_ops->start_gp_poll_full && cur_ops->poll_gp_state_full) { + synctype[nsynctypes++] = RTWS_POLL_GET_FULL; + pr_info("%s: Testing polling full-state GPs.\n", __func__); + } else if (gp_poll_full && (!cur_ops->start_gp_poll_full || !cur_ops->poll_gp_state_full)) { + pr_alert("%s: gp_poll_full without primitives.\n", __func__); + } if (gp_poll_exp1 && cur_ops->start_gp_poll_exp && cur_ops->poll_gp_state_exp) { synctype[nsynctypes++] = RTWS_POLL_GET_EXP; pr_info("%s: Testing polling expedited GPs.\n", __func__); @@ -1262,6 +1275,7 @@ rcu_torture_writer(void *arg) struct rcu_gp_oldstate cookie_full; int expediting = 0; unsigned long gp_snap; + struct rcu_gp_oldstate gp_snap_full; int i; int idx; int oldnice = task_nice(current); @@ -1376,6 +1390,15 @@ rcu_torture_writer(void *arg) &rand); rcu_torture_pipe_update(old_rp); break; + case RTWS_POLL_GET_FULL: + rcu_torture_writer_state = RTWS_POLL_GET_FULL; + cur_ops->start_gp_poll_full(&gp_snap_full); + rcu_torture_writer_state = RTWS_POLL_WAIT_FULL; + while (!cur_ops->poll_gp_state_full(&gp_snap_full)) + torture_hrtimeout_jiffies(torture_random(&rand) % 16, + &rand); + rcu_torture_pipe_update(old_rp); + break; case RTWS_POLL_GET_EXP: rcu_torture_writer_state = RTWS_POLL_GET_EXP; gp_snap = cur_ops->start_gp_poll_exp(); @@ -1454,6 +1477,7 @@ static int rcu_torture_fakewriter(void *arg) { unsigned long gp_snap; + struct rcu_gp_oldstate gp_snap_full; DEFINE_TORTURE_RANDOM(rand); VERBOSE_TOROUT_STRING("rcu_torture_fakewriter task started"); @@ -1499,6 +1523,13 @@ rcu_torture_fakewriter(void *arg) &rand); } break; + case RTWS_POLL_GET_FULL: + cur_ops->start_gp_poll_full(&gp_snap_full); + while (!cur_ops->poll_gp_state_full(&gp_snap_full)) { + torture_hrtimeout_jiffies(torture_random(&rand) % 16, + &rand); + } + break; case RTWS_POLL_GET_EXP: gp_snap = cur_ops->start_gp_poll_exp(); while (!cur_ops->poll_gp_state_exp(gp_snap)) { diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3fa79ee78b5b..89572385fd1a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3589,22 +3589,13 @@ void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) } EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full); -/** - * start_poll_synchronize_rcu - Snapshot and start RCU grace period - * - * Returns a cookie that is used by a later call to cond_synchronize_rcu() - * or poll_state_synchronize_rcu() to determine whether or not a full - * grace period has elapsed in the meantime. If the needed grace period - * is not already slated to start, notifies RCU core of the need for that - * grace period. - * - * Interrupts must be enabled for the case where it is necessary to awaken - * the grace-period kthread. +/* + * Helper function for start_poll_synchronize_rcu() and + * start_poll_synchronize_rcu_full(). */ -unsigned long start_poll_synchronize_rcu(void) +static void start_poll_synchronize_rcu_common(void) { unsigned long flags; - unsigned long gp_seq = get_state_synchronize_rcu(); bool needwake; struct rcu_data *rdp; struct rcu_node *rnp; @@ -3624,10 +3615,51 @@ unsigned long start_poll_synchronize_rcu(void) raw_spin_unlock_irqrestore_rcu_node(rnp, flags); if (needwake) rcu_gp_kthread_wake(); +} + +/** + * start_poll_synchronize_rcu - Snapshot and start RCU grace period + * + * Returns a cookie that is used by a later call to cond_synchronize_rcu() + * or poll_state_synchronize_rcu() to determine whether or not a full + * grace period has elapsed in the meantime. If the needed grace period + * is not already slated to start, notifies RCU core of the need for that + * grace period. + * + * Interrupts must be enabled for the case where it is necessary to awaken + * the grace-period kthread. + */ +unsigned long start_poll_synchronize_rcu(void) +{ + unsigned long gp_seq = get_state_synchronize_rcu(); + + start_poll_synchronize_rcu_common(); return gp_seq; } EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); +/** + * start_poll_synchronize_rcu_full - Take a full snapshot and start RCU grace period + * @rgosp: value from get_state_synchronize_rcu_full() or start_poll_synchronize_rcu_full() + * + * Places the normal and expedited grace-period states in *@rgos. This + * state value can be passed to a later call to cond_synchronize_rcu_full() + * or poll_state_synchronize_rcu_full() to determine whether or not a + * grace period (whether normal or expedited) has elapsed in the meantime. + * If the needed grace period is not already slated to start, notifies + * RCU core of the need for that grace period. + * + * Interrupts must be enabled for the case where it is necessary to awaken + * the grace-period kthread. + */ +void start_poll_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) +{ + get_state_synchronize_rcu_full(rgosp); + + start_poll_synchronize_rcu_common(); +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_full); + /** * poll_state_synchronize_rcu - Has the specified RCU grace period completed? * -- cgit v1.2.3 From 6c502b14ba66da0670a59e20354469fa56eab26c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Aug 2022 12:38:51 -0700 Subject: rcu: Add full-sized polling for start_poll_expedited() The start_poll_synchronize_rcu_expedited() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the start_poll_synchronize_rcu_expedited_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 51 +++++++++++++++++++++++++++++++++++++++---------- kernel/rcu/tree_exp.h | 18 +++++++++++++++++ 2 files changed, 59 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 68387ccc7ddf..f9ca33555deb 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -89,6 +89,7 @@ torture_param(bool, gp_normal, false, "Use normal (non-expedited) GP wait primit torture_param(bool, gp_poll, false, "Use polling GP wait primitives"); torture_param(bool, gp_poll_exp, false, "Use polling expedited GP wait primitives"); torture_param(bool, gp_poll_full, false, "Use polling full-state GP wait primitives"); +torture_param(bool, gp_poll_exp_full, false, "Use polling full-state expedited GP wait primitives"); torture_param(bool, gp_sync, false, "Use synchronous GP wait primitives"); torture_param(int, irqreader, 1, "Allow RCU readers from irq handlers"); torture_param(int, leakpointer, 0, "Leak pointer dereferences from readers"); @@ -201,12 +202,14 @@ static int rcu_torture_writer_state; #define RTWS_POLL_GET 9 #define RTWS_POLL_GET_FULL 10 #define RTWS_POLL_GET_EXP 11 -#define RTWS_POLL_WAIT 12 -#define RTWS_POLL_WAIT_FULL 13 -#define RTWS_POLL_WAIT_EXP 14 -#define RTWS_SYNC 15 -#define RTWS_STUTTER 16 -#define RTWS_STOPPING 17 +#define RTWS_POLL_GET_EXP_FULL 12 +#define RTWS_POLL_WAIT 13 +#define RTWS_POLL_WAIT_FULL 14 +#define RTWS_POLL_WAIT_EXP 15 +#define RTWS_POLL_WAIT_EXP_FULL 16 +#define RTWS_SYNC 17 +#define RTWS_STUTTER 18 +#define RTWS_STOPPING 19 static const char * const rcu_torture_writer_state_names[] = { "RTWS_FIXED_DELAY", "RTWS_DELAY", @@ -220,9 +223,11 @@ static const char * const rcu_torture_writer_state_names[] = { "RTWS_POLL_GET", "RTWS_POLL_GET_FULL", "RTWS_POLL_GET_EXP", + "RTWS_POLL_GET_EXP_FULL", "RTWS_POLL_WAIT", "RTWS_POLL_WAIT_FULL", "RTWS_POLL_WAIT_EXP", + "RTWS_POLL_WAIT_EXP_FULL", "RTWS_SYNC", "RTWS_STUTTER", "RTWS_STOPPING", @@ -337,6 +342,7 @@ struct rcu_torture_ops { void (*exp_sync)(void); unsigned long (*get_gp_state_exp)(void); unsigned long (*start_gp_poll_exp)(void); + void (*start_gp_poll_exp_full)(struct rcu_gp_oldstate *rgosp); bool (*poll_gp_state_exp)(unsigned long oldstate); void (*cond_sync_exp)(unsigned long oldstate); unsigned long (*get_gp_state)(void); @@ -528,6 +534,7 @@ static struct rcu_torture_ops rcu_ops = { .cond_sync = cond_synchronize_rcu, .get_gp_state_exp = get_state_synchronize_rcu, .start_gp_poll_exp = start_poll_synchronize_rcu_expedited, + .start_gp_poll_exp_full = start_poll_synchronize_rcu_expedited_full, .poll_gp_state_exp = poll_state_synchronize_rcu, .cond_sync_exp = cond_synchronize_rcu_expedited, .call = call_rcu, @@ -1169,13 +1176,14 @@ static int nsynctypes; static void rcu_torture_write_types(void) { bool gp_cond1 = gp_cond, gp_cond_exp1 = gp_cond_exp, gp_exp1 = gp_exp; - bool gp_poll_exp1 = gp_poll_exp, gp_normal1 = gp_normal, gp_poll1 = gp_poll; - bool gp_poll_full1 = gp_poll_full, gp_sync1 = gp_sync; + bool gp_poll_exp1 = gp_poll_exp, gp_poll_exp_full1 = gp_poll_exp_full; + bool gp_normal1 = gp_normal, gp_poll1 = gp_poll, gp_poll_full1 = gp_poll_full; + bool gp_sync1 = gp_sync; /* Initialize synctype[] array. If none set, take default. */ - if (!gp_cond1 && !gp_cond_exp1 && !gp_exp1 && !gp_poll_exp && + if (!gp_cond1 && !gp_cond_exp1 && !gp_exp1 && !gp_poll_exp && !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && !gp_sync1) - gp_cond1 = gp_cond_exp1 = gp_exp1 = gp_poll_exp1 = + gp_cond1 = gp_cond_exp1 = gp_exp1 = gp_poll_exp1 = gp_poll_exp_full1 = gp_normal1 = gp_poll1 = gp_poll_full1 = gp_sync1 = true; if (gp_cond1 && cur_ops->get_gp_state && cur_ops->cond_sync) { synctype[nsynctypes++] = RTWS_COND_GET; @@ -1219,6 +1227,13 @@ static void rcu_torture_write_types(void) } else if (gp_poll_exp && (!cur_ops->start_gp_poll_exp || !cur_ops->poll_gp_state_exp)) { pr_alert("%s: gp_poll_exp without primitives.\n", __func__); } + if (gp_poll_exp_full1 && cur_ops->start_gp_poll_exp_full && cur_ops->poll_gp_state_full) { + synctype[nsynctypes++] = RTWS_POLL_GET_EXP_FULL; + pr_info("%s: Testing polling full-state expedited GPs.\n", __func__); + } else if (gp_poll_exp_full && + (!cur_ops->start_gp_poll_exp_full || !cur_ops->poll_gp_state_full)) { + pr_alert("%s: gp_poll_exp_full without primitives.\n", __func__); + } if (gp_sync1 && cur_ops->sync) { synctype[nsynctypes++] = RTWS_SYNC; pr_info("%s: Testing normal GPs.\n", __func__); @@ -1408,6 +1423,15 @@ rcu_torture_writer(void *arg) &rand); rcu_torture_pipe_update(old_rp); break; + case RTWS_POLL_GET_EXP_FULL: + rcu_torture_writer_state = RTWS_POLL_GET_EXP_FULL; + cur_ops->start_gp_poll_exp_full(&gp_snap_full); + rcu_torture_writer_state = RTWS_POLL_WAIT_EXP_FULL; + while (!cur_ops->poll_gp_state_full(&gp_snap_full)) + torture_hrtimeout_jiffies(torture_random(&rand) % 16, + &rand); + rcu_torture_pipe_update(old_rp); + break; case RTWS_SYNC: rcu_torture_writer_state = RTWS_SYNC; do_rtws_sync(&rand, cur_ops->sync); @@ -1537,6 +1561,13 @@ rcu_torture_fakewriter(void *arg) &rand); } break; + case RTWS_POLL_GET_EXP_FULL: + cur_ops->start_gp_poll_exp_full(&gp_snap_full); + while (!cur_ops->poll_gp_state_full(&gp_snap_full)) { + torture_hrtimeout_jiffies(torture_random(&rand) % 16, + &rand); + } + break; case RTWS_SYNC: cur_ops->sync(); break; diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index be667583a554..18128ee0d36c 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -1027,6 +1027,24 @@ unsigned long start_poll_synchronize_rcu_expedited(void) } EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited); +/** + * start_poll_synchronize_rcu_expedited_full - Take a full snapshot and start expedited grace period + * @rgosp: Place to put snapshot of grace-period state + * + * Places the normal and expedited grace-period states in rgosp. This + * state value can be passed to a later call to cond_synchronize_rcu_full() + * or poll_state_synchronize_rcu_full() to determine whether or not a + * grace period (whether normal or expedited) has elapsed in the meantime. + * If the needed expedited grace period is not already slated to start, + * initiates that grace period. + */ +void start_poll_synchronize_rcu_expedited_full(struct rcu_gp_oldstate *rgosp) +{ + get_state_synchronize_rcu_full(rgosp); + (void)start_poll_synchronize_rcu_expedited(); +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited_full); + /** * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period * -- cgit v1.2.3 From f21e014345e0abf11fdc2e59fb6eb6d6aa6ae4eb Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 3 Aug 2022 16:57:47 -0700 Subject: rcu: Remove blank line from poll_state_synchronize_rcu() docbook header This commit removes the blank line preceding the oldstate parameter to the docbook header for the poll_state_synchronize_rcu() function and marks uses of this parameter later in that header. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 89572385fd1a..0a24ef4d6b82 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3662,11 +3662,10 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_full); /** * poll_state_synchronize_rcu - Has the specified RCU grace period completed? - * * @oldstate: value from get_state_synchronize_rcu() or start_poll_synchronize_rcu() * * If a full RCU grace period has elapsed since the earlier call from - * which oldstate was obtained, return @true, otherwise return @false. + * which @oldstate was obtained, return @true, otherwise return @false. * If @false is returned, it is the caller's responsibility to invoke this * function later on until it does return @true. Alternatively, the caller * can explicitly wait for a grace period, for example, by passing @oldstate @@ -3675,7 +3674,7 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_full); * Yes, this function does not take counter wrap into account. * But counter wrap is harmless. If the counter wraps, we have waited for * more than a billion grace periods (and way more on a 64-bit system!). - * Those needing to keep oldstate values for very long time periods + * Those needing to keep old state values for very long time periods * (many hours even on 32-bit systems) should check them occasionally and * either refresh them or set a flag indicating that the grace period has * completed. Alternatively, they can use get_completed_synchronize_rcu() -- cgit v1.2.3 From b6fe4917ae4353b397079902cb024ae01f20dfb2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 13:46:05 -0700 Subject: rcu: Add full-sized polling for cond_sync_full() The cond_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the cond_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 67 ++++++++++++++++++++++++++++++++++--------------- kernel/rcu/tree.c | 28 ++++++++++++++++++++- 2 files changed, 74 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f9ca33555deb..9d22161bf770 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -84,6 +84,7 @@ torture_param(int, fwd_progress_holdoff, 60, "Time between forward-progress test torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched() behind need_resched()"); torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives"); torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives"); +torture_param(bool, gp_cond_full, false, "Use conditional/async full-state GP wait primitives"); torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); torture_param(bool, gp_normal, false, "Use normal (non-expedited) GP wait primitives"); torture_param(bool, gp_poll, false, "Use polling GP wait primitives"); @@ -196,20 +197,22 @@ static int rcu_torture_writer_state; #define RTWS_DEF_FREE 3 #define RTWS_EXP_SYNC 4 #define RTWS_COND_GET 5 -#define RTWS_COND_GET_EXP 6 -#define RTWS_COND_SYNC 7 -#define RTWS_COND_SYNC_EXP 8 -#define RTWS_POLL_GET 9 -#define RTWS_POLL_GET_FULL 10 -#define RTWS_POLL_GET_EXP 11 -#define RTWS_POLL_GET_EXP_FULL 12 -#define RTWS_POLL_WAIT 13 -#define RTWS_POLL_WAIT_FULL 14 -#define RTWS_POLL_WAIT_EXP 15 -#define RTWS_POLL_WAIT_EXP_FULL 16 -#define RTWS_SYNC 17 -#define RTWS_STUTTER 18 -#define RTWS_STOPPING 19 +#define RTWS_COND_GET_FULL 6 +#define RTWS_COND_GET_EXP 7 +#define RTWS_COND_SYNC 8 +#define RTWS_COND_SYNC_FULL 9 +#define RTWS_COND_SYNC_EXP 10 +#define RTWS_POLL_GET 11 +#define RTWS_POLL_GET_FULL 12 +#define RTWS_POLL_GET_EXP 13 +#define RTWS_POLL_GET_EXP_FULL 14 +#define RTWS_POLL_WAIT 15 +#define RTWS_POLL_WAIT_FULL 16 +#define RTWS_POLL_WAIT_EXP 17 +#define RTWS_POLL_WAIT_EXP_FULL 18 +#define RTWS_SYNC 19 +#define RTWS_STUTTER 20 +#define RTWS_STOPPING 21 static const char * const rcu_torture_writer_state_names[] = { "RTWS_FIXED_DELAY", "RTWS_DELAY", @@ -217,8 +220,10 @@ static const char * const rcu_torture_writer_state_names[] = { "RTWS_DEF_FREE", "RTWS_EXP_SYNC", "RTWS_COND_GET", + "RTWS_COND_GET_FULL", "RTWS_COND_GET_EXP", "RTWS_COND_SYNC", + "RTWS_COND_SYNC_FULL", "RTWS_COND_SYNC_EXP", "RTWS_POLL_GET", "RTWS_POLL_GET_FULL", @@ -355,6 +360,7 @@ struct rcu_torture_ops { bool (*poll_gp_state_full)(struct rcu_gp_oldstate *rgosp); bool (*poll_need_2gp)(bool poll, bool poll_full); void (*cond_sync)(unsigned long oldstate); + void (*cond_sync_full)(struct rcu_gp_oldstate *rgosp); call_rcu_func_t call; void (*cb_barrier)(void); void (*fqs)(void); @@ -532,6 +538,7 @@ static struct rcu_torture_ops rcu_ops = { .poll_gp_state_full = poll_state_synchronize_rcu_full, .poll_need_2gp = rcu_poll_need_2gp, .cond_sync = cond_synchronize_rcu, + .cond_sync_full = cond_synchronize_rcu_full, .get_gp_state_exp = get_state_synchronize_rcu, .start_gp_poll_exp = start_poll_synchronize_rcu_expedited, .start_gp_poll_exp_full = start_poll_synchronize_rcu_expedited_full, @@ -1175,16 +1182,17 @@ static int nsynctypes; */ static void rcu_torture_write_types(void) { - bool gp_cond1 = gp_cond, gp_cond_exp1 = gp_cond_exp, gp_exp1 = gp_exp; - bool gp_poll_exp1 = gp_poll_exp, gp_poll_exp_full1 = gp_poll_exp_full; + bool gp_cond1 = gp_cond, gp_cond_exp1 = gp_cond_exp, gp_cond_full1 = gp_cond_full; + bool gp_exp1 = gp_exp, gp_poll_exp1 = gp_poll_exp, gp_poll_exp_full1 = gp_poll_exp_full; bool gp_normal1 = gp_normal, gp_poll1 = gp_poll, gp_poll_full1 = gp_poll_full; bool gp_sync1 = gp_sync; /* Initialize synctype[] array. If none set, take default. */ - if (!gp_cond1 && !gp_cond_exp1 && !gp_exp1 && !gp_poll_exp && !gp_poll_exp_full1 && - !gp_normal1 && !gp_poll1 && !gp_poll_full1 && !gp_sync1) - gp_cond1 = gp_cond_exp1 = gp_exp1 = gp_poll_exp1 = gp_poll_exp_full1 = - gp_normal1 = gp_poll1 = gp_poll_full1 = gp_sync1 = true; + if (!gp_cond1 && !gp_cond_exp1 && !gp_cond_full1 && !gp_exp1 && !gp_poll_exp && + !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && !gp_sync1) + gp_cond1 = gp_cond_exp1 = gp_cond_full1 = gp_exp1 = gp_poll_exp1 = + gp_poll_exp_full1 = gp_normal1 = gp_poll1 = gp_poll_full1 = + gp_sync1 = true; if (gp_cond1 && cur_ops->get_gp_state && cur_ops->cond_sync) { synctype[nsynctypes++] = RTWS_COND_GET; pr_info("%s: Testing conditional GPs.\n", __func__); @@ -1197,6 +1205,12 @@ static void rcu_torture_write_types(void) } else if (gp_cond_exp && (!cur_ops->get_gp_state_exp || !cur_ops->cond_sync_exp)) { pr_alert("%s: gp_cond_exp without primitives.\n", __func__); } + if (gp_cond_full1 && cur_ops->get_gp_state && cur_ops->cond_sync_full) { + synctype[nsynctypes++] = RTWS_COND_GET_FULL; + pr_info("%s: Testing conditional full-state GPs.\n", __func__); + } else if (gp_cond_full && (!cur_ops->get_gp_state || !cur_ops->cond_sync_full)) { + pr_alert("%s: gp_cond_full without primitives.\n", __func__); + } if (gp_exp1 && cur_ops->exp_sync) { synctype[nsynctypes++] = RTWS_EXP_SYNC; pr_info("%s: Testing expedited GPs.\n", __func__); @@ -1396,6 +1410,14 @@ rcu_torture_writer(void *arg) cur_ops->cond_sync_exp(gp_snap); rcu_torture_pipe_update(old_rp); break; + case RTWS_COND_GET_FULL: + rcu_torture_writer_state = RTWS_COND_GET_FULL; + cur_ops->get_gp_state_full(&gp_snap_full); + torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand); + rcu_torture_writer_state = RTWS_COND_SYNC_FULL; + cur_ops->cond_sync_full(&gp_snap_full); + rcu_torture_pipe_update(old_rp); + break; case RTWS_POLL_GET: rcu_torture_writer_state = RTWS_POLL_GET; gp_snap = cur_ops->start_gp_poll(); @@ -1540,6 +1562,11 @@ rcu_torture_fakewriter(void *arg) torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand); cur_ops->cond_sync_exp(gp_snap); break; + case RTWS_COND_GET_FULL: + cur_ops->get_gp_state_full(&gp_snap_full); + torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand); + cur_ops->cond_sync_full(&gp_snap_full); + break; case RTWS_POLL_GET: gp_snap = cur_ops->start_gp_poll(); while (!cur_ops->poll_gp_state(gp_snap)) { diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0a24ef4d6b82..5c46c0d34ef0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3749,7 +3749,6 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_full); /** * cond_synchronize_rcu - Conditionally wait for an RCU grace period - * * @oldstate: value from get_state_synchronize_rcu(), start_poll_synchronize_rcu(), or start_poll_synchronize_rcu_expedited() * * If a full RCU grace period has elapsed since the earlier call to @@ -3773,6 +3772,33 @@ void cond_synchronize_rcu(unsigned long oldstate) } EXPORT_SYMBOL_GPL(cond_synchronize_rcu); +/** + * cond_synchronize_rcu_full - Conditionally wait for an RCU grace period + * @rgosp: value from get_state_synchronize_rcu_full(), start_poll_synchronize_rcu_full(), or start_poll_synchronize_rcu_expedited_full() + * + * If a full RCU grace period has elapsed since the call to + * get_state_synchronize_rcu_full(), start_poll_synchronize_rcu_full(), + * or start_poll_synchronize_rcu_expedited_full() from which @rgosp was + * obtained, just return. Otherwise, invoke synchronize_rcu() to wait + * for a full grace period. + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!), + * so waiting for a couple of additional grace periods should be just fine. + * + * This function provides the same memory-ordering guarantees that + * would be provided by a synchronize_rcu() that was invoked at the call + * to the function that provided @rgosp and that returned at the end of + * this function. + */ +void cond_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) +{ + if (!poll_state_synchronize_rcu_full(rgosp)) + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full); + /* * Check to see if there is any immediate RCU-related work to be done by * the current CPU, returning 1 if so and zero otherwise. The checks are -- cgit v1.2.3 From 8df13f01608ea48712956c0b1afce35bdba5a1c5 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 15:23:26 -0700 Subject: rcu: Add full-sized polling for cond_sync_exp_full() The cond_synchronize_rcu_expedited() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the cond_synchronize_rcu_exp_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 72 ++++++++++++++++++++++++++++++++++--------------- kernel/rcu/tree_exp.h | 27 +++++++++++++++++++ 2 files changed, 77 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 9d22161bf770..8995429c6f1c 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -85,6 +85,8 @@ torture_param(bool, fwd_progress_need_resched, 1, "Hide cond_resched() behind ne torture_param(bool, gp_cond, false, "Use conditional/async GP wait primitives"); torture_param(bool, gp_cond_exp, false, "Use conditional/async expedited GP wait primitives"); torture_param(bool, gp_cond_full, false, "Use conditional/async full-state GP wait primitives"); +torture_param(bool, gp_cond_exp_full, false, + "Use conditional/async full-stateexpedited GP wait primitives"); torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); torture_param(bool, gp_normal, false, "Use normal (non-expedited) GP wait primitives"); torture_param(bool, gp_poll, false, "Use polling GP wait primitives"); @@ -199,20 +201,22 @@ static int rcu_torture_writer_state; #define RTWS_COND_GET 5 #define RTWS_COND_GET_FULL 6 #define RTWS_COND_GET_EXP 7 -#define RTWS_COND_SYNC 8 -#define RTWS_COND_SYNC_FULL 9 -#define RTWS_COND_SYNC_EXP 10 -#define RTWS_POLL_GET 11 -#define RTWS_POLL_GET_FULL 12 -#define RTWS_POLL_GET_EXP 13 -#define RTWS_POLL_GET_EXP_FULL 14 -#define RTWS_POLL_WAIT 15 -#define RTWS_POLL_WAIT_FULL 16 -#define RTWS_POLL_WAIT_EXP 17 -#define RTWS_POLL_WAIT_EXP_FULL 18 -#define RTWS_SYNC 19 -#define RTWS_STUTTER 20 -#define RTWS_STOPPING 21 +#define RTWS_COND_GET_EXP_FULL 8 +#define RTWS_COND_SYNC 9 +#define RTWS_COND_SYNC_FULL 10 +#define RTWS_COND_SYNC_EXP 11 +#define RTWS_COND_SYNC_EXP_FULL 12 +#define RTWS_POLL_GET 13 +#define RTWS_POLL_GET_FULL 14 +#define RTWS_POLL_GET_EXP 15 +#define RTWS_POLL_GET_EXP_FULL 16 +#define RTWS_POLL_WAIT 17 +#define RTWS_POLL_WAIT_FULL 18 +#define RTWS_POLL_WAIT_EXP 19 +#define RTWS_POLL_WAIT_EXP_FULL 20 +#define RTWS_SYNC 21 +#define RTWS_STUTTER 22 +#define RTWS_STOPPING 23 static const char * const rcu_torture_writer_state_names[] = { "RTWS_FIXED_DELAY", "RTWS_DELAY", @@ -222,9 +226,11 @@ static const char * const rcu_torture_writer_state_names[] = { "RTWS_COND_GET", "RTWS_COND_GET_FULL", "RTWS_COND_GET_EXP", + "RTWS_COND_GET_EXP_FULL", "RTWS_COND_SYNC", "RTWS_COND_SYNC_FULL", "RTWS_COND_SYNC_EXP", + "RTWS_COND_SYNC_EXP_FULL", "RTWS_POLL_GET", "RTWS_POLL_GET_FULL", "RTWS_POLL_GET_EXP", @@ -350,6 +356,7 @@ struct rcu_torture_ops { void (*start_gp_poll_exp_full)(struct rcu_gp_oldstate *rgosp); bool (*poll_gp_state_exp)(unsigned long oldstate); void (*cond_sync_exp)(unsigned long oldstate); + void (*cond_sync_exp_full)(struct rcu_gp_oldstate *rgosp); unsigned long (*get_gp_state)(void); void (*get_gp_state_full)(struct rcu_gp_oldstate *rgosp); unsigned long (*get_gp_completed)(void); @@ -1183,16 +1190,17 @@ static int nsynctypes; static void rcu_torture_write_types(void) { bool gp_cond1 = gp_cond, gp_cond_exp1 = gp_cond_exp, gp_cond_full1 = gp_cond_full; - bool gp_exp1 = gp_exp, gp_poll_exp1 = gp_poll_exp, gp_poll_exp_full1 = gp_poll_exp_full; - bool gp_normal1 = gp_normal, gp_poll1 = gp_poll, gp_poll_full1 = gp_poll_full; - bool gp_sync1 = gp_sync; + bool gp_cond_exp_full1 = gp_cond_exp_full, gp_exp1 = gp_exp, gp_poll_exp1 = gp_poll_exp; + bool gp_poll_exp_full1 = gp_poll_exp_full, gp_normal1 = gp_normal, gp_poll1 = gp_poll; + bool gp_poll_full1 = gp_poll_full, gp_sync1 = gp_sync; /* Initialize synctype[] array. If none set, take default. */ - if (!gp_cond1 && !gp_cond_exp1 && !gp_cond_full1 && !gp_exp1 && !gp_poll_exp && - !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && !gp_sync1) - gp_cond1 = gp_cond_exp1 = gp_cond_full1 = gp_exp1 = gp_poll_exp1 = - gp_poll_exp_full1 = gp_normal1 = gp_poll1 = gp_poll_full1 = - gp_sync1 = true; + if (!gp_cond1 && !gp_cond_exp1 && !gp_cond_full1 && !gp_cond_exp_full1 && !gp_exp1 && + !gp_poll_exp && !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && + !gp_sync1) + gp_cond1 = gp_cond_exp1 = gp_cond_full1 = gp_cond_exp_full1 = gp_exp1 = + gp_poll_exp1 = gp_poll_exp_full1 = gp_normal1 = gp_poll1 = + gp_poll_full1 = gp_sync1 = true; if (gp_cond1 && cur_ops->get_gp_state && cur_ops->cond_sync) { synctype[nsynctypes++] = RTWS_COND_GET; pr_info("%s: Testing conditional GPs.\n", __func__); @@ -1211,6 +1219,13 @@ static void rcu_torture_write_types(void) } else if (gp_cond_full && (!cur_ops->get_gp_state || !cur_ops->cond_sync_full)) { pr_alert("%s: gp_cond_full without primitives.\n", __func__); } + if (gp_cond_exp_full1 && cur_ops->get_gp_state_exp && cur_ops->cond_sync_exp_full) { + synctype[nsynctypes++] = RTWS_COND_GET_EXP_FULL; + pr_info("%s: Testing conditional full-state expedited GPs.\n", __func__); + } else if (gp_cond_exp_full && + (!cur_ops->get_gp_state_exp || !cur_ops->cond_sync_exp_full)) { + pr_alert("%s: gp_cond_exp_full without primitives.\n", __func__); + } if (gp_exp1 && cur_ops->exp_sync) { synctype[nsynctypes++] = RTWS_EXP_SYNC; pr_info("%s: Testing expedited GPs.\n", __func__); @@ -1418,6 +1433,14 @@ rcu_torture_writer(void *arg) cur_ops->cond_sync_full(&gp_snap_full); rcu_torture_pipe_update(old_rp); break; + case RTWS_COND_GET_EXP_FULL: + rcu_torture_writer_state = RTWS_COND_GET_EXP_FULL; + cur_ops->get_gp_state_full(&gp_snap_full); + torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand); + rcu_torture_writer_state = RTWS_COND_SYNC_EXP_FULL; + cur_ops->cond_sync_exp_full(&gp_snap_full); + rcu_torture_pipe_update(old_rp); + break; case RTWS_POLL_GET: rcu_torture_writer_state = RTWS_POLL_GET; gp_snap = cur_ops->start_gp_poll(); @@ -1567,6 +1590,11 @@ rcu_torture_fakewriter(void *arg) torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand); cur_ops->cond_sync_full(&gp_snap_full); break; + case RTWS_COND_GET_EXP_FULL: + cur_ops->get_gp_state_full(&gp_snap_full); + torture_hrtimeout_jiffies(torture_random(&rand) % 16, &rand); + cur_ops->cond_sync_exp_full(&gp_snap_full); + break; case RTWS_POLL_GET: gp_snap = cur_ops->start_gp_poll(); while (!cur_ops->poll_gp_state(gp_snap)) { diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 18128ee0d36c..9c0ae834ef07 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -1071,3 +1071,30 @@ void cond_synchronize_rcu_expedited(unsigned long oldstate) synchronize_rcu_expedited(); } EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited); + +/** + * cond_synchronize_rcu_expedited_full - Conditionally wait for an expedited RCU grace period + * @rgosp: value from get_state_synchronize_rcu_full(), start_poll_synchronize_rcu_full(), or start_poll_synchronize_rcu_expedited_full() + * + * If a full RCU grace period has elapsed since the call to + * get_state_synchronize_rcu_full(), start_poll_synchronize_rcu_full(), + * or start_poll_synchronize_rcu_expedited_full() from which @rgosp was + * obtained, just return. Otherwise, invoke synchronize_rcu_expedited() + * to wait for a full grace period. + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!), + * so waiting for a couple of additional grace periods should be just fine. + * + * This function provides the same memory-ordering guarantees that + * would be provided by a synchronize_rcu() that was invoked at the call + * to the function that provided @rgosp and that returned at the end of + * this function. + */ +void cond_synchronize_rcu_expedited_full(struct rcu_gp_oldstate *rgosp) +{ + if (!poll_state_synchronize_rcu_full(rgosp)) + synchronize_rcu_expedited(); +} +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited_full); -- cgit v1.2.3 From 258f887aba60c8fc7946a9f379f9a3889f92fc85 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 16:07:04 -0700 Subject: rcu: Disable run-time single-CPU grace-period optimization The run-time single-CPU grace-period optimization applies only to kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running on a single-CPU system. But a kernel intended for a single-CPU system should instead be built with CONFIG_SMP=n, and in any case, single-CPU systems running Linux no longer appear to be the common case. Plus this optimization results in the rcu_gp_oldstate structure being half again larger than it needs to be. This commit therefore disables the run-time single-CPU grace-period optimization, so that this optimization applies only during the pre-scheduler portion of the boot sequence. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 40 +++++++++------------------------------- 1 file changed, 9 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 5c46c0d34ef0..5c4ec9dd4ce7 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3423,42 +3423,20 @@ void __init kfree_rcu_scheduler_running(void) /* * During early boot, any blocking grace-period wait automatically - * implies a grace period. Later on, this is never the case for PREEMPTION. + * implies a grace period. * - * However, because a context switch is a grace period for !PREEMPTION, any - * blocking grace-period wait automatically implies a grace period if - * there is only one CPU online at any point time during execution of - * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to - * occasionally incorrectly indicate that there are multiple CPUs online - * when there was in fact only one the whole time, as this just adds some - * overhead: RCU still operates correctly. + * Later on, this could in theory be the case for kernels built with + * CONFIG_SMP=y && CONFIG_PREEMPTION=y running on a single CPU, but this + * is not a common case. Furthermore, this optimization would cause + * the rcu_gp_oldstate structure to expand by 50%, so this potential + * grace-period optimization is ignored once the scheduler is running. */ static int rcu_blocking_is_gp(void) { - int ret; - - // Invoking preempt_model_*() too early gets a splat. - if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE || - preempt_model_full() || preempt_model_rt()) - return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; + if (rcu_scheduler_active != RCU_SCHEDULER_INACTIVE) + return false; might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); - /* - * If the rcu_state.n_online_cpus counter is equal to one, - * there is only one CPU, and that CPU sees all prior accesses - * made by any CPU that was online at the time of its access. - * Furthermore, if this counter is equal to one, its value cannot - * change until after the preempt_enable() below. - * - * Furthermore, if rcu_state.n_online_cpus is equal to one here, - * all later CPUs (both this one and any that come online later - * on) are guaranteed to see all accesses prior to this point - * in the code, without the need for additional memory barriers. - * Those memory barriers are provided by CPU-hotplug code. - */ - ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; - preempt_enable(); - return ret; + return true; } /** -- cgit v1.2.3 From a5d1b0b68a62afb1bce0b36cc9a1875acf8a6dff Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 17:01:55 -0700 Subject: rcu: Set rcu_data structures' initial ->gpwrap value to true It would be good do reduce the size of the rcu_gp_oldstate structure from three unsigned long instances to two, but this requires that the boot-time optimized grace periods update the various ->gp_seq fields. Updating these fields in the rcu_state structure and in all of the rcu_node structures is at least semi-reasonable, but updating them in all of the rcu_data structures is a bridge too far. This means that if there are too many early boot-time grace periods, the ->gp_seq field in the rcu_data structure cannot be trusted. This commit therefore sets each rcu_data structure's ->gpwrap field to provide the necessary impetus for a suitable level of distrust. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 5c4ec9dd4ce7..03b089184b37 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -76,6 +76,7 @@ /* Data structures. */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = { + .gpwrap = true, #ifdef CONFIG_RCU_NOCB_CPU .cblist.flags = SEGCBLIST_RCU_CORE, #endif -- cgit v1.2.3 From 5f11bad6b7228858e06729de6dd4079dfc082648 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 17:16:24 -0700 Subject: rcu-tasks: Remove grace-period fast-path rcu-tasks helper Now that the grace-period fast path can only happen during the pre-scheduler portion of early boot, this fast path can no longer block run-time RCU Tasks and RCU Tasks Trace grace periods. This commit therefore removes the conditional cond_resched_tasks_rcu_qs() invocation. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 03b089184b37..0ff7d5eaa376 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3492,8 +3492,6 @@ void synchronize_rcu(void) // which allows reuse of ->gp_seq_polled_snap. rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap); rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap); - if (rcu_init_invoked()) - cond_resched_tasks_rcu_qs(); return; // Context allows vacuous grace periods. } if (rcu_gp_is_expedited()) -- cgit v1.2.3 From 910e12092eac8a9f19b507ed0fdc1c21d8da9483 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 17:28:01 -0700 Subject: rcu: Make synchronize_rcu() fast path update ->gp_seq counters This commit causes the early boot single-CPU synchronize_rcu() fastpath to update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed counters. This will allow the full-state polled grace-period APIs to detect all normal grace periods without the need to track the special combined polling-only counter, which is a step towards removing the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size by one third. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0ff7d5eaa376..8fa5ec0f3d11 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3480,24 +3480,37 @@ static int rcu_blocking_is_gp(void) */ void synchronize_rcu(void) { + unsigned long flags; + struct rcu_node *rnp; + RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) || lock_is_held(&rcu_lock_map) || lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_rcu() in RCU read-side critical section"); - if (rcu_blocking_is_gp()) { - // Note well that this code runs with !PREEMPT && !SMP. - // In addition, all code that advances grace periods runs at - // process level. Therefore, this normal GP overlaps with - // other normal GPs only by being fully nested within them, - // which allows reuse of ->gp_seq_polled_snap. - rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap); - rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap); - return; // Context allows vacuous grace periods. + if (!rcu_blocking_is_gp()) { + if (rcu_gp_is_expedited()) + synchronize_rcu_expedited(); + else + wait_rcu_gp(call_rcu); + return; } - if (rcu_gp_is_expedited()) - synchronize_rcu_expedited(); - else - wait_rcu_gp(call_rcu); + + // Context allows vacuous grace periods. + // Note well that this code runs with !PREEMPT && !SMP. + // In addition, all code that advances grace periods runs at + // process level. Therefore, this normal GP overlaps with other + // normal GPs only by being fully nested within them, which allows + // reuse of ->gp_seq_polled_snap. + rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap); + rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap); + + // Update normal grace-period counters to record grace period. + local_irq_save(flags); + WARN_ON_ONCE(num_online_cpus() > 1); + rcu_state.gp_seq += (1 << RCU_SEQ_CTR_SHIFT); + rcu_for_each_node_breadth_first(rnp) + rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; + local_irq_restore(flags); } EXPORT_SYMBOL_GPL(synchronize_rcu); -- cgit v1.2.3 From e8755d2bde7c296026fe5d47f1b75c7b19ba46fd Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 17:34:30 -0700 Subject: rcu: Remove expedited grace-period fast-path forward-progress helper Now that the expedited grace-period fast path can only happen during the pre-scheduler portion of early boot, this fast path can no longer block run-time RCU Trace grace periods. This commit therefore removes the conditional cond_resched() invocation. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 9c0ae834ef07..1a51f9301ebf 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -924,8 +924,6 @@ void synchronize_rcu_expedited(void) // them, which allows reuse of ->gp_seq_polled_exp_snap. rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_exp_snap); rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_exp_snap); - if (rcu_init_invoked()) - cond_resched(); return; // Context allows vacuous grace periods. } -- cgit v1.2.3 From 43ff97cc997f5641127152f97e1fd0fc9fa060f6 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 17:43:53 -0700 Subject: rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence This commit causes the early boot single-CPU synchronize_rcu_expedited() fastpath to update the rcu_state structure's ->expedited_sequence counter. This will allow the full-state polled grace-period APIs to detect all expedited grace periods without the need to track the special combined polling-only counter, which is another step towards removing the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size by one third. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 1a51f9301ebf..54e05d13d151 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -906,6 +906,7 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp) void synchronize_rcu_expedited(void) { bool boottime = (rcu_scheduler_active == RCU_SCHEDULER_INIT); + unsigned long flags; struct rcu_exp_work rew; struct rcu_node *rnp; unsigned long s; @@ -924,6 +925,11 @@ void synchronize_rcu_expedited(void) // them, which allows reuse of ->gp_seq_polled_exp_snap. rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_exp_snap); rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_exp_snap); + + local_irq_save(flags); + WARN_ON_ONCE(num_online_cpus() > 1); + rcu_state.expedited_sequence += (1 << RCU_SEQ_CTR_SHIFT); + local_irq_restore(flags); return; // Context allows vacuous grace periods. } -- cgit v1.2.3 From 7ecef0871dd9a879038dbe8a681ab48bd0c92988 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 17:54:53 -0700 Subject: rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure Because both normal and expedited grace periods increment their respective counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate structure no longer needs its ->rgos_polled field. This commit therefore removes this field, shrinking this structure so that it is the same size as an rcu_head structure. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8fa5ec0f3d11..b9e8ed00536d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3526,7 +3526,6 @@ void get_completed_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) { rgosp->rgos_norm = RCU_GET_STATE_COMPLETED; rgosp->rgos_exp = RCU_GET_STATE_COMPLETED; - rgosp->rgos_polled = RCU_GET_STATE_COMPLETED; } EXPORT_SYMBOL_GPL(get_completed_synchronize_rcu_full); @@ -3575,7 +3574,6 @@ void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) smp_mb(); /* ^^^ */ rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq); rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence); - rgosp->rgos_polled = rcu_seq_snap(&rcu_state.gp_seq_polled); } EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full); @@ -3727,9 +3725,7 @@ bool poll_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp) if (rgosp->rgos_norm == RCU_GET_STATE_COMPLETED || rcu_seq_done_exact(&rnp->gp_seq, rgosp->rgos_norm) || rgosp->rgos_exp == RCU_GET_STATE_COMPLETED || - rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp) || - rgosp->rgos_polled == RCU_GET_STATE_COMPLETED || - rcu_seq_done_exact(&rcu_state.gp_seq_polled, rgosp->rgos_polled)) { + rcu_seq_done_exact(&rcu_state.expedited_sequence, rgosp->rgos_exp)) { smp_mb(); /* Ensure GP ends before subsequent accesses. */ return true; } -- cgit v1.2.3 From b3cdd0a79c875d5e9cac9f6555485031ce5bea81 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 4 Aug 2022 18:05:46 -0700 Subject: rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal Now that rcu_gp_oldstate can accurately track both normal and expedited grace periods regardless of system state, rcutorture's rcu_poll_need_2gp() function need only call for a second grace period for the old single-unsigned-long grace-period polling APIs This commit therefore adjusts rcu_poll_need_2gp() accordingly. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 8995429c6f1c..029de67a9da9 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -520,7 +520,7 @@ static void rcu_sync_torture_init(void) static bool rcu_poll_need_2gp(bool poll, bool poll_full) { - return poll || (!IS_ENABLED(CONFIG_TINY_RCU) && poll_full && num_online_cpus() <= 1); + return poll; } static struct rcu_torture_ops rcu_ops = { -- cgit v1.2.3 From d761de8a7dcef8e8e9e20a543f85a2c079ae3d0d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Aug 2022 15:42:25 -0700 Subject: rcu: Make synchronize_rcu() fastpath update only boot-CPU counters Large systems can have hundreds of rcu_node structures, and updating counters in each of them might slow down booting. This commit therefore updates only the counters in those rcu_node structures corresponding to the boot CPU, up to and including the root rcu_node structure. The counters for the remaining rcu_node structures are updated by the rcu_scheduler_starting() function, which executes just before the first non-boot kthread is spawned. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b9e8ed00536d..ef15bae3c7c7 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3504,11 +3504,14 @@ void synchronize_rcu(void) rcu_poll_gp_seq_start_unlocked(&rcu_state.gp_seq_polled_snap); rcu_poll_gp_seq_end_unlocked(&rcu_state.gp_seq_polled_snap); - // Update normal grace-period counters to record grace period. + // Update the normal grace-period counters to record + // this grace period, but only those used by the boot CPU. + // The rcu_scheduler_starting() will take care of the rest of + // these counters. local_irq_save(flags); WARN_ON_ONCE(num_online_cpus() > 1); rcu_state.gp_seq += (1 << RCU_SEQ_CTR_SHIFT); - rcu_for_each_node_breadth_first(rnp) + for (rnp = this_cpu_ptr(&rcu_data)->mynode; rnp; rnp = rnp->parent) rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; local_irq_restore(flags); } @@ -4456,9 +4459,20 @@ early_initcall(rcu_spawn_gp_kthread); */ void rcu_scheduler_starting(void) { + unsigned long flags; + struct rcu_node *rnp; + WARN_ON(num_online_cpus() != 1); WARN_ON(nr_context_switches() > 0); rcu_test_sync_prims(); + + // Fix up the ->gp_seq counters. + local_irq_save(flags); + rcu_for_each_node_breadth_first(rnp) + rnp->gp_seq_needed = rnp->gp_seq = rcu_state.gp_seq; + local_irq_restore(flags); + + // Switch out of early boot mode. rcu_scheduler_active = RCU_SCHEDULER_INIT; rcu_test_sync_prims(); } -- cgit v1.2.3 From cc8faf5b65997a994c77a0122b94240b82c95f23 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 22 Aug 2022 10:02:56 -0700 Subject: rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check This commit changes the use of gp_poll_exp to gp_poll_exp1 in the first check in rcu_torture_write_types(). No functional effect, but consistency is a good thing. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 029de67a9da9..71d1af9c060e 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1196,7 +1196,7 @@ static void rcu_torture_write_types(void) /* Initialize synctype[] array. If none set, take default. */ if (!gp_cond1 && !gp_cond_exp1 && !gp_cond_full1 && !gp_cond_exp_full1 && !gp_exp1 && - !gp_poll_exp && !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && + !gp_poll_exp1 && !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && !gp_sync1) gp_cond1 = gp_cond_exp1 = gp_cond_full1 = gp_cond_exp_full1 = gp_exp1 = gp_poll_exp1 = gp_poll_exp_full1 = gp_normal1 = gp_poll1 = -- cgit v1.2.3 From 5d7801f20170a50252b791c2ce05fb982616665b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 23 Aug 2022 12:11:12 -0700 Subject: rcutorture: Expand rcu_torture_write_types() first "if" statement This commit expands the rcu_torture_write_types() function's first "if" condition and body, placing one element per line, in order to make the compiler's error messages more helpful. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 71d1af9c060e..fe1836aad646 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1195,12 +1195,29 @@ static void rcu_torture_write_types(void) bool gp_poll_full1 = gp_poll_full, gp_sync1 = gp_sync; /* Initialize synctype[] array. If none set, take default. */ - if (!gp_cond1 && !gp_cond_exp1 && !gp_cond_full1 && !gp_cond_exp_full1 && !gp_exp1 && - !gp_poll_exp1 && !gp_poll_exp_full1 && !gp_normal1 && !gp_poll1 && !gp_poll_full1 && - !gp_sync1) - gp_cond1 = gp_cond_exp1 = gp_cond_full1 = gp_cond_exp_full1 = gp_exp1 = - gp_poll_exp1 = gp_poll_exp_full1 = gp_normal1 = gp_poll1 = - gp_poll_full1 = gp_sync1 = true; + if (!gp_cond1 && + !gp_cond_exp1 && + !gp_cond_full1 && + !gp_cond_exp_full1 && + !gp_exp1 && + !gp_poll_exp1 && + !gp_poll_exp_full1 && + !gp_normal1 && + !gp_poll1 && + !gp_poll_full1 && + !gp_sync1) { + gp_cond1 = true; + gp_cond_exp1 = true; + gp_cond_full1 = true; + gp_cond_exp_full1 = true; + gp_exp1 = true; + gp_poll_exp1 = true; + gp_poll_exp_full1 = true; + gp_normal1 = true; + gp_poll1 = true; + gp_poll_full1 = true; + gp_sync1 = true; + } if (gp_cond1 && cur_ops->get_gp_state && cur_ops->cond_sync) { synctype[nsynctypes++] = RTWS_COND_GET; pr_info("%s: Testing conditional GPs.\n", __func__); -- cgit v1.2.3 From 967c298d650e2a8562fb4ea488af36bdf2a71ac4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 24 Aug 2022 18:57:45 -0700 Subject: rcutorture: Limit read-side polling-API testing RCU's polled grace-period API is reasonably lightweight, but still contains heavyweight memory barriers. This commit therefore limits testing of this API from rcutorture's readers in order to avoid the false negatives that these heavyweight operations could provoke. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 41 +++++++++++++++++++++++------------------ 1 file changed, 23 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index fe1836aad646..91103279d7b4 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1903,6 +1903,7 @@ rcutorture_loop_extend(int *readstate, struct torture_random_state *trsp, */ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) { + bool checkpolling = !(torture_random(trsp) & 0xfff); unsigned long cookie; struct rcu_gp_oldstate cookie_full; int i; @@ -1920,10 +1921,12 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) WARN_ON_ONCE(!rcu_is_watching()); newstate = rcutorture_extend_mask(readstate, trsp); rcutorture_one_extend(&readstate, newstate, trsp, rtrsp++); - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - cookie = cur_ops->get_gp_state(); - if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) - cur_ops->get_gp_state_full(&cookie_full); + if (checkpolling) { + if (cur_ops->get_gp_state && cur_ops->poll_gp_state) + cookie = cur_ops->get_gp_state(); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) + cur_ops->get_gp_state_full(&cookie_full); + } started = cur_ops->get_gp_seq(); ts = rcu_trace_clock_local(); p = rcu_dereference_check(rcu_torture_current, @@ -1957,20 +1960,22 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) } __this_cpu_inc(rcu_torture_batch[completed]); preempt_enable(); - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - WARN_ONCE(cur_ops->poll_gp_state(cookie), - "%s: Cookie check 2 failed %s(%d) %lu->%lu\n", - __func__, - rcu_torture_writer_state_getname(), - rcu_torture_writer_state, - cookie, cur_ops->get_gp_state()); - if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) - WARN_ONCE(cur_ops->poll_gp_state_full(&cookie_full), - "%s: Cookie check 6 failed %s(%d) online %*pbl\n", - __func__, - rcu_torture_writer_state_getname(), - rcu_torture_writer_state, - cpumask_pr_args(cpu_online_mask)); + if (checkpolling) { + if (cur_ops->get_gp_state && cur_ops->poll_gp_state) + WARN_ONCE(cur_ops->poll_gp_state(cookie), + "%s: Cookie check 2 failed %s(%d) %lu->%lu\n", + __func__, + rcu_torture_writer_state_getname(), + rcu_torture_writer_state, + cookie, cur_ops->get_gp_state()); + if (cur_ops->get_gp_state_full && cur_ops->poll_gp_state_full) + WARN_ONCE(cur_ops->poll_gp_state_full(&cookie_full), + "%s: Cookie check 6 failed %s(%d) online %*pbl\n", + __func__, + rcu_torture_writer_state_getname(), + rcu_torture_writer_state, + cpumask_pr_args(cpu_online_mask)); + } rcutorture_one_extend(&readstate, 0, trsp, rtrsp); WARN_ON_ONCE(readstate); // This next splat is expected behavior if leakpointer, especially -- cgit v1.2.3 From 599d97e3f2236a0f02a59da05798e8af43d5ce72 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 27 Jul 2022 16:16:41 -0700 Subject: rcutorture: Make "srcud" option also test polled grace-period API This commit brings the "srcud" (dynamically allocated) SRCU test in line with the "srcu" (statically allocated) test, so that both test the full SRCU polled grace-period API. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index d8e1b270a065..7168dc8d61e9 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -709,6 +709,9 @@ static struct rcu_torture_ops srcud_ops = { .deferred_free = srcu_torture_deferred_free, .sync = srcu_torture_synchronize, .exp_sync = srcu_torture_synchronize_expedited, + .get_gp_state = srcu_torture_get_gp_state, + .start_gp_poll = srcu_torture_start_gp_poll, + .poll_gp_state = srcu_torture_poll_gp_state, .call = srcu_torture_call, .cb_barrier = srcu_torture_barrier, .stats = srcu_torture_stats, -- cgit v1.2.3 From de3f2671ae563d24c679dcca36c9e0ebd9564ebd Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Aug 2022 11:59:49 -0700 Subject: srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise This commit applies the more-precise grace-period-state check used by rcu_seq_done_exact() to poll_state_synchronize_srcu(). This is important because Tiny SRCU uses a 16-bit counter, which can wrap quite quickly. If counter wrap continues to be a problem, then expanding ->srcu_idx and ->srcu_idx_max to 32 bits might be warranted. Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutiny.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 92c002d65482..a2af24f21467 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -240,10 +240,10 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_srcu); */ bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie) { - bool ret = USHORT_CMP_GE(READ_ONCE(ssp->srcu_idx), cookie); + unsigned short cur_s = READ_ONCE(ssp->srcu_idx); barrier(); - return ret; + return USHORT_CMP_GE(cur_s, cookie) || USHORT_CMP_LT(cur_s, cookie - 3); } EXPORT_SYMBOL_GPL(poll_state_synchronize_srcu); -- cgit v1.2.3 From 5fe89191e43fb37e874b8e5177fb2a5c72379b06 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Aug 2022 15:32:47 -0700 Subject: srcu: Make Tiny SRCU use full-sized grace-period counters This commit makes Tiny SRCU use full-sized grace-period counters to further avoid counter-wrap issues when using polled grace-period APIs. Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutiny.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index a2af24f21467..33adafdad261 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -117,7 +117,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) + if (ssp->srcu_gp_running || ULONG_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -150,17 +150,17 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (USHORT_CMP_LT(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) + if (ULONG_CMP_LT(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) schedule_work(&ssp->srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_if_needed(struct srcu_struct *ssp) { - unsigned short cookie; + unsigned long cookie; cookie = get_state_synchronize_srcu(ssp); - if (USHORT_CMP_GE(READ_ONCE(ssp->srcu_idx_max), cookie)) + if (ULONG_CMP_GE(READ_ONCE(ssp->srcu_idx_max), cookie)) return; WRITE_ONCE(ssp->srcu_idx_max, cookie); if (!READ_ONCE(ssp->srcu_gp_running)) { @@ -215,7 +215,7 @@ unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) barrier(); ret = (READ_ONCE(ssp->srcu_idx) + 3) & ~0x1; barrier(); - return ret & USHRT_MAX; + return ret; } EXPORT_SYMBOL_GPL(get_state_synchronize_srcu); @@ -240,10 +240,10 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_srcu); */ bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie) { - unsigned short cur_s = READ_ONCE(ssp->srcu_idx); + unsigned long cur_s = READ_ONCE(ssp->srcu_idx); barrier(); - return USHORT_CMP_GE(cur_s, cookie) || USHORT_CMP_LT(cur_s, cookie - 3); + return ULONG_CMP_GE(cur_s, cookie) || ULONG_CMP_LT(cur_s, cookie - 3); } EXPORT_SYMBOL_GPL(poll_state_synchronize_srcu); -- cgit v1.2.3 From fcd53c8a4dfa38bafb89efdd0b0f718f3a03f884 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Tue, 12 Jul 2022 16:26:05 +0800 Subject: rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() Kernels built with CONFIG_PROVE_RCU=y and CONFIG_DEBUG_LOCK_ALLOC=y attempt to emit a warning when the synchronize_rcu_tasks_generic() function is called during early boot while the rcu_scheduler_active variable is RCU_SCHEDULER_INACTIVE. However the warnings is not actually be printed because the debug_lockdep_rcu_enabled() returns false, exactly because the rcu_scheduler_active variable is still equal to RCU_SCHEDULER_INACTIVE. This commit therefore replaces RCU_LOCKDEP_WARN() with WARN_ONCE() to force these warnings to actually be printed. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 83c7e6620d40..469bf2a3b505 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -560,7 +560,7 @@ static int __noreturn rcu_tasks_kthread(void *arg) static void synchronize_rcu_tasks_generic(struct rcu_tasks *rtp) { /* Complain if the scheduler has not started. */ - RCU_LOCKDEP_WARN(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE, + WARN_ONCE(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE, "synchronize_rcu_tasks called too soon"); // If the grace-period kthread is running, use it. -- cgit v1.2.3 From d6ad60635cafe900bcd11ad588d8accb36c36b1b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 18 Jul 2022 10:57:26 -0700 Subject: rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states The RCU Tasks Trace grace-period kthread loops across all CPUs, and there can be quite a few CPUs, with some commercially available systems sporting well over a thousand of them. Some of these loops can feature IPIs, which can take some time. This commit therefore places a call to cond_resched_tasks_rcu_qs() in each such loop. Link: https://docs.google.com/document/d/1V0YnG1HTWMt9WHJjroiJL9lf-hMrud4v8Fn3fhyY0cI/edit?usp=sharing Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 469bf2a3b505..f5bf6fb430da 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1500,6 +1500,7 @@ static void rcu_tasks_trace_pregp_step(struct list_head *hop) if (rcu_tasks_trace_pertask_prep(t, true)) trc_add_holdout(t, hop); rcu_read_unlock(); + cond_resched_tasks_rcu_qs(); } // Only after all running tasks have been accounted for is it @@ -1520,6 +1521,7 @@ static void rcu_tasks_trace_pregp_step(struct list_head *hop) raw_spin_lock_irqsave_rcu_node(rtpcp, flags); } raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); + cond_resched_tasks_rcu_qs(); } // Re-enable CPU hotplug now that the holdout list is populated. @@ -1619,6 +1621,7 @@ static void check_all_holdout_tasks_trace(struct list_head *hop, trc_del_holdout(t); else if (needreport) show_stalled_task_trace(t, firstreport); + cond_resched_tasks_rcu_qs(); } // Re-enable CPU hotplug now that the holdout list scan has completed. -- cgit v1.2.3 From 528262f50274079740b53e29bcaaabf219aa7417 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Tue, 19 Jul 2022 12:39:00 +0800 Subject: rcu-tasks: Make RCU Tasks Trace check for userspace execution Userspace execution is a valid quiescent state for RCU Tasks Trace, but the scheduling-clock interrupt does not currently report such quiescent states. Of course, the scheduling-clock interrupt is not strictly speaking userspace execution. However, the only way that this code is not in a quiescent state is if something invoked rcu_read_lock_trace(), and that would be reflected in the ->trc_reader_nesting field in the task_struct structure. Furthermore, this field is checked by rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in turn invoked by rcu_note_voluntary_context_switch() in kernels building at least one of the RCU Tasks flavors. It is therefore safe to invoke rcu_tasks_trace_qs() from the rcu_sched_clock_irq(). But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU Tasks, which lacks the read-side markers provided by RCU Tasks Trace. This raises the possibility that an RCU Tasks grace period could start after the interrupt from userspace execution, but before the call to rcu_sched_clock_irq(). However, it turns out that this is safe because the RCU Tasks grace period waits for an RCU grace period, which will wait for the entire scheduling-clock interrupt handler, including any RCU Tasks read-side critical section that this handler might contain. This commit therefore updates the rcu_sched_clock_irq() function's check for usermode execution and its call to rcu_tasks_classic_qs() to instead check for both usermode execution and interrupt from idle, and to instead call rcu_note_voluntary_context_switch(). This consolidates code and provides more faster RCU Tasks Trace reporting of quiescent states in kernels that do scheduling-clock interrupts for userspace execution. [ paulmck: Consolidate checks into rcu_sched_clock_irq(). ] Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 4 ++-- kernel/rcu/tree_plugin.h | 4 ---- 2 files changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 79aea7df4345..11d5aefd1696 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2341,8 +2341,8 @@ void rcu_sched_clock_irq(int user) rcu_flavor_sched_clock_irq(user); if (rcu_pending(user)) invoke_rcu_core(); - if (user) - rcu_tasks_classic_qs(current, false); + if (user || rcu_is_cpu_rrupt_from_idle()) + rcu_note_voluntary_context_switch(current); lockdep_assert_irqs_disabled(); trace_rcu_utilization(TPS("End scheduler-tick")); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 438ecae6bd7e..aa64b035c24f 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -718,9 +718,6 @@ static void rcu_flavor_sched_clock_irq(int user) struct task_struct *t = current; lockdep_assert_irqs_disabled(); - if (user || rcu_is_cpu_rrupt_from_idle()) { - rcu_note_voluntary_context_switch(current); - } if (rcu_preempt_depth() > 0 || (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) { /* No QS, force context switch if deferred. */ @@ -972,7 +969,6 @@ static void rcu_flavor_sched_clock_irq(int user) * neither access nor modify, at least not while the * corresponding CPU is online. */ - rcu_qs(); } } -- cgit v1.2.3 From 48297a22a39adcde7a3ba52b913c5aaa9a990364 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Sun, 31 Jul 2022 18:53:56 +0800 Subject: rcutorture: Use the barrier operation specified by cur_ops The rcutorture_oom_notify() function unconditionally invokes rcu_barrier(), which is OK when the rcutorture.torture_type value is "rcu", but unhelpful otherwise. The purpose of these barrier calls is to wait for all outstanding callback-flooding callbacks to be invoked before cleaning up their data. Using the wrong barrier function therefore risks arbitrary memory corruption. Thus, this commit changes these rcu_barrier() calls into cur_ops->cb_barrier() to make things work when torturing non-vanilla flavors of RCU. Signed-off-by: Zqiang Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index d8e1b270a065..08b7b59d5d05 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2600,12 +2600,12 @@ static int rcutorture_oom_notify(struct notifier_block *self, for (i = 0; i < fwd_progress; i++) ncbs += rcu_torture_fwd_prog_cbfree(&rfp[i]); pr_info("%s: Freed %lu RCU callbacks.\n", __func__, ncbs); - rcu_barrier(); + cur_ops->cb_barrier(); ncbs = 0; for (i = 0; i < fwd_progress; i++) ncbs += rcu_torture_fwd_prog_cbfree(&rfp[i]); pr_info("%s: Freed %lu RCU callbacks.\n", __func__, ncbs); - rcu_barrier(); + cur_ops->cb_barrier(); ncbs = 0; for (i = 0; i < fwd_progress; i++) ncbs += rcu_torture_fwd_prog_cbfree(&rfp[i]); -- cgit v1.2.3 From 6b959ba22d34ca793ffdb15b5715457c78e38b1a Mon Sep 17 00:00:00 2001 From: Yang Jihong Date: Fri, 2 Sep 2022 16:29:18 +0800 Subject: perf/core: Fix reentry problem in perf_output_read_group() perf_output_read_group may respond to IPI request of other cores and invoke __perf_install_in_context function. As a result, hwc configuration is modified. causing inconsistency and unexpected consequences. Interrupts are not disabled when perf_output_read_group reads PMU counter. In this case, IPI request may be received from other cores. As a result, PMU configuration is modified and an error occurs when reading PMU counter: CPU0 CPU1 __se_sys_perf_event_open perf_install_in_context perf_output_read_group smp_call_function_single for_each_sibling_event(sub, leader) { generic_exec_single if ((sub != event) && remote_function (sub->state == PERF_EVENT_STATE_ACTIVE)) | <----RAISE IPI-----+ __perf_install_in_context ctx_resched event_sched_out armpmu_del ... hwc->idx = -1; // event->hwc.idx is set to -1 ... sub->pmu->read(sub); armpmu_read armv8pmu_read_counter armv8pmu_read_hw_counter int idx = event->hw.idx; // idx = -1 u64 val = armv8pmu_read_evcntr(idx); u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30 read_pmevcntrn(counter) // undefined instruction Signed-off-by: Yang Jihong Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com --- kernel/events/core.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 2621fd24ad26..ff4bffc502c6 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6893,9 +6893,16 @@ static void perf_output_read_group(struct perf_output_handle *handle, { struct perf_event *leader = event->group_leader, *sub; u64 read_format = event->attr.read_format; + unsigned long flags; u64 values[6]; int n = 0; + /* + * Disabling interrupts avoids all counter scheduling + * (context switches, timer based rotation and IPIs). + */ + local_irq_save(flags); + values[n++] = 1 + leader->nr_siblings; if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) @@ -6931,6 +6938,8 @@ static void perf_output_read_group(struct perf_output_handle *handle, __output_copy(handle, values, n * sizeof(u64)); } + + local_irq_restore(flags); } #define PERF_FORMAT_TOTAL_TIMES (PERF_FORMAT_TOTAL_TIME_ENABLED|\ -- cgit v1.2.3 From 1e660f7ebe0ff6ac65ee0000280392d878630a67 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Tue, 6 Sep 2022 19:38:53 -0700 Subject: bpf: Replace __ksize with ksize. __ksize() was made private. Use ksize() instead. Reported-by: Stephen Rothwell Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 5cc952da7d41..20621f5407d8 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -610,7 +610,7 @@ void notrace bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr) if (!ptr) return; - idx = bpf_mem_cache_idx(__ksize(ptr - LLIST_NODE_SZ)); + idx = bpf_mem_cache_idx(ksize(ptr - LLIST_NODE_SZ)); if (idx < 0) return; -- cgit v1.2.3 From 720e6a435194fb5237833a4a7ec6aa60a78964a8 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Wed, 31 Aug 2022 08:26:46 -0700 Subject: bpf: Allow struct argument in trampoline based programs Allow struct argument in trampoline based programs where the struct size should be <= 16 bytes. In such cases, the argument will be put into up to 2 registers for bpf, x86_64 and arm64 architectures. To support arch-specific trampoline manipulation, add arg_flags for additional struct information about arguments in btf_func_model. Such information will be used in arch specific function arch_prepare_bpf_trampoline() to prepare argument access properly in trampoline. Signed-off-by: Yonghong Song Link: https://lore.kernel.org/r/20220831152646.2078089-1-yhs@fb.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 45 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 903719b89238..ea94527e5d70 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -5328,6 +5328,34 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t) return btf_type_is_int(t); } +static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, + int off) +{ + const struct btf_param *args; + const struct btf_type *t; + u32 offset = 0, nr_args; + int i; + + if (!func_proto) + return off / 8; + + nr_args = btf_type_vlen(func_proto); + args = (const struct btf_param *)(func_proto + 1); + for (i = 0; i < nr_args; i++) { + t = btf_type_skip_modifiers(btf, args[i].type, NULL); + offset += btf_type_is_ptr(t) ? 8 : roundup(t->size, 8); + if (off < offset) + return i; + } + + t = btf_type_skip_modifiers(btf, func_proto->type, NULL); + offset += btf_type_is_ptr(t) ? 8 : roundup(t->size, 8); + if (off < offset) + return nr_args; + + return nr_args + 1; +} + bool btf_ctx_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) @@ -5347,7 +5375,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, tname, off); return false; } - arg = off / 8; + arg = get_ctx_arg_idx(btf, t, off); args = (const struct btf_param *)(t + 1); /* if (t == NULL) Fall back to default BPF prog with * MAX_BPF_FUNC_REG_ARGS u64 arguments. @@ -5417,7 +5445,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, /* skip modifiers */ while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); - if (btf_type_is_small_int(t) || btf_is_any_enum(t)) + if (btf_type_is_small_int(t) || btf_is_any_enum(t) || __btf_type_is_struct(t)) /* accessing a scalar */ return true; if (!btf_type_is_ptr(t)) { @@ -5881,7 +5909,7 @@ static int __get_type_size(struct btf *btf, u32 btf_id, if (btf_type_is_ptr(t)) /* kernel size of pointer. Not BPF's size of pointer*/ return sizeof(void *); - if (btf_type_is_int(t) || btf_is_any_enum(t)) + if (btf_type_is_int(t) || btf_is_any_enum(t) || __btf_type_is_struct(t)) return t->size; return -EINVAL; } @@ -5901,8 +5929,10 @@ int btf_distill_func_proto(struct bpf_verifier_log *log, /* BTF function prototype doesn't match the verifier types. * Fall back to MAX_BPF_FUNC_REG_ARGS u64 args. */ - for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) + for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) { m->arg_size[i] = 8; + m->arg_flags[i] = 0; + } m->ret_size = 8; m->nr_args = MAX_BPF_FUNC_REG_ARGS; return 0; @@ -5916,7 +5946,7 @@ int btf_distill_func_proto(struct bpf_verifier_log *log, return -EINVAL; } ret = __get_type_size(btf, func->type, &t); - if (ret < 0) { + if (ret < 0 || __btf_type_is_struct(t)) { bpf_log(log, "The function %s return type %s is unsupported.\n", tname, btf_kind_str[BTF_INFO_KIND(t->info)]); @@ -5932,7 +5962,9 @@ int btf_distill_func_proto(struct bpf_verifier_log *log, return -EINVAL; } ret = __get_type_size(btf, args[i].type, &t); - if (ret < 0) { + + /* No support of struct argument size greater than 16 bytes */ + if (ret < 0 || ret > 16) { bpf_log(log, "The function %s arg%d type %s is unsupported.\n", tname, i, btf_kind_str[BTF_INFO_KIND(t->info)]); @@ -5945,6 +5977,7 @@ int btf_distill_func_proto(struct bpf_verifier_log *log, return -EINVAL; } m->arg_size[i] = ret; + m->arg_flags[i] = __btf_type_is_struct(t) ? BTF_FMODEL_STRUCT_ARG : 0; } m->nr_args = nargs; return 0; -- cgit v1.2.3 From 934f70d9d498a7fffd611edf48356698f1d72e91 Mon Sep 17 00:00:00 2001 From: Xiu Jianfeng Date: Tue, 6 Sep 2022 21:25:08 +0800 Subject: audit: remove selinux_audit_rule_update() declaration selinux_audit_rule_update() has been renamed to audit_update_lsm_rules() since commit d7a96f3a1ae2 ("Audit: internally use the new LSM audit hooks"), so remove it. Signed-off-by: Xiu Jianfeng Signed-off-by: Paul Moore --- kernel/audit.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.h b/kernel/audit.h index d6eb7b59c791..c57b008b9914 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -245,8 +245,6 @@ struct audit_netlink_list { int audit_send_list_thread(void *_dest); -extern int selinux_audit_rule_update(void); - extern struct mutex audit_filter_mutex; extern int audit_del_rule(struct audit_entry *entry); extern void audit_free_rule_rcu(struct rcu_head *head); -- cgit v1.2.3 From 95f2f26f3cac06cfc046d2b29e60719d7848ea54 Mon Sep 17 00:00:00 2001 From: Benjamin Tissoires Date: Tue, 6 Sep 2022 17:12:58 +0200 Subject: bpf: split btf_check_subprog_arg_match in two btf_check_subprog_arg_match() was used twice in verifier.c: - when checking for the type mismatches between a (sub)prog declaration and BTF - when checking the call of a subprog to see if the provided arguments are correct and valid This is problematic when we check if the first argument of a program (pointer to ctx) is correctly accessed: To be able to ensure we access a valid memory in the ctx, the verifier assumes the pointer to context is not null. This has the side effect of marking the program accessing the entire context, even if the context is never dereferenced. For example, by checking the context access with the current code, the following eBPF program would fail with -EINVAL if the ctx is set to null from the userspace: ``` SEC("syscall") int prog(struct my_ctx *args) { return 0; } ``` In that particular case, we do not want to actually check that the memory is correct while checking for the BTF validity, but we just want to ensure that the (sub)prog definition matches the BTF we have. So split btf_check_subprog_arg_match() in two so we can actually check for the memory used when in a call, and ignore that part when not. Note that a further patch is in preparation to disentangled btf_check_func_arg_match() from these two purposes, and so right now we just add a new hack around that by adding a boolean to this function. Signed-off-by: Benjamin Tissoires Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220906151303.2780789-3-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++----- kernel/bpf/verifier.c | 2 +- 2 files changed, 50 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index ea94527e5d70..9291e2b2c950 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -6203,7 +6203,8 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, const struct btf *btf, u32 func_id, struct bpf_reg_state *regs, bool ptr_to_mem_ok, - u32 kfunc_flags) + u32 kfunc_flags, + bool processing_call) { enum bpf_prog_type prog_type = resolve_prog_type(env->prog); bool rel = false, kptr_get = false, trusted_arg = false; @@ -6389,7 +6390,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, reg_ref_tname); return -EINVAL; } - } else if (ptr_to_mem_ok) { + } else if (ptr_to_mem_ok && processing_call) { const struct btf_type *resolve_ret; u32 type_size; @@ -6464,7 +6465,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, return rel ? ref_regno : 0; } -/* Compare BTF of a function with given bpf_reg_state. +/* Compare BTF of a function declaration with given bpf_reg_state. * Returns: * EFAULT - there is a verifier bug. Abort verification. * EINVAL - there is a type mismatch or BTF is not available. @@ -6491,7 +6492,50 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog, return -EINVAL; is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; - err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0); + err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0, false); + + /* Compiler optimizations can remove arguments from static functions + * or mismatched type can be passed into a global function. + * In such cases mark the function as unreliable from BTF point of view. + */ + if (err) + prog->aux->func_info_aux[subprog].unreliable = true; + return err; +} + +/* Compare BTF of a function call with given bpf_reg_state. + * Returns: + * EFAULT - there is a verifier bug. Abort verification. + * EINVAL - there is a type mismatch or BTF is not available. + * 0 - BTF matches with what bpf_reg_state expects. + * Only PTR_TO_CTX and SCALAR_VALUE states are recognized. + * + * NOTE: the code is duplicated from btf_check_subprog_arg_match() + * because btf_check_func_arg_match() is still doing both. Once that + * function is split in 2, we can call from here btf_check_subprog_arg_match() + * first, and then treat the calling part in a new code path. + */ +int btf_check_subprog_call(struct bpf_verifier_env *env, int subprog, + struct bpf_reg_state *regs) +{ + struct bpf_prog *prog = env->prog; + struct btf *btf = prog->aux->btf; + bool is_global; + u32 btf_id; + int err; + + if (!prog->aux->func_info) + return -EINVAL; + + btf_id = prog->aux->func_info[subprog].type_id; + if (!btf_id) + return -EFAULT; + + if (prog->aux->func_info_aux[subprog].unreliable) + return -EINVAL; + + is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; + err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0, true); /* Compiler optimizations can remove arguments from static functions * or mismatched type can be passed into a global function. @@ -6507,7 +6551,7 @@ int btf_check_kfunc_arg_match(struct bpf_verifier_env *env, struct bpf_reg_state *regs, u32 kfunc_flags) { - return btf_check_func_arg_match(env, btf, func_id, regs, true, kfunc_flags); + return btf_check_func_arg_match(env, btf, func_id, regs, true, kfunc_flags, true); } /* Convert BTF of a function into bpf_reg_state if possible diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 003f7ba19558..7d9a2e18ca8a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6629,7 +6629,7 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn func_info_aux = env->prog->aux->func_info_aux; if (func_info_aux) is_global = func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; - err = btf_check_subprog_arg_match(env, subprog, caller->regs); + err = btf_check_subprog_call(env, subprog, caller->regs); if (err == -EFAULT) return err; if (is_global) { -- cgit v1.2.3 From 15baa55ff5b00b81bcd9874b89cb8e0b0daaa13d Mon Sep 17 00:00:00 2001 From: Benjamin Tissoires Date: Tue, 6 Sep 2022 17:12:59 +0200 Subject: bpf/verifier: allow all functions to read user provided context When a function was trying to access data from context in a syscall eBPF program, the verifier was rejecting the call unless it was accessing the first element. This is because the syscall context is not known at compile time, and so we need to check this when actually accessing it. Check for the valid memory access if there is no convert_ctx callback, and allow such situation to happen. Acked-by: Kumar Kartikeya Dwivedi Signed-off-by: Benjamin Tissoires Link: https://lore.kernel.org/r/20220906151303.2780789-4-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7d9a2e18ca8a..3cfe60206de6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5233,6 +5233,25 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, env, regno, reg->off, access_size, zero_size_allowed, ACCESS_HELPER, meta); + case PTR_TO_CTX: + /* in case the function doesn't know how to access the context, + * (because we are in a program of type SYSCALL for example), we + * can not statically check its size. + * Dynamically check it now. + */ + if (!env->ops->convert_ctx_access) { + enum bpf_access_type atype = meta && meta->raw_mode ? BPF_WRITE : BPF_READ; + int offset = access_size - 1; + + /* Allow zero-byte read from PTR_TO_CTX */ + if (access_size == 0) + return zero_size_allowed ? 0 : -EACCES; + + return check_mem_access(env, env->insn_idx, regno, offset, BPF_B, + atype, -1, false); + } + + fallthrough; default: /* scalar_value or invalid ptr */ /* Allow zero-byte read from NULL, regardless of pointer type */ if (zero_size_allowed && access_size == 0 && -- cgit v1.2.3 From f9b348185f4d684cc19e6bd9b87904823d5aa5ed Mon Sep 17 00:00:00 2001 From: Benjamin Tissoires Date: Tue, 6 Sep 2022 17:13:01 +0200 Subject: bpf/btf: bump BTF_KFUNC_SET_MAX_CNT net/bpf/test_run.c is already presenting 20 kfuncs. net/netfilter/nf_conntrack_bpf.c is also presenting an extra 10 kfuncs. Given that all the kfuncs are regrouped into one unique set, having only 2 space left prevent us to add more selftests. Bump it to 256. Signed-off-by: Benjamin Tissoires Link: https://lore.kernel.org/r/20220906151303.2780789-6-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 9291e2b2c950..2c2d8190ca4a 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -208,7 +208,7 @@ enum btf_kfunc_hook { }; enum { - BTF_KFUNC_SET_MAX_CNT = 32, + BTF_KFUNC_SET_MAX_CNT = 256, BTF_DTOR_KFUNC_MAX_CNT = 256, }; -- cgit v1.2.3 From eb1f7f71c126c8fd50ea81af98f97c4b581ea4ae Mon Sep 17 00:00:00 2001 From: Benjamin Tissoires Date: Tue, 6 Sep 2022 17:13:02 +0200 Subject: bpf/verifier: allow kfunc to return an allocated mem For drivers (outside of network), the incoming data is not statically defined in a struct. Most of the time the data buffer is kzalloc-ed and thus we can not rely on eBPF and BTF to explore the data. This commit allows to return an arbitrary memory, previously allocated by the driver. An interesting extra point is that the kfunc can mark the exported memory region as read only or read/write. So, when a kfunc is not returning a pointer to a struct but to a plain type, we can consider it is a valid allocated memory assuming that: - one of the arguments is either called rdonly_buf_size or rdwr_buf_size - and this argument is a const from the caller point of view We can then use this parameter as the size of the allocated memory. The memory is either read-only or read-write based on the name of the size parameter. Acked-by: Kumar Kartikeya Dwivedi Signed-off-by: Benjamin Tissoires Link: https://lore.kernel.org/r/20220906151303.2780789-7-benjamin.tissoires@redhat.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 101 ++++++++++++++++++++++++++++++++++++++++---------- kernel/bpf/verifier.c | 45 +++++++++++++++------- 2 files changed, 113 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 2c2d8190ca4a..9d12212fcd61 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -6199,11 +6199,36 @@ static bool is_kfunc_arg_mem_size(const struct btf *btf, return true; } +static bool btf_is_kfunc_arg_mem_size(const struct btf *btf, + const struct btf_param *arg, + const struct bpf_reg_state *reg, + const char *name) +{ + int len, target_len = strlen(name); + const struct btf_type *t; + const char *param_name; + + t = btf_type_skip_modifiers(btf, arg->type, NULL); + if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE) + return false; + + param_name = btf_name_by_offset(btf, arg->name_off); + if (str_is_empty(param_name)) + return false; + len = strlen(param_name); + if (len != target_len) + return false; + if (strcmp(param_name, name)) + return false; + + return true; +} + static int btf_check_func_arg_match(struct bpf_verifier_env *env, const struct btf *btf, u32 func_id, struct bpf_reg_state *regs, bool ptr_to_mem_ok, - u32 kfunc_flags, + struct bpf_kfunc_arg_meta *kfunc_meta, bool processing_call) { enum bpf_prog_type prog_type = resolve_prog_type(env->prog); @@ -6241,12 +6266,12 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, return -EINVAL; } - if (is_kfunc) { + if (is_kfunc && kfunc_meta) { /* Only kfunc can be release func */ - rel = kfunc_flags & KF_RELEASE; - kptr_get = kfunc_flags & KF_KPTR_GET; - trusted_arg = kfunc_flags & KF_TRUSTED_ARGS; - sleepable = kfunc_flags & KF_SLEEPABLE; + rel = kfunc_meta->flags & KF_RELEASE; + kptr_get = kfunc_meta->flags & KF_KPTR_GET; + trusted_arg = kfunc_meta->flags & KF_TRUSTED_ARGS; + sleepable = kfunc_meta->flags & KF_SLEEPABLE; } /* check that BTF function arguments match actual types that the @@ -6259,6 +6284,38 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, t = btf_type_skip_modifiers(btf, args[i].type, NULL); if (btf_type_is_scalar(t)) { + if (is_kfunc && kfunc_meta) { + bool is_buf_size = false; + + /* check for any const scalar parameter of name "rdonly_buf_size" + * or "rdwr_buf_size" + */ + if (btf_is_kfunc_arg_mem_size(btf, &args[i], reg, + "rdonly_buf_size")) { + kfunc_meta->r0_rdonly = true; + is_buf_size = true; + } else if (btf_is_kfunc_arg_mem_size(btf, &args[i], reg, + "rdwr_buf_size")) + is_buf_size = true; + + if (is_buf_size) { + if (kfunc_meta->r0_size) { + bpf_log(log, "2 or more rdonly/rdwr_buf_size parameters for kfunc"); + return -EINVAL; + } + + if (!tnum_is_const(reg->var_off)) { + bpf_log(log, "R%d is not a const\n", regno); + return -EINVAL; + } + + kfunc_meta->r0_size = reg->var_off.value; + ret = mark_chain_precision(env, regno); + if (ret) + return ret; + } + } + if (reg->type == SCALAR_VALUE) continue; bpf_log(log, "R%d is not a scalar\n", regno); @@ -6289,6 +6346,17 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, if (ret < 0) return ret; + if (is_kfunc && reg->ref_obj_id) { + /* Ensure only one argument is referenced PTR_TO_BTF_ID */ + if (ref_obj_id) { + bpf_log(log, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n", + regno, reg->ref_obj_id, ref_obj_id); + return -EFAULT; + } + ref_regno = regno; + ref_obj_id = reg->ref_obj_id; + } + /* kptr_get is only true for kfunc */ if (i == 0 && kptr_get) { struct bpf_map_value_off_desc *off_desc; @@ -6361,16 +6429,6 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, if (reg->type == PTR_TO_BTF_ID) { reg_btf = reg->btf; reg_ref_id = reg->btf_id; - /* Ensure only one argument is referenced PTR_TO_BTF_ID */ - if (reg->ref_obj_id) { - if (ref_obj_id) { - bpf_log(log, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n", - regno, reg->ref_obj_id, ref_obj_id); - return -EFAULT; - } - ref_regno = regno; - ref_obj_id = reg->ref_obj_id; - } } else { reg_btf = btf_vmlinux; reg_ref_id = *reg2btf_ids[base_type(reg->type)]; @@ -6461,6 +6519,9 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, return -EINVAL; } + if (kfunc_meta && ref_obj_id) + kfunc_meta->ref_obj_id = ref_obj_id; + /* returns argument register number > 0 in case of reference release kfunc */ return rel ? ref_regno : 0; } @@ -6492,7 +6553,7 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog, return -EINVAL; is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; - err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0, false); + err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, NULL, false); /* Compiler optimizations can remove arguments from static functions * or mismatched type can be passed into a global function. @@ -6535,7 +6596,7 @@ int btf_check_subprog_call(struct bpf_verifier_env *env, int subprog, return -EINVAL; is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; - err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0, true); + err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, NULL, true); /* Compiler optimizations can remove arguments from static functions * or mismatched type can be passed into a global function. @@ -6549,9 +6610,9 @@ int btf_check_subprog_call(struct bpf_verifier_env *env, int subprog, int btf_check_kfunc_arg_match(struct bpf_verifier_env *env, const struct btf *btf, u32 func_id, struct bpf_reg_state *regs, - u32 kfunc_flags) + struct bpf_kfunc_arg_meta *meta) { - return btf_check_func_arg_match(env, btf, func_id, regs, true, kfunc_flags, true); + return btf_check_func_arg_match(env, btf, func_id, regs, true, meta, true); } /* Convert BTF of a function into bpf_reg_state if possible diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 3cfe60206de6..f3344a86d88d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2908,7 +2908,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno, return 0; } -static int mark_chain_precision(struct bpf_verifier_env *env, int regno) +int mark_chain_precision(struct bpf_verifier_env *env, int regno) { return __mark_chain_precision(env, regno, -1); } @@ -7595,6 +7595,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, { const struct btf_type *t, *func, *func_proto, *ptr_type; struct bpf_reg_state *regs = cur_regs(env); + struct bpf_kfunc_arg_meta meta = { 0 }; const char *func_name, *ptr_type_name; u32 i, nargs, func_id, ptr_type_id; int err, insn_idx = *insn_idx_p; @@ -7629,8 +7630,10 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, acq = *kfunc_flags & KF_ACQUIRE; + meta.flags = *kfunc_flags; + /* Check the arguments */ - err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs, *kfunc_flags); + err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs, &meta); if (err < 0) return err; /* In case of release function, we get register number of refcounted @@ -7651,7 +7654,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, /* Check return type */ t = btf_type_skip_modifiers(desc_btf, func_proto->type, NULL); - if (acq && !btf_type_is_ptr(t)) { + if (acq && !btf_type_is_struct_ptr(desc_btf, t)) { verbose(env, "acquire kernel function does not return PTR_TO_BTF_ID\n"); return -EINVAL; } @@ -7663,17 +7666,33 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id); if (!btf_type_is_struct(ptr_type)) { - ptr_type_name = btf_name_by_offset(desc_btf, - ptr_type->name_off); - verbose(env, "kernel function %s returns pointer type %s %s is not supported\n", - func_name, btf_type_str(ptr_type), - ptr_type_name); - return -EINVAL; + if (!meta.r0_size) { + ptr_type_name = btf_name_by_offset(desc_btf, + ptr_type->name_off); + verbose(env, + "kernel function %s returns pointer type %s %s is not supported\n", + func_name, + btf_type_str(ptr_type), + ptr_type_name); + return -EINVAL; + } + + mark_reg_known_zero(env, regs, BPF_REG_0); + regs[BPF_REG_0].type = PTR_TO_MEM; + regs[BPF_REG_0].mem_size = meta.r0_size; + + if (meta.r0_rdonly) + regs[BPF_REG_0].type |= MEM_RDONLY; + + /* Ensures we don't access the memory after a release_reference() */ + if (meta.ref_obj_id) + regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id; + } else { + mark_reg_known_zero(env, regs, BPF_REG_0); + regs[BPF_REG_0].btf = desc_btf; + regs[BPF_REG_0].type = PTR_TO_BTF_ID; + regs[BPF_REG_0].btf_id = ptr_type_id; } - mark_reg_known_zero(env, regs, BPF_REG_0); - regs[BPF_REG_0].btf = desc_btf; - regs[BPF_REG_0].type = PTR_TO_BTF_ID; - regs[BPF_REG_0].btf_id = ptr_type_id; if (*kfunc_flags & KF_RET_NULL) { regs[BPF_REG_0].type |= PTR_MAYBE_NULL; /* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */ -- cgit v1.2.3 From 9fad7fe5b29803584c7f17a2abe6c2936fec6828 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Wed, 7 Sep 2022 16:24:20 +0100 Subject: bpf: Fix resetting logic for unreferenced kptrs Sparse reported a warning at bpf_map_free_kptrs() "warning: Using plain integer as NULL pointer" During the process of fixing this warning, it was discovered that the current code erroneously writes to the pointer variable instead of deferencing and writing to the actual kptr. Hence, Sparse tool accidentally helped to uncover this problem. Fix this by doing WRITE_ONCE(*p, 0) instead of WRITE_ONCE(p, 0). Note that the effect of this bug is that unreferenced kptrs will not be cleared during check_and_free_fields. It is not a problem if the clearing is not done during map_free stage, as there is nothing to free for them. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Jules Irenge Link: https://lore.kernel.org/r/Yxi3pJaK6UDjVJSy@playground Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 4fb08c43420d..d35a6aa3aa96 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -598,7 +598,7 @@ void bpf_map_free_kptrs(struct bpf_map *map, void *map_value) if (off_desc->type == BPF_KPTR_UNREF) { u64 *p = (u64 *)btf_id_ptr; - WRITE_ONCE(p, 0); + WRITE_ONCE(*p, 0); continue; } old_ptr = xchg(btf_id_ptr, 0); -- cgit v1.2.3 From 6df4ea1ff0ff70798ff1e7eed79f98ccb7b5b0a2 Mon Sep 17 00:00:00 2001 From: Kumar Kartikeya Dwivedi Date: Sun, 4 Sep 2022 22:41:15 +0200 Subject: bpf: Support kptrs in percpu arraymap Enable support for kptrs in percpu BPF arraymap by wiring up the freeing of these kptrs from percpu map elements. Signed-off-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/arraymap.c | 33 ++++++++++++++++++++++++--------- kernel/bpf/syscall.c | 3 ++- 2 files changed, 26 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 624527401d4d..832b2659e96e 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -279,7 +279,8 @@ int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value) rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { - bpf_long_memcpy(value + off, per_cpu_ptr(pptr, cpu), size); + copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu)); + check_and_init_map_value(map, value + off); off += size; } rcu_read_unlock(); @@ -338,8 +339,9 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value, return -EINVAL; if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { - memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]), - value, map->value_size); + val = this_cpu_ptr(array->pptrs[index & array->index_mask]); + copy_map_value(map, val, value); + check_and_free_fields(array, val); } else { val = array->value + (u64)array->elem_size * (index & array->index_mask); @@ -383,7 +385,8 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { - bpf_long_memcpy(per_cpu_ptr(pptr, cpu), value + off, size); + copy_map_value_long(map, per_cpu_ptr(pptr, cpu), value + off); + check_and_free_fields(array, per_cpu_ptr(pptr, cpu)); off += size; } rcu_read_unlock(); @@ -421,8 +424,20 @@ static void array_map_free(struct bpf_map *map) int i; if (map_value_has_kptrs(map)) { - for (i = 0; i < array->map.max_entries; i++) - bpf_map_free_kptrs(map, array_map_elem_ptr(array, i)); + if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { + for (i = 0; i < array->map.max_entries; i++) { + void __percpu *pptr = array->pptrs[i & array->index_mask]; + int cpu; + + for_each_possible_cpu(cpu) { + bpf_map_free_kptrs(map, per_cpu_ptr(pptr, cpu)); + cond_resched(); + } + } + } else { + for (i = 0; i < array->map.max_entries; i++) + bpf_map_free_kptrs(map, array_map_elem_ptr(array, i)); + } bpf_map_free_kptr_off_tab(map); } @@ -608,9 +623,9 @@ static int __bpf_array_map_seq_show(struct seq_file *seq, void *v) pptr = v; size = array->elem_size; for_each_possible_cpu(cpu) { - bpf_long_memcpy(info->percpu_value_buf + off, - per_cpu_ptr(pptr, cpu), - size); + copy_map_value_long(map, info->percpu_value_buf + off, + per_cpu_ptr(pptr, cpu)); + check_and_init_map_value(map, info->percpu_value_buf + off); off += size; } ctx.value = info->percpu_value_buf; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index d35a6aa3aa96..69be1c612daa 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1049,7 +1049,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, } if (map->map_type != BPF_MAP_TYPE_HASH && map->map_type != BPF_MAP_TYPE_LRU_HASH && - map->map_type != BPF_MAP_TYPE_ARRAY) { + map->map_type != BPF_MAP_TYPE_ARRAY && + map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY) { ret = -EOPNOTSUPP; goto free_map_tab; } -- cgit v1.2.3 From b239da34203f49c40b5d656220c39647c3ff0b3c Mon Sep 17 00:00:00 2001 From: Kumar Kartikeya Dwivedi Date: Sun, 4 Sep 2022 22:41:28 +0200 Subject: bpf: Add helper macro bpf_for_each_reg_in_vstate For a lot of use cases in future patches, we will want to modify the state of registers part of some same 'group' (e.g. same ref_obj_id). It won't just be limited to releasing reference state, but setting a type flag dynamically based on certain actions, etc. Hence, we need a way to easily pass a callback to the function that iterates over all registers in current bpf_verifier_state in all frames upto (and including) the curframe. While in C++ we would be able to easily use a lambda to pass state and the callback together, sadly we aren't using C++ in the kernel. The next best thing to avoid defining a function for each case seems like statement expressions in GNU C. The kernel already uses them heavily, hence they can passed to the macro in the style of a lambda. The statement expression will then be substituted in the for loop bodies. Variables __state and __reg are set to current bpf_func_state and reg for each invocation of the expression inside the passed in verifier state. Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers, release_reference, find_good_pkt_pointers, find_equal_scalars to use bpf_for_each_reg_in_vstate. Signed-off-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220904204145.3089-16-memxor@gmail.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 135 +++++++++++--------------------------------------- 1 file changed, 28 insertions(+), 107 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f3344a86d88d..c0f175ac187a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6513,31 +6513,15 @@ static int check_func_proto(const struct bpf_func_proto *fn, int func_id) /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END] * are now invalid, so turn them into unknown SCALAR_VALUE. */ -static void __clear_all_pkt_pointers(struct bpf_verifier_env *env, - struct bpf_func_state *state) +static void clear_all_pkt_pointers(struct bpf_verifier_env *env) { - struct bpf_reg_state *regs = state->regs, *reg; - int i; - - for (i = 0; i < MAX_BPF_REG; i++) - if (reg_is_pkt_pointer_any(®s[i])) - mark_reg_unknown(env, regs, i); + struct bpf_func_state *state; + struct bpf_reg_state *reg; - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; + bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({ if (reg_is_pkt_pointer_any(reg)) __mark_reg_unknown(env, reg); - } -} - -static void clear_all_pkt_pointers(struct bpf_verifier_env *env) -{ - struct bpf_verifier_state *vstate = env->cur_state; - int i; - - for (i = 0; i <= vstate->curframe; i++) - __clear_all_pkt_pointers(env, vstate->frame[i]); + })); } enum { @@ -6566,41 +6550,24 @@ static void mark_pkt_end(struct bpf_verifier_state *vstate, int regn, bool range reg->range = AT_PKT_END; } -static void release_reg_references(struct bpf_verifier_env *env, - struct bpf_func_state *state, - int ref_obj_id) -{ - struct bpf_reg_state *regs = state->regs, *reg; - int i; - - for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].ref_obj_id == ref_obj_id) - mark_reg_unknown(env, regs, i); - - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - if (reg->ref_obj_id == ref_obj_id) - __mark_reg_unknown(env, reg); - } -} - /* The pointer with the specified id has released its reference to kernel * resources. Identify all copies of the same pointer and clear the reference. */ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id) { - struct bpf_verifier_state *vstate = env->cur_state; + struct bpf_func_state *state; + struct bpf_reg_state *reg; int err; - int i; err = release_reference_state(cur_func(env), ref_obj_id); if (err) return err; - for (i = 0; i <= vstate->curframe; i++) - release_reg_references(env, vstate->frame[i], ref_obj_id); + bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({ + if (reg->ref_obj_id == ref_obj_id) + __mark_reg_unknown(env, reg); + })); return 0; } @@ -9335,34 +9302,14 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) return 0; } -static void __find_good_pkt_pointers(struct bpf_func_state *state, - struct bpf_reg_state *dst_reg, - enum bpf_reg_type type, int new_range) -{ - struct bpf_reg_state *reg; - int i; - - for (i = 0; i < MAX_BPF_REG; i++) { - reg = &state->regs[i]; - if (reg->type == type && reg->id == dst_reg->id) - /* keep the maximum range already checked */ - reg->range = max(reg->range, new_range); - } - - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - if (reg->type == type && reg->id == dst_reg->id) - reg->range = max(reg->range, new_range); - } -} - static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, struct bpf_reg_state *dst_reg, enum bpf_reg_type type, bool range_right_open) { - int new_range, i; + struct bpf_func_state *state; + struct bpf_reg_state *reg; + int new_range; if (dst_reg->off < 0 || (dst_reg->off == 0 && range_right_open)) @@ -9427,9 +9374,11 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, * the range won't allow anything. * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16. */ - for (i = 0; i <= vstate->curframe; i++) - __find_good_pkt_pointers(vstate->frame[i], dst_reg, type, - new_range); + bpf_for_each_reg_in_vstate(vstate, state, reg, ({ + if (reg->type == type && reg->id == dst_reg->id) + /* keep the maximum range already checked */ + reg->range = max(reg->range, new_range); + })); } static int is_branch32_taken(struct bpf_reg_state *reg, u32 val, u8 opcode) @@ -9918,7 +9867,7 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, if (!reg_may_point_to_spin_lock(reg)) { /* For not-NULL ptr, reg->ref_obj_id will be reset - * in release_reg_references(). + * in release_reference(). * * reg->id is still used by spin_lock ptr. Other * than spin_lock ptr type, reg->id can be reset. @@ -9928,22 +9877,6 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } } -static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id, - bool is_null) -{ - struct bpf_reg_state *reg; - int i; - - for (i = 0; i < MAX_BPF_REG; i++) - mark_ptr_or_null_reg(state, &state->regs[i], id, is_null); - - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - mark_ptr_or_null_reg(state, reg, id, is_null); - } -} - /* The logic is similar to find_good_pkt_pointers(), both could eventually * be folded together at some point. */ @@ -9951,10 +9884,9 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, bool is_null) { struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *regs = state->regs; + struct bpf_reg_state *regs = state->regs, *reg; u32 ref_obj_id = regs[regno].ref_obj_id; u32 id = regs[regno].id; - int i; if (ref_obj_id && ref_obj_id == id && is_null) /* regs[regno] is in the " == NULL" branch. @@ -9963,8 +9895,9 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, */ WARN_ON_ONCE(release_reference_state(state, id)); - for (i = 0; i <= vstate->curframe; i++) - __mark_ptr_or_null_regs(vstate->frame[i], id, is_null); + bpf_for_each_reg_in_vstate(vstate, state, reg, ({ + mark_ptr_or_null_reg(state, reg, id, is_null); + })); } static bool try_match_pkt_pointers(const struct bpf_insn *insn, @@ -10077,23 +10010,11 @@ static void find_equal_scalars(struct bpf_verifier_state *vstate, { struct bpf_func_state *state; struct bpf_reg_state *reg; - int i, j; - for (i = 0; i <= vstate->curframe; i++) { - state = vstate->frame[i]; - for (j = 0; j < MAX_BPF_REG; j++) { - reg = &state->regs[j]; - if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) - *reg = *known_reg; - } - - bpf_for_each_spilled_reg(j, state, reg) { - if (!reg) - continue; - if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) - *reg = *known_reg; - } - } + bpf_for_each_reg_in_vstate(vstate, state, reg, ({ + if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) + *reg = *known_reg; + })); } static int check_cond_jmp_op(struct bpf_verifier_env *env, -- cgit v1.2.3 From cf7de6a53600ea554a8358e44fbcf47b449235f9 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Sun, 11 Sep 2022 00:07:11 +0900 Subject: bpf: add missing percpu_counter_destroy() in htab_map_alloc() syzbot is reporting ODEBUG bug in htab_map_alloc() [1], for commit 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.") added percpu_counter_init() to htab_map_alloc() but forgot to add percpu_counter_destroy() to the error path. Link: https://syzkaller.appspot.com/bug?extid=5d1da78b375c3b5e6c2b [1] Reported-by: syzbot Signed-off-by: Tetsuo Handa Fixes: 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.") Reviewed-by: Stanislav Fomichev Link: https://lore.kernel.org/r/e2e4cc0e-9d36-4ca1-9bfa-ce23e6f8310b@I-love.SAKURA.ne.jp Signed-off-by: Alexei Starovoitov --- kernel/bpf/hashtab.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 0fe3f136cbbe..86aec20c22d0 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -622,6 +622,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) free_prealloc: prealloc_destroy(htab); free_map_locked: + if (htab->use_percpu_counter) + percpu_counter_destroy(&htab->pcount); for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++) free_percpu(htab->map_locked[i]); bpf_map_area_free(htab->buckets); -- cgit v1.2.3 From 57c92f11a215717bf90880828b7a23c736c3c0d9 Mon Sep 17 00:00:00 2001 From: Punit Agrawal Date: Wed, 7 Sep 2022 16:57:46 +0100 Subject: bpf: Simplify code by using for_each_cpu_wrap() In the percpu freelist code, it is a common pattern to iterate over the possible CPUs mask starting with the current CPU. The pattern is implemented using a hand rolled while loop with the loop variable increment being open-coded. Simplify the code by using for_each_cpu_wrap() helper to iterate over the possible cpus starting with the current CPU. As a result, some of the special-casing in the loop also gets simplified. No functional change intended. Signed-off-by: Punit Agrawal Acked-by: Song Liu Link: https://lore.kernel.org/r/20220907155746.1750329-1-punit.agrawal@bytedance.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/percpu_freelist.c | 48 +++++++++++++++----------------------------- 1 file changed, 16 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c index 00b874c8e889..b6e7f5c5b9ab 100644 --- a/kernel/bpf/percpu_freelist.c +++ b/kernel/bpf/percpu_freelist.c @@ -58,23 +58,21 @@ static inline void ___pcpu_freelist_push_nmi(struct pcpu_freelist *s, { int cpu, orig_cpu; - orig_cpu = cpu = raw_smp_processor_id(); + orig_cpu = raw_smp_processor_id(); while (1) { - struct pcpu_freelist_head *head; + for_each_cpu_wrap(cpu, cpu_possible_mask, orig_cpu) { + struct pcpu_freelist_head *head; - head = per_cpu_ptr(s->freelist, cpu); - if (raw_spin_trylock(&head->lock)) { - pcpu_freelist_push_node(head, node); - raw_spin_unlock(&head->lock); - return; + head = per_cpu_ptr(s->freelist, cpu); + if (raw_spin_trylock(&head->lock)) { + pcpu_freelist_push_node(head, node); + raw_spin_unlock(&head->lock); + return; + } } - cpu = cpumask_next(cpu, cpu_possible_mask); - if (cpu >= nr_cpu_ids) - cpu = 0; /* cannot lock any per cpu lock, try extralist */ - if (cpu == orig_cpu && - pcpu_freelist_try_push_extra(s, node)) + if (pcpu_freelist_try_push_extra(s, node)) return; } } @@ -125,13 +123,12 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s) { struct pcpu_freelist_head *head; struct pcpu_freelist_node *node; - int orig_cpu, cpu; + int cpu; - orig_cpu = cpu = raw_smp_processor_id(); - while (1) { + for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) { head = per_cpu_ptr(s->freelist, cpu); if (!READ_ONCE(head->first)) - goto next_cpu; + continue; raw_spin_lock(&head->lock); node = head->first; if (node) { @@ -140,12 +137,6 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s) return node; } raw_spin_unlock(&head->lock); -next_cpu: - cpu = cpumask_next(cpu, cpu_possible_mask); - if (cpu >= nr_cpu_ids) - cpu = 0; - if (cpu == orig_cpu) - break; } /* per cpu lists are all empty, try extralist */ @@ -164,13 +155,12 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s) { struct pcpu_freelist_head *head; struct pcpu_freelist_node *node; - int orig_cpu, cpu; + int cpu; - orig_cpu = cpu = raw_smp_processor_id(); - while (1) { + for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) { head = per_cpu_ptr(s->freelist, cpu); if (!READ_ONCE(head->first)) - goto next_cpu; + continue; if (raw_spin_trylock(&head->lock)) { node = head->first; if (node) { @@ -180,12 +170,6 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s) } raw_spin_unlock(&head->lock); } -next_cpu: - cpu = cpumask_next(cpu, cpu_possible_mask); - if (cpu >= nr_cpu_ids) - cpu = 0; - if (cpu == orig_cpu) - break; } /* cannot pop from per cpu lists, try extralist */ -- cgit v1.2.3 From 65269888c695cf4643c6fdb989ea28bf1623685d Mon Sep 17 00:00:00 2001 From: Daniel Xu Date: Wed, 7 Sep 2022 10:40:36 -0600 Subject: bpf: Remove duplicate PTR_TO_BTF_ID RO check Since commit 27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS") there has existed bpf_verifier_ops:btf_struct_access. When btf_struct_access is _unset_ for a prog type, the verifier runs the default implementation, which is to enforce read only: if (env->ops->btf_struct_access) { [...] } else { if (atype != BPF_READ) { verbose(env, "only read is supported\n"); return -EACCES; } [...] } When btf_struct_access is _set_, the expectation is that btf_struct_access has full control over accesses, including if writes are allowed. Rather than carve out an exception for each prog type that may write to BTF ptrs, delete the redundant check and give full control to btf_struct_access. Signed-off-by: Daniel Xu Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/962da2bff1238746589e332ff1aecc49403cd7ce.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c0f175ac187a..c3efd461f36c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -13406,9 +13406,6 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code); env->prog->aux->num_exentries++; - } else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) { - verbose(env, "Writes through BTF pointers are not allowed\n"); - return -EINVAL; } continue; default: -- cgit v1.2.3 From 84c6ac417ceacd086efc330afece8922969610b7 Mon Sep 17 00:00:00 2001 From: Daniel Xu Date: Wed, 7 Sep 2022 10:40:39 -0600 Subject: bpf: Export btf_type_by_id() and bpf_log() These symbols will be used in nf_conntrack.ko to support direct writes to `nf_conn`. Signed-off-by: Daniel Xu Link: https://lore.kernel.org/r/3c98c19dc50d3b18ea5eca135b4fc3a5db036060.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 1 + kernel/bpf/verifier.c | 1 + 2 files changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 9d12212fcd61..98be25d13325 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -818,6 +818,7 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) return NULL; return btf->types[type_id]; } +EXPORT_SYMBOL_GPL(btf_type_by_id); /* * Regular int is not a bit field and it must be either diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c3efd461f36c..9109e07b759a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -370,6 +370,7 @@ __printf(2, 3) void bpf_log(struct bpf_verifier_log *log, bpf_verifier_vlog(log, fmt, args); va_end(args); } +EXPORT_SYMBOL_GPL(bpf_log); static const char *ltrim(const char *s) { -- cgit v1.2.3 From a37a32583e282d8d815e22add29bc1e91e19951a Mon Sep 17 00:00:00 2001 From: Lorenz Bauer Date: Sat, 10 Sep 2022 11:01:20 +0000 Subject: bpf: btf: fix truncated last_member_type_id in btf_struct_resolve When trying to finish resolving a struct member, btf_struct_resolve saves the member type id in a u16 temporary variable. This truncates the 32 bit type id value if it exceeds UINT16_MAX. As a result, structs that have members with type ids > UINT16_MAX and which need resolution will fail with a message like this: [67414] STRUCT ff_device size=120 vlen=12 effect_owners type_id=67434 bits_offset=960 Member exceeds struct_size Fix this by changing the type of last_member_type_id to u32. Fixes: a0791f0df7d2 ("bpf: fix BTF limits") Reviewed-by: Stanislav Fomichev Signed-off-by: Lorenz Bauer Link: https://lore.kernel.org/r/20220910110120.339242-1-oss@lmb.io Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 7e64447659f3..36fd4b509294 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3128,7 +3128,7 @@ static int btf_struct_resolve(struct btf_verifier_env *env, if (v->next_member) { const struct btf_type *last_member_type; const struct btf_member *last_member; - u16 last_member_type_id; + u32 last_member_type_id; last_member = btf_type_member(v->t) + v->next_member - 1; last_member_type_id = last_member->type; -- cgit v1.2.3 From 1bfe26fb082724be453e4d7fd9bb358e3ba669b2 Mon Sep 17 00:00:00 2001 From: Dave Marchevsky Date: Thu, 8 Sep 2022 16:07:16 -0700 Subject: bpf: Add verifier support for custom callback return range Verifier logic to confirm that a callback function returns 0 or 1 was added in commit 69c087ba6225b ("bpf: Add bpf_for_each_map_elem() helper"). At the time, callback return value was only used to continue or stop iteration. In order to support callbacks with a broader return value range, such as those added in rbtree series[0] and others, add a callback_ret_range to bpf_func_state. Verifier's helpers which set in_callback_fn will also set the new field, which the verifier will later use to check return value bounds. Default to tnum_range(0, 0) instead of using tnum_unknown as a sentinel value as the latter would prevent the valid range (0, U64_MAX) being used. Previous global default tnum_range(0, 1) is explicitly set for extant callback helpers. The change to global default was made after discussion around this patch in rbtree series [1], goal here is to make it more obvious that callback_ret_range should be explicitly set. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com/ [1]: lore.kernel.org/bpf/20220830172759.4069786-2-davemarchevsky@fb.com/ Signed-off-by: Dave Marchevsky Reviewed-by: Stanislav Fomichev Link: https://lore.kernel.org/r/20220908230716.2751723-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9109e07b759a..c259d734f863 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1750,6 +1750,7 @@ static void init_func_state(struct bpf_verifier_env *env, state->callsite = callsite; state->frameno = frameno; state->subprogno = subprogno; + state->callback_ret_range = tnum_range(0, 0); init_reg_state(env, state); mark_verifier_state_scratched(env); } @@ -6790,6 +6791,7 @@ static int set_map_elem_callback_state(struct bpf_verifier_env *env, return err; callee->in_callback_fn = true; + callee->callback_ret_range = tnum_range(0, 1); return 0; } @@ -6811,6 +6813,7 @@ static int set_loop_callback_state(struct bpf_verifier_env *env, __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); callee->in_callback_fn = true; + callee->callback_ret_range = tnum_range(0, 1); return 0; } @@ -6840,6 +6843,7 @@ static int set_timer_callback_state(struct bpf_verifier_env *env, __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); callee->in_async_callback_fn = true; + callee->callback_ret_range = tnum_range(0, 1); return 0; } @@ -6867,6 +6871,7 @@ static int set_find_vma_callback_state(struct bpf_verifier_env *env, __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); callee->in_callback_fn = true; + callee->callback_ret_range = tnum_range(0, 1); return 0; } @@ -6894,7 +6899,7 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) caller = state->frame[state->curframe]; if (callee->in_callback_fn) { /* enforce R0 return value range [0, 1]. */ - struct tnum range = tnum_range(0, 1); + struct tnum range = callee->callback_ret_range; if (r0->type != SCALAR_VALUE) { verbose(env, "R0 not a scalar value\n"); -- cgit v1.2.3 From 47e34cb74d376ddfeaef94abb1d6dfb3c905ee51 Mon Sep 17 00:00:00 2001 From: Dave Marchevsky Date: Mon, 12 Sep 2022 08:45:44 -0700 Subject: bpf: Add verifier check for BPF_PTR_POISON retval and arg BPF_PTR_POISON was added in commit c0a5a21c25f37 ("bpf: Allow storing referenced kptr in map") to denote a bpf_func_proto btf_id which the verifier will replace with a dynamically-determined btf_id at verification time. This patch adds verifier 'poison' functionality to BPF_PTR_POISON in order to prepare for expanded use of the value to poison ret- and arg-btf_id in ongoing work, namely rbtree and linked list patchsets [0, 1]. Specifically, when the verifier checks helper calls, it assumes that BPF_PTR_POISON'ed ret type will be replaced with a valid type before - or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id. If poisoned btf_id reaches default handling block for either, consider this a verifier internal error and fail verification. Otherwise a helper w/ poisoned btf_id but no verifier logic replacing the type will cause a crash as the invalid pointer is dereferenced. Also move BPF_PTR_POISON to existing include/linux/posion.h header and remove unnecessary shift. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com [1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com Signed-off-by: Dave Marchevsky Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 6 +++--- kernel/bpf/verifier.c | 30 +++++++++++++++++++++++------- 2 files changed, 26 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index fc08035f14ed..41aeaf3862ec 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -1376,10 +1377,9 @@ BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr) } /* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg() - * helper is determined dynamically by the verifier. + * helper is determined dynamically by the verifier. Use BPF_PTR_POISON to + * denote type that verifier will determine. */ -#define BPF_PTR_POISON ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) - static const struct bpf_func_proto bpf_kptr_xchg_proto = { .func = bpf_kptr_xchg, .gpl_only = false, diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c259d734f863..8c6fbcd0afaf 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "disasm.h" @@ -5782,13 +5783,22 @@ found: if (meta->func_id == BPF_FUNC_kptr_xchg) { if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno)) return -EACCES; - } else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off, - btf_vmlinux, *arg_btf_id, - strict_type_match)) { - verbose(env, "R%d is of type %s but %s is expected\n", - regno, kernel_type_name(reg->btf, reg->btf_id), - kernel_type_name(btf_vmlinux, *arg_btf_id)); - return -EACCES; + } else { + if (arg_btf_id == BPF_PTR_POISON) { + verbose(env, "verifier internal error:"); + verbose(env, "R%d has non-overwritten BPF_PTR_POISON type\n", + regno); + return -EACCES; + } + + if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off, + btf_vmlinux, *arg_btf_id, + strict_type_match)) { + verbose(env, "R%d is of type %s but %s is expected\n", + regno, kernel_type_name(reg->btf, reg->btf_id), + kernel_type_name(btf_vmlinux, *arg_btf_id)); + return -EACCES; + } } } @@ -7457,6 +7467,12 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn ret_btf = meta.kptr_off_desc->kptr.btf; ret_btf_id = meta.kptr_off_desc->kptr.btf_id; } else { + if (fn->ret_btf_id == BPF_PTR_POISON) { + verbose(env, "verifier internal error:"); + verbose(env, "func %s has non-overwritten BPF_PTR_POISON return type\n", + func_id_name(func_id)); + return -EINVAL; + } ret_btf = btf_vmlinux; ret_btf_id = *fn->ret_btf_id; } -- cgit v1.2.3 From 83c10cc362d91c0d8d25e60779ee52fdbbf3894d Mon Sep 17 00:00:00 2001 From: Lee Jones Date: Mon, 12 Sep 2022 14:38:55 +0100 Subject: bpf: Ensure correct locking around vulnerable function find_vpid() The documentation for find_vpid() clearly states: "Must be called with the tasklist_lock or rcu_read_lock() held." Presently we do neither for find_vpid() instance in bpf_task_fd_query(). Add proper rcu_read_lock/unlock() to fix the issue. Fixes: 41bdc4b40ed6f ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY") Signed-off-by: Lee Jones Signed-off-by: Daniel Borkmann Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20220912133855.1218900-1-lee@kernel.org --- kernel/bpf/syscall.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 27760627370d..1bd18af8af83 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -4395,7 +4395,9 @@ static int bpf_task_fd_query(const union bpf_attr *attr, if (attr->task_fd_query.flags != 0) return -EINVAL; + rcu_read_lock(); task = get_pid_task(find_vpid(pid), PIDTYPE_PID); + rcu_read_unlock(); if (!task) return -ENOENT; -- cgit v1.2.3 From a02c118ee9e898612cbae42121b9e8663455b515 Mon Sep 17 00:00:00 2001 From: Wang Yufen Date: Tue, 13 Sep 2022 16:40:33 +0800 Subject: bpf: use kvmemdup_bpfptr helper Use kvmemdup_bpfptr helper instead of open-coding to simplify the code. Signed-off-by: Wang Yufen Acked-by: Stanislav Fomichev Link: https://lore.kernel.org/r/1663058433-14089-1-git-send-email-wangyufen@huawei.com Signed-off-by: Martin KaFai Lau --- kernel/bpf/syscall.c | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 69be1c612daa..dab156f09f8d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1417,19 +1417,14 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr) } value_size = bpf_map_value_size(map); - - err = -ENOMEM; - value = kvmalloc(value_size, GFP_USER | __GFP_NOWARN); - if (!value) + value = kvmemdup_bpfptr(uvalue, value_size); + if (IS_ERR(value)) { + err = PTR_ERR(value); goto free_key; - - err = -EFAULT; - if (copy_from_bpfptr(value, uvalue, value_size) != 0) - goto free_value; + } err = bpf_map_update_value(map, f, key, value, attr->flags); -free_value: kvfree(value); free_key: kvfree(key); -- cgit v1.2.3 From 9440155ccb948f8e3ce5308907a2e7378799be60 Mon Sep 17 00:00:00 2001 From: "Peter Zijlstra (Intel)" Date: Sat, 3 Sep 2022 15:11:53 +0200 Subject: ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE x86 will shortly start using -fpatchable-function-entry for purposes other than ftrace, make sure the __patchable_function_entry section isn't merged in the mcount_loc section. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Signed-off-by: Jiri Olsa Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20220903131154.420467-2-jolsa@kernel.org --- kernel/trace/Kconfig | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 1052126bdca2..e9e95c790b8e 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -51,6 +51,12 @@ config HAVE_DYNAMIC_FTRACE_WITH_ARGS This allows for use of regs_get_kernel_argument() and kernel_stack_pointer(). +config HAVE_DYNAMIC_FTRACE_NO_PATCHABLE + bool + help + If the architecture generates __patchable_function_entries sections + but does not want them included in the ftrace locations. + config HAVE_FTRACE_MCOUNT_RECORD bool help -- cgit v1.2.3 From 571f9738bfb3d4b42253c1d0ad26da9fede85f36 Mon Sep 17 00:00:00 2001 From: Peilin Ye Date: Fri, 16 Sep 2022 13:28:00 -0700 Subject: bpf/btf: Use btf_type_str() whenever possible We have btf_type_str(). Use it whenever possible in btf.c, instead of "btf_kind_str[BTF_INFO_KIND(t->info)]". Signed-off-by: Peilin Ye Link: https://lore.kernel.org/r/20220916202800.31421-1-yepeilin.cs@gmail.com Signed-off-by: Martin KaFai Lau --- kernel/bpf/btf.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 98be25d13325..b3940c605aac 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1397,7 +1397,6 @@ __printf(4, 5) static void __btf_verifier_log_type(struct btf_verifier_env *env, const char *fmt, ...) { struct bpf_verifier_log *log = &env->log; - u8 kind = BTF_INFO_KIND(t->info); struct btf *btf = env->btf; va_list args; @@ -1413,7 +1412,7 @@ __printf(4, 5) static void __btf_verifier_log_type(struct btf_verifier_env *env, __btf_verifier_log(log, "[%u] %s %s%s", env->log_type_id, - btf_kind_str[kind], + btf_type_str(t), __btf_name_by_offset(btf, t->name_off), log_details ? " " : ""); @@ -5427,7 +5426,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, if (!btf_type_is_small_int(t)) { bpf_log(log, "ret type %s not allowed for fmod_ret\n", - btf_kind_str[BTF_INFO_KIND(t->info)]); + btf_type_str(t)); return false; } break; @@ -5454,7 +5453,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, "func '%s' arg%d '%s' has type %s. Only pointer access is allowed\n", tname, arg, __btf_name_by_offset(btf, t->name_off), - btf_kind_str[BTF_INFO_KIND(t->info)]); + btf_type_str(t)); return false; } @@ -5538,11 +5537,11 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, if (!btf_type_is_struct(t)) { bpf_log(log, "func '%s' arg%d type %s is not a struct\n", - tname, arg, btf_kind_str[BTF_INFO_KIND(t->info)]); + tname, arg, btf_type_str(t)); return false; } bpf_log(log, "func '%s' arg%d has btf_id %d type %s '%s'\n", - tname, arg, info->btf_id, btf_kind_str[BTF_INFO_KIND(t->info)], + tname, arg, info->btf_id, btf_type_str(t), __btf_name_by_offset(btf, t->name_off)); return true; } @@ -5950,7 +5949,7 @@ int btf_distill_func_proto(struct bpf_verifier_log *log, if (ret < 0 || __btf_type_is_struct(t)) { bpf_log(log, "The function %s return type %s is unsupported.\n", - tname, btf_kind_str[BTF_INFO_KIND(t->info)]); + tname, btf_type_str(t)); return -EINVAL; } m->ret_size = ret; @@ -5968,7 +5967,7 @@ int btf_distill_func_proto(struct bpf_verifier_log *log, if (ret < 0 || ret > 16) { bpf_log(log, "The function %s arg%d type %s is unsupported.\n", - tname, i, btf_kind_str[BTF_INFO_KIND(t->info)]); + tname, i, btf_type_str(t)); return -EINVAL; } if (ret == 0) { @@ -6727,7 +6726,7 @@ int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog, continue; } bpf_log(log, "Arg#%d type %s in %s() is not supported yet.\n", - i, btf_kind_str[BTF_INFO_KIND(t->info)], tname); + i, btf_type_str(t), tname); return -EINVAL; } return 0; -- cgit v1.2.3 From c31b38cb948ee7d3317139f005fa1f90de4a06b7 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Mon, 19 Sep 2022 22:48:11 +0800 Subject: bpf: Check whether or not node is NULL before free it in free_bulk llnode could be NULL if there are new allocations after the checking of c-free_cnt > c->high_watermark in bpf_mem_refill() and before the calling of __llist_del_first() in free_bulk (e.g. a PREEMPT_RT kernel or allocation in NMI context). And it will incur oops as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT_RT SMP CPU: 39 PID: 373 Comm: irq_work/39 Tainted: G W 6.0.0-rc6-rt9+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:bpf_mem_refill+0x66/0x130 ...... Call Trace: irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x18/0x20 smpboot_thread_fn+0x13f/0x2c0 kthread+0x121/0x140 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 Simply fixing it by checking whether or not llnode is NULL in free_bulk(). Fixes: 8d5a8011b35d ("bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.") Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20220919144811.3570825-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/memalloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 20621f5407d8..5f83be1d2018 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -277,7 +277,8 @@ static void free_bulk(struct bpf_mem_cache *c) local_dec(&c->active); if (IS_ENABLED(CONFIG_PREEMPT_RT)) local_irq_restore(flags); - enque_to_free(c, llnode); + if (llnode) + enque_to_free(c, llnode); } while (cnt > (c->high_watermark + c->low_watermark) / 2); /* and drain free_llist_extra */ -- cgit v1.2.3 From 3a74904ceff3ecdb9d6cc0844ed67df417968eb6 Mon Sep 17 00:00:00 2001 From: William Dean Date: Sat, 17 Sep 2022 16:42:48 +0800 Subject: bpf: simplify code in btf_parse_hdr It could directly return 'btf_check_sec_info' to simplify code. Signed-off-by: William Dean Acked-by: Yonghong Song Link: https://lore.kernel.org/r/20220917084248.3649-1-williamsukatube@163.com Signed-off-by: Martin KaFai Lau --- kernel/bpf/btf.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index b3940c605aac..6ccd4f4d731e 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -4854,7 +4854,6 @@ static int btf_parse_hdr(struct btf_verifier_env *env) u32 hdr_len, hdr_copy, btf_data_size; const struct btf_header *hdr; struct btf *btf; - int err; btf = env->btf; btf_data_size = btf->data_size; @@ -4911,11 +4910,7 @@ static int btf_parse_hdr(struct btf_verifier_env *env) return -EINVAL; } - err = btf_check_sec_info(env, btf_data_size); - if (err) - return err; - - return 0; + return btf_check_sec_info(env, btf_data_size); } static int btf_check_type_tags(struct btf_verifier_env *env, -- cgit v1.2.3 From 0e426a3ae030a9e891899370229e117158b35de6 Mon Sep 17 00:00:00 2001 From: Pu Lehui Date: Wed, 21 Sep 2022 10:46:02 +0000 Subject: bpf, cgroup: Reject prog_attach_flags array when effective query Attach flags is only valid for attached progs of this layer cgroup, but not for effective progs. For querying with EFFECTIVE flags, exporting attach flags does not make sense. So when effective query, we reject prog_attach_flags array and don't need to populate it. Also we limit attach_flags to output 0 during effective query. Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP") Signed-off-by: Pu Lehui Link: https://lore.kernel.org/r/20220921104604.2340580-2-pulehui@huaweicloud.com Signed-off-by: Martin KaFai Lau --- kernel/bpf/cgroup.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 4a400cd63731..22888aaa68b6 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1020,6 +1020,7 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr) { __u32 __user *prog_attach_flags = u64_to_user_ptr(attr->query.prog_attach_flags); + bool effective_query = attr->query.query_flags & BPF_F_QUERY_EFFECTIVE; __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids); enum bpf_attach_type type = attr->query.attach_type; enum cgroup_bpf_attach_type from_atype, to_atype; @@ -1029,8 +1030,12 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, int total_cnt = 0; u32 flags; + if (effective_query && prog_attach_flags) + return -EINVAL; + if (type == BPF_LSM_CGROUP) { - if (attr->query.prog_cnt && prog_ids && !prog_attach_flags) + if (!effective_query && attr->query.prog_cnt && + prog_ids && !prog_attach_flags) return -EINVAL; from_atype = CGROUP_LSM_START; @@ -1045,7 +1050,7 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, } for (atype = from_atype; atype <= to_atype; atype++) { - if (attr->query.query_flags & BPF_F_QUERY_EFFECTIVE) { + if (effective_query) { effective = rcu_dereference_protected(cgrp->bpf.effective[atype], lockdep_is_held(&cgroup_mutex)); total_cnt += bpf_prog_array_length(effective); @@ -1054,6 +1059,8 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, } } + /* always output uattr->query.attach_flags as 0 during effective query */ + flags = effective_query ? 0 : flags; if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) return -EFAULT; if (copy_to_user(&uattr->query.prog_cnt, &total_cnt, sizeof(total_cnt))) @@ -1068,7 +1075,7 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, } for (atype = from_atype; atype <= to_atype && total_cnt; atype++) { - if (attr->query.query_flags & BPF_F_QUERY_EFFECTIVE) { + if (effective_query) { effective = rcu_dereference_protected(cgrp->bpf.effective[atype], lockdep_is_held(&cgroup_mutex)); cnt = min_t(int, bpf_prog_array_length(effective), total_cnt); @@ -1090,15 +1097,16 @@ static int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, if (++i == cnt) break; } - } - if (prog_attach_flags) { - flags = cgrp->bpf.flags[atype]; + if (prog_attach_flags) { + flags = cgrp->bpf.flags[atype]; - for (i = 0; i < cnt; i++) - if (copy_to_user(prog_attach_flags + i, &flags, sizeof(flags))) - return -EFAULT; - prog_attach_flags += cnt; + for (i = 0; i < cnt; i++) + if (copy_to_user(prog_attach_flags + i, + &flags, sizeof(flags))) + return -EFAULT; + prog_attach_flags += cnt; + } } prog_ids += cnt; -- cgit v1.2.3 From 8addbfc7b308d591f8a5f2f6bb24d08d9d79dfbb Mon Sep 17 00:00:00 2001 From: Kumar Kartikeya Dwivedi Date: Wed, 21 Sep 2022 16:35:50 +0200 Subject: bpf: Gate dynptr API behind CAP_BPF This has been enabled for unprivileged programs for only one kernel release, hence the expected annoyances due to this move are low. Users using ringbuf can stick to non-dynptr APIs. The actual use cases dynptr is meant to serve may not make sense in unprivileged BPF programs. Hence, gate these helpers behind CAP_BPF and limit use to privileged BPF programs. Fixes: 263ae152e962 ("bpf: Add bpf_dynptr_from_mem for local dynptrs") Fixes: bc34dee65a65 ("bpf: Dynptr support for ring buffers") Fixes: 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write") Fixes: 34d4ef5775f7 ("bpf: Add dynptr data slices") Signed-off-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220921143550.30247-1-memxor@gmail.com Acked-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 1f961f9982d2..3814b0fd3a2c 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1627,26 +1627,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_ringbuf_discard_proto; case BPF_FUNC_ringbuf_query: return &bpf_ringbuf_query_proto; - case BPF_FUNC_ringbuf_reserve_dynptr: - return &bpf_ringbuf_reserve_dynptr_proto; - case BPF_FUNC_ringbuf_submit_dynptr: - return &bpf_ringbuf_submit_dynptr_proto; - case BPF_FUNC_ringbuf_discard_dynptr: - return &bpf_ringbuf_discard_dynptr_proto; case BPF_FUNC_for_each_map_elem: return &bpf_for_each_map_elem_proto; case BPF_FUNC_loop: return &bpf_loop_proto; case BPF_FUNC_strncmp: return &bpf_strncmp_proto; - case BPF_FUNC_dynptr_from_mem: - return &bpf_dynptr_from_mem_proto; - case BPF_FUNC_dynptr_read: - return &bpf_dynptr_read_proto; - case BPF_FUNC_dynptr_write: - return &bpf_dynptr_write_proto; - case BPF_FUNC_dynptr_data: - return &bpf_dynptr_data_proto; default: break; } @@ -1675,6 +1661,20 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_timer_cancel_proto; case BPF_FUNC_kptr_xchg: return &bpf_kptr_xchg_proto; + case BPF_FUNC_ringbuf_reserve_dynptr: + return &bpf_ringbuf_reserve_dynptr_proto; + case BPF_FUNC_ringbuf_submit_dynptr: + return &bpf_ringbuf_submit_dynptr_proto; + case BPF_FUNC_ringbuf_discard_dynptr: + return &bpf_ringbuf_discard_dynptr_proto; + case BPF_FUNC_dynptr_from_mem: + return &bpf_dynptr_from_mem_proto; + case BPF_FUNC_dynptr_read: + return &bpf_dynptr_read_proto; + case BPF_FUNC_dynptr_write: + return &bpf_dynptr_write_proto; + case BPF_FUNC_dynptr_data: + return &bpf_dynptr_data_proto; default: break; } -- cgit v1.2.3 From 583c1f420173f7d84413a1a1fbf5109d798b4faa Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 19 Sep 2022 19:00:57 -0500 Subject: bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map type We want to support a ringbuf map type where samples are published from user-space, to be consumed by BPF programs. BPF currently supports a kernel -> user-space circular ring buffer via the BPF_MAP_TYPE_RINGBUF map type. We'll need to define a new map type for user-space -> kernel, as none of the helpers exported for BPF_MAP_TYPE_RINGBUF will apply to a user-space producer ring buffer, and we'll want to add one or more helper functions that would not apply for a kernel-producer ring buffer. This patch therefore adds a new BPF_MAP_TYPE_USER_RINGBUF map type definition. The map type is useless in its current form, as there is no way to access or use it for anything until we one or more BPF helpers. A follow-on patch will therefore add a new helper function that allows BPF programs to run callbacks on samples that are published to the ring buffer. Signed-off-by: David Vernet Signed-off-by: Andrii Nakryiko Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20220920000100.477320-2-void@manifault.com --- kernel/bpf/ringbuf.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++----- kernel/bpf/verifier.c | 3 +++ 2 files changed, 59 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index b483aea35f41..754e915748fb 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -38,10 +38,27 @@ struct bpf_ringbuf { struct page **pages; int nr_pages; spinlock_t spinlock ____cacheline_aligned_in_smp; - /* Consumer and producer counters are put into separate pages to allow - * mapping consumer page as r/w, but restrict producer page to r/o. - * This protects producer position from being modified by user-space - * application and ruining in-kernel position tracking. + /* Consumer and producer counters are put into separate pages to + * allow each position to be mapped with different permissions. + * This prevents a user-space application from modifying the + * position and ruining in-kernel tracking. The permissions of the + * pages depend on who is producing samples: user-space or the + * kernel. + * + * Kernel-producer + * --------------- + * The producer position and data pages are mapped as r/o in + * userspace. For this approach, bits in the header of samples are + * used to signal to user-space, and to other producers, whether a + * sample is currently being written. + * + * User-space producer + * ------------------- + * Only the page containing the consumer position is mapped r/o in + * user-space. User-space producers also use bits of the header to + * communicate to the kernel, but the kernel must carefully check and + * validate each sample to ensure that they're correctly formatted, and + * fully contained within the ring buffer. */ unsigned long consumer_pos __aligned(PAGE_SIZE); unsigned long producer_pos __aligned(PAGE_SIZE); @@ -224,7 +241,7 @@ static int ringbuf_map_get_next_key(struct bpf_map *map, void *key, return -ENOTSUPP; } -static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) +static int ringbuf_map_mmap_kern(struct bpf_map *map, struct vm_area_struct *vma) { struct bpf_ringbuf_map *rb_map; @@ -242,6 +259,26 @@ static int ringbuf_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) vma->vm_pgoff + RINGBUF_PGOFF); } +static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma) +{ + struct bpf_ringbuf_map *rb_map; + + rb_map = container_of(map, struct bpf_ringbuf_map, map); + + if (vma->vm_flags & VM_WRITE) { + if (vma->vm_pgoff == 0) + /* Disallow writable mappings to the consumer pointer, + * and allow writable mappings to both the producer + * position, and the ring buffer data itself. + */ + return -EPERM; + } else { + vma->vm_flags &= ~VM_MAYWRITE; + } + /* remap_vmalloc_range() checks size and offset constraints */ + return remap_vmalloc_range(vma, rb_map->rb, vma->vm_pgoff + RINGBUF_PGOFF); +} + static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) { unsigned long cons_pos, prod_pos; @@ -269,7 +306,7 @@ const struct bpf_map_ops ringbuf_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc = ringbuf_map_alloc, .map_free = ringbuf_map_free, - .map_mmap = ringbuf_map_mmap, + .map_mmap = ringbuf_map_mmap_kern, .map_poll = ringbuf_map_poll, .map_lookup_elem = ringbuf_map_lookup_elem, .map_update_elem = ringbuf_map_update_elem, @@ -278,6 +315,19 @@ const struct bpf_map_ops ringbuf_map_ops = { .map_btf_id = &ringbuf_map_btf_ids[0], }; +BTF_ID_LIST_SINGLE(user_ringbuf_map_btf_ids, struct, bpf_ringbuf_map) +const struct bpf_map_ops user_ringbuf_map_ops = { + .map_meta_equal = bpf_map_meta_equal, + .map_alloc = ringbuf_map_alloc, + .map_free = ringbuf_map_free, + .map_mmap = ringbuf_map_mmap_user, + .map_lookup_elem = ringbuf_map_lookup_elem, + .map_update_elem = ringbuf_map_update_elem, + .map_delete_elem = ringbuf_map_delete_elem, + .map_get_next_key = ringbuf_map_get_next_key, + .map_btf_id = &user_ringbuf_map_btf_ids[0], +}; + /* Given pointer to ring buffer record metadata and struct bpf_ringbuf itself, * calculate offset from record metadata to ring buffer in pages, rounded * down. This page offset is stored as part of record metadata and allows to diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8c6fbcd0afaf..83710b60e708 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6240,6 +6240,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_ringbuf_discard_dynptr) goto error; break; + case BPF_MAP_TYPE_USER_RINGBUF: + goto error; case BPF_MAP_TYPE_STACK_TRACE: if (func_id != BPF_FUNC_get_stackid) goto error; @@ -12635,6 +12637,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, case BPF_MAP_TYPE_ARRAY_OF_MAPS: case BPF_MAP_TYPE_HASH_OF_MAPS: case BPF_MAP_TYPE_RINGBUF: + case BPF_MAP_TYPE_USER_RINGBUF: case BPF_MAP_TYPE_INODE_STORAGE: case BPF_MAP_TYPE_SK_STORAGE: case BPF_MAP_TYPE_TASK_STORAGE: -- cgit v1.2.3 From 20571567384428dfc9fe5cf9f2e942e1df13c2dd Mon Sep 17 00:00:00 2001 From: David Vernet Date: Mon, 19 Sep 2022 19:00:58 -0500 Subject: bpf: Add bpf_user_ringbuf_drain() helper In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which will allow user-space applications to publish messages to a ring buffer that is consumed by a BPF program in kernel-space. In order for this map-type to be useful, it will require a BPF helper function that BPF programs can invoke to drain samples from the ring buffer, and invoke callbacks on those samples. This change adds that capability via a new BPF helper function: bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx, u64 flags) BPF programs may invoke this function to run callback_fn() on a series of samples in the ring buffer. callback_fn() has the following signature: long callback_fn(struct bpf_dynptr *dynptr, void *context); Samples are provided to the callback in the form of struct bpf_dynptr *'s, which the program can read using BPF helper functions for querying struct bpf_dynptr's. In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register type is added to the verifier to reflect a dynptr that was allocated by a helper function and passed to a BPF program. Unlike PTR_TO_STACK dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR dynptrs need not use reference tracking, as the BPF helper is trusted to properly free the dynptr before returning. The verifier currently only supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL. Note that while the corresponding user-space libbpf logic will be added in a subsequent patch, this patch does contain an implementation of the .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This .map_poll() callback guarantees that an epoll-waiting user-space producer will receive at least one event notification whenever at least one sample is drained in an invocation of bpf_user_ringbuf_drain(), provided that the function is not invoked with the BPF_RB_NO_WAKEUP flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup notification is sent even if no sample was drained. Signed-off-by: David Vernet Signed-off-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com --- kernel/bpf/helpers.c | 2 + kernel/bpf/ringbuf.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++-- kernel/bpf/verifier.c | 61 +++++++++++++++-- 3 files changed, 235 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 41aeaf3862ec..cb5564c77482 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1659,6 +1659,8 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_for_each_map_elem_proto; case BPF_FUNC_loop: return &bpf_loop_proto; + case BPF_FUNC_user_ringbuf_drain: + return &bpf_user_ringbuf_drain_proto; default: break; } diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 754e915748fb..9e832acf4692 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -38,6 +38,22 @@ struct bpf_ringbuf { struct page **pages; int nr_pages; spinlock_t spinlock ____cacheline_aligned_in_smp; + /* For user-space producer ring buffers, an atomic_t busy bit is used + * to synchronize access to the ring buffers in the kernel, rather than + * the spinlock that is used for kernel-producer ring buffers. This is + * done because the ring buffer must hold a lock across a BPF program's + * callback: + * + * __bpf_user_ringbuf_peek() // lock acquired + * -> program callback_fn() + * -> __bpf_user_ringbuf_sample_release() // lock released + * + * It is unsafe and incorrect to hold an IRQ spinlock across what could + * be a long execution window, so we instead simply disallow concurrent + * access to the ring buffer by kernel consumers, and return -EBUSY from + * __bpf_user_ringbuf_peek() if the busy bit is held by another task. + */ + atomic_t busy ____cacheline_aligned_in_smp; /* Consumer and producer counters are put into separate pages to * allow each position to be mapped with different permissions. * This prevents a user-space application from modifying the @@ -153,6 +169,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) return NULL; spin_lock_init(&rb->spinlock); + atomic_set(&rb->busy, 0); init_waitqueue_head(&rb->waitq); init_irq_work(&rb->work, bpf_ringbuf_notify); @@ -288,8 +305,13 @@ static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb) return prod_pos - cons_pos; } -static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp, - struct poll_table_struct *pts) +static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb) +{ + return rb->mask + 1; +} + +static __poll_t ringbuf_map_poll_kern(struct bpf_map *map, struct file *filp, + struct poll_table_struct *pts) { struct bpf_ringbuf_map *rb_map; @@ -301,13 +323,26 @@ static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp, return 0; } +static __poll_t ringbuf_map_poll_user(struct bpf_map *map, struct file *filp, + struct poll_table_struct *pts) +{ + struct bpf_ringbuf_map *rb_map; + + rb_map = container_of(map, struct bpf_ringbuf_map, map); + poll_wait(filp, &rb_map->rb->waitq, pts); + + if (ringbuf_avail_data_sz(rb_map->rb) < ringbuf_total_data_sz(rb_map->rb)) + return EPOLLOUT | EPOLLWRNORM; + return 0; +} + BTF_ID_LIST_SINGLE(ringbuf_map_btf_ids, struct, bpf_ringbuf_map) const struct bpf_map_ops ringbuf_map_ops = { .map_meta_equal = bpf_map_meta_equal, .map_alloc = ringbuf_map_alloc, .map_free = ringbuf_map_free, .map_mmap = ringbuf_map_mmap_kern, - .map_poll = ringbuf_map_poll, + .map_poll = ringbuf_map_poll_kern, .map_lookup_elem = ringbuf_map_lookup_elem, .map_update_elem = ringbuf_map_update_elem, .map_delete_elem = ringbuf_map_delete_elem, @@ -321,6 +356,7 @@ const struct bpf_map_ops user_ringbuf_map_ops = { .map_alloc = ringbuf_map_alloc, .map_free = ringbuf_map_free, .map_mmap = ringbuf_map_mmap_user, + .map_poll = ringbuf_map_poll_user, .map_lookup_elem = ringbuf_map_lookup_elem, .map_update_elem = ringbuf_map_update_elem, .map_delete_elem = ringbuf_map_delete_elem, @@ -362,7 +398,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) return NULL; len = round_up(size + BPF_RINGBUF_HDR_SZ, 8); - if (len > rb->mask + 1) + if (len > ringbuf_total_data_sz(rb)) return NULL; cons_pos = smp_load_acquire(&rb->consumer_pos); @@ -509,7 +545,7 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags) case BPF_RB_AVAIL_DATA: return ringbuf_avail_data_sz(rb); case BPF_RB_RING_SIZE: - return rb->mask + 1; + return ringbuf_total_data_sz(rb); case BPF_RB_CONS_POS: return smp_load_acquire(&rb->consumer_pos); case BPF_RB_PROD_POS: @@ -603,3 +639,138 @@ const struct bpf_func_proto bpf_ringbuf_discard_dynptr_proto = { .arg1_type = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_RINGBUF | OBJ_RELEASE, .arg2_type = ARG_ANYTHING, }; + +static int __bpf_user_ringbuf_peek(struct bpf_ringbuf *rb, void **sample, u32 *size) +{ + int err; + u32 hdr_len, sample_len, total_len, flags, *hdr; + u64 cons_pos, prod_pos; + + /* Synchronizes with smp_store_release() in user-space producer. */ + prod_pos = smp_load_acquire(&rb->producer_pos); + if (prod_pos % 8) + return -EINVAL; + + /* Synchronizes with smp_store_release() in __bpf_user_ringbuf_sample_release() */ + cons_pos = smp_load_acquire(&rb->consumer_pos); + if (cons_pos >= prod_pos) + return -ENODATA; + + hdr = (u32 *)((uintptr_t)rb->data + (uintptr_t)(cons_pos & rb->mask)); + /* Synchronizes with smp_store_release() in user-space producer. */ + hdr_len = smp_load_acquire(hdr); + flags = hdr_len & (BPF_RINGBUF_BUSY_BIT | BPF_RINGBUF_DISCARD_BIT); + sample_len = hdr_len & ~flags; + total_len = round_up(sample_len + BPF_RINGBUF_HDR_SZ, 8); + + /* The sample must fit within the region advertised by the producer position. */ + if (total_len > prod_pos - cons_pos) + return -EINVAL; + + /* The sample must fit within the data region of the ring buffer. */ + if (total_len > ringbuf_total_data_sz(rb)) + return -E2BIG; + + /* The sample must fit into a struct bpf_dynptr. */ + err = bpf_dynptr_check_size(sample_len); + if (err) + return -E2BIG; + + if (flags & BPF_RINGBUF_DISCARD_BIT) { + /* If the discard bit is set, the sample should be skipped. + * + * Update the consumer pos, and return -EAGAIN so the caller + * knows to skip this sample and try to read the next one. + */ + smp_store_release(&rb->consumer_pos, cons_pos + total_len); + return -EAGAIN; + } + + if (flags & BPF_RINGBUF_BUSY_BIT) + return -ENODATA; + + *sample = (void *)((uintptr_t)rb->data + + (uintptr_t)((cons_pos + BPF_RINGBUF_HDR_SZ) & rb->mask)); + *size = sample_len; + return 0; +} + +static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t size, u64 flags) +{ + u64 consumer_pos; + u32 rounded_size = round_up(size + BPF_RINGBUF_HDR_SZ, 8); + + /* Using smp_load_acquire() is unnecessary here, as the busy-bit + * prevents another task from writing to consumer_pos after it was read + * by this task with smp_load_acquire() in __bpf_user_ringbuf_peek(). + */ + consumer_pos = rb->consumer_pos; + /* Synchronizes with smp_load_acquire() in user-space producer. */ + smp_store_release(&rb->consumer_pos, consumer_pos + rounded_size); +} + +BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map, + void *, callback_fn, void *, callback_ctx, u64, flags) +{ + struct bpf_ringbuf *rb; + long samples, discarded_samples = 0, ret = 0; + bpf_callback_t callback = (bpf_callback_t)callback_fn; + u64 wakeup_flags = BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP; + int busy = 0; + + if (unlikely(flags & ~wakeup_flags)) + return -EINVAL; + + rb = container_of(map, struct bpf_ringbuf_map, map)->rb; + + /* If another consumer is already consuming a sample, wait for them to finish. */ + if (!atomic_try_cmpxchg(&rb->busy, &busy, 1)) + return -EBUSY; + + for (samples = 0; samples < BPF_MAX_USER_RINGBUF_SAMPLES && ret == 0; samples++) { + int err; + u32 size; + void *sample; + struct bpf_dynptr_kern dynptr; + + err = __bpf_user_ringbuf_peek(rb, &sample, &size); + if (err) { + if (err == -ENODATA) { + break; + } else if (err == -EAGAIN) { + discarded_samples++; + continue; + } else { + ret = err; + goto schedule_work_return; + } + } + + bpf_dynptr_init(&dynptr, sample, BPF_DYNPTR_TYPE_LOCAL, 0, size); + ret = callback((uintptr_t)&dynptr, (uintptr_t)callback_ctx, 0, 0, 0); + __bpf_user_ringbuf_sample_release(rb, size, flags); + } + ret = samples - discarded_samples; + +schedule_work_return: + /* Prevent the clearing of the busy-bit from being reordered before the + * storing of any rb consumer or producer positions. + */ + smp_mb__before_atomic(); + atomic_set(&rb->busy, 0); + + if (flags & BPF_RB_FORCE_WAKEUP) + irq_work_queue(&rb->work); + else if (!(flags & BPF_RB_NO_WAKEUP) && samples > 0) + irq_work_queue(&rb->work); + return ret; +} + +const struct bpf_func_proto bpf_user_ringbuf_drain_proto = { + .func = bpf_user_ringbuf_drain, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_FUNC, + .arg3_type = ARG_PTR_TO_STACK_OR_NULL, + .arg4_type = ARG_ANYTHING, +}; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 83710b60e708..c76fa45a5906 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -563,6 +563,7 @@ static const char *reg_type_str(struct bpf_verifier_env *env, [PTR_TO_BUF] = "buf", [PTR_TO_FUNC] = "func", [PTR_TO_MAP_KEY] = "map_key", + [PTR_TO_DYNPTR] = "dynptr_ptr", }; if (type & PTR_MAYBE_NULL) { @@ -5688,6 +5689,12 @@ static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } }; static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } }; static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } }; +static const struct bpf_reg_types dynptr_types = { + .types = { + PTR_TO_STACK, + PTR_TO_DYNPTR | DYNPTR_TYPE_LOCAL, + } +}; static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = { [ARG_PTR_TO_MAP_KEY] = &map_key_value_types, @@ -5714,7 +5721,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = { [ARG_PTR_TO_CONST_STR] = &const_str_ptr_types, [ARG_PTR_TO_TIMER] = &timer_types, [ARG_PTR_TO_KPTR] = &kptr_types, - [ARG_PTR_TO_DYNPTR] = &stack_ptr_types, + [ARG_PTR_TO_DYNPTR] = &dynptr_types, }; static int check_reg_type(struct bpf_verifier_env *env, u32 regno, @@ -6066,6 +6073,13 @@ skip_type_check: err = check_mem_size_reg(env, reg, regno, true, meta); break; case ARG_PTR_TO_DYNPTR: + /* We only need to check for initialized / uninitialized helper + * dynptr args if the dynptr is not PTR_TO_DYNPTR, as the + * assumption is that if it is, that a helper function + * initialized the dynptr on behalf of the BPF program. + */ + if (base_type(reg->type) == PTR_TO_DYNPTR) + break; if (arg_type & MEM_UNINIT) { if (!is_dynptr_reg_valid_uninit(env, reg)) { verbose(env, "Dynptr has to be an uninitialized dynptr\n"); @@ -6241,7 +6255,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, goto error; break; case BPF_MAP_TYPE_USER_RINGBUF: - goto error; + if (func_id != BPF_FUNC_user_ringbuf_drain) + goto error; + break; case BPF_MAP_TYPE_STACK_TRACE: if (func_id != BPF_FUNC_get_stackid) goto error; @@ -6361,6 +6377,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, if (map->map_type != BPF_MAP_TYPE_RINGBUF) goto error; break; + case BPF_FUNC_user_ringbuf_drain: + if (map->map_type != BPF_MAP_TYPE_USER_RINGBUF) + goto error; + break; case BPF_FUNC_get_stackid: if (map->map_type != BPF_MAP_TYPE_STACK_TRACE) goto error; @@ -6887,6 +6907,29 @@ static int set_find_vma_callback_state(struct bpf_verifier_env *env, return 0; } +static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env, + struct bpf_func_state *caller, + struct bpf_func_state *callee, + int insn_idx) +{ + /* bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void + * callback_ctx, u64 flags); + * callback_fn(struct bpf_dynptr_t* dynptr, void *callback_ctx); + */ + __mark_reg_not_init(env, &callee->regs[BPF_REG_0]); + callee->regs[BPF_REG_1].type = PTR_TO_DYNPTR | DYNPTR_TYPE_LOCAL; + __mark_reg_known_zero(&callee->regs[BPF_REG_1]); + callee->regs[BPF_REG_2] = caller->regs[BPF_REG_3]; + + /* unused */ + __mark_reg_not_init(env, &callee->regs[BPF_REG_3]); + __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); + __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); + + callee->in_callback_fn = true; + return 0; +} + static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) { struct bpf_verifier_state *state = env->cur_state; @@ -7346,12 +7389,18 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn case BPF_FUNC_dynptr_data: for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) { if (arg_type_is_dynptr(fn->arg_type[i])) { + struct bpf_reg_state *reg = ®s[BPF_REG_1 + i]; + if (meta.ref_obj_id) { verbose(env, "verifier internal error: meta.ref_obj_id already set\n"); return -EFAULT; } - /* Find the id of the dynptr we're tracking the reference of */ - meta.ref_obj_id = stack_slot_get_id(env, ®s[BPF_REG_1 + i]); + + if (base_type(reg->type) != PTR_TO_DYNPTR) + /* Find the id of the dynptr we're + * tracking the reference of + */ + meta.ref_obj_id = stack_slot_get_id(env, reg); break; } } @@ -7360,6 +7409,10 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn return -EFAULT; } break; + case BPF_FUNC_user_ringbuf_drain: + err = __check_func_call(env, insn, insn_idx_p, meta.subprogno, + set_user_ringbuf_callback_state); + break; } if (err) -- cgit v1.2.3 From d15bf1501c7533826a616478002c601fcc7671f3 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Tue, 20 Sep 2022 09:59:39 +0200 Subject: bpf: Allow kfuncs to be used in LSM programs In preparation for the addition of new kfuncs, allow kfuncs defined in the tracing subsystem to be used in LSM programs by mapping the LSM program type to the TRACING hook. Signed-off-by: KP Singh Signed-off-by: Roberto Sassu Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220920075951.929132-2-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 6ccd4f4d731e..dbcf020c4124 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -7376,6 +7376,7 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type) case BPF_PROG_TYPE_STRUCT_OPS: return BTF_KFUNC_HOOK_STRUCT_OPS; case BPF_PROG_TYPE_TRACING: + case BPF_PROG_TYPE_LSM: return BTF_KFUNC_HOOK_TRACING; case BPF_PROG_TYPE_SYSCALL: return BTF_KFUNC_HOOK_SYSCALL; -- cgit v1.2.3 From 00f146413ccb6c84308e559281449755c83f54c5 Mon Sep 17 00:00:00 2001 From: Roberto Sassu Date: Tue, 20 Sep 2022 09:59:40 +0200 Subject: btf: Export bpf_dynptr definition eBPF dynamic pointers is a new feature recently added to upstream. It binds together a pointer to a memory area and its size. The internal kernel structure bpf_dynptr_kern is not accessible by eBPF programs in user space. They instead see bpf_dynptr, which is then translated to the internal kernel structure by the eBPF verifier. The problem is that it is not possible to include at the same time the uapi include linux/bpf.h and the vmlinux BTF vmlinux.h, as they both contain the definition of some structures/enums. The compiler complains saying that the structures/enums are redefined. As bpf_dynptr is defined in the uapi include linux/bpf.h, this makes it impossible to include vmlinux.h. However, in some cases, e.g. when using kfuncs, vmlinux.h has to be included. The only option until now was to include vmlinux.h and add the definition of bpf_dynptr directly in the eBPF program source code from linux/bpf.h. Solve the problem by using the same approach as for bpf_timer (which also follows the same scheme with the _kern suffix for the internal kernel structure). Add the following line in one of the dynamic pointer helpers, bpf_dynptr_from_mem(): BTF_TYPE_EMIT(struct bpf_dynptr); Cc: stable@vger.kernel.org Cc: Joanne Koong Fixes: 97e03f521050c ("bpf: Add verifier support for dynptrs") Signed-off-by: Roberto Sassu Acked-by: Yonghong Song Tested-by: KP Singh Link: https://lore.kernel.org/r/20220920075951.929132-3-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index cb5564c77482..6d69e30f42d8 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1446,6 +1446,8 @@ BPF_CALL_4(bpf_dynptr_from_mem, void *, data, u32, size, u64, flags, struct bpf_ { int err; + BTF_TYPE_EMIT(struct bpf_dynptr); + err = bpf_dynptr_check_size(size); if (err) goto error; -- cgit v1.2.3 From e9e315b4a5de32d0482b92f482517095d5d844e4 Mon Sep 17 00:00:00 2001 From: Roberto Sassu Date: Tue, 20 Sep 2022 09:59:41 +0200 Subject: bpf: Move dynptr type check to is_dynptr_type_expected() Move dynptr type check to is_dynptr_type_expected() from is_dynptr_reg_valid_init(), so that callers can better determine the cause of a negative result (dynamic pointer not valid/initialized, dynamic pointer of the wrong type). It will be useful for example for BTF, to restrict which dynamic pointer types can be passed to kfuncs, as initially only the local type will be supported. Also, splitting makes the code more readable, since checking the dynamic pointer type is not necessarily related to validity and initialization. Split the validity/initialization and dynamic pointer type check also in the verifier, and adjust the expected error message in the test (a test for an unexpected dynptr type passed to a helper cannot be added due to missing suitable helpers, but this case has been tested manually). Cc: Joanne Koong Cc: Kumar Kartikeya Dwivedi Signed-off-by: Roberto Sassu Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220920075951.929132-4-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c76fa45a5906..c08dde19eb67 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -782,8 +782,8 @@ static bool is_dynptr_reg_valid_uninit(struct bpf_verifier_env *env, struct bpf_ return true; } -static bool is_dynptr_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_state *reg, - enum bpf_arg_type arg_type) +static bool is_dynptr_reg_valid_init(struct bpf_verifier_env *env, + struct bpf_reg_state *reg) { struct bpf_func_state *state = func(env, reg); int spi = get_spi(reg->off); @@ -799,11 +799,24 @@ static bool is_dynptr_reg_valid_init(struct bpf_verifier_env *env, struct bpf_re return false; } + return true; +} + +static bool is_dynptr_type_expected(struct bpf_verifier_env *env, + struct bpf_reg_state *reg, + enum bpf_arg_type arg_type) +{ + struct bpf_func_state *state = func(env, reg); + enum bpf_dynptr_type dynptr_type; + int spi = get_spi(reg->off); + /* ARG_PTR_TO_DYNPTR takes any type of dynptr */ if (arg_type == ARG_PTR_TO_DYNPTR) return true; - return state->stack[spi].spilled_ptr.dynptr.type == arg_to_dynptr_type(arg_type); + dynptr_type = arg_to_dynptr_type(arg_type); + + return state->stack[spi].spilled_ptr.dynptr.type == dynptr_type; } /* The reg state of a pointer or a bounded scalar was saved when @@ -6095,21 +6108,27 @@ skip_type_check: } meta->uninit_dynptr_regno = regno; - } else if (!is_dynptr_reg_valid_init(env, reg, arg_type)) { + } else if (!is_dynptr_reg_valid_init(env, reg)) { + verbose(env, + "Expected an initialized dynptr as arg #%d\n", + arg + 1); + return -EINVAL; + } else if (!is_dynptr_type_expected(env, reg, arg_type)) { const char *err_extra = ""; switch (arg_type & DYNPTR_TYPE_FLAG_MASK) { case DYNPTR_TYPE_LOCAL: - err_extra = "local "; + err_extra = "local"; break; case DYNPTR_TYPE_RINGBUF: - err_extra = "ringbuf "; + err_extra = "ringbuf"; break; default: + err_extra = ""; break; } - - verbose(env, "Expected an initialized %sdynptr as arg #%d\n", + verbose(env, + "Expected a dynptr of type %s as arg #%d\n", err_extra, arg + 1); return -EINVAL; } -- cgit v1.2.3 From b8d31762a0ae6861e1115302ee338560d853e317 Mon Sep 17 00:00:00 2001 From: Roberto Sassu Date: Tue, 20 Sep 2022 09:59:42 +0200 Subject: btf: Allow dynamic pointer parameters in kfuncs Allow dynamic pointers (struct bpf_dynptr_kern *) to be specified as parameters in kfuncs. Also, ensure that dynamic pointers passed as argument are valid and initialized, are a pointer to the stack, and of the type local. More dynamic pointer types can be supported in the future. To properly detect whether a parameter is of the desired type, introduce the stringify_struct() macro to compare the returned structure name with the desired name. In addition, protect against structure renames, by halting the build with BUILD_BUG_ON(), so that developers have to revisit the code. To check if a dynamic pointer passed to the kfunc is valid and initialized, and if its type is local, export the existing functions is_dynptr_reg_valid_init() and is_dynptr_type_expected(). Cc: Joanne Koong Cc: Kumar Kartikeya Dwivedi Signed-off-by: Roberto Sassu Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220920075951.929132-5-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 33 +++++++++++++++++++++++++++++++++ kernel/bpf/verifier.c | 10 +++++----- 2 files changed, 38 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index dbcf020c4124..13faede0f2b4 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -6449,15 +6449,20 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, if (is_kfunc) { bool arg_mem_size = i + 1 < nargs && is_kfunc_arg_mem_size(btf, &args[i + 1], ®s[regno + 1]); + bool arg_dynptr = btf_type_is_struct(ref_t) && + !strcmp(ref_tname, + stringify_struct(bpf_dynptr_kern)); /* Permit pointer to mem, but only when argument * type is pointer to scalar, or struct composed * (recursively) of scalars. * When arg_mem_size is true, the pointer can be * void *. + * Also permit initialized local dynamic pointers. */ if (!btf_type_is_scalar(ref_t) && !__btf_type_is_scalar_struct(log, btf, ref_t, 0) && + !arg_dynptr && (arg_mem_size ? !btf_type_is_void(ref_t) : 1)) { bpf_log(log, "arg#%d pointer type %s %s must point to %sscalar, or struct with scalar\n", @@ -6465,6 +6470,34 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, return -EINVAL; } + if (arg_dynptr) { + if (reg->type != PTR_TO_STACK) { + bpf_log(log, "arg#%d pointer type %s %s not to stack\n", + i, btf_type_str(ref_t), + ref_tname); + return -EINVAL; + } + + if (!is_dynptr_reg_valid_init(env, reg)) { + bpf_log(log, + "arg#%d pointer type %s %s must be valid and initialized\n", + i, btf_type_str(ref_t), + ref_tname); + return -EINVAL; + } + + if (!is_dynptr_type_expected(env, reg, + ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_LOCAL)) { + bpf_log(log, + "arg#%d pointer type %s %s points to unsupported dynamic pointer type\n", + i, btf_type_str(ref_t), + ref_tname); + return -EINVAL; + } + + continue; + } + /* Check for mem, len pair */ if (arg_mem_size) { if (check_kfunc_mem_size_reg(env, ®s[regno + 1], regno + 1)) { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c08dde19eb67..6f6d2d511c06 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -782,8 +782,8 @@ static bool is_dynptr_reg_valid_uninit(struct bpf_verifier_env *env, struct bpf_ return true; } -static bool is_dynptr_reg_valid_init(struct bpf_verifier_env *env, - struct bpf_reg_state *reg) +bool is_dynptr_reg_valid_init(struct bpf_verifier_env *env, + struct bpf_reg_state *reg) { struct bpf_func_state *state = func(env, reg); int spi = get_spi(reg->off); @@ -802,9 +802,9 @@ static bool is_dynptr_reg_valid_init(struct bpf_verifier_env *env, return true; } -static bool is_dynptr_type_expected(struct bpf_verifier_env *env, - struct bpf_reg_state *reg, - enum bpf_arg_type arg_type) +bool is_dynptr_type_expected(struct bpf_verifier_env *env, + struct bpf_reg_state *reg, + enum bpf_arg_type arg_type) { struct bpf_func_state *state = func(env, reg); enum bpf_dynptr_type dynptr_type; -- cgit v1.2.3 From 51df4865718540f51bb5d3e552c50dc88e1333d6 Mon Sep 17 00:00:00 2001 From: Roberto Sassu Date: Tue, 20 Sep 2022 09:59:43 +0200 Subject: bpf: Export bpf_dynptr_get_size() Export bpf_dynptr_get_size(), so that kernel code dealing with eBPF dynamic pointers can obtain the real size of data carried by this data structure. Signed-off-by: Roberto Sassu Reviewed-by: Joanne Koong Acked-by: KP Singh Acked-by: Kumar Kartikeya Dwivedi Link: https://lore.kernel.org/r/20220920075951.929132-6-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 6d69e30f42d8..b069517a3da0 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1408,7 +1408,7 @@ static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_typ ptr->size |= type << DYNPTR_TYPE_SHIFT; } -static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr) +u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr) { return ptr->size & DYNPTR_SIZE_MASK; } -- cgit v1.2.3 From f3cf4134c5c6c47b9b5c7aa3cb2d67e107887a7b Mon Sep 17 00:00:00 2001 From: Roberto Sassu Date: Tue, 20 Sep 2022 09:59:45 +0200 Subject: bpf: Add bpf_lookup_*_key() and bpf_key_put() kfuncs Add the bpf_lookup_user_key(), bpf_lookup_system_key() and bpf_key_put() kfuncs, to respectively search a key with a given key handle serial number and flags, obtain a key from a pre-determined ID defined in include/linux/verification.h, and cleanup. Introduce system_keyring_id_check() to validate the keyring ID parameter of bpf_lookup_system_key(). Signed-off-by: Roberto Sassu Acked-by: Kumar Kartikeya Dwivedi Acked-by: Song Liu Link: https://lore.kernel.org/r/20220920075951.929132-8-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 68e5cdd24cef..ab183dbaa8d1 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include #include @@ -1181,6 +1183,139 @@ static const struct bpf_func_proto bpf_get_func_arg_cnt_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +#ifdef CONFIG_KEYS +__diag_push(); +__diag_ignore_all("-Wmissing-prototypes", + "kfuncs which will be used in BPF programs"); + +/** + * bpf_lookup_user_key - lookup a key by its serial + * @serial: key handle serial number + * @flags: lookup-specific flags + * + * Search a key with a given *serial* and the provided *flags*. + * If found, increment the reference count of the key by one, and + * return it in the bpf_key structure. + * + * The bpf_key structure must be passed to bpf_key_put() when done + * with it, so that the key reference count is decremented and the + * bpf_key structure is freed. + * + * Permission checks are deferred to the time the key is used by + * one of the available key-specific kfuncs. + * + * Set *flags* with KEY_LOOKUP_CREATE, to attempt creating a requested + * special keyring (e.g. session keyring), if it doesn't yet exist. + * Set *flags* with KEY_LOOKUP_PARTIAL, to lookup a key without waiting + * for the key construction, and to retrieve uninstantiated keys (keys + * without data attached to them). + * + * Return: a bpf_key pointer with a valid key pointer if the key is found, a + * NULL pointer otherwise. + */ +struct bpf_key *bpf_lookup_user_key(u32 serial, u64 flags) +{ + key_ref_t key_ref; + struct bpf_key *bkey; + + if (flags & ~KEY_LOOKUP_ALL) + return NULL; + + /* + * Permission check is deferred until the key is used, as the + * intent of the caller is unknown here. + */ + key_ref = lookup_user_key(serial, flags, KEY_DEFER_PERM_CHECK); + if (IS_ERR(key_ref)) + return NULL; + + bkey = kmalloc(sizeof(*bkey), GFP_KERNEL); + if (!bkey) { + key_put(key_ref_to_ptr(key_ref)); + return NULL; + } + + bkey->key = key_ref_to_ptr(key_ref); + bkey->has_ref = true; + + return bkey; +} + +/** + * bpf_lookup_system_key - lookup a key by a system-defined ID + * @id: key ID + * + * Obtain a bpf_key structure with a key pointer set to the passed key ID. + * The key pointer is marked as invalid, to prevent bpf_key_put() from + * attempting to decrement the key reference count on that pointer. The key + * pointer set in such way is currently understood only by + * verify_pkcs7_signature(). + * + * Set *id* to one of the values defined in include/linux/verification.h: + * 0 for the primary keyring (immutable keyring of system keys); + * VERIFY_USE_SECONDARY_KEYRING for both the primary and secondary keyring + * (where keys can be added only if they are vouched for by existing keys + * in those keyrings); VERIFY_USE_PLATFORM_KEYRING for the platform + * keyring (primarily used by the integrity subsystem to verify a kexec'ed + * kerned image and, possibly, the initramfs signature). + * + * Return: a bpf_key pointer with an invalid key pointer set from the + * pre-determined ID on success, a NULL pointer otherwise + */ +struct bpf_key *bpf_lookup_system_key(u64 id) +{ + struct bpf_key *bkey; + + if (system_keyring_id_check(id) < 0) + return NULL; + + bkey = kmalloc(sizeof(*bkey), GFP_ATOMIC); + if (!bkey) + return NULL; + + bkey->key = (struct key *)(unsigned long)id; + bkey->has_ref = false; + + return bkey; +} + +/** + * bpf_key_put - decrement key reference count if key is valid and free bpf_key + * @bkey: bpf_key structure + * + * Decrement the reference count of the key inside *bkey*, if the pointer + * is valid, and free *bkey*. + */ +void bpf_key_put(struct bpf_key *bkey) +{ + if (bkey->has_ref) + key_put(bkey->key); + + kfree(bkey); +} + +__diag_pop(); + +BTF_SET8_START(key_sig_kfunc_set) +BTF_ID_FLAGS(func, bpf_lookup_user_key, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_lookup_system_key, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_key_put, KF_RELEASE) +BTF_SET8_END(key_sig_kfunc_set) + +static const struct btf_kfunc_id_set bpf_key_sig_kfunc_set = { + .owner = THIS_MODULE, + .set = &key_sig_kfunc_set, +}; + +static int __init bpf_key_sig_kfuncs_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_key_sig_kfunc_set); +} + +late_initcall(bpf_key_sig_kfuncs_init); +#endif /* CONFIG_KEYS */ + static const struct bpf_func_proto * bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { -- cgit v1.2.3 From 865b0566d8f1a0c3937e5eb4bd6ba4ef03e7e98c Mon Sep 17 00:00:00 2001 From: Roberto Sassu Date: Tue, 20 Sep 2022 09:59:46 +0200 Subject: bpf: Add bpf_verify_pkcs7_signature() kfunc Add the bpf_verify_pkcs7_signature() kfunc, to give eBPF security modules the ability to check the validity of a signature against supplied data, by using user-provided or system-provided keys as trust anchor. The new kfunc makes it possible to enforce mandatory policies, as eBPF programs might be allowed to make security decisions only based on data sources the system administrator approves. The caller should provide the data to be verified and the signature as eBPF dynamic pointers (to minimize the number of parameters) and a bpf_key structure containing a reference to the keyring with keys trusted for signature verification, obtained from bpf_lookup_user_key() or bpf_lookup_system_key(). For bpf_key structures obtained from the former lookup function, bpf_verify_pkcs7_signature() completes the permission check deferred by that function by calling key_validate(). key_task_permission() is already called by the PKCS#7 code. Signed-off-by: Roberto Sassu Acked-by: KP Singh Acked-by: Song Liu Link: https://lore.kernel.org/r/20220920075951.929132-9-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ab183dbaa8d1..9df53c40cffd 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1294,12 +1294,57 @@ void bpf_key_put(struct bpf_key *bkey) kfree(bkey); } +#ifdef CONFIG_SYSTEM_DATA_VERIFICATION +/** + * bpf_verify_pkcs7_signature - verify a PKCS#7 signature + * @data_ptr: data to verify + * @sig_ptr: signature of the data + * @trusted_keyring: keyring with keys trusted for signature verification + * + * Verify the PKCS#7 signature *sig_ptr* against the supplied *data_ptr* + * with keys in a keyring referenced by *trusted_keyring*. + * + * Return: 0 on success, a negative value on error. + */ +int bpf_verify_pkcs7_signature(struct bpf_dynptr_kern *data_ptr, + struct bpf_dynptr_kern *sig_ptr, + struct bpf_key *trusted_keyring) +{ + int ret; + + if (trusted_keyring->has_ref) { + /* + * Do the permission check deferred in bpf_lookup_user_key(). + * See bpf_lookup_user_key() for more details. + * + * A call to key_task_permission() here would be redundant, as + * it is already done by keyring_search() called by + * find_asymmetric_key(). + */ + ret = key_validate(trusted_keyring->key); + if (ret < 0) + return ret; + } + + return verify_pkcs7_signature(data_ptr->data, + bpf_dynptr_get_size(data_ptr), + sig_ptr->data, + bpf_dynptr_get_size(sig_ptr), + trusted_keyring->key, + VERIFYING_UNSPECIFIED_SIGNATURE, NULL, + NULL); +} +#endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ + __diag_pop(); BTF_SET8_START(key_sig_kfunc_set) BTF_ID_FLAGS(func, bpf_lookup_user_key, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_lookup_system_key, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_key_put, KF_RELEASE) +#ifdef CONFIG_SYSTEM_DATA_VERIFICATION +BTF_ID_FLAGS(func, bpf_verify_pkcs7_signature, KF_SLEEPABLE) +#endif BTF_SET8_END(key_sig_kfunc_set) static const struct btf_kfunc_id_set bpf_key_sig_kfunc_set = { -- cgit v1.2.3 From 05b24ff9b2cfabfcfd951daaa915a036ab53c9e1 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Fri, 16 Sep 2022 09:19:14 +0200 Subject: bpf: Prevent bpf program recursion for raw tracepoint probes We got report from sysbot [1] about warnings that were caused by bpf program attached to contention_begin raw tracepoint triggering the same tracepoint by using bpf_trace_printk helper that takes trace_printk_lock lock. Call Trace: ? trace_event_raw_event_bpf_trace_printk+0x5f/0x90 bpf_trace_printk+0x2b/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 bpf_trace_printk+0x3f/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 bpf_trace_printk+0x3f/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 bpf_trace_printk+0x3f/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 __unfreeze_partials+0x5b/0x160 ... The can be reproduced by attaching bpf program as raw tracepoint on contention_begin tracepoint. The bpf prog calls bpf_trace_printk helper. Then by running perf bench the spin lock code is forced to take slow path and call contention_begin tracepoint. Fixing this by skipping execution of the bpf program if it's already running, Using bpf prog 'active' field, which is being currently used by trampoline programs for the same reason. Moving bpf_prog_inc_misses_counter to syscall.c because trampoline.c is compiled in just for CONFIG_BPF_JIT option. Reviewed-by: Stanislav Fomichev Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com [1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#t Signed-off-by: Jiri Olsa Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 11 +++++++++++ kernel/bpf/trampoline.c | 15 ++------------- kernel/trace/bpf_trace.c | 6 ++++++ 3 files changed, 19 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index dab156f09f8d..372fad5ef3d3 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2093,6 +2093,17 @@ struct bpf_prog_kstats { u64 misses; }; +void notrace bpf_prog_inc_misses_counter(struct bpf_prog *prog) +{ + struct bpf_prog_stats *stats; + unsigned int flags; + + stats = this_cpu_ptr(prog->stats); + flags = u64_stats_update_begin_irqsave(&stats->syncp); + u64_stats_inc(&stats->misses); + u64_stats_update_end_irqrestore(&stats->syncp, flags); +} + static void bpf_prog_get_stats(const struct bpf_prog *prog, struct bpf_prog_kstats *stats) { diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index ad76940b02cc..41b67eb83ab3 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -863,17 +863,6 @@ static __always_inline u64 notrace bpf_prog_start_time(void) return start; } -static void notrace inc_misses_counter(struct bpf_prog *prog) -{ - struct bpf_prog_stats *stats; - unsigned int flags; - - stats = this_cpu_ptr(prog->stats); - flags = u64_stats_update_begin_irqsave(&stats->syncp); - u64_stats_inc(&stats->misses); - u64_stats_update_end_irqrestore(&stats->syncp, flags); -} - /* The logic is similar to bpf_prog_run(), but with an explicit * rcu_read_lock() and migrate_disable() which are required * for the trampoline. The macro is split into @@ -896,7 +885,7 @@ u64 notrace __bpf_prog_enter(struct bpf_prog *prog, struct bpf_tramp_run_ctx *ru run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx); if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { - inc_misses_counter(prog); + bpf_prog_inc_misses_counter(prog); return 0; } return bpf_prog_start_time(); @@ -967,7 +956,7 @@ u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog, struct bpf_tramp_r might_fault(); if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { - inc_misses_counter(prog); + bpf_prog_inc_misses_counter(prog); return 0; } diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 9df53c40cffd..b05f0310dbd3 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2222,9 +2222,15 @@ static __always_inline void __bpf_trace_run(struct bpf_prog *prog, u64 *args) { cant_sleep(); + if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) { + bpf_prog_inc_misses_counter(prog); + goto out; + } rcu_read_lock(); (void) bpf_prog_run(prog, args); rcu_read_unlock(); +out: + this_cpu_dec(*(prog->active)); } #define UNPACK(...) __VA_ARGS__ -- cgit v1.2.3 From 1d8b82c613297f24354b4d750413a7456b5cd92c Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 21 Sep 2022 15:38:26 +0800 Subject: bpf: Always use raw spinlock for hash bucket lock For a non-preallocated hash map on RT kernel, regular spinlock instead of raw spinlock is used for bucket lock. The reason is that on RT kernel memory allocation is forbidden under atomic context and regular spinlock is sleepable under RT. Now hash map has been fully converted to use bpf_map_alloc, and there will be no synchronous memory allocation for non-preallocated hash map, so it is safe to always use raw spinlock for bucket lock on RT. So removing the usage of htab_use_raw_lock() and updating the comments accordingly. Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20220921073826.2365800-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/hashtab.c | 66 +++++++++++----------------------------------------- 1 file changed, 14 insertions(+), 52 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 86aec20c22d0..ed3f8a53603b 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -68,24 +68,16 @@ * In theory the BPF locks could be converted to regular spinlocks as well, * but the bucket locks and percpu_freelist locks can be taken from * arbitrary contexts (perf, kprobes, tracepoints) which are required to be - * atomic contexts even on RT. These mechanisms require preallocated maps, - * so there is no need to invoke memory allocations within the lock held - * sections. - * - * BPF maps which need dynamic allocation are only used from (forced) - * thread context on RT and can therefore use regular spinlocks which in - * turn allows to invoke memory allocations from the lock held section. - * - * On a non RT kernel this distinction is neither possible nor required. - * spinlock maps to raw_spinlock and the extra code is optimized out by the - * compiler. + * atomic contexts even on RT. Before the introduction of bpf_mem_alloc, + * it is only safe to use raw spinlock for preallocated hash map on a RT kernel, + * because there is no memory allocation within the lock held sections. However + * after hash map was fully converted to use bpf_mem_alloc, there will be + * non-synchronous memory allocation for non-preallocated hash map, so it is + * safe to always use raw spinlock for bucket lock. */ struct bucket { struct hlist_nulls_head head; - union { - raw_spinlock_t raw_lock; - spinlock_t lock; - }; + raw_spinlock_t raw_lock; }; #define HASHTAB_MAP_LOCK_COUNT 8 @@ -141,26 +133,15 @@ static inline bool htab_is_prealloc(const struct bpf_htab *htab) return !(htab->map.map_flags & BPF_F_NO_PREALLOC); } -static inline bool htab_use_raw_lock(const struct bpf_htab *htab) -{ - return (!IS_ENABLED(CONFIG_PREEMPT_RT) || htab_is_prealloc(htab)); -} - static void htab_init_buckets(struct bpf_htab *htab) { unsigned int i; for (i = 0; i < htab->n_buckets; i++) { INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); - if (htab_use_raw_lock(htab)) { - raw_spin_lock_init(&htab->buckets[i].raw_lock); - lockdep_set_class(&htab->buckets[i].raw_lock, + raw_spin_lock_init(&htab->buckets[i].raw_lock); + lockdep_set_class(&htab->buckets[i].raw_lock, &htab->lockdep_key); - } else { - spin_lock_init(&htab->buckets[i].lock); - lockdep_set_class(&htab->buckets[i].lock, - &htab->lockdep_key); - } cond_resched(); } } @@ -170,28 +151,17 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, unsigned long *pflags) { unsigned long flags; - bool use_raw_lock; hash = hash & HASHTAB_MAP_LOCK_MASK; - use_raw_lock = htab_use_raw_lock(htab); - if (use_raw_lock) - preempt_disable(); - else - migrate_disable(); + preempt_disable(); if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { __this_cpu_dec(*(htab->map_locked[hash])); - if (use_raw_lock) - preempt_enable(); - else - migrate_enable(); + preempt_enable(); return -EBUSY; } - if (use_raw_lock) - raw_spin_lock_irqsave(&b->raw_lock, flags); - else - spin_lock_irqsave(&b->lock, flags); + raw_spin_lock_irqsave(&b->raw_lock, flags); *pflags = flags; return 0; @@ -201,18 +171,10 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, struct bucket *b, u32 hash, unsigned long flags) { - bool use_raw_lock = htab_use_raw_lock(htab); - hash = hash & HASHTAB_MAP_LOCK_MASK; - if (use_raw_lock) - raw_spin_unlock_irqrestore(&b->raw_lock, flags); - else - spin_unlock_irqrestore(&b->lock, flags); + raw_spin_unlock_irqrestore(&b->raw_lock, flags); __this_cpu_dec(*(htab->map_locked[hash])); - if (use_raw_lock) - preempt_enable(); - else - migrate_enable(); + preempt_enable(); } static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); -- cgit v1.2.3 From eed807f626101f6a4227bd53942892c5983b95a7 Mon Sep 17 00:00:00 2001 From: Kumar Kartikeya Dwivedi Date: Wed, 21 Sep 2022 18:48:25 +0200 Subject: bpf: Tweak definition of KF_TRUSTED_ARGS Instead of forcing all arguments to be referenced pointers with non-zero reg->ref_obj_id, tweak the definition of KF_TRUSTED_ARGS to mean that only PTR_TO_BTF_ID (and socket types translated to PTR_TO_BTF_ID) have that constraint, and require their offset to be set to 0. The rest of pointer types are also accomodated in this definition of trusted pointers, but with more relaxed rules regarding offsets. The inherent meaning of setting this flag is that all kfunc pointer arguments have a guranteed lifetime, and kernel object pointers (PTR_TO_BTF_ID, PTR_TO_CTX) are passed in their unmodified form (with offset 0). In general, this is not true for PTR_TO_BTF_ID as it can be obtained using pointer walks. Signed-off-by: Kumar Kartikeya Dwivedi Signed-off-by: Lorenzo Bianconi Link: https://lore.kernel.org/r/cdede0043c47ed7a357f0a915d16f9ce06a1d589.1663778601.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/btf.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 13faede0f2b4..a44ad4b347ff 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -6227,7 +6227,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, bool processing_call) { enum bpf_prog_type prog_type = resolve_prog_type(env->prog); - bool rel = false, kptr_get = false, trusted_arg = false; + bool rel = false, kptr_get = false, trusted_args = false; bool sleepable = false; struct bpf_verifier_log *log = &env->log; u32 i, nargs, ref_id, ref_obj_id = 0; @@ -6265,7 +6265,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, /* Only kfunc can be release func */ rel = kfunc_meta->flags & KF_RELEASE; kptr_get = kfunc_meta->flags & KF_KPTR_GET; - trusted_arg = kfunc_meta->flags & KF_TRUSTED_ARGS; + trusted_args = kfunc_meta->flags & KF_TRUSTED_ARGS; sleepable = kfunc_meta->flags & KF_SLEEPABLE; } @@ -6276,6 +6276,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, enum bpf_arg_type arg_type = ARG_DONTCARE; u32 regno = i + 1; struct bpf_reg_state *reg = ®s[regno]; + bool obj_ptr = false; t = btf_type_skip_modifiers(btf, args[i].type, NULL); if (btf_type_is_scalar(t)) { @@ -6323,10 +6324,17 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, return -EINVAL; } + /* These register types have special constraints wrt ref_obj_id + * and offset checks. The rest of trusted args don't. + */ + obj_ptr = reg->type == PTR_TO_CTX || reg->type == PTR_TO_BTF_ID || + reg2btf_ids[base_type(reg->type)]; + /* Check if argument must be a referenced pointer, args + i has * been verified to be a pointer (after skipping modifiers). + * PTR_TO_CTX is ok without having non-zero ref_obj_id. */ - if (is_kfunc && trusted_arg && !reg->ref_obj_id) { + if (is_kfunc && trusted_args && (obj_ptr && reg->type != PTR_TO_CTX) && !reg->ref_obj_id) { bpf_log(log, "R%d must be referenced\n", regno); return -EINVAL; } @@ -6335,7 +6343,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, ref_tname = btf_name_by_offset(btf, ref_t->name_off); /* Trusted args have the same offset checks as release arguments */ - if (trusted_arg || (rel && reg->ref_obj_id)) + if ((trusted_args && obj_ptr) || (rel && reg->ref_obj_id)) arg_type |= OBJ_RELEASE; ret = check_func_arg_reg_off(env, reg, regno, arg_type); if (ret < 0) @@ -6435,7 +6443,7 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, reg_ref_t->name_off); if (!btf_struct_ids_match(log, reg_btf, reg_ref_id, reg->off, btf, ref_id, - trusted_arg || (rel && reg->ref_obj_id))) { + trusted_args || (rel && reg->ref_obj_id))) { bpf_log(log, "kernel function %s args#%d expected pointer to %s %s but R%d has a pointer to %s %s\n", func_name, i, btf_type_str(ref_t), ref_tname, -- cgit v1.2.3 From 9fca7115827b2e5f48d84e50bceb4edfd4cb6375 Mon Sep 17 00:00:00 2001 From: Sami Tolvanen Date: Thu, 8 Sep 2022 14:54:45 -0700 Subject: cfi: Remove CONFIG_CFI_CLANG_SHADOW In preparation to switching to -fsanitize=kcfi, remove support for the CFI module shadow that will no longer be needed. Signed-off-by: Sami Tolvanen Reviewed-by: Kees Cook Tested-by: Kees Cook Tested-by: Nathan Chancellor Acked-by: Peter Zijlstra (Intel) Tested-by: Peter Zijlstra (Intel) Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220908215504.3686827-4-samitolvanen@google.com --- kernel/cfi.c | 236 +-------------------------------------------------- kernel/module/main.c | 15 ---- 2 files changed, 1 insertion(+), 250 deletions(-) (limited to 'kernel') diff --git a/kernel/cfi.c b/kernel/cfi.c index 2046276ee234..e8bc1b370edc 100644 --- a/kernel/cfi.c +++ b/kernel/cfi.c @@ -32,237 +32,6 @@ static inline void handle_cfi_failure(void *ptr) } #ifdef CONFIG_MODULES -#ifdef CONFIG_CFI_CLANG_SHADOW -/* - * Index type. A 16-bit index can address at most (2^16)-2 pages (taking - * into account SHADOW_INVALID), i.e. ~256M with 4k pages. - */ -typedef u16 shadow_t; -#define SHADOW_INVALID ((shadow_t)~0UL) - -struct cfi_shadow { - /* Page index for the beginning of the shadow */ - unsigned long base; - /* An array of __cfi_check locations (as indices to the shadow) */ - shadow_t shadow[1]; -} __packed; - -/* - * The shadow covers ~128M from the beginning of the module region. If - * the region is larger, we fall back to __module_address for the rest. - */ -#define __SHADOW_RANGE (_UL(SZ_128M) >> PAGE_SHIFT) - -/* The in-memory size of struct cfi_shadow, always at least one page */ -#define __SHADOW_PAGES ((__SHADOW_RANGE * sizeof(shadow_t)) >> PAGE_SHIFT) -#define SHADOW_PAGES max(1UL, __SHADOW_PAGES) -#define SHADOW_SIZE (SHADOW_PAGES << PAGE_SHIFT) - -/* The actual size of the shadow array, minus metadata */ -#define SHADOW_ARR_SIZE (SHADOW_SIZE - offsetof(struct cfi_shadow, shadow)) -#define SHADOW_ARR_SLOTS (SHADOW_ARR_SIZE / sizeof(shadow_t)) - -static DEFINE_MUTEX(shadow_update_lock); -static struct cfi_shadow __rcu *cfi_shadow __read_mostly; - -/* Returns the index in the shadow for the given address */ -static inline int ptr_to_shadow(const struct cfi_shadow *s, unsigned long ptr) -{ - unsigned long index; - unsigned long page = ptr >> PAGE_SHIFT; - - if (unlikely(page < s->base)) - return -1; /* Outside of module area */ - - index = page - s->base; - - if (index >= SHADOW_ARR_SLOTS) - return -1; /* Cannot be addressed with shadow */ - - return (int)index; -} - -/* Returns the page address for an index in the shadow */ -static inline unsigned long shadow_to_ptr(const struct cfi_shadow *s, - int index) -{ - if (unlikely(index < 0 || index >= SHADOW_ARR_SLOTS)) - return 0; - - return (s->base + index) << PAGE_SHIFT; -} - -/* Returns the __cfi_check function address for the given shadow location */ -static inline unsigned long shadow_to_check_fn(const struct cfi_shadow *s, - int index) -{ - if (unlikely(index < 0 || index >= SHADOW_ARR_SLOTS)) - return 0; - - if (unlikely(s->shadow[index] == SHADOW_INVALID)) - return 0; - - /* __cfi_check is always page aligned */ - return (s->base + s->shadow[index]) << PAGE_SHIFT; -} - -static void prepare_next_shadow(const struct cfi_shadow __rcu *prev, - struct cfi_shadow *next) -{ - int i, index, check; - - /* Mark everything invalid */ - memset(next->shadow, 0xFF, SHADOW_ARR_SIZE); - - if (!prev) - return; /* No previous shadow */ - - /* If the base address didn't change, an update is not needed */ - if (prev->base == next->base) { - memcpy(next->shadow, prev->shadow, SHADOW_ARR_SIZE); - return; - } - - /* Convert the previous shadow to the new address range */ - for (i = 0; i < SHADOW_ARR_SLOTS; ++i) { - if (prev->shadow[i] == SHADOW_INVALID) - continue; - - index = ptr_to_shadow(next, shadow_to_ptr(prev, i)); - if (index < 0) - continue; - - check = ptr_to_shadow(next, - shadow_to_check_fn(prev, prev->shadow[i])); - if (check < 0) - continue; - - next->shadow[index] = (shadow_t)check; - } -} - -static void add_module_to_shadow(struct cfi_shadow *s, struct module *mod, - unsigned long min_addr, unsigned long max_addr) -{ - int check_index; - unsigned long check = (unsigned long)mod->cfi_check; - unsigned long ptr; - - if (unlikely(!PAGE_ALIGNED(check))) { - pr_warn("cfi: not using shadow for module %s\n", mod->name); - return; - } - - check_index = ptr_to_shadow(s, check); - if (check_index < 0) - return; /* Module not addressable with shadow */ - - /* For each page, store the check function index in the shadow */ - for (ptr = min_addr; ptr <= max_addr; ptr += PAGE_SIZE) { - int index = ptr_to_shadow(s, ptr); - - if (index >= 0) { - /* Each page must only contain one module */ - WARN_ON_ONCE(s->shadow[index] != SHADOW_INVALID); - s->shadow[index] = (shadow_t)check_index; - } - } -} - -static void remove_module_from_shadow(struct cfi_shadow *s, struct module *mod, - unsigned long min_addr, unsigned long max_addr) -{ - unsigned long ptr; - - for (ptr = min_addr; ptr <= max_addr; ptr += PAGE_SIZE) { - int index = ptr_to_shadow(s, ptr); - - if (index >= 0) - s->shadow[index] = SHADOW_INVALID; - } -} - -typedef void (*update_shadow_fn)(struct cfi_shadow *, struct module *, - unsigned long min_addr, unsigned long max_addr); - -static void update_shadow(struct module *mod, unsigned long base_addr, - update_shadow_fn fn) -{ - struct cfi_shadow *prev; - struct cfi_shadow *next; - unsigned long min_addr, max_addr; - - next = vmalloc(SHADOW_SIZE); - - mutex_lock(&shadow_update_lock); - prev = rcu_dereference_protected(cfi_shadow, - mutex_is_locked(&shadow_update_lock)); - - if (next) { - next->base = base_addr >> PAGE_SHIFT; - prepare_next_shadow(prev, next); - - min_addr = (unsigned long)mod->core_layout.base; - max_addr = min_addr + mod->core_layout.text_size; - fn(next, mod, min_addr & PAGE_MASK, max_addr & PAGE_MASK); - - set_memory_ro((unsigned long)next, SHADOW_PAGES); - } - - rcu_assign_pointer(cfi_shadow, next); - mutex_unlock(&shadow_update_lock); - synchronize_rcu(); - - if (prev) { - set_memory_rw((unsigned long)prev, SHADOW_PAGES); - vfree(prev); - } -} - -void cfi_module_add(struct module *mod, unsigned long base_addr) -{ - update_shadow(mod, base_addr, add_module_to_shadow); -} - -void cfi_module_remove(struct module *mod, unsigned long base_addr) -{ - update_shadow(mod, base_addr, remove_module_from_shadow); -} - -static inline cfi_check_fn ptr_to_check_fn(const struct cfi_shadow __rcu *s, - unsigned long ptr) -{ - int index; - - if (unlikely(!s)) - return NULL; /* No shadow available */ - - index = ptr_to_shadow(s, ptr); - if (index < 0) - return NULL; /* Cannot be addressed with shadow */ - - return (cfi_check_fn)shadow_to_check_fn(s, index); -} - -static inline cfi_check_fn find_shadow_check_fn(unsigned long ptr) -{ - cfi_check_fn fn; - - rcu_read_lock_sched_notrace(); - fn = ptr_to_check_fn(rcu_dereference_sched(cfi_shadow), ptr); - rcu_read_unlock_sched_notrace(); - - return fn; -} - -#else /* !CONFIG_CFI_CLANG_SHADOW */ - -static inline cfi_check_fn find_shadow_check_fn(unsigned long ptr) -{ - return NULL; -} - -#endif /* CONFIG_CFI_CLANG_SHADOW */ static inline cfi_check_fn find_module_check_fn(unsigned long ptr) { @@ -298,10 +67,7 @@ static inline cfi_check_fn find_check_fn(unsigned long ptr) ct_irq_enter(); } - if (IS_ENABLED(CONFIG_CFI_CLANG_SHADOW)) - fn = find_shadow_check_fn(ptr); - if (!fn) - fn = find_module_check_fn(ptr); + fn = find_module_check_fn(ptr); if (rcu_idle) { ct_irq_exit(); diff --git a/kernel/module/main.c b/kernel/module/main.c index a4e4d84b6f4e..0228f44b58e5 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1144,8 +1144,6 @@ void __weak module_arch_freeing_init(struct module *mod) { } -static void cfi_cleanup(struct module *mod); - /* Free a module, remove from lists, etc. */ static void free_module(struct module *mod) { @@ -1190,9 +1188,6 @@ static void free_module(struct module *mod) mod->name); mutex_unlock(&module_mutex); - /* Clean up CFI for the module. */ - cfi_cleanup(mod); - /* This may be empty, but that's OK */ module_arch_freeing_init(mod); module_memfree(mod->init_layout.base); @@ -2875,7 +2870,6 @@ static int load_module(struct load_info *info, const char __user *uargs, synchronize_rcu(); kfree(mod->args); free_arch_cleanup: - cfi_cleanup(mod); module_arch_cleanup(mod); free_modinfo: free_modinfo(mod); @@ -2984,15 +2978,6 @@ static void cfi_init(struct module *mod) mod->exit = *exit; #endif rcu_read_unlock_sched(); - - cfi_module_add(mod, mod_tree.addr_min); -#endif -} - -static void cfi_cleanup(struct module *mod) -{ -#ifdef CONFIG_CFI_CLANG - cfi_module_remove(mod, mod_tree.addr_min); #endif } -- cgit v1.2.3 From 89245600941e4e0f87d77f60ee269b5e61ef4e49 Mon Sep 17 00:00:00 2001 From: Sami Tolvanen Date: Thu, 8 Sep 2022 14:54:47 -0700 Subject: cfi: Switch to -fsanitize=kcfi Switch from Clang's original forward-edge control-flow integrity implementation to -fsanitize=kcfi, which is better suited for the kernel, as it doesn't require LTO, doesn't use a jump table that requires altering function references, and won't break cross-module function address equality. Signed-off-by: Sami Tolvanen Reviewed-by: Kees Cook Tested-by: Kees Cook Tested-by: Nathan Chancellor Acked-by: Peter Zijlstra (Intel) Tested-by: Peter Zijlstra (Intel) Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220908215504.3686827-6-samitolvanen@google.com --- kernel/cfi.c | 144 +++++++++++++++++++++++++-------------------------- kernel/module/main.c | 35 ++----------- 2 files changed, 73 insertions(+), 106 deletions(-) (limited to 'kernel') diff --git a/kernel/cfi.c b/kernel/cfi.c index e8bc1b370edc..08caad776717 100644 --- a/kernel/cfi.c +++ b/kernel/cfi.c @@ -1,105 +1,101 @@ // SPDX-License-Identifier: GPL-2.0 /* - * Clang Control Flow Integrity (CFI) error and slowpath handling. + * Clang Control Flow Integrity (CFI) error handling. * - * Copyright (C) 2021 Google LLC + * Copyright (C) 2022 Google LLC */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -/* Compiler-defined handler names */ -#ifdef CONFIG_CFI_PERMISSIVE -#define cfi_failure_handler __ubsan_handle_cfi_check_fail -#else -#define cfi_failure_handler __ubsan_handle_cfi_check_fail_abort -#endif - -static inline void handle_cfi_failure(void *ptr) +#include + +enum bug_trap_type report_cfi_failure(struct pt_regs *regs, unsigned long addr, + unsigned long *target, u32 type) { - if (IS_ENABLED(CONFIG_CFI_PERMISSIVE)) - WARN_RATELIMIT(1, "CFI failure (target: %pS):\n", ptr); + if (target) + pr_err("CFI failure at %pS (target: %pS; expected type: 0x%08x)\n", + (void *)addr, (void *)*target, type); else - panic("CFI failure (target: %pS)\n", ptr); + pr_err("CFI failure at %pS (no target information)\n", + (void *)addr); + + if (IS_ENABLED(CONFIG_CFI_PERMISSIVE)) { + __warn(NULL, 0, (void *)addr, 0, regs, NULL); + return BUG_TRAP_TYPE_WARN; + } + + return BUG_TRAP_TYPE_BUG; } -#ifdef CONFIG_MODULES +#ifdef CONFIG_ARCH_USES_CFI_TRAPS +static inline unsigned long trap_address(s32 *p) +{ + return (unsigned long)((long)p + (long)*p); +} -static inline cfi_check_fn find_module_check_fn(unsigned long ptr) +static bool is_trap(unsigned long addr, s32 *start, s32 *end) { - cfi_check_fn fn = NULL; - struct module *mod; + s32 *p; - rcu_read_lock_sched_notrace(); - mod = __module_address(ptr); - if (mod) - fn = mod->cfi_check; - rcu_read_unlock_sched_notrace(); + for (p = start; p < end; ++p) { + if (trap_address(p) == addr) + return true; + } - return fn; + return false; } -static inline cfi_check_fn find_check_fn(unsigned long ptr) +#ifdef CONFIG_MODULES +/* Populates `kcfi_trap(_end)?` fields in `struct module`. */ +void module_cfi_finalize(const Elf_Ehdr *hdr, const Elf_Shdr *sechdrs, + struct module *mod) { - cfi_check_fn fn = NULL; - unsigned long flags; - bool rcu_idle; - - if (is_kernel_text(ptr)) - return __cfi_check; - - /* - * Indirect call checks can happen when RCU is not watching. Both - * the shadow and __module_address use RCU, so we need to wake it - * up if necessary. - */ - rcu_idle = !rcu_is_watching(); - if (rcu_idle) { - local_irq_save(flags); - ct_irq_enter(); - } + char *secstrings; + unsigned int i; - fn = find_module_check_fn(ptr); + mod->kcfi_traps = NULL; + mod->kcfi_traps_end = NULL; - if (rcu_idle) { - ct_irq_exit(); - local_irq_restore(flags); - } + secstrings = (char *)hdr + sechdrs[hdr->e_shstrndx].sh_offset; + + for (i = 1; i < hdr->e_shnum; i++) { + if (strcmp(secstrings + sechdrs[i].sh_name, "__kcfi_traps")) + continue; - return fn; + mod->kcfi_traps = (s32 *)sechdrs[i].sh_addr; + mod->kcfi_traps_end = (s32 *)(sechdrs[i].sh_addr + sechdrs[i].sh_size); + break; + } } -void __cfi_slowpath_diag(uint64_t id, void *ptr, void *diag) +static bool is_module_cfi_trap(unsigned long addr) { - cfi_check_fn fn = find_check_fn((unsigned long)ptr); + struct module *mod; + bool found = false; - if (likely(fn)) - fn(id, ptr, diag); - else /* Don't allow unchecked modules */ - handle_cfi_failure(ptr); -} -EXPORT_SYMBOL(__cfi_slowpath_diag); + rcu_read_lock_sched_notrace(); -#else /* !CONFIG_MODULES */ + mod = __module_address(addr); + if (mod) + found = is_trap(addr, mod->kcfi_traps, mod->kcfi_traps_end); -void __cfi_slowpath_diag(uint64_t id, void *ptr, void *diag) + rcu_read_unlock_sched_notrace(); + + return found; +} +#else /* CONFIG_MODULES */ +static inline bool is_module_cfi_trap(unsigned long addr) { - handle_cfi_failure(ptr); /* No modules */ + return false; } -EXPORT_SYMBOL(__cfi_slowpath_diag); - #endif /* CONFIG_MODULES */ -void cfi_failure_handler(void *data, void *ptr, void *vtable) +extern s32 __start___kcfi_traps[]; +extern s32 __stop___kcfi_traps[]; + +bool is_cfi_trap(unsigned long addr) { - handle_cfi_failure(ptr); + if (is_trap(addr, __start___kcfi_traps, __stop___kcfi_traps)) + return true; + + return is_module_cfi_trap(addr); } -EXPORT_SYMBOL(cfi_failure_handler); +#endif /* CONFIG_ARCH_USES_CFI_TRAPS */ diff --git a/kernel/module/main.c b/kernel/module/main.c index 0228f44b58e5..70c0b2c6fef8 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -53,6 +53,7 @@ #include #include #include +#include #include #include "internal.h" @@ -2597,8 +2598,9 @@ static int complete_formation(struct module *mod, struct load_info *info) if (err < 0) goto out; - /* This relies on module_mutex for list integrity. */ + /* These rely on module_mutex for list integrity. */ module_bug_finalize(info->hdr, info->sechdrs, mod); + module_cfi_finalize(info->hdr, info->sechdrs, mod); if (module_check_misalignment(mod)) goto out_misaligned; @@ -2660,8 +2662,6 @@ static int unknown_module_param_cb(char *param, char *val, const char *modname, return 0; } -static void cfi_init(struct module *mod); - /* * Allocate and load the module: note that size of section 0 is always * zero, and we rely on this for optional sections. @@ -2791,9 +2791,6 @@ static int load_module(struct load_info *info, const char __user *uargs, flush_module_icache(mod); - /* Setup CFI for the module. */ - cfi_init(mod); - /* Now copy in args */ mod->args = strndup_user(uargs, ~0UL >> 1); if (IS_ERR(mod->args)) { @@ -2955,32 +2952,6 @@ static inline int within(unsigned long addr, void *start, unsigned long size) return ((void *)addr >= start && (void *)addr < start + size); } -static void cfi_init(struct module *mod) -{ -#ifdef CONFIG_CFI_CLANG - initcall_t *init; -#ifdef CONFIG_MODULE_UNLOAD - exitcall_t *exit; -#endif - - rcu_read_lock_sched(); - mod->cfi_check = (cfi_check_fn) - find_kallsyms_symbol_value(mod, "__cfi_check"); - init = (initcall_t *) - find_kallsyms_symbol_value(mod, "__cfi_jt_init_module"); - /* Fix init/exit functions to point to the CFI jump table */ - if (init) - mod->init = *init; -#ifdef CONFIG_MODULE_UNLOAD - exit = (exitcall_t *) - find_kallsyms_symbol_value(mod, "__cfi_jt_cleanup_module"); - if (exit) - mod->exit = *exit; -#endif - rcu_read_unlock_sched(); -#endif -} - /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */ char *module_flags(struct module *mod, char *buf, bool show_state) { -- cgit v1.2.3 From 4b24356312fbe1bace72f9905d529b14fc34c1c3 Mon Sep 17 00:00:00 2001 From: Sami Tolvanen Date: Thu, 8 Sep 2022 14:54:56 -0700 Subject: treewide: Drop WARN_ON_FUNCTION_MISMATCH CONFIG_CFI_CLANG no longer breaks cross-module function address equality, which makes WARN_ON_FUNCTION_MISMATCH unnecessary. Remove the definition and switch back to WARN_ON_ONCE. Signed-off-by: Sami Tolvanen Reviewed-by: Kees Cook Tested-by: Kees Cook Tested-by: Nathan Chancellor Acked-by: Peter Zijlstra (Intel) Tested-by: Peter Zijlstra (Intel) Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220908215504.3686827-15-samitolvanen@google.com --- kernel/kthread.c | 3 +-- kernel/workqueue.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/kthread.c b/kernel/kthread.c index 3c677918d8f2..28a6b7ab4a0f 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1050,8 +1050,7 @@ static void __kthread_queue_delayed_work(struct kthread_worker *worker, struct timer_list *timer = &dwork->timer; struct kthread_work *work = &dwork->work; - WARN_ON_FUNCTION_MISMATCH(timer->function, - kthread_delayed_work_timer_fn); + WARN_ON_ONCE(timer->function != kthread_delayed_work_timer_fn); /* * If @delay is 0, queue @dwork->work immediately. This is for diff --git a/kernel/workqueue.c b/kernel/workqueue.c index aeea9731ef80..16df315d2a3d 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1651,7 +1651,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, struct work_struct *work = &dwork->work; WARN_ON_ONCE(!wq); - WARN_ON_FUNCTION_MISMATCH(timer->function, delayed_work_timer_fn); + WARN_ON_ONCE(timer->function != delayed_work_timer_fn); WARN_ON_ONCE(timer_pending(timer)); WARN_ON_ONCE(!list_empty(&work->entry)); -- cgit v1.2.3 From dfb352ab1162f73b8c6dc98150fa32cf5aa2f623 Mon Sep 17 00:00:00 2001 From: Sami Tolvanen Date: Thu, 8 Sep 2022 14:55:00 -0700 Subject: kallsyms: Drop CONFIG_CFI_CLANG workarounds With -fsanitize=kcfi, the compiler no longer renames static functions with CONFIG_CFI_CLANG + ThinLTO. Drop the code that cleans up the ThinLTO hash from the function names. Signed-off-by: Sami Tolvanen Reviewed-by: Nick Desaulniers Reviewed-by: Kees Cook Tested-by: Kees Cook Tested-by: Nathan Chancellor Acked-by: Peter Zijlstra (Intel) Tested-by: Peter Zijlstra (Intel) Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220908215504.3686827-19-samitolvanen@google.com --- kernel/kallsyms.c | 17 ----------------- 1 file changed, 17 deletions(-) (limited to 'kernel') diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index 3e7e2c2ad2f7..b27e6ea31f8b 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -159,7 +159,6 @@ static bool cleanup_symbol_name(char *s) * character in an identifier in C. Suffixes observed: * - foo.llvm.[0-9a-f]+ * - foo.[0-9a-f]+ - * - foo.[0-9a-f]+.cfi_jt */ res = strchr(s, '.'); if (res) { @@ -167,22 +166,6 @@ static bool cleanup_symbol_name(char *s) return true; } - if (!IS_ENABLED(CONFIG_CFI_CLANG) || - !IS_ENABLED(CONFIG_LTO_CLANG_THIN) || - CONFIG_CLANG_VERSION >= 130000) - return false; - - /* - * Prior to LLVM 13, the following suffixes were observed when thinLTO - * and CFI are both enabled: - * - foo$[0-9]+ - */ - res = strrchr(s, '$'); - if (res) { - *res = '\0'; - return true; - } - return false; } -- cgit v1.2.3 From bf7a87f1075f67c286f794519f0fedfa8b0b18cc Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Mon, 26 Sep 2022 17:33:35 +0200 Subject: kprobes: Add new KPROBE_FLAG_ON_FUNC_ENTRY kprobe flag Adding KPROBE_FLAG_ON_FUNC_ENTRY kprobe flag to indicate that attach address is on function entry. This is used in following changes in get_func_ip helper to return correct function address. Acked-by: Masami Hiramatsu (Google) Signed-off-by: Jiri Olsa Link: https://lore.kernel.org/r/20220926153340.1621984-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/kprobes.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 08350e35aba2..51adc3c94503 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1606,9 +1606,10 @@ int register_kprobe(struct kprobe *p) struct kprobe *old_p; struct module *probed_mod; kprobe_opcode_t *addr; + bool on_func_entry; /* Adjust probe address from symbol */ - addr = kprobe_addr(p); + addr = _kprobe_addr(p->addr, p->symbol_name, p->offset, &on_func_entry); if (IS_ERR(addr)) return PTR_ERR(addr); p->addr = addr; @@ -1628,6 +1629,9 @@ int register_kprobe(struct kprobe *p) mutex_lock(&kprobe_mutex); + if (on_func_entry) + p->flags |= KPROBE_FLAG_ON_FUNC_ENTRY; + old_p = get_kprobe(p->addr); if (old_p) { /* Since this may unoptimize 'old_p', locking 'text_mutex'. */ -- cgit v1.2.3 From 9d68c19c57d690547cde977bb3d9ccd3ceb6afe9 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Mon, 26 Sep 2022 17:33:36 +0200 Subject: ftrace: Keep the resolved addr in kallsyms_callback Keeping the resolved 'addr' in kallsyms_callback, instead of taking ftrace_location value, because we depend on symbol address in the cookie related code. With CONFIG_X86_KERNEL_IBT option the ftrace_location value differs from symbol address, which screwes the symbol address cookies matching. There are 2 users of this function: - bpf_kprobe_multi_link_attach for which this fix is for - get_ftrace_locations which is used by register_fprobe_syms this function needs to get symbols resolved to addresses, but does not need 'ftrace location addresses' at this point there's another ftrace location translation in the path done by ftrace_set_filter_ips call: register_fprobe_syms addrs = get_ftrace_locations register_fprobe_ips(addrs) ... ftrace_set_filter_ips ... __ftrace_match_addr ip = ftrace_location(ip); ... Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Jiri Olsa Link: https://lore.kernel.org/r/20220926153340.1621984-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/trace/ftrace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 439e2ab6905e..447d2e2a8549 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -8265,8 +8265,7 @@ static int kallsyms_callback(void *data, const char *name, if (args->addrs[idx]) return 0; - addr = ftrace_location(addr); - if (!addr) + if (!ftrace_location(addr)) return 0; args->addrs[idx] = addr; -- cgit v1.2.3 From c09eb2e578eb1668bbc84dc07e8d8bd6f04b9a02 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Mon, 26 Sep 2022 17:33:38 +0200 Subject: bpf: Adjust kprobe_multi entry_ip for CONFIG_X86_KERNEL_IBT Martynas reported bpf_get_func_ip returning +4 address when CONFIG_X86_KERNEL_IBT option is enabled. When CONFIG_X86_KERNEL_IBT is enabled we'll have endbr instruction at the function entry, which screws return value of bpf_get_func_ip() helper that should return the function address. There's short term workaround for kprobe_multi bpf program made by Alexei [1], but we need this fixup also for bpf_get_attach_cookie, that returns cookie based on the entry_ip value. Moving the fixup in the fprobe handler, so both bpf_get_func_ip and bpf_get_attach_cookie get expected function address when CONFIG_X86_KERNEL_IBT option is enabled. Also renaming kprobe_multi_link_handler entry_ip argument to fentry_ip so it's clearer this is an ftrace __fentry__ ip. [1] commit 7f0059b58f02 ("selftests/bpf: Fix kprobe_multi test.") Cc: Peter Zijlstra Reported-by: Martynas Pumputis Acked-by: Andrii Nakryiko Signed-off-by: Jiri Olsa Link: https://lore.kernel.org/r/20220926153340.1621984-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b05f0310dbd3..ebd1b348beb3 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1028,6 +1028,22 @@ static const struct bpf_func_proto bpf_get_func_ip_proto_tracing = { .arg1_type = ARG_PTR_TO_CTX, }; +#ifdef CONFIG_X86_KERNEL_IBT +static unsigned long get_entry_ip(unsigned long fentry_ip) +{ + u32 instr; + + /* Being extra safe in here in case entry ip is on the page-edge. */ + if (get_kernel_nofault(instr, (u32 *) fentry_ip - 1)) + return fentry_ip; + if (is_endbr(instr)) + fentry_ip -= ENDBR_INSN_SIZE; + return fentry_ip; +} +#else +#define get_entry_ip(fentry_ip) fentry_ip +#endif + BPF_CALL_1(bpf_get_func_ip_kprobe, struct pt_regs *, regs) { struct kprobe *kp = kprobe_running(); @@ -2600,13 +2616,13 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link, } static void -kprobe_multi_link_handler(struct fprobe *fp, unsigned long entry_ip, +kprobe_multi_link_handler(struct fprobe *fp, unsigned long fentry_ip, struct pt_regs *regs) { struct bpf_kprobe_multi_link *link; link = container_of(fp, struct bpf_kprobe_multi_link, fp); - kprobe_multi_link_prog_run(link, entry_ip, regs); + kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs); } static int symbols_cmp_r(const void *a, const void *b, const void *priv) -- cgit v1.2.3 From 0e253f7e558a3e250902ba2034091e0185448836 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Mon, 26 Sep 2022 17:33:39 +0200 Subject: bpf: Return value in kprobe get_func_ip only for entry address Changing return value of kprobe's version of bpf_get_func_ip to return zero if the attach address is not on the function's entry point. For kprobes attached in the middle of the function we can't easily get to the function address especially now with the CONFIG_X86_KERNEL_IBT support. If user cares about current IP for kprobes attached within the function body, they can get it with PT_REGS_IP(ctx). Suggested-by: Andrii Nakryiko Acked-by: Andrii Nakryiko Acked-by: Martynas Pumputis Signed-off-by: Jiri Olsa Link: https://lore.kernel.org/r/20220926153340.1621984-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ebd1b348beb3..688552df95ca 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1048,7 +1048,10 @@ BPF_CALL_1(bpf_get_func_ip_kprobe, struct pt_regs *, regs) { struct kprobe *kp = kprobe_running(); - return kp ? (uintptr_t)kp->addr : 0; + if (!kp || !(kp->flags & KPROBE_FLAG_ON_FUNC_ENTRY)) + return 0; + + return get_entry_ip((uintptr_t)kp->addr); } static const struct bpf_func_proto bpf_get_func_ip_proto_kprobe = { -- cgit v1.2.3 From 19c02415da2345d0dda2b5c4495bc17cc14b18b5 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Mon, 26 Sep 2022 11:47:38 -0700 Subject: bpf: use bpf_prog_pack for bpf_dispatcher Allocate bpf_dispatcher with bpf_prog_pack_alloc so that bpf_dispatcher can share pages with bpf programs. arch_prepare_bpf_dispatcher() is updated to provide a RW buffer as working area for arch code to write to. This also fixes CPA W^X warnning like: CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ... Signed-off-by: Song Liu Link: https://lore.kernel.org/r/20220926184739.3512547-2-song@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 9 +++++++-- kernel/bpf/dispatcher.c | 27 +++++++++++++++++++++------ 2 files changed, 28 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d1be78c28619..711fd293b6de 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -825,6 +825,11 @@ struct bpf_prog_pack { unsigned long bitmap[]; }; +void bpf_jit_fill_hole_with_zero(void *area, unsigned int size) +{ + memset(area, 0, size); +} + #define BPF_PROG_SIZE_TO_NBITS(size) (round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE) static DEFINE_MUTEX(pack_mutex); @@ -864,7 +869,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins return pack; } -static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) +void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) { unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size); struct bpf_prog_pack *pack; @@ -905,7 +910,7 @@ out: return ptr; } -static void bpf_prog_pack_free(struct bpf_binary_header *hdr) +void bpf_prog_pack_free(struct bpf_binary_header *hdr) { struct bpf_prog_pack *pack = NULL, *tmp; unsigned int nbits; diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c index 2444bd15cc2d..fa64b80b8bca 100644 --- a/kernel/bpf/dispatcher.c +++ b/kernel/bpf/dispatcher.c @@ -85,12 +85,12 @@ static bool bpf_dispatcher_remove_prog(struct bpf_dispatcher *d, return false; } -int __weak arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs) +int __weak arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_funcs) { return -ENOTSUPP; } -static int bpf_dispatcher_prepare(struct bpf_dispatcher *d, void *image) +static int bpf_dispatcher_prepare(struct bpf_dispatcher *d, void *image, void *buf) { s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0]; int i; @@ -99,12 +99,12 @@ static int bpf_dispatcher_prepare(struct bpf_dispatcher *d, void *image) if (d->progs[i].prog) *ipsp++ = (s64)(uintptr_t)d->progs[i].prog->bpf_func; } - return arch_prepare_bpf_dispatcher(image, &ips[0], d->num_progs); + return arch_prepare_bpf_dispatcher(image, buf, &ips[0], d->num_progs); } static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs) { - void *old, *new; + void *old, *new, *tmp; u32 noff; int err; @@ -117,8 +117,14 @@ static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs) } new = d->num_progs ? d->image + noff : NULL; + tmp = d->num_progs ? d->rw_image + noff : NULL; if (new) { - if (bpf_dispatcher_prepare(d, new)) + /* Prepare the dispatcher in d->rw_image. Then use + * bpf_arch_text_copy to update d->image, which is RO+X. + */ + if (bpf_dispatcher_prepare(d, new, tmp)) + return; + if (IS_ERR(bpf_arch_text_copy(new, tmp, PAGE_SIZE / 2))) return; } @@ -140,9 +146,18 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, mutex_lock(&d->mutex); if (!d->image) { - d->image = bpf_jit_alloc_exec_page(); + d->image = bpf_prog_pack_alloc(PAGE_SIZE, bpf_jit_fill_hole_with_zero); if (!d->image) goto out; + d->rw_image = bpf_jit_alloc_exec(PAGE_SIZE); + if (!d->rw_image) { + u32 size = PAGE_SIZE; + + bpf_arch_text_copy(d->image, &size, sizeof(size)); + bpf_prog_pack_free((struct bpf_binary_header *)d->image); + d->image = NULL; + goto out; + } bpf_image_ksym_add(d->image, &d->ksym); } -- cgit v1.2.3 From 5b0d1c7bd5722467960829af51d523f5a6ffd848 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Mon, 26 Sep 2022 11:47:39 -0700 Subject: bpf: Enforce W^X for bpf trampoline Mark the trampoline as RO+X after arch_prepare_bpf_trampoline, so that the trampoine follows W^X rule strictly. This will turn off warnings like CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ... Also remove bpf_jit_alloc_exec_page(), since it is not used any more. Signed-off-by: Song Liu Link: https://lore.kernel.org/r/20220926184739.3512547-3-song@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/trampoline.c | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 41b67eb83ab3..6f7b939321d6 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -116,22 +116,6 @@ bool bpf_prog_has_trampoline(const struct bpf_prog *prog) (ptype == BPF_PROG_TYPE_LSM && eatype == BPF_LSM_MAC); } -void *bpf_jit_alloc_exec_page(void) -{ - void *image; - - image = bpf_jit_alloc_exec(PAGE_SIZE); - if (!image) - return NULL; - - set_vm_flush_reset_perms(image); - /* Keep image as writeable. The alternative is to keep flipping ro/rw - * every time new program is attached or detached. - */ - set_memory_x((long)image, 1); - return image; -} - void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym) { ksym->start = (unsigned long) data; @@ -404,9 +388,10 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx) goto out_free_im; err = -ENOMEM; - im->image = image = bpf_jit_alloc_exec_page(); + im->image = image = bpf_jit_alloc_exec(PAGE_SIZE); if (!image) goto out_uncharge; + set_vm_flush_reset_perms(image); err = percpu_ref_init(&im->pcref, __bpf_tramp_image_release, 0, GFP_KERNEL); if (err) @@ -483,6 +468,9 @@ again: if (err < 0) goto out; + set_memory_ro((long)im->image, 1); + set_memory_x((long)im->image, 1); + WARN_ON(tr->cur_image && tr->selector == 0); WARN_ON(!tr->cur_image && tr->selector); if (tr->cur_image) -- cgit v1.2.3 From 73bbb94466fd3f8b313eeb0b0467314a262dddb3 Mon Sep 17 00:00:00 2001 From: Miguel Ojeda Date: Mon, 5 Apr 2021 04:58:39 +0200 Subject: kallsyms: support "big" kernel symbols Rust symbols can become quite long due to namespacing introduced by modules, types, traits, generics, etc. Increasing to 255 is not enough in some cases, therefore introduce longer lengths to the symbol table. In order to avoid increasing all lengths to 2 bytes (since most of them are small, including many Rust ones), use ULEB128 to keep smaller symbols in 1 byte, with the rest in 2 bytes. Reviewed-by: Kees Cook Reviewed-by: Greg Kroah-Hartman Co-developed-by: Alex Gaynor Signed-off-by: Alex Gaynor Co-developed-by: Wedson Almeida Filho Signed-off-by: Wedson Almeida Filho Co-developed-by: Gary Guo Signed-off-by: Gary Guo Co-developed-by: Boqun Feng Signed-off-by: Boqun Feng Co-developed-by: Matthew Wilcox Signed-off-by: Matthew Wilcox Signed-off-by: Miguel Ojeda --- kernel/kallsyms.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index 3e7e2c2ad2f7..fc5e26348d25 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -50,12 +50,20 @@ static unsigned int kallsyms_expand_symbol(unsigned int off, data = &kallsyms_names[off]; len = *data; data++; + off++; + + /* If MSB is 1, it is a "big" symbol, so needs an additional byte. */ + if ((len & 0x80) != 0) { + len = (len & 0x7F) | (*data << 7); + data++; + off++; + } /* * Update the offset to return the offset for the next symbol on * the compressed stream. */ - off += len + 1; + off += len; /* * For every byte on the compressed symbol data, copy the table @@ -108,7 +116,7 @@ static char kallsyms_get_symbol_type(unsigned int off) static unsigned int get_symbol_offset(unsigned long pos) { const u8 *name; - int i; + int i, len; /* * Use the closest marker we have. We have markers every 256 positions, @@ -122,8 +130,18 @@ static unsigned int get_symbol_offset(unsigned long pos) * so we just need to add the len to the current pointer for every * symbol we wish to skip. */ - for (i = 0; i < (pos & 0xFF); i++) - name = name + (*name) + 1; + for (i = 0; i < (pos & 0xFF); i++) { + len = *name; + + /* + * If MSB is 1, it is a "big" symbol, so we need to look into + * the next byte (and skip it, too). + */ + if ((len & 0x80) != 0) + len = ((len & 0x7F) | (name[1] << 7)) + 1; + + name = name + len + 1; + } return name - kallsyms_names; } -- cgit v1.2.3 From b8a94bfb33952bb17fbc65f8903d242a721c533d Mon Sep 17 00:00:00 2001 From: Miguel Ojeda Date: Mon, 5 Apr 2021 05:03:50 +0200 Subject: kallsyms: increase maximum kernel symbol length to 512 Rust symbols can become quite long due to namespacing introduced by modules, types, traits, generics, etc. For instance, the following code: pub mod my_module { pub struct MyType; pub struct MyGenericType(T); pub trait MyTrait { fn my_method() -> u32; } impl MyTrait for MyGenericType { fn my_method() -> u32 { 42 } } } generates a symbol of length 96 when using the upcoming v0 mangling scheme: _RNvXNtCshGpAVYOtgW1_7example9my_moduleINtB2_13MyGenericTypeNtB2_6MyTypeENtB2_7MyTrait9my_method At the moment, Rust symbols may reach up to 300 in length. Setting 512 as the maximum seems like a reasonable choice to keep some headroom. Reviewed-by: Kees Cook Reviewed-by: Petr Mladek Reviewed-by: Greg Kroah-Hartman Co-developed-by: Alex Gaynor Signed-off-by: Alex Gaynor Co-developed-by: Wedson Almeida Filho Signed-off-by: Wedson Almeida Filho Co-developed-by: Gary Guo Signed-off-by: Gary Guo Co-developed-by: Boqun Feng Signed-off-by: Boqun Feng Signed-off-by: Miguel Ojeda --- kernel/livepatch/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index bc475e62279d..ec06ce59d728 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -213,7 +213,7 @@ static int klp_resolve_symbols(Elf_Shdr *sechdrs, const char *strtab, * we use the smallest/strictest upper bound possible (56, based on * the current definition of MODULE_NAME_LEN) to prevent overflows. */ - BUILD_BUG_ON(MODULE_NAME_LEN < 56 || KSYM_NAME_LEN != 128); + BUILD_BUG_ON(MODULE_NAME_LEN < 56 || KSYM_NAME_LEN != 512); relas = (Elf_Rela *) relasec->sh_addr; /* For each rela in this klp relocation section */ @@ -227,7 +227,7 @@ static int klp_resolve_symbols(Elf_Shdr *sechdrs, const char *strtab, /* Format: .klp.sym.sym_objname.sym_name,sympos */ cnt = sscanf(strtab + sym->st_name, - ".klp.sym.%55[^.].%127[^,],%lu", + ".klp.sym.%55[^.].%511[^,],%lu", sym_objname, sym_name, &sympos); if (cnt != 3) { pr_err("symbol %s has an incorrectly formatted name\n", -- cgit v1.2.3 From 2f7ab1267dc9b2d1f29695aff3211c87483480f3 Mon Sep 17 00:00:00 2001 From: Miguel Ojeda Date: Sat, 3 Jul 2021 16:42:57 +0200 Subject: Kbuild: add Rust support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Having most of the new files in place, we now enable Rust support in the build system, including `Kconfig` entries related to Rust, the Rust configuration printer and a few other bits. Reviewed-by: Kees Cook Reviewed-by: Nick Desaulniers Tested-by: Nick Desaulniers Reviewed-by: Greg Kroah-Hartman Co-developed-by: Alex Gaynor Signed-off-by: Alex Gaynor Co-developed-by: Finn Behrens Signed-off-by: Finn Behrens Co-developed-by: Adam Bratschi-Kaye Signed-off-by: Adam Bratschi-Kaye Co-developed-by: Wedson Almeida Filho Signed-off-by: Wedson Almeida Filho Co-developed-by: Michael Ellerman Signed-off-by: Michael Ellerman Co-developed-by: Sven Van Asbroeck Signed-off-by: Sven Van Asbroeck Co-developed-by: Gary Guo Signed-off-by: Gary Guo Co-developed-by: Boris-Chengbiao Zhou Signed-off-by: Boris-Chengbiao Zhou Co-developed-by: Boqun Feng Signed-off-by: Boqun Feng Co-developed-by: Douglas Su Signed-off-by: Douglas Su Co-developed-by: Dariusz Sosnowski Signed-off-by: Dariusz Sosnowski Co-developed-by: Antonio Terceiro Signed-off-by: Antonio Terceiro Co-developed-by: Daniel Xu Signed-off-by: Daniel Xu Co-developed-by: Björn Roy Baron Signed-off-by: Björn Roy Baron Co-developed-by: Martin Rodriguez Reboredo Signed-off-by: Martin Rodriguez Reboredo Signed-off-by: Miguel Ojeda --- kernel/configs/rust.config | 1 + 1 file changed, 1 insertion(+) create mode 100644 kernel/configs/rust.config (limited to 'kernel') diff --git a/kernel/configs/rust.config b/kernel/configs/rust.config new file mode 100644 index 000000000000..38a7c5362c9c --- /dev/null +++ b/kernel/configs/rust.config @@ -0,0 +1 @@ +CONFIG_RUST=y -- cgit v1.2.3 From f0d74c4da1f060d2a66976193712a5e6abd361f5 Mon Sep 17 00:00:00 2001 From: Kui-Feng Lee Date: Mon, 26 Sep 2022 11:49:53 -0700 Subject: bpf: Parameterize task iterators. Allow creating an iterator that loops through resources of one thread/process. People could only create iterators to loop through all resources of files, vma, and tasks in the system, even though they were interested in only the resources of a specific task or process. Passing the additional parameters, people can now create an iterator to go through all resources or only the resources of a task. Signed-off-by: Kui-Feng Lee Signed-off-by: Andrii Nakryiko Acked-by: Yonghong Song Acked-by: Martin KaFai Lau Link: https://lore.kernel.org/bpf/20220926184957.208194-2-kuifeng@fb.com --- kernel/bpf/task_iter.c | 188 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 166 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 8c921799def4..8b2f47e7139d 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -12,6 +12,9 @@ struct bpf_iter_seq_task_common { struct pid_namespace *ns; + enum bpf_iter_task_type type; + u32 pid; + u32 pid_visiting; }; struct bpf_iter_seq_task_info { @@ -22,18 +25,115 @@ struct bpf_iter_seq_task_info { u32 tid; }; -static struct task_struct *task_seq_get_next(struct pid_namespace *ns, +static struct task_struct *task_group_seq_get_next(struct bpf_iter_seq_task_common *common, + u32 *tid, + bool skip_if_dup_files) +{ + struct task_struct *task, *next_task; + struct pid *pid; + u32 saved_tid; + + if (!*tid) { + /* The first time, the iterator calls this function. */ + pid = find_pid_ns(common->pid, common->ns); + if (!pid) + return NULL; + + task = get_pid_task(pid, PIDTYPE_TGID); + if (!task) + return NULL; + + *tid = common->pid; + common->pid_visiting = common->pid; + + return task; + } + + /* If the control returns to user space and comes back to the + * kernel again, *tid and common->pid_visiting should be the + * same for task_seq_start() to pick up the correct task. + */ + if (*tid == common->pid_visiting) { + pid = find_pid_ns(common->pid_visiting, common->ns); + task = get_pid_task(pid, PIDTYPE_PID); + + return task; + } + + pid = find_pid_ns(common->pid_visiting, common->ns); + if (!pid) + return NULL; + + task = get_pid_task(pid, PIDTYPE_PID); + if (!task) + return NULL; + +retry: + if (!pid_alive(task)) { + put_task_struct(task); + return NULL; + } + + next_task = next_thread(task); + put_task_struct(task); + if (!next_task) + return NULL; + + saved_tid = *tid; + *tid = __task_pid_nr_ns(next_task, PIDTYPE_PID, common->ns); + if (!*tid || *tid == common->pid) { + /* Run out of tasks of a process. The tasks of a + * thread_group are linked as circular linked list. + */ + *tid = saved_tid; + return NULL; + } + + get_task_struct(next_task); + common->pid_visiting = *tid; + + if (skip_if_dup_files && task->files == task->group_leader->files) { + task = next_task; + goto retry; + } + + return next_task; +} + +static struct task_struct *task_seq_get_next(struct bpf_iter_seq_task_common *common, u32 *tid, bool skip_if_dup_files) { struct task_struct *task = NULL; struct pid *pid; + if (common->type == BPF_TASK_ITER_TID) { + if (*tid && *tid != common->pid) + return NULL; + rcu_read_lock(); + pid = find_pid_ns(common->pid, common->ns); + if (pid) { + task = get_pid_task(pid, PIDTYPE_TGID); + *tid = common->pid; + } + rcu_read_unlock(); + + return task; + } + + if (common->type == BPF_TASK_ITER_TGID) { + rcu_read_lock(); + task = task_group_seq_get_next(common, tid, skip_if_dup_files); + rcu_read_unlock(); + + return task; + } + rcu_read_lock(); retry: - pid = find_ge_pid(*tid, ns); + pid = find_ge_pid(*tid, common->ns); if (pid) { - *tid = pid_nr_ns(pid, ns); + *tid = pid_nr_ns(pid, common->ns); task = get_pid_task(pid, PIDTYPE_PID); if (!task) { ++*tid; @@ -56,7 +156,7 @@ static void *task_seq_start(struct seq_file *seq, loff_t *pos) struct bpf_iter_seq_task_info *info = seq->private; struct task_struct *task; - task = task_seq_get_next(info->common.ns, &info->tid, false); + task = task_seq_get_next(&info->common, &info->tid, false); if (!task) return NULL; @@ -73,7 +173,7 @@ static void *task_seq_next(struct seq_file *seq, void *v, loff_t *pos) ++*pos; ++info->tid; put_task_struct((struct task_struct *)v); - task = task_seq_get_next(info->common.ns, &info->tid, false); + task = task_seq_get_next(&info->common, &info->tid, false); if (!task) return NULL; @@ -117,6 +217,41 @@ static void task_seq_stop(struct seq_file *seq, void *v) put_task_struct((struct task_struct *)v); } +static int bpf_iter_attach_task(struct bpf_prog *prog, + union bpf_iter_link_info *linfo, + struct bpf_iter_aux_info *aux) +{ + unsigned int flags; + struct pid *pid; + pid_t tgid; + + if ((!!linfo->task.tid + !!linfo->task.pid + !!linfo->task.pid_fd) > 1) + return -EINVAL; + + aux->task.type = BPF_TASK_ITER_ALL; + if (linfo->task.tid != 0) { + aux->task.type = BPF_TASK_ITER_TID; + aux->task.pid = linfo->task.tid; + } + if (linfo->task.pid != 0) { + aux->task.type = BPF_TASK_ITER_TGID; + aux->task.pid = linfo->task.pid; + } + if (linfo->task.pid_fd != 0) { + aux->task.type = BPF_TASK_ITER_TGID; + + pid = pidfd_get_pid(linfo->task.pid_fd, &flags); + if (IS_ERR(pid)) + return PTR_ERR(pid); + + tgid = pid_nr_ns(pid, task_active_pid_ns(current)); + aux->task.pid = tgid; + put_pid(pid); + } + + return 0; +} + static const struct seq_operations task_seq_ops = { .start = task_seq_start, .next = task_seq_next, @@ -137,8 +272,7 @@ struct bpf_iter_seq_task_file_info { static struct file * task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info) { - struct pid_namespace *ns = info->common.ns; - u32 curr_tid = info->tid; + u32 saved_tid = info->tid; struct task_struct *curr_task; unsigned int curr_fd = info->fd; @@ -151,21 +285,18 @@ again: curr_task = info->task; curr_fd = info->fd; } else { - curr_task = task_seq_get_next(ns, &curr_tid, true); + curr_task = task_seq_get_next(&info->common, &info->tid, true); if (!curr_task) { info->task = NULL; - info->tid = curr_tid; return NULL; } - /* set info->task and info->tid */ + /* set info->task */ info->task = curr_task; - if (curr_tid == info->tid) { + if (saved_tid == info->tid) curr_fd = info->fd; - } else { - info->tid = curr_tid; + else curr_fd = 0; - } } rcu_read_lock(); @@ -186,9 +317,15 @@ again: /* the current task is done, go to the next task */ rcu_read_unlock(); put_task_struct(curr_task); + + if (info->common.type == BPF_TASK_ITER_TID) { + info->task = NULL; + return NULL; + } + info->task = NULL; info->fd = 0; - curr_tid = ++(info->tid); + saved_tid = ++(info->tid); goto again; } @@ -269,6 +406,9 @@ static int init_seq_pidns(void *priv_data, struct bpf_iter_aux_info *aux) struct bpf_iter_seq_task_common *common = priv_data; common->ns = get_pid_ns(task_active_pid_ns(current)); + common->type = aux->task.type; + common->pid = aux->task.pid; + return 0; } @@ -307,11 +447,10 @@ enum bpf_task_vma_iter_find_op { static struct vm_area_struct * task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info) { - struct pid_namespace *ns = info->common.ns; enum bpf_task_vma_iter_find_op op; struct vm_area_struct *curr_vma; struct task_struct *curr_task; - u32 curr_tid = info->tid; + u32 saved_tid = info->tid; /* If this function returns a non-NULL vma, it holds a reference to * the task_struct, and holds read lock on vma->mm->mmap_lock. @@ -371,14 +510,13 @@ task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info) } } else { again: - curr_task = task_seq_get_next(ns, &curr_tid, true); + curr_task = task_seq_get_next(&info->common, &info->tid, true); if (!curr_task) { - info->tid = curr_tid + 1; + info->tid++; goto finish; } - if (curr_tid != info->tid) { - info->tid = curr_tid; + if (saved_tid != info->tid) { /* new task, process the first vma */ op = task_vma_iter_first_vma; } else { @@ -430,9 +568,12 @@ again: return curr_vma; next_task: + if (info->common.type == BPF_TASK_ITER_TID) + goto finish; + put_task_struct(curr_task); info->task = NULL; - curr_tid++; + info->tid++; goto again; finish: @@ -533,6 +674,7 @@ static const struct bpf_iter_seq_info task_seq_info = { static struct bpf_iter_reg task_reg_info = { .target = "task", + .attach_target = bpf_iter_attach_task, .feature = BPF_ITER_RESCHED, .ctx_arg_info_size = 1, .ctx_arg_info = { @@ -551,6 +693,7 @@ static const struct bpf_iter_seq_info task_file_seq_info = { static struct bpf_iter_reg task_file_reg_info = { .target = "task_file", + .attach_target = bpf_iter_attach_task, .feature = BPF_ITER_RESCHED, .ctx_arg_info_size = 2, .ctx_arg_info = { @@ -571,6 +714,7 @@ static const struct bpf_iter_seq_info task_vma_seq_info = { static struct bpf_iter_reg task_vma_reg_info = { .target = "task_vma", + .attach_target = bpf_iter_attach_task, .feature = BPF_ITER_RESCHED, .ctx_arg_info_size = 2, .ctx_arg_info = { -- cgit v1.2.3 From 21fb6f2aa3890b0d0abf88b7756d0098e9367a7c Mon Sep 17 00:00:00 2001 From: Kui-Feng Lee Date: Mon, 26 Sep 2022 11:49:54 -0700 Subject: bpf: Handle bpf_link_info for the parameterized task BPF iterators. Add new fields to bpf_link_info that users can query it through bpf_obj_get_info_by_fd(). Signed-off-by: Kui-Feng Lee Signed-off-by: Andrii Nakryiko Acked-by: Yonghong Song Acked-by: Martin KaFai Lau Link: https://lore.kernel.org/bpf/20220926184957.208194-3-kuifeng@fb.com --- kernel/bpf/task_iter.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 8b2f47e7139d..46f836be22e2 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -672,6 +672,21 @@ static const struct bpf_iter_seq_info task_seq_info = { .seq_priv_size = sizeof(struct bpf_iter_seq_task_info), }; +static int bpf_iter_fill_link_info(const struct bpf_iter_aux_info *aux, struct bpf_link_info *info) +{ + switch (aux->task.type) { + case BPF_TASK_ITER_TID: + info->iter.task.tid = aux->task.pid; + break; + case BPF_TASK_ITER_TGID: + info->iter.task.pid = aux->task.pid; + break; + default: + break; + } + return 0; +} + static struct bpf_iter_reg task_reg_info = { .target = "task", .attach_target = bpf_iter_attach_task, @@ -682,6 +697,7 @@ static struct bpf_iter_reg task_reg_info = { PTR_TO_BTF_ID_OR_NULL }, }, .seq_info = &task_seq_info, + .fill_link_info = bpf_iter_fill_link_info, }; static const struct bpf_iter_seq_info task_file_seq_info = { @@ -703,6 +719,7 @@ static struct bpf_iter_reg task_file_reg_info = { PTR_TO_BTF_ID_OR_NULL }, }, .seq_info = &task_file_seq_info, + .fill_link_info = bpf_iter_fill_link_info, }; static const struct bpf_iter_seq_info task_vma_seq_info = { @@ -724,6 +741,7 @@ static struct bpf_iter_reg task_vma_reg_info = { PTR_TO_BTF_ID_OR_NULL }, }, .seq_info = &task_vma_seq_info, + .fill_link_info = bpf_iter_fill_link_info, }; BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, -- cgit v1.2.3 From 2c4fe44fb020f3cce904da2ba9e42bb1c118e8a3 Mon Sep 17 00:00:00 2001 From: Kui-Feng Lee Date: Mon, 26 Sep 2022 11:49:55 -0700 Subject: bpf: Handle show_fdinfo for the parameterized task BPF iterators Show information of iterators in the respective files under /proc//fdinfo/. For example, for a task file iterator with 1723 as the value of tid parameter, its fdinfo would look like the following lines. pos: 0 flags: 02000000 mnt_id: 14 ino: 38 link_type: iter link_id: 51 prog_tag: a590ac96db22b825 prog_id: 299 target_name: task_file task_type: TID tid: 1723 This patch add the last three fields. task_type is the type of the task parameter. TID means the iterator visit only the thread specified by tid. The value of tid in the above example is 1723. For the case of PID task_type, it means the iterator visits only threads of a process and will show the pid value of the process instead of a tid. Signed-off-by: Kui-Feng Lee Signed-off-by: Andrii Nakryiko Acked-by: Yonghong Song Acked-by: Martin KaFai Lau Link: https://lore.kernel.org/bpf/20220926184957.208194-4-kuifeng@fb.com --- kernel/bpf/task_iter.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 46f836be22e2..67e03e1833ba 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -10,6 +10,12 @@ #include #include "mmap_unlock_work.h" +static const char * const iter_task_type_names[] = { + "ALL", + "TID", + "PID", +}; + struct bpf_iter_seq_task_common { struct pid_namespace *ns; enum bpf_iter_task_type type; @@ -687,6 +693,15 @@ static int bpf_iter_fill_link_info(const struct bpf_iter_aux_info *aux, struct b return 0; } +static void bpf_iter_task_show_fdinfo(const struct bpf_iter_aux_info *aux, struct seq_file *seq) +{ + seq_printf(seq, "task_type:\t%s\n", iter_task_type_names[aux->task.type]); + if (aux->task.type == BPF_TASK_ITER_TID) + seq_printf(seq, "tid:\t%u\n", aux->task.pid); + else if (aux->task.type == BPF_TASK_ITER_TGID) + seq_printf(seq, "pid:\t%u\n", aux->task.pid); +} + static struct bpf_iter_reg task_reg_info = { .target = "task", .attach_target = bpf_iter_attach_task, @@ -698,6 +713,7 @@ static struct bpf_iter_reg task_reg_info = { }, .seq_info = &task_seq_info, .fill_link_info = bpf_iter_fill_link_info, + .show_fdinfo = bpf_iter_task_show_fdinfo, }; static const struct bpf_iter_seq_info task_file_seq_info = { @@ -720,6 +736,7 @@ static struct bpf_iter_reg task_file_reg_info = { }, .seq_info = &task_file_seq_info, .fill_link_info = bpf_iter_fill_link_info, + .show_fdinfo = bpf_iter_task_show_fdinfo, }; static const struct bpf_iter_seq_info task_vma_seq_info = { @@ -742,6 +759,7 @@ static struct bpf_iter_reg task_vma_reg_info = { }, .seq_info = &task_vma_seq_info, .fill_link_info = bpf_iter_fill_link_info, + .show_fdinfo = bpf_iter_task_show_fdinfo, }; BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, -- cgit v1.2.3 From 64696c40d03c01e0ea2e3e9aa1c490a7b6a1b6be Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Thu, 29 Sep 2022 00:04:03 -0700 Subject: bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline The struct_ops prog is to allow using bpf to implement the functions in a struct (eg. kernel module). The current usage is to implement the tcp_congestion. The kernel does not call the tcp-cc's ops (ie. the bpf prog) in a recursive way. The struct_ops is sharing the tracing-trampoline's enter/exit function which tracks prog->active to avoid recursion. It is needed for tracing prog. However, it turns out the struct_ops bpf prog will hit this prog->active and unnecessarily skipped running the struct_ops prog. eg. The '.ssthresh' may run in_task() and then interrupted by softirq that runs the same '.ssthresh'. Skip running the '.ssthresh' will end up returning random value to the caller. The patch adds __bpf_prog_{enter,exit}_struct_ops for the struct_ops trampoline. They do not track the prog->active to detect recursion. One exception is when the tcp_congestion's '.init' ops is doing bpf_setsockopt(TCP_CONGESTION) and then recurs to the same '.init' ops. This will be addressed in the following patches. Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Martin KaFai Lau Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov --- kernel/bpf/trampoline.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 6f7b939321d6..bf0906e1e2b9 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -964,6 +964,29 @@ void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start, rcu_read_unlock_trace(); } +u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog, + struct bpf_tramp_run_ctx *run_ctx) + __acquires(RCU) +{ + rcu_read_lock(); + migrate_disable(); + + run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx); + + return bpf_prog_start_time(); +} + +void notrace __bpf_prog_exit_struct_ops(struct bpf_prog *prog, u64 start, + struct bpf_tramp_run_ctx *run_ctx) + __releases(RCU) +{ + bpf_reset_run_ctx(run_ctx->saved_run_ctx); + + update_prog_stats(prog, start); + migrate_enable(); + rcu_read_unlock(); +} + void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr) { percpu_ref_get(&tr->pcref); -- cgit v1.2.3