From e37a06f10994c2ba86f54d8f96734f2483a869b8 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Thu, 17 Apr 2014 13:53:08 +0800 Subject: cgroup: fix the retry path of cgroup_mount() If we hit the retry path, we'll call parse_cgroupfs_options() again, but the string we pass to it has been modified by the previous call to this function. This bug can be observed by: # mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \ mount -t cgroup -o name=foo,cpuset xxx /mnt mount: wrong fs type, bad option, bad superblock on xxx, missing codepage or helper program, or other error ... The second mount passed "name=foo,cpuset" to the parser, and then it hit the retry path and call the parser again, but this time the string passed to the parser is "name=foo". To fix this, we avoid calling parse_cgroupfs_options() again in this case. Signed-off-by: Li Zefan Signed-off-by: Tejun Heo --- kernel/cgroup.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 9fcdaa705b6c..11a03d67635a 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1495,7 +1495,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, */ if (!use_task_css_set_links) cgroup_enable_task_cg_lists(); -retry: + mutex_lock(&cgroup_tree_mutex); mutex_lock(&cgroup_mutex); @@ -1503,7 +1503,7 @@ retry: ret = parse_cgroupfs_options(data, &opts); if (ret) goto out_unlock; - +retry: /* look for a matching existing root */ if (!opts.subsys_mask && !opts.none && !opts.name) { cgrp_dfl_root_visible = true; @@ -1562,9 +1562,9 @@ retry: if (!atomic_inc_not_zero(&root->cgrp.refcnt)) { mutex_unlock(&cgroup_mutex); mutex_unlock(&cgroup_tree_mutex); - kfree(opts.release_agent); - kfree(opts.name); msleep(10); + mutex_lock(&cgroup_tree_mutex); + mutex_lock(&cgroup_mutex); goto retry; } -- cgit v1.2.3 From 79d719749d23234e9b725098aa49133f3ef7299d Mon Sep 17 00:00:00 2001 From: Aristeu Rozanski Date: Mon, 21 Apr 2014 12:13:03 -0400 Subject: device_cgroup: rework device access check and exception checking Whenever a device file is opened and checked against current device cgroup rules, it uses the same function (may_access()) as when a new exception rule is added by writing devices.{allow,deny}. And in both cases, the algorithm is the same, doesn't matter the behavior. First problem is having device access to be considered the same as rule checking. Consider the following structure: A (default behavior: allow, exceptions disallow access) \ B (default behavior: allow, exceptions disallow access) A new exception is added to B by writing devices.deny: c 12:34 rw When checking if that exception is allowed in may_access(): if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) { if (behavior == DEVCG_DEFAULT_ALLOW) { /* the exception will deny access to certain devices */ return true; Which is ok, since B is not getting more privileges than A, it doesn't matter and the rule is accepted Now, consider it's a device file open check and the process belongs to cgroup B. The access will be generated as: behavior: allow exception: c 12:34 rw The very same chunk of code will allow it, even if there's an explicit exception telling to do otherwise. A simple test case: # mkdir new_group # cd new_group # echo $$ >tasks # echo "c 1:3 w" >devices.deny # echo >/dev/null # echo $? 0 This is a serious bug and was introduced on c39a2a3018f8 devcg: prepare may_access() for hierarchy support To solve this problem, the device file open function was split from the new exception check. Second problem is how exceptions are processed by may_access(). The first part of the said function tries to match fully with an existing exception: list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) { if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK)) continue; if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR)) continue; if (ex->major != ~0 && ex->major != refex->major) continue; if (ex->minor != ~0 && ex->minor != refex->minor) continue; if (refex->access & (~ex->access)) continue; match = true; break; } That means the new exception should be contained into an existing one to be considered a match: New exception Existing match? notes b 12:34 rwm b 12:34 rwm yes b 12:34 r b *:34 rw yes b 12:34 rw b 12:34 w no extra "r" b *:34 rw b 12:34 rw no too broad "*" b *:34 rw b *:34 rwm yes Which is fine in some cases. Consider: A (default behavior: deny, exceptions allow access) \ B (default behavior: deny, exceptions allow access) In this case the full match makes sense, the new exception cannot add more access than the parent allows But this doesn't always work, consider: A (default behavior: allow, exceptions disallow access) \ B (default behavior: deny, exceptions allow access) In this case, a new exception in B shouldn't match any of the exceptions in A, after all you can't allow something that was forbidden by A. But consider this scenario: New exception Existing in A match? outcome b 12:34 rw b 12:34 r no exception is accepted Because the new exception has "w" as extra, it doesn't match, so it'll be added to B's exception list. The same problem can happen during a file access check. Consider a cgroup with allow as default behavior: Access Exception match? b 12:34 rw b 12:34 r no In this case, the access didn't match any of the exceptions in the cgroup, which is required since exceptions will disallow access. To solve this problem, two new functions were created to match an exception either fully or partially. In the example above, a partial check will be performed and it'll produce a match since at least "b 12:34 r" from "b 12:34 rw" access matches. Cc: cgroups@vger.kernel.org Cc: Tejun Heo Cc: Serge Hallyn Cc: Li Zefan Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski Signed-off-by: Tejun Heo --- security/device_cgroup.c | 162 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 122 insertions(+), 40 deletions(-) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 8365909f5f8c..b9048dc46b1a 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -306,57 +306,139 @@ static int devcgroup_seq_show(struct seq_file *m, void *v) } /** - * may_access - verifies if a new exception is part of what is allowed - * by a dev cgroup based on the default policy + - * exceptions. This is used to make sure a child cgroup - * won't have more privileges than its parent or to - * verify if a certain access is allowed. - * @dev_cgroup: dev cgroup to be tested against - * @refex: new exception - * @behavior: behavior of the exception + * match_exception - iterates the exception list trying to match a rule + * based on type, major, minor and access type. It is + * considered a match if an exception is found that + * will contain the entire range of provided parameters. + * @exceptions: list of exceptions + * @type: device type (DEV_BLOCK or DEV_CHAR) + * @major: device file major number, ~0 to match all + * @minor: device file minor number, ~0 to match all + * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD) + * + * returns: true in case it matches an exception completely */ -static bool may_access(struct dev_cgroup *dev_cgroup, - struct dev_exception_item *refex, - enum devcg_behavior behavior) +static bool match_exception(struct list_head *exceptions, short type, + u32 major, u32 minor, short access) { struct dev_exception_item *ex; - bool match = false; - rcu_lockdep_assert(rcu_read_lock_held() || - lockdep_is_held(&devcgroup_mutex), - "device_cgroup::may_access() called without proper synchronization"); + list_for_each_entry_rcu(ex, exceptions, list) { + if ((type & DEV_BLOCK) && !(ex->type & DEV_BLOCK)) + continue; + if ((type & DEV_CHAR) && !(ex->type & DEV_CHAR)) + continue; + if (ex->major != ~0 && ex->major != major) + continue; + if (ex->minor != ~0 && ex->minor != minor) + continue; + /* provided access cannot have more than the exception rule */ + if (access & (~ex->access)) + continue; + return true; + } + return false; +} + +/** + * match_exception_partial - iterates the exception list trying to match a rule + * based on type, major, minor and access type. It is + * considered a match if an exception's range is + * found to contain *any* of the devices specified by + * provided parameters. This is used to make sure no + * extra access is being granted that is forbidden by + * any of the exception list. + * @exceptions: list of exceptions + * @type: device type (DEV_BLOCK or DEV_CHAR) + * @major: device file major number, ~0 to match all + * @minor: device file minor number, ~0 to match all + * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD) + * + * returns: true in case the provided range mat matches an exception completely + */ +static bool match_exception_partial(struct list_head *exceptions, short type, + u32 major, u32 minor, short access) +{ + struct dev_exception_item *ex; - list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) { - if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK)) + list_for_each_entry_rcu(ex, exceptions, list) { + if ((type & DEV_BLOCK) && !(ex->type & DEV_BLOCK)) continue; - if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR)) + if ((type & DEV_CHAR) && !(ex->type & DEV_CHAR)) continue; - if (ex->major != ~0 && ex->major != refex->major) + /* + * We must be sure that both the exception and the provided + * range aren't masking all devices + */ + if (ex->major != ~0 && major != ~0 && ex->major != major) continue; - if (ex->minor != ~0 && ex->minor != refex->minor) + if (ex->minor != ~0 && minor != ~0 && ex->minor != minor) continue; - if (refex->access & (~ex->access)) + /* + * In order to make sure the provided range isn't matching + * an exception, all its access bits shouldn't match the + * exception's access bits + */ + if (!(access & ex->access)) continue; - match = true; - break; + return true; } + return false; +} + +/** + * verify_new_ex - verifies if a new exception is part of what is allowed + * by a dev cgroup based on the default policy + + * exceptions. This is used to make sure a child cgroup + * won't have more privileges than its parent + * @dev_cgroup: dev cgroup to be tested against + * @refex: new exception + * @behavior: behavior of the exception's dev_cgroup + */ +static bool verify_new_ex(struct dev_cgroup *dev_cgroup, + struct dev_exception_item *refex, + enum devcg_behavior behavior) +{ + bool match = false; + + rcu_lockdep_assert(rcu_read_lock_held() || + lockdep_is_held(&devcgroup_mutex), + "device_cgroup:verify_new_ex called without proper synchronization"); if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) { if (behavior == DEVCG_DEFAULT_ALLOW) { - /* the exception will deny access to certain devices */ + /* + * new exception in the child doesn't matter, only + * adding extra restrictions + */ return true; } else { - /* the exception will allow access to certain devices */ + /* + * new exception in the child will add more devices + * that can be acessed, so it can't match any of + * parent's exceptions, even slightly + */ + match = match_exception_partial(&dev_cgroup->exceptions, + refex->type, + refex->major, + refex->minor, + refex->access); + if (match) - /* - * a new exception allowing access shouldn't - * match an parent's exception - */ return false; return true; } } else { - /* only behavior == DEVCG_DEFAULT_DENY allowed here */ + /* + * Only behavior == DEVCG_DEFAULT_DENY allowed here, therefore + * the new exception will add access to more devices and must + * be contained completely in an parent's exception to be + * allowed + */ + match = match_exception(&dev_cgroup->exceptions, refex->type, + refex->major, refex->minor, + refex->access); + if (match) /* parent has an exception that matches the proposed */ return true; @@ -378,7 +460,7 @@ static int parent_has_perm(struct dev_cgroup *childcg, if (!parent) return 1; - return may_access(parent, ex, childcg->behavior); + return verify_new_ex(parent, ex, childcg->behavior); } /** @@ -704,18 +786,18 @@ static int __devcgroup_check_permission(short type, u32 major, u32 minor, short access) { struct dev_cgroup *dev_cgroup; - struct dev_exception_item ex; - int rc; - - memset(&ex, 0, sizeof(ex)); - ex.type = type; - ex.major = major; - ex.minor = minor; - ex.access = access; + bool rc; rcu_read_lock(); dev_cgroup = task_devcgroup(current); - rc = may_access(dev_cgroup, &ex, dev_cgroup->behavior); + if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) + /* Can't match any of the exceptions, even partially */ + rc = !match_exception_partial(&dev_cgroup->exceptions, + type, major, minor, access); + else + /* Need to match completely one exception to be allowed */ + rc = match_exception(&dev_cgroup->exceptions, type, major, + minor, access); rcu_read_unlock(); if (!rc) -- cgit v1.2.3 From f5f3cf6f7e49b9529fc00a2c4629fa92cf2755fe Mon Sep 17 00:00:00 2001 From: Aristeu Rozanski Date: Thu, 24 Apr 2014 15:33:21 -0400 Subject: device_cgroup: fix the comment format for recently added functions Moving more extensive explanations to the end of the comment. Cc: Li Zefan Signed-off-by: Aristeu Rozanski Acked-by: Serge Hallyn Signed-off-by: Tejun Heo --- security/device_cgroup.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index b9048dc46b1a..6b1266dd92bb 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -306,17 +306,17 @@ static int devcgroup_seq_show(struct seq_file *m, void *v) } /** - * match_exception - iterates the exception list trying to match a rule - * based on type, major, minor and access type. It is - * considered a match if an exception is found that - * will contain the entire range of provided parameters. + * match_exception - iterates the exception list trying to find a complete match * @exceptions: list of exceptions * @type: device type (DEV_BLOCK or DEV_CHAR) * @major: device file major number, ~0 to match all * @minor: device file minor number, ~0 to match all * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD) * - * returns: true in case it matches an exception completely + * It is considered a complete match if an exception is found that will + * contain the entire range of provided parameters. + * + * Return: true in case it matches an exception completely */ static bool match_exception(struct list_head *exceptions, short type, u32 major, u32 minor, short access) @@ -341,20 +341,19 @@ static bool match_exception(struct list_head *exceptions, short type, } /** - * match_exception_partial - iterates the exception list trying to match a rule - * based on type, major, minor and access type. It is - * considered a match if an exception's range is - * found to contain *any* of the devices specified by - * provided parameters. This is used to make sure no - * extra access is being granted that is forbidden by - * any of the exception list. + * match_exception_partial - iterates the exception list trying to find a partial match * @exceptions: list of exceptions * @type: device type (DEV_BLOCK or DEV_CHAR) * @major: device file major number, ~0 to match all * @minor: device file minor number, ~0 to match all * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD) * - * returns: true in case the provided range mat matches an exception completely + * It is considered a partial match if an exception's range is found to + * contain *any* of the devices specified by provided parameters. This is + * used to make sure no extra access is being granted that is forbidden by + * any of the exception list. + * + * Return: true in case the provided range mat matches an exception completely */ static bool match_exception_partial(struct list_head *exceptions, short type, u32 major, u32 minor, short access) @@ -387,13 +386,13 @@ static bool match_exception_partial(struct list_head *exceptions, short type, } /** - * verify_new_ex - verifies if a new exception is part of what is allowed - * by a dev cgroup based on the default policy + - * exceptions. This is used to make sure a child cgroup - * won't have more privileges than its parent + * verify_new_ex - verifies if a new exception is allowed by parent cgroup's permissions * @dev_cgroup: dev cgroup to be tested against * @refex: new exception * @behavior: behavior of the exception's dev_cgroup + * + * This is used to make sure a child cgroup won't have more privileges + * than its parent */ static bool verify_new_ex(struct dev_cgroup *dev_cgroup, struct dev_exception_item *refex, -- cgit v1.2.3 From d2c2b11cfa134f4fbdcc34088824da26a084d8de Mon Sep 17 00:00:00 2001 From: Aristeu Rozanski Date: Mon, 5 May 2014 11:18:59 -0400 Subject: device_cgroup: check if exception removal is allowed [PATCH v3 1/2] device_cgroup: check if exception removal is allowed When the device cgroup hierarchy was introduced in bd2953ebbb53 - devcg: propagate local changes down the hierarchy a specific case was overlooked. Consider the hierarchy bellow: A default policy: ALLOW, exceptions will deny access \ B default policy: ALLOW, exceptions will deny access There's no need to verify when an new exception is added to B because in this case exceptions will deny access to further devices, which is always fine. Hierarchy in device cgroup only makes sure B won't have more access than A. But when an exception is removed (by writing devices.allow), it isn't checked if the user is in fact removing an inherited exception from A, thus giving more access to B. Example: # echo 'a' >A/devices.allow # echo 'c 1:3 rw' >A/devices.deny # echo $$ >A/B/tasks # echo >/dev/null -bash: /dev/null: Operation not permitted # echo 'c 1:3 w' >A/B/devices.allow # echo >/dev/null # This shouldn't be allowed and this patch fixes it by making sure to never allow exceptions in this case to be removed if the exception is partially or fully present on the parent. v3: missing '*' in function description v2: improved log message and formatting fixes Cc: cgroups@vger.kernel.org Cc: Li Zefan Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski Acked-by: Serge Hallyn Signed-off-by: Tejun Heo --- security/device_cgroup.c | 41 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 38 insertions(+), 3 deletions(-) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 6b1266dd92bb..9134dbf70d3e 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -462,6 +462,37 @@ static int parent_has_perm(struct dev_cgroup *childcg, return verify_new_ex(parent, ex, childcg->behavior); } +/** + * parent_allows_removal - verify if it's ok to remove an exception + * @childcg: child cgroup from where the exception will be removed + * @ex: exception being removed + * + * When removing an exception in cgroups with default ALLOW policy, it must + * be checked if removing it will give the child cgroup more access than the + * parent. + * + * Return: true if it's ok to remove exception, false otherwise + */ +static bool parent_allows_removal(struct dev_cgroup *childcg, + struct dev_exception_item *ex) +{ + struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css)); + + if (!parent) + return true; + + /* It's always allowed to remove access to devices */ + if (childcg->behavior == DEVCG_DEFAULT_DENY) + return true; + + /* + * Make sure you're not removing part or a whole exception existing in + * the parent cgroup + */ + return !match_exception_partial(&parent->exceptions, ex->type, + ex->major, ex->minor, ex->access); +} + /** * may_allow_all - checks if it's possible to change the behavior to * allow based on parent's rules. @@ -697,17 +728,21 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup, switch (filetype) { case DEVCG_ALLOW: - if (!parent_has_perm(devcgroup, &ex)) - return -EPERM; /* * If the default policy is to allow by default, try to remove * an matching exception instead. And be silent about it: we * don't want to break compatibility */ if (devcgroup->behavior == DEVCG_DEFAULT_ALLOW) { + /* Check if the parent allows removing it first */ + if (!parent_allows_removal(devcgroup, &ex)) + return -EPERM; dev_exception_rm(devcgroup, &ex); - return 0; + break; } + + if (!parent_has_perm(devcgroup, &ex)) + return -EPERM; rc = dev_exception_add(devcgroup, &ex); break; case DEVCG_DENY: -- cgit v1.2.3 From 36c38fb7144aa941dc072ba8f58b2dbe509c0345 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 May 2014 12:37:30 -0400 Subject: blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() During the recent conversion of cgroup to kernfs, cgroup_tree_mutex which nests above both the kernfs s_active protection and cgroup_mutex is added to synchronize cgroup file type operations as cgroup_mutex needed to be grabbed from some file operations and thus can't be put above s_active protection. While this arrangement mostly worked for cgroup, this triggered the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W ------------------------------------------------------- trinity-c173/9024 is trying to acquire lock: (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) but task is already holding lock: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (s_active#89){++++.+}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024) kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219) cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899) cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2)) rebind_subsystems (kernel/cgroup.c:1144) cgroup_setup_root (kernel/cgroup.c:1568) cgroup_mount (kernel/cgroup.c:1716) mount_fs (fs/super.c:1094) vfs_kern_mount (fs/namespace.c:899) do_mount (fs/namespace.c:2238 fs/namespace.c:2561) SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729) tracesys (arch/x86/kernel/entry_64.S:746) -> #1 (cgroup_tree_mutex){+.+.+.}: lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040) blkcg_policy_register (block/blk-cgroup.c:1106) throtl_init (block/blk-throttle.c:1694) do_one_initcall (init/main.c:789) kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003) kernel_init (init/main.c:935) ret_from_fork (arch/x86/kernel/entry_64.S:552) -> #0 (blkcg_pol_mutex){+.+.+.}: __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) tracesys (arch/x86/kernel/entry_64.S:746) other info that might help us debug this: Chain exists of: blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#89); lock(cgroup_tree_mutex); lock(s_active#89); lock(blkcg_pol_mutex); *** DEADLOCK *** 4 locks held by trinity-c173/9024: #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714) #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530) #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283) #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283) stack backtrace: CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002 ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004 ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0 Call Trace: dump_stack (lib/dump_stack.c:52) print_circular_bug (kernel/locking/lockdep.c:1216) __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182) lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587) blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455) cgroup_file_write (kernel/cgroup.c:2714) kernfs_fop_write (fs/kernfs/file.c:295) vfs_write (fs/read_write.c:532) SyS_write (fs/read_write.c:584 fs/read_write.c:576) This is a highly unlikely but valid circular dependency between "echo 1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going through further locking update which will remove this complication but for now let's use trylock on blkcg_pol_mutex and retry the file operation if the trylock fails. Signed-off-by: Tejun Heo Reported-by: Sasha Levin References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com --- block/blk-cgroup.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index e4a4145926f6..1039fb9ff5f5 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -451,7 +451,20 @@ static int blkcg_reset_stats(struct cgroup_subsys_state *css, struct blkcg_gq *blkg; int i; - mutex_lock(&blkcg_pol_mutex); + /* + * XXX: We invoke cgroup_add/rm_cftypes() under blkcg_pol_mutex + * which ends up putting cgroup's internal cgroup_tree_mutex under + * it; however, cgroup_tree_mutex is nested above cgroup file + * active protection and grabbing blkcg_pol_mutex from a cgroup + * file operation creates a possible circular dependency. cgroup + * internal locking is planned to go through further simplification + * and this issue should go away soon. For now, let's trylock + * blkcg_pol_mutex and restart the write on failure. + * + * http://lkml.kernel.org/g/5363C04B.4010400@oracle.com + */ + if (!mutex_trylock(&blkcg_pol_mutex)) + return restart_syscall(); spin_lock_irq(&blkcg->lock); /* -- cgit v1.2.3