From e37a06f10994c2ba86f54d8f96734f2483a869b8 Mon Sep 17 00:00:00 2001
From: Li Zefan <lizefan@huawei.com>
Date: Thu, 17 Apr 2014 13:53:08 +0800
Subject: cgroup: fix the retry path of cgroup_mount()

If we hit the retry path, we'll call parse_cgroupfs_options() again,
but the string we pass to it has been modified by the previous call
to this function.

This bug can be observed by:

  # mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \
    mount -t cgroup -o name=foo,cpuset xxx /mnt
  mount: wrong fs type, bad option, bad superblock on xxx,
         missing codepage or helper program, or other error
  ...

The second mount passed "name=foo,cpuset" to the parser, and then it
hit the retry path and call the parser again, but this time the string
passed to the parser is "name=foo".

To fix this, we avoid calling parse_cgroupfs_options() again in this
case.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9fcdaa705b6c..11a03d67635a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1495,7 +1495,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	 */
 	if (!use_task_css_set_links)
 		cgroup_enable_task_cg_lists();
-retry:
+
 	mutex_lock(&cgroup_tree_mutex);
 	mutex_lock(&cgroup_mutex);
 
@@ -1503,7 +1503,7 @@ retry:
 	ret = parse_cgroupfs_options(data, &opts);
 	if (ret)
 		goto out_unlock;
-
+retry:
 	/* look for a matching existing root */
 	if (!opts.subsys_mask && !opts.none && !opts.name) {
 		cgrp_dfl_root_visible = true;
@@ -1562,9 +1562,9 @@ retry:
 		if (!atomic_inc_not_zero(&root->cgrp.refcnt)) {
 			mutex_unlock(&cgroup_mutex);
 			mutex_unlock(&cgroup_tree_mutex);
-			kfree(opts.release_agent);
-			kfree(opts.name);
 			msleep(10);
+			mutex_lock(&cgroup_tree_mutex);
+			mutex_lock(&cgroup_mutex);
 			goto retry;
 		}
 
-- 
cgit v1.2.3


From 79d719749d23234e9b725098aa49133f3ef7299d Mon Sep 17 00:00:00 2001
From: Aristeu Rozanski <aris@redhat.com>
Date: Mon, 21 Apr 2014 12:13:03 -0400
Subject: device_cgroup: rework device access check and exception checking

Whenever a device file is opened and checked against current device
cgroup rules, it uses the same function (may_access()) as when a new
exception rule is added by writing devices.{allow,deny}. And in both
cases, the algorithm is the same, doesn't matter the behavior.

First problem is having device access to be considered the same as rule
checking. Consider the following structure:

	A	(default behavior: allow, exceptions disallow access)
	 \
	  B	(default behavior: allow, exceptions disallow access)

A new exception is added to B by writing devices.deny:

	c 12:34 rw

When checking if that exception is allowed in may_access():

	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
		if (behavior == DEVCG_DEFAULT_ALLOW) {
			/* the exception will deny access to certain devices */
			return true;

Which is ok, since B is not getting more privileges than A, it doesn't
matter and the rule is accepted

Now, consider it's a device file open check and the process belongs to
cgroup B. The access will be generated as:

	behavior: allow
	exception: c 12:34 rw

The very same chunk of code will allow it, even if there's an explicit
exception telling to do otherwise.

A simple test case:

	# mkdir new_group
	# cd new_group
	# echo $$ >tasks
	# echo "c 1:3 w" >devices.deny
	# echo >/dev/null
	# echo $?
	0

This is a serious bug and was introduced on

	c39a2a3018f8 devcg: prepare may_access() for hierarchy support

To solve this problem, the device file open function was split from the
new exception check.

Second problem is how exceptions are processed by may_access(). The
first part of the said function tries to match fully with an existing
exception:

	list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) {
		if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
			continue;
		if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR))
			continue;
		if (ex->major != ~0 && ex->major != refex->major)
			continue;
		if (ex->minor != ~0 && ex->minor != refex->minor)
			continue;
		if (refex->access & (~ex->access))
			continue;
		match = true;
		break;
	}

That means the new exception should be contained into an existing one to
be considered a match:

	New exception		Existing	match?	notes
	b 12:34 rwm		b 12:34 rwm	yes
	b 12:34 r		b *:34 rw	yes
	b 12:34 rw		b 12:34 w	no	extra "r"
	b *:34 rw		b 12:34 rw	no	too broad "*"
	b *:34 rw		b *:34 rwm	yes

Which is fine in some cases. Consider:

	A	(default behavior: deny, exceptions allow access)
	 \
	  B	(default behavior: deny, exceptions allow access)

In this case the full match makes sense, the new exception cannot add
more access than the parent allows

But this doesn't always work, consider:

	A	(default behavior: allow, exceptions disallow access)
	 \
	  B	(default behavior: deny, exceptions allow access)

In this case, a new exception in B shouldn't match any of the exceptions
in A, after all you can't allow something that was forbidden by A. But
consider this scenario:

	New exception	Existing in A	match?	outcome
	b 12:34 rw	b 12:34 r	no	exception is accepted

Because the new exception has "w" as extra, it doesn't match, so it'll
be added to B's exception list.

The same problem can happen during a file access check. Consider a
cgroup with allow as default behavior:

	Access		Exception	match?
	b 12:34 rw	b 12:34 r	no

In this case, the access didn't match any of the exceptions in the
cgroup, which is required since exceptions will disallow access.

To solve this problem, two new functions were created to match an
exception either fully or partially. In the example above, a partial
check will be performed and it'll produce a match since at least
"b 12:34 r" from "b 12:34 rw" access matches.

Cc: cgroups@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 security/device_cgroup.c | 162 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 122 insertions(+), 40 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8365909f5f8c..b9048dc46b1a 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -306,57 +306,139 @@ static int devcgroup_seq_show(struct seq_file *m, void *v)
 }
 
 /**
- * may_access - verifies if a new exception is part of what is allowed
- *		by a dev cgroup based on the default policy +
- *		exceptions. This is used to make sure a child cgroup
- *		won't have more privileges than its parent or to
- *		verify if a certain access is allowed.
- * @dev_cgroup: dev cgroup to be tested against
- * @refex: new exception
- * @behavior: behavior of the exception
+ * match_exception	- iterates the exception list trying to match a rule
+ * 			  based on type, major, minor and access type. It is
+ * 			  considered a match if an exception is found that
+ * 			  will contain the entire range of provided parameters.
+ * @exceptions: list of exceptions
+ * @type: device type (DEV_BLOCK or DEV_CHAR)
+ * @major: device file major number, ~0 to match all
+ * @minor: device file minor number, ~0 to match all
+ * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD)
+ *
+ * returns: true in case it matches an exception completely
  */
-static bool may_access(struct dev_cgroup *dev_cgroup,
-		       struct dev_exception_item *refex,
-		       enum devcg_behavior behavior)
+static bool match_exception(struct list_head *exceptions, short type,
+			    u32 major, u32 minor, short access)
 {
 	struct dev_exception_item *ex;
-	bool match = false;
 
-	rcu_lockdep_assert(rcu_read_lock_held() ||
-			   lockdep_is_held(&devcgroup_mutex),
-			   "device_cgroup::may_access() called without proper synchronization");
+	list_for_each_entry_rcu(ex, exceptions, list) {
+		if ((type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
+			continue;
+		if ((type & DEV_CHAR) && !(ex->type & DEV_CHAR))
+			continue;
+		if (ex->major != ~0 && ex->major != major)
+			continue;
+		if (ex->minor != ~0 && ex->minor != minor)
+			continue;
+		/* provided access cannot have more than the exception rule */
+		if (access & (~ex->access))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+/**
+ * match_exception_partial - iterates the exception list trying to match a rule
+ * 			     based on type, major, minor and access type. It is
+ * 			     considered a match if an exception's range is
+ * 			     found to contain *any* of the devices specified by
+ * 			     provided parameters. This is used to make sure no
+ * 			     extra access is being granted that is forbidden by
+ * 			     any of the exception list.
+ * @exceptions: list of exceptions
+ * @type: device type (DEV_BLOCK or DEV_CHAR)
+ * @major: device file major number, ~0 to match all
+ * @minor: device file minor number, ~0 to match all
+ * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD)
+ *
+ * returns: true in case the provided range mat matches an exception completely
+ */
+static bool match_exception_partial(struct list_head *exceptions, short type,
+				    u32 major, u32 minor, short access)
+{
+	struct dev_exception_item *ex;
 
-	list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) {
-		if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
+	list_for_each_entry_rcu(ex, exceptions, list) {
+		if ((type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
 			continue;
-		if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR))
+		if ((type & DEV_CHAR) && !(ex->type & DEV_CHAR))
 			continue;
-		if (ex->major != ~0 && ex->major != refex->major)
+		/*
+		 * We must be sure that both the exception and the provided
+		 * range aren't masking all devices
+		 */
+		if (ex->major != ~0 && major != ~0 && ex->major != major)
 			continue;
-		if (ex->minor != ~0 && ex->minor != refex->minor)
+		if (ex->minor != ~0 && minor != ~0 && ex->minor != minor)
 			continue;
-		if (refex->access & (~ex->access))
+		/*
+		 * In order to make sure the provided range isn't matching
+		 * an exception, all its access bits shouldn't match the
+		 * exception's access bits
+		 */
+		if (!(access & ex->access))
 			continue;
-		match = true;
-		break;
+		return true;
 	}
+	return false;
+}
+
+/**
+ * verify_new_ex - verifies if a new exception is part of what is allowed
+ *		   by a dev cgroup based on the default policy +
+ *		   exceptions. This is used to make sure a child cgroup
+ *		   won't have more privileges than its parent
+ * @dev_cgroup: dev cgroup to be tested against
+ * @refex: new exception
+ * @behavior: behavior of the exception's dev_cgroup
+ */
+static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
+		          struct dev_exception_item *refex,
+		          enum devcg_behavior behavior)
+{
+	bool match = false;
+
+	rcu_lockdep_assert(rcu_read_lock_held() ||
+			   lockdep_is_held(&devcgroup_mutex),
+			   "device_cgroup:verify_new_ex called without proper synchronization");
 
 	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
 		if (behavior == DEVCG_DEFAULT_ALLOW) {
-			/* the exception will deny access to certain devices */
+			/*
+			 * new exception in the child doesn't matter, only
+			 * adding extra restrictions
+			 */ 
 			return true;
 		} else {
-			/* the exception will allow access to certain devices */
+			/*
+			 * new exception in the child will add more devices
+			 * that can be acessed, so it can't match any of
+			 * parent's exceptions, even slightly
+			 */ 
+			match = match_exception_partial(&dev_cgroup->exceptions,
+							refex->type,
+							refex->major,
+							refex->minor,
+							refex->access);
+
 			if (match)
-				/*
-				 * a new exception allowing access shouldn't
-				 * match an parent's exception
-				 */
 				return false;
 			return true;
 		}
 	} else {
-		/* only behavior == DEVCG_DEFAULT_DENY allowed here */
+		/*
+		 * Only behavior == DEVCG_DEFAULT_DENY allowed here, therefore
+		 * the new exception will add access to more devices and must
+		 * be contained completely in an parent's exception to be
+		 * allowed
+		 */
+		match = match_exception(&dev_cgroup->exceptions, refex->type,
+					refex->major, refex->minor,
+					refex->access);
+
 		if (match)
 			/* parent has an exception that matches the proposed */
 			return true;
@@ -378,7 +460,7 @@ static int parent_has_perm(struct dev_cgroup *childcg,
 
 	if (!parent)
 		return 1;
-	return may_access(parent, ex, childcg->behavior);
+	return verify_new_ex(parent, ex, childcg->behavior);
 }
 
 /**
@@ -704,18 +786,18 @@ static int __devcgroup_check_permission(short type, u32 major, u32 minor,
 				        short access)
 {
 	struct dev_cgroup *dev_cgroup;
-	struct dev_exception_item ex;
-	int rc;
-
-	memset(&ex, 0, sizeof(ex));
-	ex.type = type;
-	ex.major = major;
-	ex.minor = minor;
-	ex.access = access;
+	bool rc;
 
 	rcu_read_lock();
 	dev_cgroup = task_devcgroup(current);
-	rc = may_access(dev_cgroup, &ex, dev_cgroup->behavior);
+	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW)
+		/* Can't match any of the exceptions, even partially */
+		rc = !match_exception_partial(&dev_cgroup->exceptions,
+					      type, major, minor, access);
+	else
+		/* Need to match completely one exception to be allowed */
+		rc = match_exception(&dev_cgroup->exceptions, type, major,
+				     minor, access);
 	rcu_read_unlock();
 
 	if (!rc)
-- 
cgit v1.2.3


From f5f3cf6f7e49b9529fc00a2c4629fa92cf2755fe Mon Sep 17 00:00:00 2001
From: Aristeu Rozanski <aris@redhat.com>
Date: Thu, 24 Apr 2014 15:33:21 -0400
Subject: device_cgroup: fix the comment format for recently added functions

Moving more extensive explanations to the end of the comment.

Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 security/device_cgroup.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index b9048dc46b1a..6b1266dd92bb 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -306,17 +306,17 @@ static int devcgroup_seq_show(struct seq_file *m, void *v)
 }
 
 /**
- * match_exception	- iterates the exception list trying to match a rule
- * 			  based on type, major, minor and access type. It is
- * 			  considered a match if an exception is found that
- * 			  will contain the entire range of provided parameters.
+ * match_exception	- iterates the exception list trying to find a complete match
  * @exceptions: list of exceptions
  * @type: device type (DEV_BLOCK or DEV_CHAR)
  * @major: device file major number, ~0 to match all
  * @minor: device file minor number, ~0 to match all
  * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD)
  *
- * returns: true in case it matches an exception completely
+ * It is considered a complete match if an exception is found that will
+ * contain the entire range of provided parameters.
+ *
+ * Return: true in case it matches an exception completely
  */
 static bool match_exception(struct list_head *exceptions, short type,
 			    u32 major, u32 minor, short access)
@@ -341,20 +341,19 @@ static bool match_exception(struct list_head *exceptions, short type,
 }
 
 /**
- * match_exception_partial - iterates the exception list trying to match a rule
- * 			     based on type, major, minor and access type. It is
- * 			     considered a match if an exception's range is
- * 			     found to contain *any* of the devices specified by
- * 			     provided parameters. This is used to make sure no
- * 			     extra access is being granted that is forbidden by
- * 			     any of the exception list.
+ * match_exception_partial - iterates the exception list trying to find a partial match
  * @exceptions: list of exceptions
  * @type: device type (DEV_BLOCK or DEV_CHAR)
  * @major: device file major number, ~0 to match all
  * @minor: device file minor number, ~0 to match all
  * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD)
  *
- * returns: true in case the provided range mat matches an exception completely
+ * It is considered a partial match if an exception's range is found to
+ * contain *any* of the devices specified by provided parameters. This is
+ * used to make sure no extra access is being granted that is forbidden by
+ * any of the exception list.
+ *
+ * Return: true in case the provided range mat matches an exception completely
  */
 static bool match_exception_partial(struct list_head *exceptions, short type,
 				    u32 major, u32 minor, short access)
@@ -387,13 +386,13 @@ static bool match_exception_partial(struct list_head *exceptions, short type,
 }
 
 /**
- * verify_new_ex - verifies if a new exception is part of what is allowed
- *		   by a dev cgroup based on the default policy +
- *		   exceptions. This is used to make sure a child cgroup
- *		   won't have more privileges than its parent
+ * verify_new_ex - verifies if a new exception is allowed by parent cgroup's permissions
  * @dev_cgroup: dev cgroup to be tested against
  * @refex: new exception
  * @behavior: behavior of the exception's dev_cgroup
+ *
+ * This is used to make sure a child cgroup won't have more privileges
+ * than its parent
  */
 static bool verify_new_ex(struct dev_cgroup *dev_cgroup,
 		          struct dev_exception_item *refex,
-- 
cgit v1.2.3


From d2c2b11cfa134f4fbdcc34088824da26a084d8de Mon Sep 17 00:00:00 2001
From: Aristeu Rozanski <aris@redhat.com>
Date: Mon, 5 May 2014 11:18:59 -0400
Subject: device_cgroup: check if exception removal is allowed

[PATCH v3 1/2] device_cgroup: check if exception removal is allowed

When the device cgroup hierarchy was introduced in
	bd2953ebbb53 - devcg: propagate local changes down the hierarchy

a specific case was overlooked. Consider the hierarchy bellow:

	A	default policy: ALLOW, exceptions will deny access
	 \
	  B	default policy: ALLOW, exceptions will deny access

There's no need to verify when an new exception is added to B because
in this case exceptions will deny access to further devices, which is
always fine. Hierarchy in device cgroup only makes sure B won't have
more access than A.

But when an exception is removed (by writing devices.allow), it isn't
checked if the user is in fact removing an inherited exception from A,
thus giving more access to B.

Example:

	# echo 'a' >A/devices.allow
	# echo 'c 1:3 rw' >A/devices.deny
	# echo $$ >A/B/tasks
	# echo >/dev/null
	-bash: /dev/null: Operation not permitted
	# echo 'c 1:3 w' >A/B/devices.allow
	# echo >/dev/null
	#

This shouldn't be allowed and this patch fixes it by making sure to never allow
exceptions in this case to be removed if the exception is partially or fully
present on the parent.

v3: missing '*' in function description
v2: improved log message and formatting fixes

Cc: cgroups@vger.kernel.org
Cc: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 security/device_cgroup.c | 41 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 6b1266dd92bb..9134dbf70d3e 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -462,6 +462,37 @@ static int parent_has_perm(struct dev_cgroup *childcg,
 	return verify_new_ex(parent, ex, childcg->behavior);
 }
 
+/**
+ * parent_allows_removal - verify if it's ok to remove an exception
+ * @childcg: child cgroup from where the exception will be removed
+ * @ex: exception being removed
+ *
+ * When removing an exception in cgroups with default ALLOW policy, it must
+ * be checked if removing it will give the child cgroup more access than the
+ * parent.
+ *
+ * Return: true if it's ok to remove exception, false otherwise
+ */
+static bool parent_allows_removal(struct dev_cgroup *childcg,
+				  struct dev_exception_item *ex)
+{
+	struct dev_cgroup *parent = css_to_devcgroup(css_parent(&childcg->css));
+
+	if (!parent)
+		return true;
+
+	/* It's always allowed to remove access to devices */
+	if (childcg->behavior == DEVCG_DEFAULT_DENY)
+		return true;
+
+	/*
+	 * Make sure you're not removing part or a whole exception existing in
+	 * the parent cgroup
+	 */
+	return !match_exception_partial(&parent->exceptions, ex->type,
+					ex->major, ex->minor, ex->access);
+}
+
 /**
  * may_allow_all - checks if it's possible to change the behavior to
  *		   allow based on parent's rules.
@@ -697,17 +728,21 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup,
 
 	switch (filetype) {
 	case DEVCG_ALLOW:
-		if (!parent_has_perm(devcgroup, &ex))
-			return -EPERM;
 		/*
 		 * If the default policy is to allow by default, try to remove
 		 * an matching exception instead. And be silent about it: we
 		 * don't want to break compatibility
 		 */
 		if (devcgroup->behavior == DEVCG_DEFAULT_ALLOW) {
+			/* Check if the parent allows removing it first */
+			if (!parent_allows_removal(devcgroup, &ex))
+				return -EPERM;
 			dev_exception_rm(devcgroup, &ex);
-			return 0;
+			break;
 		}
+
+		if (!parent_has_perm(devcgroup, &ex))
+			return -EPERM;
 		rc = dev_exception_add(devcgroup, &ex);
 		break;
 	case DEVCG_DENY:
-- 
cgit v1.2.3


From 36c38fb7144aa941dc072ba8f58b2dbe509c0345 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 5 May 2014 12:37:30 -0400
Subject: blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()

During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
which nests above both the kernfs s_active protection and cgroup_mutex
is added to synchronize cgroup file type operations as cgroup_mutex
needed to be grabbed from some file operations and thus can't be put
above s_active protection.

While this arrangement mostly worked for cgroup, this triggered the
following lockdep warning.

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
  -------------------------------------------------------
  trinity-c173/9024 is trying to acquire lock:
  (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)

  but task is already holding lock:
  (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (s_active#89){++++.+}:
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
  kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
  cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
  cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
  rebind_subsystems (kernel/cgroup.c:1144)
  cgroup_setup_root (kernel/cgroup.c:1568)
  cgroup_mount (kernel/cgroup.c:1716)
  mount_fs (fs/super.c:1094)
  vfs_kern_mount (fs/namespace.c:899)
  do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
  SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
  tracesys (arch/x86/kernel/entry_64.S:746)

  -> #1 (cgroup_tree_mutex){+.+.+.}:
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
  cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
  blkcg_policy_register (block/blk-cgroup.c:1106)
  throtl_init (block/blk-throttle.c:1694)
  do_one_initcall (init/main.c:789)
  kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
  kernel_init (init/main.c:935)
  ret_from_fork (arch/x86/kernel/entry_64.S:552)

  -> #0 (blkcg_pol_mutex){+.+.+.}:
  __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
  blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
  cgroup_file_write (kernel/cgroup.c:2714)
  kernfs_fop_write (fs/kernfs/file.c:295)
  vfs_write (fs/read_write.c:532)
  SyS_write (fs/read_write.c:584 fs/read_write.c:576)
  tracesys (arch/x86/kernel/entry_64.S:746)

  other info that might help us debug this:

  Chain exists of:
  blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(s_active#89);
				 lock(cgroup_tree_mutex);
				 lock(s_active#89);
    lock(blkcg_pol_mutex);

   *** DEADLOCK ***

  4 locks held by trinity-c173/9024:
  #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
  #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
  #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
  #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

  stack backtrace:
  CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
   ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
   ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
   ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
  Call Trace:
  dump_stack (lib/dump_stack.c:52)
  print_circular_bug (kernel/locking/lockdep.c:1216)
  __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
  lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
  mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
  blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
  cgroup_file_write (kernel/cgroup.c:2714)
  kernfs_fop_write (fs/kernfs/file.c:295)
  vfs_write (fs/read_write.c:532)
  SyS_write (fs/read_write.c:584 fs/read_write.c:576)

This is a highly unlikely but valid circular dependency between "echo
1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
through further locking update which will remove this complication but
for now let's use trylock on blkcg_pol_mutex and retry the file
operation if the trylock fails.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
---
 block/blk-cgroup.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index e4a4145926f6..1039fb9ff5f5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -451,7 +451,20 @@ static int blkcg_reset_stats(struct cgroup_subsys_state *css,
 	struct blkcg_gq *blkg;
 	int i;
 
-	mutex_lock(&blkcg_pol_mutex);
+	/*
+	 * XXX: We invoke cgroup_add/rm_cftypes() under blkcg_pol_mutex
+	 * which ends up putting cgroup's internal cgroup_tree_mutex under
+	 * it; however, cgroup_tree_mutex is nested above cgroup file
+	 * active protection and grabbing blkcg_pol_mutex from a cgroup
+	 * file operation creates a possible circular dependency.  cgroup
+	 * internal locking is planned to go through further simplification
+	 * and this issue should go away soon.  For now, let's trylock
+	 * blkcg_pol_mutex and restart the write on failure.
+	 *
+	 * http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
+	 */
+	if (!mutex_trylock(&blkcg_pol_mutex))
+		return restart_syscall();
 	spin_lock_irq(&blkcg->lock);
 
 	/*
-- 
cgit v1.2.3