linux-toradex.git/kernel, branch tegra-10.9.9

[kernel-cgroups] Fix cgroups soft lockup issue.

2010-10-05T21:37:14+00:00

In kernel-2.6.32 some times softlockups are seen with cgroup file locking.
Some of the locking issues are fixed in kernel-34. Pulling the kernel-2.6.34
changes related to the cgroups and integrated with k32. With this change
soft lockups with cgroups file locking are not observed.

Bug 714293
Bug 703146

Change-Id: I8debb33d1edb34abdea3169e37a7b4f0ec302f40

cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput()

The LTP cgroup test suite generates a "kernel BUG at kernel/cgroup.c:790!"
here in cgroup_diput():

                 /*
                  * if we're getting rid of the cgroup, refcount should ensure
                  * that there are no pidlists left.
                  */
                 BUG_ON(!list_empty(&cgrp->pidlists));

The cgroup pidlist rework in 2.6.32 generates the BUG_ON, which is caused
when pidlist_array_load() calls cgroup_pidlist_find():

(1) if a matching cgroup_pidlist is found, it down_write's the mutex of the
     pre-existing cgroup_pidlist, and increments its use_count.
(2) if no matching cgroup_pidlist is found, then a new one is allocated, it
     down_write's its mutex, and the use_count is set to 0.
(3) the matching, or new, cgroup_pidlist gets returned back to pidlist_array_load(),
     which increments its use_count -- regardless whether new or pre-existing --
     and up_write's the mutex.

So if a matching list is ever encountered by cgroup_pidlist_find() during
the life of a cgroup directory, it results in an inflated use_count value,
preventing it from ever getting released by cgroup_release_pid_array().
Then if the directory is subsequently removed, cgroup_diput() hits the
BUG_ON() when it finds that the directory's cgroup is still populated with
a pidlist.

The patch simply removes the use_count increment when a matching pidlist
is found by cgroup_pidlist_find(), because it gets bumped by the calling
pidlist_array_load() function while still protected by the list's mutex.

Signed-off-by: Dave Anderson 
Reviewed-by: Li Zefan 
Acked-by: Ben Blum 
Cc: Paul Menage 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroups: fix to return errno in a failure path

In cgroup_create(), if alloc_css_id() returns failure, the errno is not
propagated to userspace, so mkdir will fail silently.

To trigger this bug, we mount blkio (or memory subsystem), and create more
then 65534 cgroups.  (The number of cgroups is limited to 65535 if a
subsystem has use_id == 1)

 # mount -t cgroup -o blkio xxx /mnt
 # for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done
 # mkdir /mnt/65534
 (should return ENOSPC)
 #

Signed-off-by: Li Zefan 
Acked-by: Serge Hallyn 
Acked-by: Paul Menage 
Acked-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

sched, cgroups: Fix module export

I have exported it in d11c563 - but cgroups.c did not have module.h included ...

Cc: Paul E. McKenney 
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar 

cgroup: introduce cancel_attach()

Add cancel_attach() operation to struct cgroup_subsys.  cancel_attach()
can be used when can_attach() operation prepares something for the subsys,
but we should rollback what can_attach() operation has prepared if attach
task fails after we've succeeded in can_attach().

Change-Id: I04a834952591179843f925e7db719df4d82a69bf
Signed-off-by: Daisuke Nishimura 
Acked-by: Li Zefan 
Reviewed-by: Paul Menage 
Cc: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Daisuke Nishimura 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroup: introduce coalesce css_get() and css_put()

Current css_get() and css_put() increment/decrement css->refcnt one by
one.

This patch add a new function __css_get(), which takes "count" as a arg
and increment the css->refcnt by "count".  And this patch also add a new
arg("count") to __css_put() and change the function to decrement the
css->refcnt by "count".

These coalesce version of __css_get()/__css_put() will be used to improve
performance of memcg's moving charge feature later, where instead of
calling css_get()/css_put() repeatedly, these new functions will be used.

No change is needed for current users of css_get()/css_put().

Signed-off-by: Daisuke Nishimura 
Acked-by: Paul Menage 
Cc: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Li Zefan 
Cc: Daisuke Nishimura 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroups: revamp subsys array

This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.

It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.

Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.

Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.

Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.

This patch:

Make subsys[] able to be dynamically populated to support modular
subsystems

This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.

Signed-off-by: Ben Blum 
Acked-by: Li Zefan 
Cc: Paul Menage 
Cc: "David S. Miller" 
Cc: KAMEZAWA Hiroyuki 
Cc: Lai Jiangshan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroups: subsystem module loading interface

Add interface between cgroups subsystem management and module loading

This patch implements rudimentary module-loading support for cgroups -
namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
module initcall, and a struct module pointer in struct cgroup_subsys.

Several functions that might be wanted by modules have had EXPORT_SYMBOL
added to them, but it's unclear exactly which functions want it and which
won't.

Signed-off-by: Ben Blum 
Acked-by: Li Zefan 
Cc: Paul Menage 
Cc: "David S. Miller" 
Cc: KAMEZAWA Hiroyuki 
Cc: Lai Jiangshan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroups: subsystem module unloading

Provides support for unloading modular subsystems.

This patch adds a new function cgroup_unload_subsys which is to be used
for removing a loaded subsystem during module deletion.  Reference
counting of the subsystems' modules is moved from once (at load time) to
once per attached hierarchy (in parse_cgroupfs_options and
rebind_subsystems) (i.e., 0 or 1).

Signed-off-by: Ben Blum 
Acked-by: Li Zefan 
Cc: Paul Menage 
Cc: "David S. Miller" 
Cc: KAMEZAWA Hiroyuki 
Cc: Lai Jiangshan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroups: clean up cgroup_pidlist_find() a bit

Don't call get_pid_ns() before we locate/alloc the ns.

Signed-off-by: Li Zefan 
Cc: Serge Hallyn 
Acked-by: Paul Menage 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroup: implement eventfd-based generic API for notifications

This patchset introduces eventfd-based API for notifications in cgroups
and implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Output of time(1) on building kernel on tmpfs:

Root cgroup before changes:
	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
Non-root cgroup before changes:
	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
Root cgroup after changes (0 thresholds):
	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
Non-root cgroup after changes (0 thresholds):
	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
Root cgroup after changes (1 thresholds, never crossed):
	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
Non-root cgroup after changes (1 thresholds, never crossed):
	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total

This patch:

Introduce the write-only file "cgroup.event_control" in every cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
  unregister_event() must be defined for the control file;
- write "  " to cgroup.event_control.
  Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

[kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
Signed-off-by: Kirill A. Shutemov 
Reviewed-by: KAMEZAWA Hiroyuki 
Paul Menage 
Cc: Li Zefan 
Cc: Balbir Singh 
Cc: Pavel Emelyanov 
Cc: Dan Malek 
Cc: Vladislav Buzov 
Cc: Daisuke Nishimura 
Cc: Alexander Shishkin 
Cc: Davide Libenzi 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Change-Id: I73060a6a951a86a6e2b5d03d36e1777aaf4f9fe2

cgroups: fix race between userspace and kernelspace

Notify userspace about cgroup removing only after rmdir of cgroup
directory to avoid race between userspace and kernelspace.

eventfd are used to notify about two types of event:
 - control file-specific, like crossing memory threshold;
 - cgroup removing.

To understand what really happen, userspace can check if the cgroup still
exists.  To avoid race beetween userspace and kernelspace we have to
notify userspace about cgroup removing only after rmdir of cgroup
directory.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: KAMEZAWA Hiroyuki 
Cc: Paul Menage 
Acked-by: Li Zefan 
Cc: Balbir Singh 
Cc: Pavel Emelyanov 
Cc: Dan Malek 
Cc: Daisuke Nishimura 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroups: remove duplicate include

commit e6a1105b ("cgroups: subsystem module loading interface") and commit
c50cc752 ("sched, cgroups: Fix module export") result in duplicate
including of module.h

Signed-off-by: Li Zefan 
Acked-by: Paul Menage 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

cgroup: Fix an RCU warning in alloc_css_id()

With CONFIG_PROVE_RCU=y, a warning can be triggered:

  # mount -t cgroup -o memory xxx /mnt
  # mkdir /mnt/0

...
kernel/cgroup.c:4442 invoked rcu_dereference_check() without protection!
...

This is a false-positive. It's safe to directly access parent_css->id.

Signed-off-by: Li Zefan 
Signed-off-by: Paul E. McKenney 

(cherry picked from commit cf6e8b67221b96d73c0a9dfb71856627ab4bd1b7)

Change-Id: Ibc9f6adcb55e37834ce030036c83b0405afc13f4
Reviewed-on: http://git-master/r/7823
Reviewed-by: Hanumanth Venkateswa Moganty 
Tested-by: Hanumanth Venkateswa Moganty 
Reviewed-by: Yu-Huan Hsu

mutex: Don't spin when the owner CPU is offline or other weird cases

2010-07-28T17:02:56+00:00

Due to recent load-balancer changes that delay the task migration to
the next wakeup, the adaptive mutex spinning ends up in a live lock
when the owner's CPU gets offlined because the cpu_online() check
lives before the owner running check.

This patch changes mutex_spin_on_owner() to return 0 (don't spin) in
any case where we aren't sure about the owner struct validity or CPU
number, and if the said CPU is offline. There is no point going back &
re-evaluate spinning in corner cases like that, let's just go to
sleep.

Cherry-picked commit: 4b402210486c6414fe5fbfd85934a0a22da56b04
URL: http://android.git.kernel.org/?p=kernel/linux-2.6.git;a=summary
Kernel version picked from: v2.6.34

Re-enable HAVE_DEFAULT_NO_SPIN_MUTEXES as root-cause of spin-lock is
now fixed in sched.c

For bug 713808

Change-Id: I06a7c85aa46be3cdd27da0a4e62ffa442a9805b4
Reviewed-on: http://git-master/r/4500
Tested-by: Bharat Nihalani 
Reviewed-by: Gary King

sched: Add a generic notifier when a task struct is about to be freed

2010-05-23T21:43:11+00:00

This patch adds a notifier which can be used by subsystems that may
be interested in when a task has completely died and is about to
have it's last resource freed.

  The Android lowmemory killer uses this to determine when a task
it has killed has finally given up its goods.

Signed-off-by: San Mehat

nohz: Allow 32-bit machines to sleep for more than 2.15 seconds

2010-05-18T02:41:40+00:00

In the dynamic tick code, "max_delta_ns" (member of the
"clock_event_device" structure) represents the maximum sleep time
that can occur between timer events in nanoseconds.

The variable, "max_delta_ns", is defined as an unsigned long
which is a 32-bit integer for 32-bit machines and a 64-bit
integer for 64-bit machines (if -m64 option is used for gcc).
The value of max_delta_ns is set by calling the function
"clockevent_delta2ns()" which returns a maximum value of LONG_MAX.
For a 32-bit machine LONG_MAX is equal to 0x7fffffff and in
nanoseconds this equates to ~2.15 seconds. Hence, the maximum
sleep time for a 32-bit machine is ~2.15 seconds, where as for
a 64-bit machine it will be many years.

This patch changes the type of max_delta_ns to be "u64" instead of
"unsigned long" so that this variable is a 64-bit type for both 32-bit
and 64-bit machines. It also changes the maximum value returned by
clockevent_delta2ns() to KTIME_MAX.  Hence this allows a 32-bit
machine to sleep for longer than ~2.15 seconds. Please note that this
patch also changes "min_delta_ns" to be "u64" too and although this is
unnecessary, it makes the patch simpler as it avoids to fixup all
callers of clockevent_delta2ns().

[ tglx: changed "unsigned long long" to u64 as we use this data type
  	through out the time code ]

Signed-off-by: Jon Hunter 
Cc: John Stultz 
LKML-Reference: <1250617512-23567-3-git-send-email-jon-hunter@ti.com>
Signed-off-by: Thomas Gleixner

clockevents: Use u32 for mult and shift factors

2010-05-18T02:41:39+00:00

The mult and shift factors of clock events differ in their data type
from those of clock sources for no reason. u32 is sufficient for
both. shift is always <= 32 and mult is limited to 2^32-1 to avoid
64bit multiplication overflows in the conversion.

Preparatory patch for a generic mult/shift factor calculation
function.

Signed-off-by: Thomas Gleixner 
Tested-by: Mikael Pettersson 
Acked-by: Ralf Baechle 
Acked-by: Linus Walleij 
Cc: John Stultz 
LKML-Reference: <20091111134229.725664788@linutronix.de>

kernel: Mapped irq chip default_disable to chip->mask.

2010-04-09T03:07:35+00:00

Despite the claim in struct irq_chip header: "disable: disable the
interrupt (defaults to chip->mask if NULL)", it is not happening as
default_disable is empty. Fixed it. Should also fix bug 667376.

Change-Id: If0c39e3b4344701bbf235201c180d9c8ce56c489
Reviewed-on: http://git-master/r/947
Tested-by: Aleksandr Frid 
Reviewed-by: Gary King 
Tested-by: Gary King

sched: Fix set_cpu_active() in cpu_down()

2010-04-09T03:07:14+00:00

Sachin found cpu hotplug test failures on powerpc, which made
the kernel hang on his POWER box.

The problem is that we fail to re-activate a cpu when a
hot-unplug fails. Fix this by moving the de-activation into
_cpu_down after doing the initial checks.

Remove the synchronize_sched() calls and rely on those implied
by rebuilding the sched domains using the new mask.

Reported-by: Sachin Sant 
Signed-off-by: Xiaotian Feng 
Tested-by: Sachin Sant 
Signed-off-by: Peter Zijlstra 
Cc: Mike Galbraith 
LKML-Reference: <20091216170517.500272612@chello.nl>
Signed-off-by: Ingo Molnar

Merge commit 'v2.6.32.9' into android-2.6.32

2010-03-11T00:38:33+00:00

Export the symbol of getboottime and mmonotonic_to_bootbased

2010-02-23T15:37:52+00:00

commit c93d89f3dbf0202bf19c07960ca8602b48c2f9a0 upstream.

Export getboottime and monotonic_to_bootbased in order to let them
could be used by following patch.

Signed-off-by: Jason Wang 
Signed-off-by: Marcelo Tosatti 
Signed-off-by: Greg Kroah-Hartman

futex: Handle futex value corruption gracefully

2010-02-23T15:37:43+00:00

commit 59647b6ac3050dd964bc556fe6ef22f4db5b935c upstream.

The WARN_ON in lookup_pi_state which complains about a mismatch
between pi_state->owner->pid and the pid which we retrieved from the
user space futex is completely bogus.

The code just emits the warning and then continues despite the fact
that it detected an inconsistent state of the futex. A conveniant way
for user space to spam the syslog.

Replace the WARN_ON by a consistency check. If the values do not match
return -EINVAL and let user space deal with the mess it created.

This also fixes the missing task_pid_vnr() when we compare the
pi_state->owner pid with the futex value.

Reported-by: Jermome Marchand 
Signed-off-by: Thomas Gleixner 
Acked-by: Darren Hart 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman