linux-toradex.git/kernel, branch v2.6.25.17

posix-timers: do_schedule_next_timer: fix the setting of ->si_overrun

2008-08-20T18:15:24+00:00

commit 54da1174922cddd4be83d5a364b2e0fdd693f513 upstream

do_schedule_next_timer() sets info->si_overrun = timr->it_overrun_last,
this discards the already accumulated overruns.

Signed-off-by: Oleg Nesterov 
Cc: Mark McLoughlin 
Cc: Oliver Pinter 
Cc: Roland McGrath 
Cc: Andrew Morton 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

posix-timers: fix posix_timer_event() vs dequeue_signal() race

2008-08-20T18:15:24+00:00

commit ba661292a2bc6ddd305a212b0526e5dc22195fe7 upstream

The bug was reported and analysed by Mark McLoughlin ,
the patch is based on his and Roland's suggestions.

posix_timer_event() always rewrites the pre-allocated siginfo before sending
the signal. Most of the written info is the same all the time, but memset(0)
is very wrong. If ->sigq is queued we can race with collect_signal() which
can fail to find this siginfo looking at .si_signo, or copy_siginfo() can
copy the wrong .si_code/si_tid/etc.

In short, sys_timer_settime() can in fact stop the active timer, or the user
can receive the siginfo with the wrong .si_xxx values.

Move "memset(->info, 0)" from posix_timer_event() to alloc_posix_timer(),
change send_sigqueue() to set .si_overrun = 0 when ->sigq is not queued.
It would be nice to move the whole sigq->info initialization from send to
create path, but this is not easy to do without uglifying timer_create()
further.

As Roland rightly pointed out, we need more cleanups/fixes here, see the
"FIXME" comment in the patch. Hopefully this patch makes sense anyway, and
it can mask the most bad implications.

Reported-by: Mark McLoughlin 
Signed-off-by: Oleg Nesterov 
Cc: Mark McLoughlin 
Cc: Oliver Pinter 
Cc: Roland McGrath 
Cc: Andrew Morton 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

relay: fix "full buffer with exactly full last subbuffer" accounting problem

2008-08-20T18:15:23+00:00

commit 32194450330be327f3b25bf6b66298bd122599e9 upstream

In relay's current read implementation, if the buffer is completely full
but hasn't triggered the buffer-full condition (i.e. the last write
didn't cross the subbuffer boundary) and the last subbuffer is exactly
full, the subbuffer accounting code erroneously finds nothing available.
This patch fixes the problem.

Signed-off-by: Tom Zanussi 
Cc: Eduard - Gabriel Munteanu 
Cc: Pekka Enberg 
Cc: Jens Axboe 
Cc: Mathieu Desnoyers 
Cc: Andrea Righi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

markers: fix markers read barrier for multiple probes

2008-08-01T18:50:55+00:00

commit 5def9a3a22e09c99717f41ab7f07ec9e1a1f3ec8 upstream

Paul pointed out two incorrect read barriers in the marker handler code in
the path where multiple probes are connected.  Those are ordering reads of
"ptype" (single or multi probe marker), "multi" array pointer, and "multi"
array data access.

It should be ordered like this :

read ptype
smp_rmb()
read multi array pointer
smp_read_barrier_depends()
access data referenced by multi array pointer

The code with a single probe connected (optimized case, does not have to
allocate an array) has correct memory ordering.

It applies to kernel 2.6.26.x, 2.6.25.x and linux-next.

Signed-off-by: Mathieu Desnoyers 
Cc: Paul E. McKenney 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Fix build on COMPAT platforms when CONFIG_EPOLL is disabled

2008-08-01T18:50:47+00:00

commit 5f17156fc55abac476d180e480bedb0f07f01b14 upstream

Add missing cond_syscall() entry for compat_sys_epoll_pwait.

Signed-off-by: Atsushi Nemoto 
Cc: Davide Libenzi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

hrtimer: prevent migration for raising softirq

2008-07-24T16:14:12+00:00

commit ee3ece830f6db9837f7ac67008f532a8c1e755f4 upstream.

Due to a possible deadlock, the waking of the softirq was pushed outside
of the hrtimer base locks. See commit 0c96c5979a522c3323c30a078a70120e29b5bdbc

Unfortunately this allows the task to migrate after setting up the softirq
and raising it. Since softirqs run a queue that is per-cpu we may raise the
softirq on the wrong CPU and this will keep the queued softirq task from
running.

To solve this issue, this patch disables preemption around the releasing
of the hrtimer lock and raising of the softirq.

Signed-off-by: Steven Rostedt 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

sched: fix cpu hotplug

2008-07-03T03:46:15+00:00

Commit 79c537998d143b127c8c662a403c3356cb885f1c upstream

the CPU hotplug problems (crashes under high-volume unplug+replug
tests) seem to be related to migrate_dead_tasks().

Firstly I added traces to see all tasks being migrated with
migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
pops up (the one with "se == NULL" in the loop of
pick_next_task_fair()) shortly after the traces indicate that some has
been migrated with migrate_dead_tasks()). btw., I can reproduce it
much faster now with just a plain cpu down/up loop.

[disclaimer] Well, unless I'm really missing something important in
this late hour [/desclaimer] pick_next_task() is not something
appropriate for migrate_dead_tasks() :-)

the following change seems to eliminate the problem on my setup
(although, I kept it running only for a few minutes to get a few
messages indicating migrate_dead_tasks() does move tasks and the
system is still ok)

Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

futexes: fix fault handling in futex_lock_pi

2008-07-03T03:46:14+00:00

commit 1b7558e457ed0de61023cfc913d2c342c7c3d9f2 upstream

This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.

David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.

The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.

When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.

After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:

 - the task which acquired the rtmutex
 - the pending owner of the pi_state which did not get the rtmutex

Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.

One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.

Reported-by: David Holmes 
Signed-off-by: Thomas Gleixner 
Cc: Andrew Morton 
Cc: David Holmes 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

capabilities: remain source compatible with 32-bit raw legacy capability support.

2008-06-09T18:27:06+00:00

upstream commit: ca05a99a54db1db5bca72eccb5866d2a86f8517f

Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset().  Its unfortunate, but
true.

Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities.  These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().

As such, this patch does a number of things to clean up the situation
for all. It

  1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
     legacy value.

  2. adopts a new #define strategy for the kernel's internal
     implementation of the preferred magic.

  3. deprecates v2 capability magic in favor of a new (v3) magic
     number. The functionality of v3 is entirely equivalent to v2,
     the only difference being that the v2 magic causes the kernel
     to log a "deprecated" warning so the admin can find applications
     that may be using v2 inappropriately.

[User space code continues to be encouraged to use the libcap API which
protects the application from details like this.  libcap-2.10 is the first
to support v3 capabilities.]

Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.

[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan 
Cc: Serge E. Hallyn 
Cc: Bojan Smojver 
Cc: stable@kernel.org
Signed-off-by: Andrew Morton 
Signed-off-by: Chris Wright

cgroups: remove node_ prefix_from ns subsystem

2008-06-09T18:27:01+00:00

upstream commit: 5c02b575780d0d785815a1e7b79a98edddee895a

This is a slight change in the namespace cgroup subsystem api.

The change is that previously when cgroup_clone() was called (currently
only from the unshare path in ns_proxy cgroup, you'd get a new group named
"node_$pid" whereas now you'll get a group named after just your pid.)

The only users who would notice it are those who are using the ns_proxy
cgroup subsystem to auto-create cgroups when namespaces are unshared -
something of an experimental feature, which I think really needs more
complete container/namespace support in order to be useful.  I suspect the
only users are Cedric and Serge, or maybe a few others on
containers@lists.linux-foundation.org.  And in fact it would only be
noticed by the users who make the assumption about how the name is
generated, rather than getting it from the /proc//cgroups file for
the process in question.

Whether the change is actually needed or not I'm fairly agnostic on, but I
guess it is more elegant to just use the pid as the new group name rather
than adding a fairly arbitrary "node_" prefix on the front.

[menage@google.com: provided changelog]
Signed-off-by: Cedric Le Goater 
Cc: "Paul Menage" 
Cc: "Serge E. Hallyn" 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright