linux-toradex.git/kernel, branch v2.6.33.2

softlockup: Stop spurious softlockup messages due to overflow

2010-04-01T23:01:51+00:00

commit 8c2eb4805d422bdbf60ba00ff233c794d23c3c00 upstream.

Ensure additions on touch_ts do not overflow.  This can occur
when the top 32 bits of the TSC reach 0xffffffff causing
additions to touch_ts to overflow and this in turn generates
spurious softlockup warnings.

Signed-off-by: Colin Ian King 
Cc: Peter Zijlstra 
Cc: Eric Dumazet 
LKML-Reference: <1268994482.1798.6.camel@lenovo>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

genirq: Prevent oneshot irq thread race

2010-04-01T23:01:50+00:00

commit 0b1adaa031a55e44f5dd942f234bf09d28e8a0d6 upstream.

Lars-Peter pointed out that the oneshot threaded interrupt handler
code has the following race:

 CPU0                            CPU1
 hande_level_irq(irq X)
   mask_ack_irq(irq X)
   handle_IRQ_event(irq X)
     wake_up(thread_handler)
                                 thread handler(irq X) runs
                                 finalize_oneshot(irq X)
				  does not unmask due to
				  !(desc->status & IRQ_MASKED)

 return from irq
 does not unmask due to
 (desc->status & IRQ_ONESHOT)

This leaves the interrupt line masked forever.

The reason for this is the inconsistent handling of the IRQ_MASKED
flag. Instead of setting it in the mask function the oneshot support
sets the flag after waking up the irq thread.

The solution for this is to set/clear the IRQ_MASKED status whenever
we mask/unmask an interrupt line. That's the easy part, but that
cleanup opens another race:

 CPU0                            CPU1
 hande_level_irq(irq)
   mask_ack_irq(irq)
   handle_IRQ_event(irq)
     wake_up(thread_handler)
                                 thread handler(irq) runs
                                 finalize_oneshot_irq(irq)
				  unmask(irq)
     irq triggers again
     handle_level_irq(irq)
       mask_ack_irq(irq)
     return from irq due to IRQ_INPROGRESS

 return from irq
 does not unmask due to
 (desc->status & IRQ_ONESHOT)

This requires that we synchronize finalize_oneshot_irq() with the
primary handler. If IRQ_INPROGESS is set we wait until the primary
handler on the other CPU has returned before unmasking the interrupt
line again.

We probably have never seen that problem because it does not happen on
UP and on SMP the irqbalancer protects us by pinning the primary
handler and the thread to the same CPU.

Reported-by: Lars-Peter Clausen 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node

2010-04-01T23:01:50+00:00

commit 5ab116c9349ef52d6fbd2e2917a53f13194b048e upstream.

cpuset_mem_spread_node() returns an offline node, and causes an oops.

This patch fixes it by initializing task->mems_allowed to
node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing
memory hotplug.

Signed-off-by: Miao Xie 
Acked-by: David Rientjes 
Reported-by: Nick Piggin 
Tested-by: Nick Piggin 
Cc: Paul Menage 
Cc: Li Zefan 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

perf_event: Fix oops triggered by cpu offline/online

2010-04-01T23:01:30+00:00

commit 220b140b52ab6cc133f674a7ffec8fa792054f25 upstream.

Anton Blanchard found that he could reliably make the kernel hit a
BUG_ON in the slab allocator by taking a cpu offline and then online
while a system-wide perf record session was running.

The reason is that when the cpu comes up, we completely reinitialize
the ctx field of the struct perf_cpu_context for the cpu.  If there is
a system-wide perf record session running, then there will be a struct
perf_event that has a reference to the context, so its refcount will
be 2.  (The perf_event has been removed from the context's group_entry
and event_entry lists by perf_event_exit_cpu(), but that doesn't
remove the perf_event's reference to the context and doesn't decrement
the context's refcount.)

When the cpu comes up, perf_event_init_cpu() gets called, and it calls
__perf_event_init_context() on the cpu's context.  That resets the
refcount to 1.  Then when the perf record session finishes and the
perf_event is closed, the refcount gets decremented to 0 and the
context gets kfreed after an RCU grace period.  Since the context
wasn't kmalloced -- it's part of a per-cpu variable -- bad things
happen.

In fact we don't need to completely reinitialize the context when the
cpu comes up.  It's sufficient to initialize the context once at boot,
but we need to do it for all possible cpus.

This moves the context initialization to happen at boot time.  With
this, we don't trash the refcount and the context never gets kfreed,
and we don't hit the BUG_ON.

Reported-by: Anton Blanchard 
Signed-off-by: Paul Mackerras 
Tested-by: Anton Blanchard 
Acked-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf: Provide generic perf_sample_data initialization

2010-04-01T23:01:29+00:00

This makes it easier to extend perf_sample_data and fixes a bug on arm
and sparc, which failed to set ->raw to NULL, which can cause crashes
when combined with PERF_SAMPLE_RAW.

It also optimizes PowerPC and tracepoint, because the struct
initialization is forced to zero out the whole structure.

Signed-off-by: Peter Zijlstra 
Acked-by: Jean Pihet 
Reviewed-by: Frederic Weisbecker 
Acked-by: David S. Miller 
Cc: Jamie Iles 
Cc: Paul Mackerras 
Cc: Stephane Eranian 
LKML-Reference: <20100304140100.315416040@chello.nl>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Fix SCHED_MC regression caused by change in sched cpu_power

2010-04-01T23:01:23+00:00

commit dd5feea14a7de4edbd9f36db1a2db785de91b88d upstream

On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.

Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:

|
|  commit f93e65c186ab3c05ce2068733ca10e34fd00125e
|  Author: Peter Zijlstra 
|  Date:   Tue Sep 1 10:34:32 2009 +0200
|
|  sched: Restore __cpu_power to a straight sum of power
|

Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.

We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.

Reported-by: Ma Ling 
Initial-Analysis-by: Zhang, Yanmin 
Signed-off-by: Suresh Siddha 
[ -v2: build fix ]
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar

tracing: Do not record user stack trace from NMI context

2010-04-01T23:01:17+00:00

commit b6345879ccbd9b92864fbd7eb8ac48acdb4d6b15 upstream.

A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 :
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Reported-by: Li Zefan 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

tracing: Disable buffer switching when starting or stopping trace

2010-04-01T23:01:17+00:00

commit a2f8071428ed9a0f06865f417c962421c9a6b488 upstream.

When the trace iterator is read, tracing_start() and tracing_stop()
is called to stop tracing while the iterator is processing the trace
output.

These functions disable both the standard buffer and the max latency
buffer. But if the wakeup tracer is running, it can switch these
buffers between the two disables:

  buffer = global_trace.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

      <<<--------- swap happens here

  buffer = max_tr.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

What happens is that we disabled the same buffer twice. On tracing_start()
we can enable the same buffer twice. All ring_buffer_record_disable()
must be matched with a ring_buffer_record_enable() or the buffer
can be disable permanently, or enable prematurely, and cause a bug
where a reset happens while a trace is commiting.

This patch protects these two by taking the ftrace_max_lock to prevent
a switch from occurring.

Found with Li Zefan's ftrace_stress_test.

Reported-by: Lai Jiangshan 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

tracing: Use same local variable when resetting the ring buffer

2010-04-01T23:01:17+00:00

commit 283740c619d211e34572cc93c8cdba92ccbdb9cc upstream.

In the ftrace code that resets the ring buffer it references the
buffer with a local variable, but then uses the tr->buffer as the
parameter to reset. If the wakeup tracer is running, which can
switch the tr->buffer with the max saved buffer, this can break
the requirement of disabling the buffer before the reset.

   buffer = tr->buffer;
   ring_buffer_record_disable(buffer);
   synchronize_sched();
   __tracing_reset(tr->buffer, cpu);

If the tr->buffer is swapped, then the reset is not happening to the
buffer that was disabled. This will cause the ring buffer to fail.

Found with Li Zefan's ftrace_stress_test.

Reported-by: Lai Jiangshan 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

tracing: Fix warning in s_next of trace file ops

2010-04-01T23:01:16+00:00

commit ac91d85456372a90af5b85eb6620fd2efb1e431b upstream.

This warning in s_next() can be triggered by lseek():
 [] ? s_next+0x77/0x80
 [] warn_slowpath_common+0x81/0xa0
 [] ? s_next+0x77/0x80
 [] warn_slowpath_null+0x1a/0x20
 [] s_next+0x77/0x80
 [] traverse+0x117/0x200
 [] seq_lseek+0xa3/0x120
 [] ? seq_lseek+0x0/0x120
 [] vfs_llseek+0x41/0x50
 [] sys_llseek+0x66/0xa0
 [] sysenter_do_call+0x12/0x26

The iterator "leftover" variable is zeroed in the opening of the trace
file. But lseek can call s_start() which will call s_next() without
reseting the "leftover" variable back to zero, which might trigger
the WARN_ON_ONCE(iter->leftover) that is in s_next().

Signed-off-by: Lai Jiangshan 
LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com>
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman