linux-toradex.git/kernel, branch v3.2.51

workqueue: cond_resched() after processing each work item

2013-09-10T00:57:36+00:00

commit b22ce2785d97423846206cceec4efee0c4afd980 upstream.

If !PREEMPT, a kworker running work items back to back can hog CPU.
This becomes dangerous when a self-requeueing work item which is
waiting for something to happen races against stop_machine.  Such
self-requeueing work item would requeue itself indefinitely hogging
the kworker and CPU it's running on while stop_machine would wait for
that CPU to enter stop_machine while preventing anything else from
happening on all other CPUs.  The two would deadlock.

Jamie Liu reports that this deadlock scenario exists around
scsi_requeue_run_queue() and libata port multiplier support, where one
port may exclude command processing from other ports.  With the right
timing, scsi_requeue_run_queue() can end up requeueing itself trying
to execute an IO which is asked to be retried while another device has
an exclusive access, which in turn can't make forward progress due to
stop_machine.

Fix it by invoking cond_resched() after executing each work item.

Signed-off-by: Tejun Heo 
Reported-by: Jamie Liu 
References: http://thread.gmane.org/gmane.linux.kernel/1552567
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

tracing: Fix fields of struct trace_iterator that are zeroed by mistake

2013-09-10T00:57:20+00:00

commit ed5467da0e369e65b247b99eb6403cb79172bcda upstream.

tracing_read_pipe zeros all fields bellow "seq". The declaration contains
a comment about that, but it doesn't help.

The first field is "snapshot", it's true when current open file is
snapshot. Looks obvious, that it should not be zeroed.

The second field is "started". It was converted from cpumask_t to
cpumask_var_t (v2.6.28-4983-g4462344), in other words it was
converted from cpumask to pointer on cpumask.

Currently the reference on "started" memory is lost after the first read
from tracing_read_pipe and a proper object will never be freed.

The "started" is never dereferenced for trace_pipe, because trace_pipe
can't have the TRACE_FILE_ANNOTATE options.

Link: http://lkml.kernel.org/r/1375463803-3085183-1-git-send-email-avagin@openvz.org

Signed-off-by: Andrew Vagin 
Signed-off-by: Steven Rostedt 
[bwh: Backported to 3.2: there's no snapshot field]
Signed-off-by: Ben Hutchings

perf: Fix event group context move

2013-09-10T00:57:05+00:00

commit 0231bb5336758426b44ccd798ccd3c5419c95d58 upstream.

When we have group with mixed events (hw/sw) we want to end up
with group leader being in hw context. So if group leader is
initialy sw event, we move all the events under hw context.

The move is done for each event by removing it from its context
and adding it back into proper one. As a part of the removal the
event is automatically disabled, which is not what we want at
this stage of creating groups.

The fix is to initialize event state after removal from sw
context.

This fix resulted from the following discussion:

  http://thread.gmane.org/gmane.linux.kernel.perf.user/1144

Reported-by: Andreas Hollmann 
Signed-off-by: Jiri Olsa 
Cc: Arnaldo Carvalho de Melo 
Cc: Namhyung Kim 
Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

sched: Fix the broken sched_rr_get_interval()

2013-09-10T00:57:04+00:00

commit a59f4e079d19464eebb9b06513a1d4f55fdae5ba upstream.

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai 
Cc: Peter Zijlstra 
Cc: Paul Turner 
Cc: Thomas Gleixner 
Cc: Steven Rostedt 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

tracing: Use current_uid() for critical time tracing

2013-08-02T20:14:53+00:00

commit f17a5194859a82afe4164e938b92035b86c55794 upstream.

The irqsoff tracer records the max time that interrupts are disabled.
There are hooks in the assembly code that calls back into the tracer when
interrupts are disabled or enabled.

When they are enabled, the tracer checks if the amount of time they
were disabled is larger than the previous recorded max interrupts off
time. If it is, it creates a snapshot of the currently running trace
to store where the last largest interrupts off time was held and how
it happened.

During testing, this RCU lockdep dump appeared:

[ 1257.829021] ===============================
[ 1257.829021] [ INFO: suspicious RCU usage. ]
[ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G        W
[ 1257.829021] -------------------------------
[ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
[ 1257.829021]
[ 1257.829021] other info that might help us debug this:
[ 1257.829021]
[ 1257.829021]
[ 1257.829021] RCU used illegally from idle CPU!
[ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
[ 1257.829021] RCU used illegally from extended quiescent state!
[ 1257.829021] 2 locks held by trace-cmd/4831:
[ 1257.829021]  #0:  (max_trace_lock){......}, at: [] stop_critical_timing+0x1a3/0x209
[ 1257.829021]  #1:  (rcu_read_lock){.+.+..}, at: [] __update_max_tr+0x88/0x1ee
[ 1257.829021]
[ 1257.829021] stack backtrace:
[ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G        W    3.10.0-rc1-test+ #171
[ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[ 1257.829021]  0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
[ 1257.829021]  ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
[ 1257.829021]  ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
[ 1257.829021] Call Trace:
[ 1257.829021]  [] dump_stack+0x19/0x1b
[ 1257.829021]  [] lockdep_rcu_suspicious+0x109/0x112
[ 1257.829021]  [] __update_max_tr+0xed/0x1ee
[ 1257.829021]  [] ? __update_max_tr+0x88/0x1ee
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] update_max_tr_single+0x11d/0x12d
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] stop_critical_timing+0x141/0x209
[ 1257.829021]  [] ? trace_hardirqs_on+0xd/0xf
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] time_hardirqs_on+0x2a/0x2f
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] trace_hardirqs_on_caller+0x16/0x197
[ 1257.829021]  [] trace_hardirqs_on+0xd/0xf
[ 1257.829021]  [] user_enter+0xfd/0x107
[ 1257.829021]  [] do_notify_resume+0x92/0x97
[ 1257.829021]  [] int_signal+0x12/0x17

What happened was entering into the user code, the interrupts were enabled
and a max interrupts off was recorded. The trace buffer was saved along with
various information about the task: comm, pid, uid, priority, etc.

The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
to retrieve the data, and this happened to happen where RCU is blind (user_enter).

As only the preempt and irqs off tracers can have this happen, and they both
only have the tsk == current, if tsk == current, use current_uid() instead of
task_uid(), as current_uid() does not use RCU as only current can change its uid.

This fixes the RCU suspicious splat.

Signed-off-by: Steven Rostedt 
Signed-off-by: Ben Hutchings

perf: Fix mmap() accounting hole

2013-07-27T04:34:32+00:00

commit 9bb5d40cd93c9dd4be74834b1dcb1ba03629716b upstream.

Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.

When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.

Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.

[Backporting for 3.4-stable.
VM_RESERVED flag was replaced with pair 'VM_DONTEXPAND | VM_DONTDUMP' in
314e51b9 since 3.7.0-rc1, and 314e51b9 comes from a big patchset, we didn't
backport the patchset, so I restored 'VM_DNOTEXPAND | VM_DONTDUMP' as before:
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags |= VM_DONTCOPY | VM_RESERVED;
 -- zliu]

Reported-and-tested-by: Vince Weaver 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Zhouping Liu 
Signed-off-by: Greg Kroah-Hartman 
[bwh: Backported to 3.2: drop unrelated addition of braces in free_event()]
Signed-off-by: Ben Hutchings

perf: Fix perf mmap bugs

2013-07-27T04:34:32+00:00

commit 26cb63ad11e04047a64309362674bcbbd6a6f246 upstream.

Vince reported a problem found by his perf specific trinity
fuzzer.

Al noticed 2 problems with perf's mmap():

 - it has issues against fork() since we use vma->vm_mm for accounting.
 - it has an rb refcount leak on double mmap().

We fix the issues against fork() by using VM_DONTCOPY; I don't
think there's code out there that uses this; we didn't hear
about weird accounting problems/crashes. If we do need this to
work, the previously proposed VM_PINNED could make this work.

Aside from the rb reference leak spotted by Al, Vince's example
prog was indeed doing a double mmap() through the use of
perf_event_set_output().

This exposes another problem, since we now have 2 events with
one buffer, the accounting gets screwy because we account per
event. Fix this by making the buffer responsible for its own
accounting.

[Backporting for 3.4-stable.
VM_RESERVED flag was replaced with pair 'VM_DONTEXPAND | VM_DONTDUMP' in
314e51b9 since 3.7.0-rc1, and 314e51b9 comes from a big patchset, we didn't
backport the patchset, so I restored 'VM_DNOTEXPAND | VM_DONTDUMP' as before:
-       vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
+       vma->vm_flags |= VM_DONTCOPY | VM_RESERVED;
 -- zliu]

Reported-by: Vince Weaver 
Signed-off-by: Peter Zijlstra 
Cc: Al Viro 
Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 
Link: http://lkml.kernel.org/r/20130528085548.GA12193@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Zhouping Liu 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Ben Hutchings

perf: Fix perf_lock_task_context() vs RCU

2013-07-27T04:34:31+00:00

commit 058ebd0eba3aff16b144eabf4510ed9510e1416e upstream.

Jiri managed to trigger this warning:

 [] ======================================================
 [] [ INFO: possible circular locking dependency detected ]
 [] 3.10.0+ #228 Tainted: G        W
 [] -------------------------------------------------------
 [] p/6613 is trying to acquire lock:
 []  (rcu_node_0){..-...}, at: [] rcu_read_unlock_special+0xa7/0x250
 []
 [] but task is already holding lock:
 []  (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0xd9/0x2c0
 []
 [] which lock already depends on the new lock.
 []
 [] the existing dependency chain (in reverse order) is:
 []
 [] -> #4 (&ctx->lock){-.-...}:
 [] -> #3 (&rq->lock){-.-.-.}:
 [] -> #2 (&p->pi_lock){-.-.-.}:
 [] -> #1 (&rnp->nocb_gp_wq[1]){......}:
 [] -> #0 (rcu_node_0){..-...}:

Paul was quick to explain that due to preemptible RCU we cannot call
rcu_read_unlock() while holding scheduler (or nested) locks when part
of the read side critical section was preemptible.

Therefore solve it by making the entire RCU read side non-preemptible.

Also pull out the retry from under the non-preempt to play nice with RT.

Reported-by: Jiri Olsa 
Helped-out-by: Paul E. McKenney 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

perf: Remove WARN_ON_ONCE() check in __perf_event_enable() for valid scenario

2013-07-27T04:34:31+00:00

commit 06f417968beac6e6b614e17b37d347aa6a6b1d30 upstream.

The '!ctx->is_active' check has a valid scenario, so
there's no need for the warning.

The reason is that there's a time window between the
'ctx->is_active' check in the perf_event_enable() function
and the __perf_event_enable() function having:

  - IRQs on
  - ctx->lock unlocked

where the task could be killed and 'ctx' deactivated by
perf_event_exit_task(), ending up with the warning below.

So remove the WARN_ON_ONCE() check and add comments to
explain it all.

This addresses the following warning reported by Vince Weaver:

[  324.983534] ------------[ cut here ]------------
[  324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190()
[  324.984420] Modules linked in:
[  324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246
[  324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[  324.984420]  0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00
[  324.984420]  ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860
[  324.984420]  0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a
[  324.984420] Call Trace:
[  324.984420]    [] dump_stack+0x19/0x1b
[  324.984420]  [] warn_slowpath_common+0x70/0xa0
[  324.984420]  [] warn_slowpath_null+0x1a/0x20
[  324.984420]  [] __perf_event_enable+0x187/0x190
[  324.984420]  [] remote_function+0x40/0x50
[  324.984420]  [] generic_smp_call_function_single_interrupt+0xbe/0x130
[  324.984420]  [] smp_call_function_single_interrupt+0x27/0x40
[  324.984420]  [] call_function_single_interrupt+0x6f/0x80
[  324.984420]    [] ? _raw_spin_unlock_irqrestore+0x41/0x70
[  324.984420]  [] perf_event_exit_task+0x14d/0x210
[  324.984420]  [] ? switch_task_namespaces+0x24/0x60
[  324.984420]  [] do_exit+0x2b6/0xa40
[  324.984420]  [] ? _raw_spin_unlock_irq+0x2c/0x30
[  324.984420]  [] do_group_exit+0x49/0xc0
[  324.984420]  [] get_signal_to_deliver+0x254/0x620
[  324.984420]  [] do_signal+0x57/0x5a0
[  324.984420]  [] ? __do_page_fault+0x2a4/0x4e0
[  324.984420]  [] ? retint_restore_args+0xe/0xe
[  324.984420]  [] ? retint_signal+0x11/0x84
[  324.984420]  [] do_notify_resume+0x65/0x80
[  324.984420]  [] retint_signal+0x46/0x84
[  324.984420] ---[ end trace 442ec2f04db3771a ]---

Reported-by: Vince Weaver 
Signed-off-by: Jiri Olsa 
Suggested-by: Peter Zijlstra 
Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

perf: Clone child context from parent context pmu

2013-07-27T04:34:31+00:00

commit 734df5ab549ca44f40de0f07af1c8803856dfb18 upstream.

Currently when the child context for inherited events is
created, it's based on the pmu object of the first event
of the parent context.

This is wrong for the following scenario:

  - HW context having HW and SW event
  - HW event got removed (closed)
  - SW event stays in HW context as the only event
    and its pmu is used to clone the child context

The issue starts when the cpu context object is touched
based on the pmu context object (__get_cpu_context). In
this case the HW context will work with SW cpu context
ending up with following WARN below.

Fixing this by using parent context pmu object to clone
from child context.

Addresses the following warning reported by Vince Weaver:

[ 2716.472065] ------------[ cut here ]------------
[ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
[ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
[ 2716.476035] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 2716.476035]  0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
[ 2716.476035]  ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
[ 2716.476035]  ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
[ 2716.476035] Call Trace:
[ 2716.476035]  [] ? warn_slowpath_common+0x5b/0x70
[ 2716.476035]  [] ? task_ctx_sched_out+0x3c/0x5f
[ 2716.476035]  [] ? perf_event_exit_task+0xbf/0x194
[ 2716.476035]  [] ? do_exit+0x3e7/0x90c
[ 2716.476035]  [] ? __do_fault+0x359/0x394
[ 2716.476035]  [] ? do_group_exit+0x66/0x98
[ 2716.476035]  [] ? get_signal_to_deliver+0x479/0x4ad
[ 2716.476035]  [] ? __perf_event_task_sched_out+0x230/0x2d1
[ 2716.476035]  [] ? do_signal+0x3c/0x432
[ 2716.476035]  [] ? ctx_sched_in+0x43/0x141
[ 2716.476035]  [] ? perf_event_context_sched_in+0x7a/0x90
[ 2716.476035]  [] ? __perf_event_task_sched_in+0x31/0x118
[ 2716.476035]  [] ? mmdrop+0xd/0x1c
[ 2716.476035]  [] ? finish_task_switch+0x7d/0xa6
[ 2716.476035]  [] ? do_notify_resume+0x20/0x5d
[ 2716.476035]  [] ? retint_signal+0x3d/0x78
[ 2716.476035] ---[ end trace 827178d8a5966c3d ]---

Reported-by: Vince Weaver 
Signed-off-by: Jiri Olsa 
Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings