linux-toradex.git/kernel, branch v3.4.57

perf: Use css_tryget() to avoid propping up css refcount

2013-08-11T22:38:43+00:00

commit 9c5da09d266ca9b32eb16cf940f8161d949c2fe5 upstream.

An rmdir pushes css's ref count to zero.  However, if the associated
directory is open at the time, the dentry ref count is non-zero.  If
the fd for this directory is then passed into perf_event_open, it
does a css_get().  This bounces the ref count back up from zero.  This
is a problem by itself.  But what makes it turn into a crash is the
fact that we end up doing an extra dput, since we perform a dput
when css_put sees the ref count go down to zero.

css_tryget() does not fall into that trap. So, we use that instead.

Reproduction test-case for the bug:

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

 #define PERF_FLAG_PID_CGROUP    (1U << 2)

 int perf_event_open(struct perf_event_attr *hw_event_uptr,
                     pid_t pid, int cpu, int group_fd, unsigned long flags) {
         return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
                 group_fd, flags);
 }

 /*
  * Directly poke at the perf_event bug, since it's proving hard to repro
  * depending on where in the kernel tree.  what moved?
  */
 int main(int argc, char **argv)
 {
        int fd;
        struct perf_event_attr attr;
        memset(&attr, 0, sizeof(attr));
        attr.exclude_kernel = 1;
        attr.size = sizeof(attr);
        mkdir("/dev/cgroup/perf_event/blah", 0777);
        fd = open("/dev/cgroup/perf_event/blah", O_RDONLY);
        perror("open");
        rmdir("/dev/cgroup/perf_event/blah");
        sleep(2);
        perf_event_open(&attr, fd, 0, -1,  PERF_FLAG_PID_CGROUP);
        perror("perf_event_open");
        close(fd);
        return 0;
 }

Signed-off-by: Salman Qazi 
Signed-off-by: Peter Zijlstra 
Acked-by: Tejun Heo 
Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar 
Cc: Li Zefan 
Signed-off-by: Greg Kroah-Hartman

perf: Fix event group context move

2013-08-11T22:38:43+00:00

commit 0231bb5336758426b44ccd798ccd3c5419c95d58 upstream.

When we have group with mixed events (hw/sw) we want to end up
with group leader being in hw context. So if group leader is
initialy sw event, we move all the events under hw context.

The move is done for each event by removing it from its context
and adding it back into proper one. As a part of the removal the
event is automatically disabled, which is not what we want at
this stage of creating groups.

The fix is to initialize event state after removal from sw
context.

This fix resulted from the following discussion:

  http://thread.gmane.org/gmane.linux.kernel.perf.user/1144

Reported-by: Andreas Hollmann 
Signed-off-by: Jiri Olsa 
Cc: Arnaldo Carvalho de Melo 
Cc: Namhyung Kim 
Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Vince Weaver 
Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar 
Cc: Li Zefan 
Signed-off-by: Greg Kroah-Hartman

sched: Fix the broken sched_rr_get_interval()

2013-08-11T22:38:43+00:00

commit a59f4e079d19464eebb9b06513a1d4f55fdae5ba upstream.

The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.

The change was introduced by commit:

  77034937dc45 sched: fix crash in sys_sched_rr_get_interval()

... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.

Signed-off-by: Zhu Yanhai 
Cc: Peter Zijlstra 
Cc: Paul Turner 
Cc: Thomas Gleixner 
Cc: Steven Rostedt 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com
[ Since this is an ABI and an old bug, we'll test this via a
  slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

tracing: Fix irqs-off tag display in syscall tracing

2013-08-04T08:25:45+00:00

commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream

Initialization of variable irq_flags and pc was missed when backport
11034ae9c to linux-3.0.y and linux-3.4.y, my fault.

Signed-off-by: zhangwei(Jovi) 
Signed-off-by: Greg Kroah-Hartman

hrtimers: Move SMP function call to thread context

2013-07-28T23:26:47+00:00

commit 5ec2481b7b47a4005bb446d176e5d0257400c77d upstream.

smp_call_function_* must not be called from softirq context.

But clock_was_set() which calls on_each_cpu() is called from softirq
context to implement a delayed clock_was_set() for the timer interrupt
handler. Though that almost never gets invoked. A recent change in the
resume code uses the softirq based delayed clock_was_set to support
Xens resume mechanism.

linux-next contains a new warning which warns if smp_call_function_*
is called from softirq context which gets triggered by that Xen
change.

Fix this by moving the delayed clock_was_set() call to a work context.

Reported-and-tested-by: Artem Savkov 
Reported-by: Sasha Levin 
Cc: David Vrabel 
Cc: Ingo Molnar 
Cc: H. Peter Anvin ,
Cc: Konrad Wilk 
Cc: John Stultz 
Cc: xen-devel@lists.xen.org
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

tracing: Fix irqs-off tag display in syscall tracing

2013-07-28T23:26:46+00:00

commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream.

All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't
disable irqs.

 [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event
 [root@jovi tracing]# cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 13/13   #P:2
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
       irqbalance-513   [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6)
       irqbalance-513   [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6)
         sendmail-771   [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6)

The reason is syscall tracing doesn't record irq_flags into buffer.
The proper display is:

 [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event
 [root@jovi tracing]# cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 14/14   #P:2
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
       irqbalance-514   [001] ....    46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6)
       irqbalance-514   [001] ....    46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6)
            <...>-920   [001] ....    47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0)

Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com

Cc: stable@vger.kernel.org # 2.6.35
Signed-off-by: zhangwei(Jovi) 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

perf: Fix perf_lock_task_context() vs RCU

2013-07-28T23:25:48+00:00

commit 058ebd0eba3aff16b144eabf4510ed9510e1416e upstream.

Jiri managed to trigger this warning:

 [] ======================================================
 [] [ INFO: possible circular locking dependency detected ]
 [] 3.10.0+ #228 Tainted: G        W
 [] -------------------------------------------------------
 [] p/6613 is trying to acquire lock:
 []  (rcu_node_0){..-...}, at: [] rcu_read_unlock_special+0xa7/0x250
 []
 [] but task is already holding lock:
 []  (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0xd9/0x2c0
 []
 [] which lock already depends on the new lock.
 []
 [] the existing dependency chain (in reverse order) is:
 []
 [] -> #4 (&ctx->lock){-.-...}:
 [] -> #3 (&rq->lock){-.-.-.}:
 [] -> #2 (&p->pi_lock){-.-.-.}:
 [] -> #1 (&rnp->nocb_gp_wq[1]){......}:
 [] -> #0 (rcu_node_0){..-...}:

Paul was quick to explain that due to preemptible RCU we cannot call
rcu_read_unlock() while holding scheduler (or nested) locks when part
of the read side critical section was preemptible.

Therefore solve it by making the entire RCU read side non-preemptible.

Also pull out the retry from under the non-preempt to play nice with RT.

Reported-by: Jiri Olsa 
Helped-out-by: Paul E. McKenney 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf: Remove WARN_ON_ONCE() check in __perf_event_enable() for valid scenario

2013-07-28T23:25:48+00:00

commit 06f417968beac6e6b614e17b37d347aa6a6b1d30 upstream.

The '!ctx->is_active' check has a valid scenario, so
there's no need for the warning.

The reason is that there's a time window between the
'ctx->is_active' check in the perf_event_enable() function
and the __perf_event_enable() function having:

  - IRQs on
  - ctx->lock unlocked

where the task could be killed and 'ctx' deactivated by
perf_event_exit_task(), ending up with the warning below.

So remove the WARN_ON_ONCE() check and add comments to
explain it all.

This addresses the following warning reported by Vince Weaver:

[  324.983534] ------------[ cut here ]------------
[  324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190()
[  324.984420] Modules linked in:
[  324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246
[  324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[  324.984420]  0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00
[  324.984420]  ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860
[  324.984420]  0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a
[  324.984420] Call Trace:
[  324.984420]    [] dump_stack+0x19/0x1b
[  324.984420]  [] warn_slowpath_common+0x70/0xa0
[  324.984420]  [] warn_slowpath_null+0x1a/0x20
[  324.984420]  [] __perf_event_enable+0x187/0x190
[  324.984420]  [] remote_function+0x40/0x50
[  324.984420]  [] generic_smp_call_function_single_interrupt+0xbe/0x130
[  324.984420]  [] smp_call_function_single_interrupt+0x27/0x40
[  324.984420]  [] call_function_single_interrupt+0x6f/0x80
[  324.984420]    [] ? _raw_spin_unlock_irqrestore+0x41/0x70
[  324.984420]  [] perf_event_exit_task+0x14d/0x210
[  324.984420]  [] ? switch_task_namespaces+0x24/0x60
[  324.984420]  [] do_exit+0x2b6/0xa40
[  324.984420]  [] ? _raw_spin_unlock_irq+0x2c/0x30
[  324.984420]  [] do_group_exit+0x49/0xc0
[  324.984420]  [] get_signal_to_deliver+0x254/0x620
[  324.984420]  [] do_signal+0x57/0x5a0
[  324.984420]  [] ? __do_page_fault+0x2a4/0x4e0
[  324.984420]  [] ? retint_restore_args+0xe/0xe
[  324.984420]  [] ? retint_signal+0x11/0x84
[  324.984420]  [] do_notify_resume+0x65/0x80
[  324.984420]  [] retint_signal+0x46/0x84
[  324.984420] ---[ end trace 442ec2f04db3771a ]---

Reported-by: Vince Weaver 
Signed-off-by: Jiri Olsa 
Suggested-by: Peter Zijlstra 
Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf: Clone child context from parent context pmu

2013-07-28T23:25:48+00:00

commit 734df5ab549ca44f40de0f07af1c8803856dfb18 upstream.

Currently when the child context for inherited events is
created, it's based on the pmu object of the first event
of the parent context.

This is wrong for the following scenario:

  - HW context having HW and SW event
  - HW event got removed (closed)
  - SW event stays in HW context as the only event
    and its pmu is used to clone the child context

The issue starts when the cpu context object is touched
based on the pmu context object (__get_cpu_context). In
this case the HW context will work with SW cpu context
ending up with following WARN below.

Fixing this by using parent context pmu object to clone
from child context.

Addresses the following warning reported by Vince Weaver:

[ 2716.472065] ------------[ cut here ]------------
[ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
[ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
[ 2716.476035] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 2716.476035]  0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
[ 2716.476035]  ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
[ 2716.476035]  ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
[ 2716.476035] Call Trace:
[ 2716.476035]  [] ? warn_slowpath_common+0x5b/0x70
[ 2716.476035]  [] ? task_ctx_sched_out+0x3c/0x5f
[ 2716.476035]  [] ? perf_event_exit_task+0xbf/0x194
[ 2716.476035]  [] ? do_exit+0x3e7/0x90c
[ 2716.476035]  [] ? __do_fault+0x359/0x394
[ 2716.476035]  [] ? do_group_exit+0x66/0x98
[ 2716.476035]  [] ? get_signal_to_deliver+0x479/0x4ad
[ 2716.476035]  [] ? __perf_event_task_sched_out+0x230/0x2d1
[ 2716.476035]  [] ? do_signal+0x3c/0x432
[ 2716.476035]  [] ? ctx_sched_in+0x43/0x141
[ 2716.476035]  [] ? perf_event_context_sched_in+0x7a/0x90
[ 2716.476035]  [] ? __perf_event_task_sched_in+0x31/0x118
[ 2716.476035]  [] ? mmdrop+0xd/0x1c
[ 2716.476035]  [] ? finish_task_switch+0x7d/0xa6
[ 2716.476035]  [] ? do_notify_resume+0x20/0x5d
[ 2716.476035]  [] ? retint_signal+0x3d/0x78
[ 2716.476035] ---[ end trace 827178d8a5966c3d ]---

Reported-by: Vince Weaver 
Signed-off-by: Jiri Olsa 
Cc: Corey Ashford 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

tracing: Use current_uid() for critical time tracing

2013-07-28T23:25:47+00:00

commit f17a5194859a82afe4164e938b92035b86c55794 upstream.

The irqsoff tracer records the max time that interrupts are disabled.
There are hooks in the assembly code that calls back into the tracer when
interrupts are disabled or enabled.

When they are enabled, the tracer checks if the amount of time they
were disabled is larger than the previous recorded max interrupts off
time. If it is, it creates a snapshot of the currently running trace
to store where the last largest interrupts off time was held and how
it happened.

During testing, this RCU lockdep dump appeared:

[ 1257.829021] ===============================
[ 1257.829021] [ INFO: suspicious RCU usage. ]
[ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G        W
[ 1257.829021] -------------------------------
[ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
[ 1257.829021]
[ 1257.829021] other info that might help us debug this:
[ 1257.829021]
[ 1257.829021]
[ 1257.829021] RCU used illegally from idle CPU!
[ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
[ 1257.829021] RCU used illegally from extended quiescent state!
[ 1257.829021] 2 locks held by trace-cmd/4831:
[ 1257.829021]  #0:  (max_trace_lock){......}, at: [] stop_critical_timing+0x1a3/0x209
[ 1257.829021]  #1:  (rcu_read_lock){.+.+..}, at: [] __update_max_tr+0x88/0x1ee
[ 1257.829021]
[ 1257.829021] stack backtrace:
[ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G        W    3.10.0-rc1-test+ #171
[ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[ 1257.829021]  0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
[ 1257.829021]  ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
[ 1257.829021]  ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
[ 1257.829021] Call Trace:
[ 1257.829021]  [] dump_stack+0x19/0x1b
[ 1257.829021]  [] lockdep_rcu_suspicious+0x109/0x112
[ 1257.829021]  [] __update_max_tr+0xed/0x1ee
[ 1257.829021]  [] ? __update_max_tr+0x88/0x1ee
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] update_max_tr_single+0x11d/0x12d
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] stop_critical_timing+0x141/0x209
[ 1257.829021]  [] ? trace_hardirqs_on+0xd/0xf
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] time_hardirqs_on+0x2a/0x2f
[ 1257.829021]  [] ? user_enter+0xfd/0x107
[ 1257.829021]  [] trace_hardirqs_on_caller+0x16/0x197
[ 1257.829021]  [] trace_hardirqs_on+0xd/0xf
[ 1257.829021]  [] user_enter+0xfd/0x107
[ 1257.829021]  [] do_notify_resume+0x92/0x97
[ 1257.829021]  [] int_signal+0x12/0x17

What happened was entering into the user code, the interrupts were enabled
and a max interrupts off was recorded. The trace buffer was saved along with
various information about the task: comm, pid, uid, priority, etc.

The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
to retrieve the data, and this happened to happen where RCU is blind (user_enter).

As only the preempt and irqs off tracers can have this happen, and they both
only have the tsk == current, if tsk == current, use current_uid() instead of
task_uid(), as current_uid() does not use RCU as only current can change its uid.

This fixes the RCU suspicious splat.

Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman