linux-toradex.git/kernel/trace/trace_syscalls.c, branch v6.13-rc7

tracing/perf: Add might_fault check to syscall probes

2024-10-09T21:09:46+00:00

Add a might_fault() check to validate that the perf sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.

Cc: Michael Jeanson 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: Andrii Nakryiko 
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes 
Link: https://lore.kernel.org/20241009010718.2050182-8-mathieu.desnoyers@efficios.com
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google)

tracing/ftrace: Add might_fault check to syscall probes

2024-10-09T21:09:36+00:00

Add a might_fault() check to validate that the ftrace sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.

Cc: Michael Jeanson 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: Andrii Nakryiko 
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes 
Link: https://lore.kernel.org/20241009010718.2050182-7-mathieu.desnoyers@efficios.com
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google)

tracing/perf: disable preemption in syscall probe

2024-10-09T21:07:25+00:00

In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that perf can handle this change by
explicitly disabling preemption within the perf system call tracepoint
probes to respect the current expectations within perf ring buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to adapt to the upcoming
change.

Cc: Michael Jeanson 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: Andrii Nakryiko 
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes 
Link: https://lore.kernel.org/20241009010718.2050182-4-mathieu.desnoyers@efficios.com
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google)

tracing/ftrace: disable preemption in syscall probe

2024-10-09T21:07:00+00:00

In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that ftrace can handle this change by
explicitly disabling preemption within the ftrace system call tracepoint
probes to respect the current expectations within ftrace ring buffer
code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to adapt to the upcoming
change.

Cc: Michael Jeanson 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Alexei Starovoitov 
Cc: Yonghong Song 
Cc: Paul E. McKenney 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: Andrii Nakryiko 
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes 
Link: https://lore.kernel.org/20241009010718.2050182-3-mathieu.desnoyers@efficios.com
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google)

bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

2024-09-11T20:27:27+00:00

Salvatore Benedetto reported an issue that when doing syscall tracepoint
tracing the kernel stack is empty. For example, using the following
command line
  bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
  bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }'
the output for both commands is
===
  Kernel Stack
===

Further analysis shows that pt_regs used for bpf syscall tracepoint
tracing is from the one constructed during user->kernel transition.
The call stack looks like
  perf_syscall_enter+0x88/0x7c0
  trace_sys_enter+0x41/0x80
  syscall_trace_enter+0x100/0x160
  do_syscall_64+0x38/0xf0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The ip address stored in pt_regs is from user space hence no kernel
stack is printed.

To fix the issue, kernel address from pt_regs is required.
In kernel repo, there are already a few cases like this. For example,
in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
instances are used to supply ip address or use ip address to construct
call stack.

Instead of allocate fake_regs in the stack which may consume
a lot of bytes, the function perf_trace_buf_alloc() in
perf_syscall_{enter, exit}() is leveraged to create fake_regs,
which will be passed to perf_call_bpf_{enter,exit}().

For the above bpftrace script, I got the following output with this patch:
for tracepoint:syscalls:sys_enter_read
===
  Kernel Stack

        syscall_trace_enter+407
        syscall_trace_enter+407
        do_syscall_64+74
        entry_SYSCALL_64_after_hwframe+75
===
and for tracepoint:syscalls:sys_exit_read
===
Kernel Stack

        syscall_exit_work+185
        syscall_exit_work+185
        syscall_exit_to_user_mode+305
        do_syscall_64+118
        entry_SYSCALL_64_after_hwframe+75
===

Reported-by: Salvatore Benedetto 
Suggested-by: Andrii Nakryiko 
Signed-off-by: Yonghong Song 
Signed-off-by: Andrii Nakryiko 
Acked-by: Andrii Nakryiko 
Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev

bpf: Change syscall_nr type to int in struct syscall_tp_t

2023-10-13T19:39:36+00:00

linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:

struct trace_event_raw_sys_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */
        /* XXX 4 bytes hole, try to pack */

        long int                   id;                   /*    16     8 */
        long unsigned int          args[6];              /*    24    48 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        char                       __data[];             /*    72     0 */

        /* size: 72, cachelines: 2, members: 4 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 8 bytes */
};

struct syscall_trace_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */

        int                        nr;                   /*    12     4 */
        long unsigned int          args[];               /*    16     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 16 bytes */
};

This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:

  10488         if (is_tracepoint || is_syscall_tp) {
  10489                 int off = trace_event_get_offsets(event->tp_event);
  10490
  10491                 if (prog->aux->max_ctx_offset > off)
  10492                         return -EACCES;
  10493         }

What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.

Signed-off-by: Artem Savkov 
Signed-off-by: Andrii Nakryiko 
Acked-by: Steven Rostedt (Google) 
Link: https://lore.kernel.org/bpf/20231013054219.172920-1-asavkov@redhat.com

tracing: bpf: use struct trace_entry in struct syscall_tp_t

2023-08-01T17:53:28+00:00

bpf tracepoint program uses struct trace_event_raw_sys_enter as
argument where trace_entry is the first field. Use the same instead
of unsigned long long since if it's amended (for example by RT
patch) it accesses data with wrong offset.

Signed-off-by: Yauheni Kaliuta 
Acked-by: Yonghong Song 
Link: https://lore.kernel.org/r/20230801075222.7717-1-ykaliuta@redhat.com
Signed-off-by: Alexei Starovoitov

tracing: Remove unused __bad_type_size() method

2022-11-18T01:21:06+00:00

__bad_type_size() is unused after
commit 04ae87a52074("ftrace: Rework event_create_dir()").
So, remove it.

Link: https://lkml.kernel.org/r/D062EC2E-7DB7-4402-A67E-33C3577F551E@gmail.com

Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Qiujun Huang 
Signed-off-by: Steven Rostedt (Google)

tracing: Make tp_printk work on syscall tracepoints

2022-04-26T21:58:52+00:00

Currently the tp_printk option has no effect on syscall tracepoint.
When adding the kernel option parameter tp_printk, then:

echo 1 > /sys/kernel/debug/tracing/events/syscalls/enable

When running any application, no trace information is printed on the
terminal.

Now added printk for syscall tracepoints.

Link: https://lkml.kernel.org/r/20220410145025.681144-1-xiehuan09@gmail.com

Signed-off-by: Jeff Xie 
Signed-off-by: Steven Rostedt (Google)

tracing: Have syscall trace events use trace_event_buffer_lock_reserve()

2022-01-13T21:23:05+00:00

Currently, the syscall trace events call trace_buffer_lock_reserve()
directly, which means that it misses out on some of the filtering
optimizations provided by the helper function
trace_event_buffer_lock_reserve(). Have the syscall trace events call that
instead, as it was missed when adding the update to use the temp buffer
when filtering.

Link: https://lkml.kernel.org/r/20220107225839.823118570@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar 
Cc: Andrew Morton 
Cc: Tom Zanussi 
Reviewed-by: Masami Hiramatsu 
Fixes: 0fc1b09ff1ff4 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Steven Rostedt