summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-01-17bpf: prevent out-of-bounds speculationAlexei Starovoitov
commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream. Under speculation, CPUs may mis-predict branches in bounds checks. Thus, memory accesses under a bounds check may be speculated even if the bounds check fails, providing a primitive for building a side channel. To avoid leaking kernel data round up array-based maps and mask the index after bounds check, so speculated load with out of bounds index will load either valid value from the array or zero from the padded area. Unconditionally mask index for all array types even when max_entries are not rounded to power of 2 for root user. When map is created by unpriv user generate a sequence of bpf insns that includes AND operation to make sure that JITed code includes the same 'index & index_mask' operation. If prog_array map is created by unpriv user replace bpf_tail_call(ctx, map, index); with if (index >= max_entries) { index &= map->index_mask; bpf_tail_call(ctx, map, index); } (along with roundup to power 2) to prevent out-of-bounds speculation. There is secondary redundant 'if (index >= max_entries)' in the interpreter and in all JITs, but they can be optimized later if necessary. Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array) cannot be used by unpriv, so no changes there. That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on all architectures with and without JIT. v2->v3: Daniel noticed that attack potentially can be crafted via syscall commands without loading the program, so add masking to those paths as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Slaby <jslaby@suse.cz> [ Backported to 4.9 - gregkh ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17bpf: refactor fixup_bpf_calls()Alexei Starovoitov
commit 79741b3bdec01a8628368fbcfccc7d189ed606cb upstream. reduce indent and make it iterate over instructions similar to convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Jiri Slaby <jslaby@suse.cz> [Backported to 4.9.y - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17bpf: move fixup_bpf_calls() functionAlexei Starovoitov
commit e245c5c6a5656e4d61aa7bb08e9694fd6e5b2b9d upstream. no functional change. move fixup_bpf_calls() to verifier.c it's being refactored in the next patch Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Jiri Slaby <jslaby@suse.cz> [backported to 4.9 - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10kernel/signal.c: remove the no longer needed SIGNAL_UNKILLABLE check in ↵Oleg Nesterov
complete_signal() commit 426915796ccaf9c2bd9bb06dc5702225957bc2e5 upstream. complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy the thread group, today this is wrong in many ways. If nothing else, fatal_signal_pending() should always imply that the whole thread group (except ->group_exit_task if it is not NULL) is killed, this check breaks the rule. After the previous changes we can rely on sig_task_ignored(); sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want to kill this task and sig == SIGKILL OR it is traced and debugger can intercept the signal. This should hopefully fix the problem reported by Dmitry. This test-case static int init(void *arg) { for (;;) pause(); } int main(void) { char stack[16 * 1024]; for (;;) { int pid = clone(init, stack + sizeof(stack)/2, CLONE_NEWPID | SIGCHLD, NULL); assert(pid > 0); assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0); assert(waitpid(-1, NULL, WSTOPPED) == pid); assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0); assert(syscall(__NR_tkill, pid, SIGKILL) == 0); assert(pid == wait(NULL)); } } triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in task_participate_group_stop(). do_signal_stop()->signal_group_exit() checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending() checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING. And his should fix the minor security problem reported by Kyle, SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the task is the root of a pid namespace. Link: http://lkml.kernel.org/r/20171103184246.GD21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Kyle Huey <me@kylehuey.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10kernel/signal.c: protect the SIGNAL_UNKILLABLE tasks from !sig_kernel_only() ↵Oleg Nesterov
signals commit ac25385089f673560867eb5179228a44ade0cfc1 upstream. Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only() signals even if force == T. This simplifies the next change and this matches the same check in get_signal() which will drop these signals anyway. Link: http://lkml.kernel.org/r/20171103184227.GC21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10kernel/signal.c: protect the traced SIGNAL_UNKILLABLE tasks from SIGKILLOleg Nesterov
commit 628c1bcba204052d19b686b5bac149a644cdb72e upstream. The comment in sig_ignored() says "Tracers may want to know about even ignored signals" but SIGKILL can not be reported to debugger and it is just wrong to return 0 in this case: SIGKILL should only kill the SIGNAL_UNKILLABLE task if it comes from the parent ns. Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on sig_task_ignored(). SISGTOP coming from within the namespace is not really right too but at least debugger can intercept it, and we can't drop it here because this will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add another ->ptrace check later, we will see. Link: http://lkml.kernel.org/r/20171103184206.GB21036@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10kernel: make groups_sort calling a responsibility group_info allocatorsThiago Rafael Becker
commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 upstream. In testing, we found that nfsd threads may call set_groups in parallel for the same entry cached in auth.unix.gid, racing in the call of groups_sort, corrupting the groups for that entry and leading to permission denials for the client. This patch: - Make groups_sort globally visible. - Move the call to groups_sort to the modifiers of group_info - Remove the call to groups_sort from set_groups Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10kernel/acct.c: fix the acct->needcheck check in check_free_space()Oleg Nesterov
commit 4d9570158b6260f449e317a5f9ed030c2504a615 upstream. As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check is very wrong, we need time_is_after_jiffies() to make sys_acct() work. Ignoring the overflows, the code should "goto out" if needcheck > jiffies, while currently it checks "needcheck < jiffies" and thus in the likely case check_free_space() does nothing until jiffies overflow. In particular this means that sys_acct() is simply broken, acct_on() sets acct->needcheck = jiffies and expects that check_free_space() should set acct->active = 1 after the free-space check, but this won't happen if jiffies increments in between. This was broken by commit 32dc73086015 ("get rid of timer in kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea ("acct() should honour the limits from the very beginning") made the problem more visible. Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com Fixes: 32dc73086015 ("get rid of timer in kern/acct.c") Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZEHugh Dickins
Kaiser only needs to map one page of the stack; and kernel/fork.c did not build on powerpc (no __PAGE_KERNEL). It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack() and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's kaiser_add_mapping() and kaiser_remove_mapping(). And use linux/kaiser.h in init/main.c to avoid the #ifdefs there. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: merged updateDave Hansen
Merged fixes and cleanups, rebased to 4.9.51 tree (no 5-level paging). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05KAISER: Kernel Address IsolationRichard Fellner
This patch introduces our implementation of KAISER (Kernel Address Isolation to have Side-channels Efficiently Removed), a kernel isolation technique to close hardware side channels on kernel address information. More information about the patch can be found on: https://github.com/IAIK/KAISER From: Richard Fellner <richard.fellner@student.tugraz.at> From: Daniel Gruss <daniel.gruss@iaik.tugraz.at> Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode Date: Thu, 4 May 2017 14:26:50 +0200 Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2 Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5 To: <linux-kernel@vger.kernel.org> To: <kernel-hardening@lists.openwall.com> Cc: <clementine.maurice@iaik.tugraz.at> Cc: <moritz.lipp@iaik.tugraz.at> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at> Cc: Richard Fellner <richard.fellner@student.tugraz.at> Cc: Ingo Molnar <mingo@kernel.org> Cc: <kirill.shutemov@linux.intel.com> Cc: <anders.fogh@gdata-adan.de> After several recent works [1,2,3] KASLR on x86_64 was basically considered dead by many researchers. We have been working on an efficient but effective fix for this problem and found that not mapping the kernel space when running in user mode is the solution to this problem [4] (the corresponding paper [5] will be presented at ESSoS17). With this RFC patch we allow anybody to configure their kernel with the flag CONFIG_KAISER to add our defense mechanism. If there are any questions we would love to answer them. We also appreciate any comments! Cheers, Daniel (+ the KAISER team from Graz University of Technology) [1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf [2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf [3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf [4] https://github.com/IAIK/KAISER [5] https://gruss.cc/files/kaiser.pdf [patch based also on https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch] Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at> Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at> Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at> Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()Thomas Gleixner
commit 5d62c183f9e9df1deeea0906d099a94e8a43047a upstream. The conditions in irq_exit() to invoke tick_nohz_irq_exit() which subsequently invokes tick_nohz_stop_sched_tick() are: if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) If need_resched() is not set, but a timer softirq is pending then this is an indication that the softirq code punted and delegated the execution to softirqd. need_resched() is not true because the current interrupted task takes precedence over softirqd. Invoking tick_nohz_irq_exit() in this case can cause an endless loop of timer interrupts because the timer wheel contains an expired timer, but softirqs are not yet executed. So it returns an immediate expiry request, which causes the timer to fire immediately again. Lather, rinse and repeat.... Prevent that by adding a check for a pending timer soft interrupt to the conditions in tick_nohz_stop_sched_tick() which avoid calling get_next_timer_interrupt(). That keeps the tick sched timer on the tick and prevents a repetitive programming of an already expired timer. Reported-by: Sebastian Siewior <bigeasy@linutronix.d> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Sebastian Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02timers: Reinitialize per cpu bases on hotplugThomas Gleixner
commit 26456f87aca7157c057de65c9414b37f1ab881d1 upstream. The timer wheel bases are not (re)initialized on CPU hotplug. That leaves them with a potentially stale clk and next_expiry valuem, which can cause trouble then the CPU is plugged. Add a prepare callback which forwards the clock, sets next_expiry to far in the future and reset the control flags to a known state. Set base->must_forward_clk so the first timer which is queued will try to forward the clock to current jiffies. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02timers: Invoke timer_start_debug() where it makes senseThomas Gleixner
commit fd45bb77ad682be728d1002431d77b8c73342836 upstream. The timer start debug function is called before the proper timer base is set. As a consequence the trace data contains the stale CPU and flags values. Call the debug function after setting the new base and flags. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: rt@linutronix.de Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Link: https://lkml.kernel.org/r/20171222145337.792907137@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02timers: Use deferrable base independent of base::nohz_activeAnna-Maria Gleixner
commit ced6d5c11d3e7b342f1a80f908e6756ebd4b8ddd upstream. During boot and before base::nohz_active is set in the timer bases, deferrable timers are enqueued into the standard timer base. This works correctly as long as base::nohz_active is false. Once it base::nohz_active is set and a timer which was enqueued before that is accessed the lock selector code choses the lock of the deferred base. This causes unlocked access to the standard base and in case the timer is removed it does not clear the pending flag in the standard base bitmap which causes get_next_timer_interrupt() to return bogus values. To prevent that, the deferrable timers must be enqueued in the deferrable base, even when base::nohz_active is not set. Those deferrable timers also need to be expired unconditional. Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel") Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: rt@linutronix.de Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20171222145337.633328378@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02ring-buffer: Mask out the info bits when returning buffer page lengthSteven Rostedt (VMware)
commit 45d8b80c2ac5d21cd1e2954431fb676bc2b1e099 upstream. Two info bits were added to the "commit" part of the ring buffer data page when returned to be consumed. This was to inform the user space readers that events have been missed, and that the count may be stored at the end of the page. What wasn't handled, was the splice code that actually called a function to return the length of the data in order to zero out the rest of the page before sending it up to user space. These data bits were returned with the length making the value negative, and that negative value was not checked. It was compared to PAGE_SIZE, and only used if the size was less than PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an unsigned compare, meaning the negative size value did not end up causing a large portion of memory to be randomly zeroed out. Fixes: 66a8cb95ed040 ("ring-buffer: Add place holder recording of dropped events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02tracing: Fix crash when it fails to alloc ring bufferJing Xia
commit 24f2aaf952ee0b59f31c3a18b8b36c9e3d3c2cf5 upstream. Double free of the ring buffer happens when it fails to alloc new ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured. The root cause is that the pointer is not set to NULL after the buffer is freed in allocate_trace_buffers(), and the freeing of the ring buffer is invoked again later if the pointer is not equal to Null, as: instance_mkdir() |-allocate_trace_buffers() |-allocate_trace_buffer(tr, &tr->trace_buffer...) |-allocate_trace_buffer(tr, &tr->max_buffer...) // allocate fail(-ENOMEM),first free // and the buffer pointer is not set to null |-ring_buffer_free(tr->trace_buffer.buffer) // out_free_tr |-free_trace_buffers() |-free_trace_buffer(&tr->trace_buffer); //if trace_buffer is not null, free again |-ring_buffer_free(buf->buffer) |-rb_free_cpu_buffer(buffer->buffers[cpu]) // ring_buffer_per_cpu is null, and // crash in ring_buffer_per_cpu->pages Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Signed-off-by: Jing Xia <jing.xia@spreadtrum.com> Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02tracing: Fix possible double free on failure of allocating trace bufferSteven Rostedt (VMware)
commit 4397f04575c44e1440ec2e49b6302785c95fd2f8 upstream. Jing Xia and Chunyan Zhang reported that on failing to allocate part of the tracing buffer, memory is freed, but the pointers that point to them are not initialized back to NULL, and later paths may try to free the freed memory again. Jing and Chunyan fixed one of the locations that does this, but missed a spot. Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: Jing Xia <jing.xia@spreadtrum.com> Reported-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02tracing: Remove extra zeroing out of the ring buffer pageSteven Rostedt (VMware)
commit 6b7e633fe9c24682df550e5311f47fb524701586 upstream. The ring_buffer_read_page() takes care of zeroing out any extra data in the page that it returns. There's no need to zero it out again from the consumer. It was removed from one consumer of this function, but read_buffers_splice_read() did not remove it, and worse, it contained a nasty bug because of it. Fixes: 2711ca237a084 ("ring-buffer: Move zeroing out excess in page to ring buffer code") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWNBen Hutchings
An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless pointer leaks are allowed. Therefore, states_equal() must not treat a state with a pointer in a register as "equal" to a state with an UNKNOWN_VALUE in that register. This was fixed differently upstream, but the code around here was largely rewritten in 4.14 by commit f1174f77b50c "bpf/verifier: rework value tracking". The bug can be detected by the bpf/verifier sub-test "pointer/scalar confusion in state equality check (way 1)". Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Edward Cree <ecree@solarflare.com> Cc: Jann Horn <jannh@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net>
2017-12-25bpf: fix incorrect sign extension in check_alu_op()Daniel Borkmann
From: Jann Horn <jannh@google.com> [ Upstream commit 95a762e2c8c942780948091f8f2a4f32fce1ac6f ] Distinguish between BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit) and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit); only perform sign extension in the first case. Starting with v4.14, this is exploitable by unprivileged users as long as the unprivileged_bpf_disabled sysctl isn't set. Debian assigned CVE-2017-16995 for this issue. v3: - add CVE number (Ben Hutchings) Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25bpf: reject out-of-bounds stack pointer calculationDaniel Borkmann
From: Jann Horn <jannh@google.com> Reject programs that compute wildly out-of-bounds stack pointers. Otherwise, pointers can be computed with an offset that doesn't fit into an `int`, causing security issues in the stack memory access check (as well as signed integer overflow during offset addition). This is a fix specifically for the v4.9 stable tree because the mainline code looks very different at this point. Fixes: 7bca0a9702edf ("bpf: enhance verifier to understand stack pointer arithmetic") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25bpf: fix branch pruning logicDaniel Borkmann
From: Alexei Starovoitov <ast@fb.com> [ Upstream commit c131187db2d3fa2f8bf32fdf4e9a4ef805168467 ] when the verifier detects that register contains a runtime constant and it's compared with another constant it will prune exploration of the branch that is guaranteed not to be taken at runtime. This is all correct, but malicious program may be constructed in such a way that it always has a constant comparison and the other branch is never taken under any conditions. In this case such path through the program will not be explored by the verifier. It won't be taken at run-time either, but since all instructions are JITed the malicious program may cause JITs to complain about using reserved fields, etc. To fix the issue we have to track the instructions explored by the verifier and sanitize instructions that are dead at run time with NOPs. We cannot reject such dead code, since llvm generates it for valid C code, since it doesn't do as much data flow analysis as the verifier does. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25bpf: adjust insn_aux_data when patching insnsDaniel Borkmann
From: Alexei Starovoitov <ast@fb.com> [ Upstream commit 8041902dae5299c1f194ba42d14383f734631009 ] convert_ctx_accesses() replaces single bpf instruction with a set of instructions. Adjust corresponding insn_aux_data while patching. It's needed to make sure subsequent 'for(all insn)' loops have matching insn and insn_aux_data. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25tracing: Exclude 'generic fields' from histogramsTom Zanussi
[ Upstream commit a15f7fc20389a8827d5859907568b201234d4b79 ] There are a small number of 'generic fields' (comm/COMM/cpu/CPU) that are found by trace_find_event_field() but are only meant for filtering. Specifically, they unlike normal fields, they have a size of 0 and thus wreak havoc when used as a histogram key. Exclude these (return -EINVAL) when used as histogram keys. Link: http://lkml.kernel.org/r/956154cbc3e8a4f0633d619b886c97f0f0edf7b4.1506105045.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20sched/deadline: Use deadline instead of period when calculating overflowSteven Rostedt (VMware)
[ Upstream commit 2317d5f1c34913bac5971d93d69fb6c31bb74670 ] I was testing Daniel's changes with his test case, and tweaked it a little. Instead of having the runtime equal to the deadline, I increased the deadline ten fold. Daniel's test case had: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ To make it more interesting, I changed it to: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ The results were rather surprising. The behavior that Daniel's patch was fixing came back. The task started using much more than .1% of the CPU. More like 20%. Looking into this I found that it was due to the dl_entity_overflow() constantly returning true. That's because it uses the relative period against relative runtime vs the absolute deadline against absolute runtime. runtime / (deadline - t) > dl_runtime / dl_period There's even a comment mentioning this, and saying that when relative deadline equals relative period, that the equation is the same as using deadline instead of period. That comment is backwards! What we really want is: runtime / (deadline - t) > dl_runtime / dl_deadline We care about if the runtime can make its deadline, not its period. And then we can say "when the deadline equals the period, the equation is the same as using dl_period instead of dl_deadline". After correcting this, now when the task gets enqueued, it can throttle correctly, and Daniel's fix to the throttling of sleeping deadline tasks works even when the runtime and deadline are not the same. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20sched/deadline: Throttle a constrained deadline task activated after the ↵Daniel Bristot de Oliveira
deadline [ Upstream commit df8eac8cafce7d086be3bd5cf5a838fa37594dfb ] During the activation, CBS checks if it can reuse the current task's runtime and period. If the deadline of the task is in the past, CBS cannot use the runtime, and so it replenishes the task. This rule works fine for implicit deadline tasks (deadline == period), and the CBS was designed for implicit deadline tasks. However, a task with constrained deadline (deadine < period) might be awakened after the deadline, but before the next period. In this case, replenishing the task would allow it to run for runtime / deadline. As in this case deadline < period, CBS enables a task to run for more than the runtime / period. In a very loaded system, this can cause a domino effect, making other tasks miss their deadlines. To avoid this problem, in the activation of a constrained deadline task after the deadline but before the next period, throttle the task and set the replenishing timer to the begin of the next period, unless it is boosted. Reproducer: --------------- %< --------------- int main (int argc, char **argv) { int ret; int flags = 0; unsigned long l = 0; struct timespec ts; struct sched_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ ts.tv_sec = 0; ts.tv_nsec = 2000 * 1000; /* 2 ms */ ret = sched_setattr(0, &attr, flags); if (ret < 0) { perror("sched_setattr"); exit(-1); } for(;;) { /* XXX: you may need to adjust the loop */ for (l = 0; l < 150000; l++); /* * The ideia is to go to sleep right before the deadline * and then wake up before the next period to receive * a new replenishment. */ nanosleep(&ts, NULL); } exit(0); } --------------- >% --------------- On my box, this reproducer uses almost 50% of the CPU time, which is obviously wrong for a task with 2/2000 reservation. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20sched/deadline: Make sure the replenishment timer fires in the next periodDaniel Bristot de Oliveira
[ Upstream commit 5ac69d37784b237707a7b15d199cdb6c6fdb6780 ] Currently, the replenishment timer is set to fire at the deadline of a task. Although that works for implicit deadline tasks because the deadline is equals to the begin of the next period, that is not correct for constrained deadline tasks (deadline < period). For instance: f.c: --------------- %< --------------- int main (void) { for(;;); } --------------- >% --------------- # gcc -o f f.c # trace-cmd record -e sched:sched_switch \ -e syscalls:sys_exit_sched_setattr \ chrt -d --sched-runtime 490000000 \ --sched-deadline 500000000 \ --sched-period 1000000000 0 ./f # trace-cmd report | grep "{pid of ./f}" After setting parameters, the task is replenished and continue running until being throttled: f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0 The task is throttled after running 492318 ms, as expected: f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0] But then, the task is replenished 500719 ms after the first replenishment: <idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1] Running for 490277 ms: f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120] Hence, in the first period, the task runs 2 * runtime, and that is a bug. During the first replenishment, the next deadline is set one period away. So the runtime / period starts to be respected. However, as the second replenishment took place in the wrong instant, the next replenishment will also be held in a wrong instant of time. Rather than occurring in the nth period away from the first activation, it is taking place in the (nth period - relative deadline). Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20sched/deadline: Add missing update_rq_clock() in dl_task_timer()Wanpeng Li
[ Upstream commit dcc3b5ffe1b32771c9a22e2c916fb94c4fcf5b79 ] The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: <IRQ> dump_stack+0x85/0xc4 __warn+0x172/0x1b0 warn_slowpath_fmt+0xb4/0xf0 ? __warn+0x1b0/0x1b0 ? debug_check_no_locks_freed+0x2c0/0x2c0 ? cpudl_set+0x3d/0x2b0 replenish_dl_entity+0x71e/0xc40 enqueue_task_dl+0x2ea/0x12e0 ? dl_task_timer+0x777/0x990 ? __hrtimer_run_queues+0x270/0xa50 dl_task_timer+0x316/0x990 ? enqueue_task_dl+0x12e0/0x12e0 ? enqueue_task_dl+0x12e0/0x12e0 __hrtimer_run_queues+0x270/0xa50 ? hrtimer_cancel+0x20/0x20 ? hrtimer_interrupt+0x119/0x600 hrtimer_interrupt+0x19c/0x600 ? trace_hardirqs_off+0xd/0x10 local_apic_timer_interrupt+0x74/0xe0 smp_apic_timer_interrupt+0x76/0xa0 apic_timer_interrupt+0x93/0xa0 The DL task will be migrated to a suitable later deadline rq once the DL timer fires and currnet rq is offline. The rq clock of the new rq should be updated. This patch fixes it by updating the rq clock after holding the new rq's rq lock. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20sched/rt: Do not pull from current CPU if only one CPU to pullSteven Rostedt
commit f73c52a5bcd1710994e53fbccc378c42b97a06b6 upstream. Daniel Wagner reported a crash on the BeagleBone Black SoC. This is a single CPU architecture, and does not have a functional arch_send_call_function_single_ipi() implementation which can crash the kernel if that is called. As it only has one CPU, it shouldn't be called, but if the kernel is compiled for SMP, the push/pull RT scheduling logic now calls it for irq_work if the one CPU is overloaded, it can use that function to call itself and crash the kernel. Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system only has a single CPU. But SCHED_FEAT is a constant if sched debugging is turned off. Another fix can also be used, and this should also help with normal SMP machines. That is, do not initiate the pull code if there's only one RT overloaded CPU, and that CPU happens to be the current CPU that is scheduling in a lower priority task. Even on a system with many CPUs, if there's many RT tasks waiting to run on a single CPU, and that CPU schedules in another RT task of lower priority, it will initiate the PULL logic in case there's a higher priority RT task on another CPU that is waiting to run. But if there is no other CPU with waiting RT tasks, it will initiate the RT pull logic on itself (as it still has RT tasks waiting to run). This is a wasted effort. Not only does this help with SMP code where the current CPU is the only one with RT overloaded tasks, it should also solve the issue that Daniel encountered, because it will prevent the PULL logic from executing, as there's only one CPU on the system, and the check added here will cause it to exit the RT pull code. Reported-by: Daniel Wagner <wagi@monom.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20tracing: Allocate mask_str buffer dynamicallyChangbin Du
commit 90e406f96f630c07d631a021fd4af10aac913e77 upstream. The default NR_CPUS can be very large, but actual possible nr_cpu_ids usually is very small. For my x86 distribution, the NR_CPUS is 8192 and nr_cpu_ids is 4. About 2 pages are wasted. Most machines don't have so many CPUs, so define a array with NR_CPUS just wastes memory. So let's allocate the buffer dynamically when need. With this change, the mutext tracing_cpumask_update_lock also can be removed now, which was used to protect mask_str. Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com Fixes: 36dfe9252bd4c ("ftrace: make use of tracing_cpumask") Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16audit: ensure that 'audit=1' actually enables audit for PID 1Paul Moore
[ Upstream commit 173743dd99a49c956b124a74c8aacb0384739a4c ] Prior to this patch we enabled audit in audit_init(), which is too late for PID 1 as the standard initcalls are run after the PID 1 task is forked. This means that we never allocate an audit_context (see audit_alloc()) for PID 1 and therefore miss a lot of audit events generated by PID 1. This patch enables audit as early as possible to help ensure that when PID 1 is forked it can allocate an audit_context if required. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14jump_label: Invoke jump_label_test() via early_initcall()Jason Baron
[ Upstream commit 92ee46efeb505ead3ab06d3c5ce695637ed5f152 ] Fengguang Wu reported that running the rcuperf test during boot can cause the jump_label_test() to hit a WARN_ON(). The issue is that the core jump label code relies on kernel_text_address() to detect when it can no longer update branches that may be contained in __init sections. The kernel_text_address() in turn assumes that if the system_state variable is greter than or equal to SYSTEM_RUNNING then __init sections are no longer valid (since the assumption is that they have been freed). However, when rcuperf is setup to run in early boot it can call kernel_power_off() which sets the system_state to SYSTEM_POWER_OFF. Since rcuperf initialization is invoked via a module_init(), we can make the dependency of jump_label_test() needing to complete before rcuperf explicit by calling it via early_initcall(). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1510609727-2238-1-git-send-email-jbaron@akamai.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14bpf: fix lockdep splatEric Dumazet
[ Upstream commit 89ad2fa3f043a1e8daae193bcb5fe34d5f8caf28 ] pcpu_freelist_pop() needs the same lockdep awareness than pcpu_freelist_populate() to avoid a false positive. [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire: (&htab->buckets[i].lock){......}, at: [<ffffffff9dc099cb>] __htab_percpu_map_update_elem+0x1cb/0x300 and this task is already holding: (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [<ffffffff9e135848>] __dev_queue_xmit+0 x868/0x1240 which would create a new lock dependency: (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......} but this new dependency connects a SOFTIRQ-irq-safe lock: (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} ... which became SOFTIRQ-irq-safe at: [<ffffffff9db5931b>] __lock_acquire+0x42b/0x1f10 [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0 [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50 [<ffffffff9e135848>] __dev_queue_xmit+0x868/0x1240 [<ffffffff9e136240>] dev_queue_xmit+0x10/0x20 [<ffffffff9e1965d9>] ip_finish_output2+0x439/0x590 [<ffffffff9e197410>] ip_finish_output+0x150/0x2f0 [<ffffffff9e19886d>] ip_output+0x7d/0x260 [<ffffffff9e19789e>] ip_local_out+0x5e/0xe0 [<ffffffff9e197b25>] ip_queue_xmit+0x205/0x620 [<ffffffff9e1b8398>] tcp_transmit_skb+0x5a8/0xcb0 [<ffffffff9e1ba152>] tcp_write_xmit+0x242/0x1070 [<ffffffff9e1baffc>] __tcp_push_pending_frames+0x3c/0xf0 [<ffffffff9e1b3472>] tcp_rcv_established+0x312/0x700 [<ffffffff9e1c1acc>] tcp_v4_do_rcv+0x11c/0x200 [<ffffffff9e1c3dc2>] tcp_v4_rcv+0xaa2/0xc30 [<ffffffff9e191107>] ip_local_deliver_finish+0xa7/0x240 [<ffffffff9e191a36>] ip_local_deliver+0x66/0x200 [<ffffffff9e19137d>] ip_rcv_finish+0xdd/0x560 [<ffffffff9e191e65>] ip_rcv+0x295/0x510 [<ffffffff9e12ff88>] __netif_receive_skb_core+0x988/0x1020 [<ffffffff9e130641>] __netif_receive_skb+0x21/0x70 [<ffffffff9e1306ff>] process_backlog+0x6f/0x230 [<ffffffff9e132129>] net_rx_action+0x229/0x420 [<ffffffff9da07ee8>] __do_softirq+0xd8/0x43d [<ffffffff9e282bcc>] do_softirq_own_stack+0x1c/0x30 [<ffffffff9dafc2f5>] do_softirq+0x55/0x60 [<ffffffff9dafc3a8>] __local_bh_enable_ip+0xa8/0xb0 [<ffffffff9db4c727>] cpu_startup_entry+0x1c7/0x500 [<ffffffff9daab333>] start_secondary+0x113/0x140 to a SOFTIRQ-irq-unsafe lock: (&head->lock){+.+...} ... which became SOFTIRQ-irq-unsafe at: ... [<ffffffff9db5971f>] __lock_acquire+0x82f/0x1f10 [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0 [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50 [<ffffffff9dc0b7fa>] pcpu_freelist_pop+0x7a/0xb0 [<ffffffff9dc08b2c>] htab_map_alloc+0x50c/0x5f0 [<ffffffff9dc00dc5>] SyS_bpf+0x265/0x1200 [<ffffffff9e28195f>] entry_SYSCALL_64_fastpath+0x12/0x17 other info that might help us debug this: Chain exists of: dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&head->lock); local_irq_disable(); lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2); lock(&htab->buckets[i].lock); <Interrupt> lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2); *** DEADLOCK *** Fixes: e19494edab82 ("bpf: introduce percpu_freelist") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14workqueue: trigger WARN if queue_delayed_work() is called with NULL @wqTejun Heo
[ Upstream commit 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d ] If queue_delayed_work() gets called with NULL @wq, the kernel will oops asynchronuosly on timer expiration which isn't too helpful in tracking down the offender. This actually happened with smc. __queue_delayed_work() already does several input sanity checks synchronously. Add NULL @wq check. Reported-by: Dave Jones <davej@codemonkey.org.uk> Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14sched/fair: Make select_idle_cpu() more aggressivePeter Zijlstra
[ Upstream commit 4c77b18cf8b7ab37c7d5737b4609010d2ceec5f0 ] Kitsunyan reported desktop latency issues on his Celeron 887 because of commit: 1b568f0aabf2 ("sched/core: Optimize SCHED_SMT") ... even though his CPU doesn't do SMT. The effect of running the SMT code on a !SMT part is basically a more aggressive select_idle_cpu(). Removing the avg condition fixed things for him. I also know FB likes this test gone, even though other workloads like having it. For now, take it out by default, until we get a better idea. Reported-by: kitsunyan <kitsunyan@inbox.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14kdb: Fix handling of kallsyms_symbol_next() return valueDaniel Thompson
commit c07d35338081d107e57cf37572d8cc931a8e32e2 upstream. kallsyms_symbol_next() returns a boolean (true on success). Currently kdb_read() tests the return value with an inequality that unconditionally evaluates to true. This is fixed in the obvious way and, since the conditional branch is supposed to be unreachable, we also add a WARN_ON(). Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14smp/hotplug: Move step CPUHP_AP_SMPCFD_DYING to the correct placeLai Jiangshan
commit 46febd37f9c758b05cd25feae8512f22584742fe upstream. Commit 31487f8328f2 ("smp/cfd: Convert core to hotplug state machine") accidently put this step on the wrong place. The step should be at the cpuhp_ap_states[] rather than the cpuhp_bp_states[]. grep smpcfd /sys/devices/system/cpu/hotplug/states 40: smpcfd:prepare 129: smpcfd:dying "smpcfd:dying" was missing before. So was the invocation of the function smpcfd_dying_cpu(). Fixes: 31487f8328f2 ("smp/cfd: Convert core to hotplug state machine") Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lkml.kernel.org/r/20171128131954.81229-1-jiangshanlai@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-09perf/x86/intel: Account interrupts for PEBS errorsJiri Olsa
[ Upstream commit 475113d937adfd150eb82b5e2c5507125a68e7af ] It's possible to set up PEBS events to get only errors and not any data, like on SNB-X (model 45) and IVB-EP (model 62) via 2 perf commands running simultaneously: taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10 This leads to a soft lock up, because the error path of the intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt for error PEBS interrupts, so in case you're getting ONLY errors you don't have a way to stop the event when it's over the max_samples_per_tick limit: NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816] ... RIP: 0010:[<ffffffff81159232>] [<ffffffff81159232>] smp_call_function_single+0xe2/0x140 ... Call Trace: ? trace_hardirqs_on_caller+0xf5/0x1b0 ? perf_cgroup_attach+0x70/0x70 perf_install_in_context+0x199/0x1b0 ? ctx_resched+0x90/0x90 SYSC_perf_event_open+0x641/0xf90 SyS_perf_event_open+0x9/0x10 do_syscall_64+0x6c/0x1f0 entry_SYSCALL64_slow_path+0x25/0x25 Add perf_event_account_interrupt() which does the interrupt and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s error path. We keep the pending_kill and pending_wakeup logic only in the __perf_event_overflow() path, because they make sense only if there's any data to deliver. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vince@deater.net> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30sched/rt: Simplify the IPI based RT balancing logicSteven Rostedt (Red Hat)
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30sched: Make resched_cpu() unconditionalPaul E. McKenney
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream. The current implementation of synchronize_sched_expedited() incorrectly assumes that resched_cpu() is unconditional, which it is not. This means that synchronize_sched_expedited() can hang when resched_cpu()'s trylock fails as follows (analysis by Neeraj Upadhyay): o CPU1 is waiting for expedited wait to complete: sync_rcu_exp_select_cpus rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5 IPI sent to CPU5 synchronize_sched_expedited_wait ret = swait_event_timeout(rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root), jiffies_stall); expmask = 0x20, CPU 5 in idle path (in cpuidle_enter()) o CPU5 handles IPI and fails to acquire rq lock. Handles IPI sync_sched_exp_handler resched_cpu returns while failing to try lock acquire rq->lock need_resched is not set o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to idle (schedule() is not called). o CPU 1 reports RCU stall. Given that resched_cpu() is now used only by RCU, this commit fixes the assumption by making resched_cpu() unconditional. Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15workqueue: Fix NULL pointer dereferenceLi Bin
commit cef572ad9bd7f85035ba8272e5352040e8be0152 upstream. When queue_work() is used in irq (not in task context), there is a potential case that trigger NULL pointer dereference. ---------------------------------------------------------------- worker_thread() |-spin_lock_irq() |-process_one_work() |-worker->current_pwq = pwq |-spin_unlock_irq() |-worker->current_func(work) |-spin_lock_irq() |-worker->current_pwq = NULL |-spin_unlock_irq() //interrupt here |-irq_handler |-__queue_work() //assuming that the wq is draining |-is_chained_work(wq) |-current_wq_worker() //Here, 'current' is the interrupted worker! |-current->current_pwq is NULL here! |-schedule() ---------------------------------------------------------------- Avoid it by checking for task context in current_wq_worker(), and if not in task context, we shouldn't use the 'current' to check the condition. Reported-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Li Bin <huawei.libin@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15sched/core: Add missing update_rq_clock() call in sched_move_task()Peter Zijlstra
[ Upstream commit 1b1d62254df0fe42a711eb71948f915918987790 ] Bug was noticed via this warning: WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80 rq->clock_update_flags < RQCF_ACT_SKIP Modules linked in: CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1 Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014 Call Trace: dump_stack+0x4d/0x65 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 detach_task_cfs_rq+0x8e8/0xb80 ? allocate_cgrp_cset_links+0x59/0x80 task_change_group_fair+0x27/0x150 sched_change_group+0x48/0xf0 sched_move_task+0x53/0x150 cpu_cgroup_attach+0x36/0x70 cgroup_taskset_migrate+0x175/0x300 cgroup_migrate+0xab/0xd0 cgroup_attach_task+0xf0/0x190 __cgroup_procs_write+0x1ed/0x2f0 cgroup_procs_write+0x14/0x20 cgroup_file_write+0x3f/0x100 kernfs_fop_write+0x104/0x180 __vfs_write+0x37/0x140 vfs_write+0xb8/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x61/0x170 entry_SYSCALL64_slow_path+0x25/0x25 Reported-by: Ingo Molnar <mingo@kernel.org> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02workqueue: replace pool->manager_arb mutex with a flagTejun Heo
commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream. Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21hrtimer: Catch invalid clockids againMarc Zyngier
[ Upstream commit 336a9cde10d641e70bac67d90ae91b3190c3edca ] commit 82e88ff1ea94 ("hrtimer: Revert CLOCK_MONOTONIC_RAW support") removed unfortunately a sanity check in the hrtimer code which was part of that MONOTONIC_RAW patch series. It would have caught the bogus usage of CLOCK_MONOTONIC_RAW in the wireless code. So bring it back. It is way too easy to take any random clockid and feed it to the hrtimer subsystem. At best, it gets mapped to a monotonic base, but it would be better to just catch illegal values as early as possible. Detect invalid clockids, map them to CLOCK_MONOTONIC and emit a warning. [ tglx: Replaced the BUG by a WARN and gracefully map to CLOCK_MONOTONIC ] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Tomasz Nowicki <tn@semihalf.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21sched/fair: Update rq clock before changing a task's CPU affinityWanpeng Li
[ Upstream commit a499c3ead88ccf147fc50689e85a530ad923ce36 ] This is triggered during boot when CONFIG_SCHED_DEBUG is enabled: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: dump_stack+0x85/0xc2 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 set_next_entity+0x11d/0x380 set_curr_task_fair+0x2b/0x60 do_set_cpus_allowed+0x139/0x180 __set_cpus_allowed_ptr+0x113/0x260 set_cpus_allowed_ptr+0x10/0x20 torture_shuffle+0xfd/0x180 kthread+0x10f/0x150 ? torture_shutdown_init+0x60/0x60 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x31/0x40 ---[ end trace dd94d92344cea9c6 ]--- The task is running && !queued, so there is no rq clock update before calling set_curr_task(). This patch fixes it by updating rq clock after holding rq->lock/pi_lock just as what other dequeue + put_prev + enqueue + set_curr story does. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21locking/lockdep: Add nest_lock integrity testPeter Zijlstra
[ Upstream commit 7fb4a2cea6b18dab56d609530d077f168169ed6b ] Boqun reported that hlock->references can overflow. Add a debug test for that to generate a clear error when this happens. Without this, lockdep is likely to report a mysterious failure on unlock. Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18rcu: Allow for page faults in NMI handlersPaul E. McKenney
commit 28585a832602747cbfa88ad8934013177a3aae38 upstream. A number of architecture invoke rcu_irq_enter() on exception entry in order to allow RCU read-side critical sections in the exception handler when the exception is from an idle or nohz_full CPU. This works, at least unless the exception happens in an NMI handler. In that case, rcu_nmi_enter() would already have exited the extended quiescent state, which would mean that rcu_irq_enter() would (incorrectly) cause RCU to think that it is again in an extended quiescent state. This will in turn result in lockdep splats in response to later RCU read-side critical sections. This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to take no action if there is an rcu_nmi_enter() in effect, thus avoiding the unscheduled return to RCU quiescent state. This in turn should make the kernel safe for on-demand RCU voyeurism. Link: http://lkml.kernel.org/r/20170922211022.GA18084@linux.vnet.ibm.com Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12sched/cpuset/pm: Fix cpuset vs. suspend-resume bugsPeter Zijlstra
commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream. Cpusets vs. suspend-resume is _completely_ broken. And it got noticed because it now resulted in non-cpuset usage breaking too. On suspend cpuset_cpu_inactive() doesn't call into cpuset_update_active_cpus() because it doesn't want to move tasks about, there is no need, all tasks are frozen and won't run again until after we've resumed everything. But this means that when we finally do call into cpuset_update_active_cpus() after resuming the last frozen cpu in cpuset_cpu_active(), the top_cpuset will not have any difference with the cpu_active_mask and this it will not in fact do _anything_. So the cpuset configuration will not be restored. This was largely hidden because we would unconditionally create identity domains and mobile users would not in fact use cpusets much. And servers what do use cpusets tend to not suspend-resume much. An addition problem is that we'd not in fact wait for the cpuset work to finish before resuming the tasks, allowing spurious migrations outside of the specified domains. Fix the rebuild by introducing cpuset_force_rebuild() and fix the ordering with cpuset_wait_for_hotplug(). Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling") Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12ftrace: Fix kmemleak in unregister_ftrace_graphShu Wang
commit 2b0b8499ae75df91455bbeb7491d45affc384fb0 upstream. The trampoline allocated by function tracer was overwriten by function_graph tracer, and caused a memory leak. The save_global_trampoline should have saved the previous trampoline in register_ftrace_graph() and restored it in unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was only used in unregister_ftrace_graph as default value 0, and it overwrote the previous trampoline's value. Causing the previous allocated trampoline to be lost. kmmeleak backtrace: kmemleak_vmalloc+0x77/0xc0 __vmalloc_node_range+0x1b5/0x2c0 module_alloc+0x7c/0xd0 arch_ftrace_update_trampoline+0xb5/0x290 ftrace_startup+0x78/0x210 register_ftrace_function+0x8b/0xd0 function_trace_init+0x4f/0x80 tracing_set_tracer+0xe6/0x170 tracing_set_trace_write+0x90/0xd0 __vfs_write+0x37/0x170 vfs_write+0xb2/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x67/0x180 return_from_SYSCALL_64+0x0/0x6a [ Looking further into this, I found that this was left over from when the function and function graph tracers shared the same ftrace_ops. But in commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer together"), the two were separated, and the save_global_trampoline no longer was necessary (and it may have been broken back then too). -- Steven Rostedt ] Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together") Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>