Age | Commit message (Collapse) | Author |
|
|
|
commit 0d19ea866562e46989412a0676412fa0983c9ce7 upstream.
If we mount a hierarchy with a specified name, the name is unique,
and we can use it to mount the hierarchy without specifying its
set of subsystem names. This feature is documented is
Documentation/cgroups/cgroups.txt section 2.3
Here's an example:
# mount -t cgroup -o cpuset,name=myhier xxx /cgroup1
# mount -t cgroup -o name=myhier xxx /cgroup2
But it was broken by commit 32a8cf235e2f192eb002755076994525cdbaa35a
(cgroup: make the mount options parsing more accurate)
This fixes the regression.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
|
|
This is the temporary simple fix for 3.2, we need more changes in this
area.
1. do_signal_stop() assumes that the running untraced thread in the
stopped thread group is not possible. This was our goal but it is
not yet achieved: a stopped-but-resumed tracee can clone the running
thread which can initiate another group-stop.
Remove WARN_ON_ONCE(!current->ptrace).
2. A new thread always starts with ->jobctl = 0. If it is auto-attached
and this group is stopped, __ptrace_unlink() sets JOBCTL_STOP_PENDING
but JOBCTL_STOP_SIGMASK part is zero, this triggers WANR_ON(!signr)
in do_jobctl_trap() if another debugger attaches.
Change __ptrace_unlink() to set the artificial SIGSTOP for report.
Alternatively we could change ptrace_init_task() to copy signr from
current, but this means we can copy it for no reason and hide the
possible similar problems.
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> [3.1]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Test-case:
int main(void)
{
int pid, status;
pid = fork();
if (!pid) {
for (;;) {
if (!fork())
return 0;
if (waitpid(-1, &status, 0) < 0) {
printf("ERR!! wait: %m\n");
return 0;
}
}
}
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
assert(waitpid(-1, NULL, 0) == pid);
assert(ptrace(PTRACE_SETOPTIONS, pid, 0,
PTRACE_O_TRACEFORK) == 0);
do {
ptrace(PTRACE_CONT, pid, 0, 0);
pid = waitpid(-1, NULL, 0);
} while (pid > 0);
return 1;
}
It fails because ->real_parent sees its child in EXIT_DEAD state
while the tracer is going to change the state back to EXIT_ZOMBIE
in wait_task_zombie().
The offending commit is 823b018e which moved the EXIT_DEAD check,
but in fact we should not blame it. The original code was not
correct as well because it didn't take ptrace_reparented() into
account and because we can't really trust ->ptrace.
This patch adds the additional check to close this particular
race but it doesn't solve the whole problem. We simply can't
rely on ->ptrace in this case, it can be cleared if the tracer
is multithreaded by the exiting ->parent.
I think we should kill EXIT_DEAD altogether, we should always
remove the soon-to-be-reaped child from ->children or at least
we should never do the DEAD->ZOMBIE transition. But this is too
complex for 3.2.
Reported-and-tested-by: Denys Vlasenko <vda.linux@googlemail.com>
Tested-by: Lukasz Michalik <lmi@ift.uni.wroc.pl>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> [3.0+]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
vfork parent uninterruptibly and unkillably waits for its child to
exec/exit. This wait is of unbounded length. Ignore such waits
in the hung_task detector.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <1325344394.28904.43.camel@lappy>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
It was found (by Sasha) that if you use a futex located in the gate
area we get stuck in an uninterruptible infinite loop, much like the
ZERO_PAGE issue.
While looking at this problem, PeterZ realized you'll get into similar
trouble when hitting any install_special_pages() mapping. And are there
still drivers setting up their own special mmaps without page->mapping,
and without special VM or pte flags to make get_user_pages fail?
In most cases, if page->mapping is NULL, we do not need to retry at all:
Linus points out that even /proc/sys/vm/drop_caches poses no problem,
because it ends up using remove_mapping(), which takes care not to
interfere when the page reference count is raised.
But there is still one case which does need a retry: if memory pressure
called shmem_writepage in between get_user_pages_fast dropping page
table lock and our acquiring page lock, then the page gets switched from
filecache to swapcache (and ->mapping set to NULL) whatever the refcount.
Fault it back in to get the page->mapping needed for key->shared.inode.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit de28f25e8244c7353abed8de0c7792f5f883588c.
It results in resume problems for various people. See for example
http://thread.gmane.org/gmane.linux.kernel/1233033
http://thread.gmane.org/gmane.linux.kernel/1233389
http://thread.gmane.org/gmane.linux.kernel/1233159
http://thread.gmane.org/gmane.linux.kernel/1227868/focus=1230877
and the fedora and ubuntu bug reports
https://bugzilla.redhat.com/show_bug.cgi?id=767248
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/904569
which got bisected down to the stable version of this commit.
Reported-by: Jonathan Nieder <jrnieder@gmail.com>
Reported-by: Phil Miller <mille121@illinois.edu>
Reported-by: Philip Langdale <philipl@overt.org>
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <gregkh@suse.de>
Cc: stable@kernel.org # for stable kernels that applied the original
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
wait_queue is a swiss army knife and in most of the cases the
complexity is not needed. For RT waitqueues are a constant source of
trouble as we can't convert the head lock to a raw spinlock due to
fancy and long lasting callbacks.
Provide a slim version, which allows RT to replace wait queues. This
should go mainline as well, as it lowers memory consumption and
runtime overhead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Add a CONFIG option to allow the output from Magic SysRq to be output
immediately, even if this causes large latencies.
If PREEMPT_RT_FULL, printk() will not try to acquire the console lock
when interrupts or preemption are disabled. If the console lock is
not acquired the printk() output will be buffered, but will not be
output immediately. Some drivers call into the Magic SysRq code
with interrupts or preemption disabled, so the output of Magic SysRq
will be buffered instead of printing immediately if this option is
not selected.
Even with this option selected, Magic SysRq output will be delayed
if the attempt to acquire the console lock fails.
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
Link: http://lkml.kernel.org/r/4E7CEF60.5020508@am.sony.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Add a /sys/kernel entry to indicate that the kernel is a
realtime kernel.
Clark says that he needs this for udev rules, udev needs to evaluate
if its a PREEMPT_RT kernel a few thousand times and parsing uname
output is too slow or so.
Are there better solutions? Should it exist and return 0 on !-rt?
Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
|
On 07/27/2011 04:37 PM, Thomas Gleixner wrote:
> - KGDB (not yet disabled) is reportedly unusable on -rt right now due
> to missing hacks in the console locking which I dropped on purpose.
>
To work around this in the short term you can use this patch, in
addition to the clocksource watchdog patch that Thomas brewed up.
Comments are welcome of course. Ultimately the right solution is to
change separation between the console and the HW to have a polled mode
+ work queue so as not to introduce any kind of latency.
Thanks,
Jason.
|
|
There is no need do disable preemption in vprintk(), disable_migrate()
is sufficient. This fixes the following bug in -rt:
[ 14.759233] BUG: sleeping function called from invalid context
at /home/rw/linux-rt/kernel/rtmutex.c:645
[ 14.759235] in_atomic(): 1, irqs_disabled(): 0, pid: 547, name: bash
[ 14.759244] Pid: 547, comm: bash Not tainted 3.0.12-rt29+ #3
[ 14.759246] Call Trace:
[ 14.759301] [<ffffffff8106fade>] __might_sleep+0xeb/0xf0
[ 14.759318] [<ffffffff810ad784>] rt_spin_lock_fastlock.constprop.9+0x21/0x43
[ 14.759336] [<ffffffff8161fef0>] rt_spin_lock+0xe/0x10
[ 14.759354] [<ffffffff81347ad1>] serial8250_console_write+0x81/0x121
[ 14.759366] [<ffffffff8107ecd3>] __call_console_drivers+0x7c/0x93
[ 14.759369] [<ffffffff8107ef31>] _call_console_drivers+0x5c/0x60
[ 14.759372] [<ffffffff8107f7e5>] console_unlock+0x147/0x1a2
[ 14.759374] [<ffffffff8107fd33>] vprintk+0x3ea/0x462
[ 14.759383] [<ffffffff816160e0>] printk+0x51/0x53
[ 14.759399] [<ffffffff811974e4>] ? proc_reg_poll+0x9a/0x9a
[ 14.759403] [<ffffffff81335b42>] __handle_sysrq+0x50/0x14d
[ 14.759406] [<ffffffff81335c8a>] write_sysrq_trigger+0x4b/0x53
[ 14.759408] [<ffffffff81335c3f>] ? __handle_sysrq+0x14d/0x14d
[ 14.759410] [<ffffffff81197583>] proc_reg_write+0x9f/0xbe
[ 14.759426] [<ffffffff811497ec>] vfs_write+0xac/0xf3
[ 14.759429] [<ffffffff8114a9b3>] ? fget_light+0x3a/0x9b
[ 14.759431] [<ffffffff811499db>] sys_write+0x4a/0x6e
[ 14.759438] [<ffffffff81625d52>] system_call_fastpath+0x16/0x1b
Signed-off-by: Richard Weinberger <rw@linutronix.de>
Link: http://lkml.kernel.org/r/1323696956-11445-1-git-send-email-rw@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Do not take lock for non handled cases (might be atomic context)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
BUG: sleeping function called from invalid context at kernel/rtmutex.c:645
in_atomic(): 1, irqs_disabled(): 0, pid: 1739, name: bash
Pid: 1739, comm: bash Not tainted 3.0.6-rt17-00284-gb76d419 #3
Call Trace:
[<c06e3b5d>] ? printk+0x1d/0x20
[<c01390b6>] __might_sleep+0xe6/0x110
[<c06e633c>] rt_spin_lock+0x1c/0x30
[<c01655a6>] flush_gcwq+0x236/0x320
[<c021c651>] ? kfree+0xe1/0x1a0
[<c05b7178>] ? __cpufreq_remove_dev+0xf8/0x260
[<c0183fad>] ? rt_down_write+0xd/0x10
[<c06cd91e>] workqueue_cpu_down_callback+0x26/0x2d
[<c06e9d65>] notifier_call_chain+0x45/0x60
[<c0171cfe>] __raw_notifier_call_chain+0x1e/0x30
[<c014c9b4>] __cpu_notify+0x24/0x40
[<c06cbc6f>] _cpu_down+0xdf/0x330
[<c06cbef0>] cpu_down+0x30/0x50
[<c06cd6b0>] store_online+0x50/0xa7
[<c06cd660>] ? acpi_os_map_memory+0xec/0xec
[<c04f2faa>] sysdev_store+0x2a/0x40
[<c02887a4>] sysfs_write_file+0xa4/0x100
[<c0229ab2>] vfs_write+0xa2/0x170
[<c0288700>] ? sysfs_poll+0x90/0x90
[<c0229d92>] sys_write+0x42/0x70
[<c06ecedf>] sysenter_do_call+0x12/0x2d
CPU 1 is now offline
SMP alternatives: switching to UP code
SMP alternatives: switching to SMP code
Booting Node 0 Processor 1 APIC 0x1
smpboot cpu 1: start_ip = 9b000
Initializing CPU#1
BUG: sleeping function called from invalid context at kernel/rtmutex.c:645
in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: kworker/0:0
Pid: 0, comm: kworker/0:0 Not tainted 3.0.6-rt17-00284-gb76d419 #3
Call Trace:
[<c06e3b5d>] ? printk+0x1d/0x20
[<c01390b6>] __might_sleep+0xe6/0x110
[<c06e633c>] rt_spin_lock+0x1c/0x30
[<c06cd85b>] workqueue_cpu_up_callback+0x56/0xf3
[<c06e9d65>] notifier_call_chain+0x45/0x60
[<c0171cfe>] __raw_notifier_call_chain+0x1e/0x30
[<c014c9b4>] __cpu_notify+0x24/0x40
[<c014c9ec>] cpu_notify+0x1c/0x20
[<c06e1d43>] notify_cpu_starting+0x1e/0x20
[<c06e0aad>] smp_callin+0xfb/0x10e
[<c06e0ad9>] start_secondary+0x19/0xd7
NMI watchdog enabled, takes one hw-pmu counter.
Switched to NOHz mode on CPU #1
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/1318762607-2261-5-git-send-email-yong.zhang0@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
PF_THREAD_BOUND is set by kthread_bind() and means the thread is bound
to a particular cpu for correctness. The workqueue code abuses this
flag and blindly sets it for all created threads, including those that
are free to migrate.
Restore the original semantics now that the worst abuse in the
cpu-hotplug path are gone. The only icky bit is the rescue thread for
per-cpu workqueues, this cannot use kthread_bind() but will use
set_cpus_allowed_ptr() to migrate itself to the desired cpu.
Set and clear PF_THREAD_BOUND manually here.
XXX: I think worker_maybe_bind_and_lock()/worker_unbind_and_unlock()
should also do a get_online_cpus(), this would likely allow us to
remove the while loop.
XXX: should probably repurpose GCWQ_DISASSOCIATED to warn on adding
works after CPU_DOWN_PREPARE -- its dual use to mark unbound gcwqs is
a tad annoying though.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The current workqueue code does crazy stuff on cpu unplug, it relies on
forced affine breakage, thereby violating per-cpu expectations. Worse,
it tries to re-attach to a cpu if the thing comes up again before all
previously queued works are finished. This breaks (admittedly bonkers)
cpu-hotplug use that relies on a down-up cycle to push all usage away.
Introduce a new WQ_NON_AFFINE flag that indicates a per-cpu workqueue
will not respect cpu affinity and use this to migrate all its pending
works to whatever cpu is doing cpu-down.
This also adds a warning for queue_on_cpu() users which warns when its
used on WQ_NON_AFFINE workqueues for the API implies you care about
what cpu things are ran on when such workqueues cannot guarantee this.
For the rest, simply flush all per-cpu works and don't mess about.
This also means that currently all workqueues that are manually
flushing things on cpu-down in order to provide the per-cpu guarantee
no longer need to do so.
In short, we tell the WQ what we want it to do, provide validation for
this and loose ~250 lines of code.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Fix this warning on x86 defconfig:
kernel/rcutree.h:433:13: warning: ‘rcu_preempt_qs’ declared ‘static’ but never defined [-Wunused-function]
The #ifdefs and prototypes here are a maze, move it closer to the
usage site that needs it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Implementing RCU-bh in terms of RCU-preempt makes the system vulnerable
to network-based denial-of-service attacks. This patch therefore
makes __do_softirq() invoke rcu_bh_qs(), but only when __do_softirq()
is running in ksoftirqd context. A wrapper layer in interposed so that
other calls to __do_softirq() avoid invoking rcu_bh_qs(). The underlying
function __do_softirq_common() does the actual work.
The reason that rcu_bh_qs() is bad in these non-ksoftirqd contexts is
that there might be a local_bh_enable() inside an RCU-preempt read-side
critical section. This local_bh_enable() can invoke __do_softirq()
directly, so if __do_softirq() were to invoke rcu_bh_qs() (which just
calls rcu_preempt_qs() in the PREEMPT_RT_FULL case), there would be
an illegal RCU-preempt quiescent state in the middle of an RCU-preempt
read-side critical section. Therefore, quiescent states can only happen
in cases where __do_softirq() is invoked directly from ksoftirqd.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20111005184518.GA21601@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The Linux kernel has long RCU-bh read-side critical sections that
intolerably increase scheduling latency under mainline's RCU-bh rules,
which include RCU-bh read-side critical sections being non-preemptible.
This patch therefore arranges for RCU-bh to be implemented in terms of
RCU-preempt for CONFIG_PREEMPT_RT_FULL=y.
This has the downside of defeating the purpose of RCU-bh, namely,
handling the case where the system is subjected to a network-based
denial-of-service attack that keeps at least one CPU doing full-time
softirq processing. This issue will be fixed by a later commit.
The current commit will need some work to make it appropriate for
mainline use, for example, it needs to be extended to cover Tiny RCU.
[ paulmck: Added a useful changelog ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20111005185938.GA20403@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
With RT_FULL we get the below wreckage:
[ 126.060484] =======================================================
[ 126.060486] [ INFO: possible circular locking dependency detected ]
[ 126.060489] 3.0.1-rt10+ #30
[ 126.060490] -------------------------------------------------------
[ 126.060492] irq/24-eth0/1235 is trying to acquire lock:
[ 126.060495] (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[ 126.060503]
[ 126.060504] but task is already holding lock:
[ 126.060506] (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[ 126.060511]
[ 126.060511] which lock already depends on the new lock.
[ 126.060513]
[ 126.060514]
[ 126.060514] the existing dependency chain (in reverse order) is:
[ 126.060516]
[ 126.060516] -> #1 (&p->pi_lock){-...-.}:
[ 126.060519] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[ 126.060524] [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85
[ 126.060527] [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f
[ 126.060531] [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a
[ 126.060534] [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f
[ 126.060537] [<ffffffff810d9020>] rcu_boost+0xad/0xde
[ 126.060541] [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b
[ 126.060544] [<ffffffff8109a760>] kthread+0x99/0xa1
[ 126.060547] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[ 126.060551]
[ 126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}:
[ 126.060555] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[ 126.060558] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[ 126.060561] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[ 126.060564] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[ 126.060566] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[ 126.060569] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[ 126.060573] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[ 126.060576] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[ 126.060580] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[ 126.060583] [<ffffffff81075425>] wake_up_process+0x15/0x17
[ 126.060585] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[ 126.060590] [<ffffffff81081df9>] irq_exit+0x49/0x55
[ 126.060593] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[ 126.060597] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[ 126.060600] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[ 126.060603] [<ffffffff810d582c>] irq_thread+0xde/0x1af
[ 126.060606] [<ffffffff8109a760>] kthread+0x99/0xa1
[ 126.060608] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[ 126.060611]
[ 126.060612] other info that might help us debug this:
[ 126.060614]
[ 126.060615] Possible unsafe locking scenario:
[ 126.060616]
[ 126.060617] CPU0 CPU1
[ 126.060619] ---- ----
[ 126.060620] lock(&p->pi_lock);
[ 126.060623] lock(&(lock)->wait_lock);
[ 126.060625] lock(&p->pi_lock);
[ 126.060627] lock(&(lock)->wait_lock);
[ 126.060629]
[ 126.060629] *** DEADLOCK ***
[ 126.060630]
[ 126.060632] 1 lock held by irq/24-eth0/1235:
[ 126.060633] #0: (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[ 126.060638]
[ 126.060638] stack backtrace:
[ 126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30
[ 126.060643] Call Trace:
[ 126.060644] <IRQ> [<ffffffff810acbde>] print_circular_bug+0x289/0x29a
[ 126.060651] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[ 126.060655] [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99
[ 126.060658] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[ 126.060661] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[ 126.060664] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[ 126.060668] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[ 126.060671] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[ 126.060674] [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c
[ 126.060677] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[ 126.060680] [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4
[ 126.060683] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[ 126.060687] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[ 126.060690] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[ 126.060693] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[ 126.060696] [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5
[ 126.060701] [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90
[ 126.060704] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[ 126.060708] [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21
[ 126.060711] [<ffffffff81075425>] wake_up_process+0x15/0x17
[ 126.060715] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[ 126.060718] [<ffffffff81081df9>] irq_exit+0x49/0x55
[ 126.060721] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[ 126.060724] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[ 126.060726] <EOI> [<ffffffff81072855>] ? migrate_disable+0x75/0x12d
[ 126.060733] [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f
[ 126.060736] [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f
[ 126.060739] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[ 126.060742] [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59
[ 126.060745] [<ffffffff810d582c>] irq_thread+0xde/0x1af
[ 126.060748] [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a
[ 126.060751] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[ 126.060754] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[ 126.060757] [<ffffffff8109a760>] kthread+0x99/0xa1
[ 126.060761] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[ 126.060764] [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a
[ 126.060768] [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe
[ 126.060771] [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c
[ 126.060774] [<ffffffff81509b10>] ? gs_change+0xb/0xb
Because irq_exit() does:
void irq_exit(void)
{
account_system_vtime(current);
trace_hardirq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
if (!in_interrupt() && local_softirq_pending())
invoke_softirq();
...
}
Which triggers a wakeup, which uses RCU, now if the interrupted task has
t->rcu_read_unlock_special set, the rcu usage from the wakeup will end
up in rcu_read_unlock_special(). rcu_read_unlock_special() will test
for in_irq(), which will fail as we just decremented preempt_count
with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for
PREEMPT_RT_FULL reads:
int in_serving_softirq(void)
{
int res;
preempt_disable();
res = __get_cpu_var(local_softirq_runner) == current;
preempt_enable();
return res;
}
Which will thus also fail, resulting in the above wreckage.
The 'somewhat' ugly solution is to open-code the preempt_count() test
in rcu_read_unlock_special().
Also, we're not at all sure how ->rcu_read_unlock_special gets set
here... so this is very likely a bandaid and more thought is required.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This fixes the following build error for the preempt-rt kernel.
make kernel/fork.o
CC kernel/fork.o
kernel/fork.c:90: error: section of ¡tasklist_lock¢ conflicts with previous declaration
make[2]: *** [kernel/fork.o] Error 1
make[1]: *** [kernel/fork.o] Error 2
The rt kernel cache aligns the RWLOCK in DEFINE_RWLOCK by default.
The non-rt kernels explicitly cache align only the tasklist_lock in
kernel/fork.c
That can create a build conflict. This fixes the build problem by making the
non-rt kernels cache align RWLOCKs by default. The side effect is that
the other RWLOCKs are also cache aligned for non-rt.
This is a short term solution for rt only.
The longer term solution would be to push the cache aligned DEFINE_RWLOCK
to mainline. If there are objections, then we could create a
DEFINE_RWLOCK_CACHE_ALIGNED or something of that nature.
Comments? Objections?
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.LFD.2.00.1109191104010.23118@localhost6.localdomain6
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Map spinlocks, rwlocks, rw_semaphores and semaphores to the rt_mutex
based locking functions for preempt-rt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
When CONFIG_PREEMPT_RT_FULL is enabled, tasklets run as threads,
and spinlocks turn are mutexes. But this can cause issues with
tasks disabling tasklets. A tasklet runs under ksoftirqd, and
if a tasklets are disabled with tasklet_disable(), the tasklet
count is increased. When a tasklet runs, it checks this counter
and if it is set, it adds itself back on the softirq queue and
returns.
The problem arises in RT because ksoftirq will see that a softirq
is ready to run (the tasklet softirq just re-armed itself), and will
not sleep, but instead run the softirqs again. The tasklet softirq
will still see that the count is non-zero and will not execute
the tasklet and requeue itself on the softirq again, which will
cause ksoftirqd to run it again and again and again.
It gets worse because ksoftirqd runs as a real-time thread.
If it preempted the task that disabled tasklets, and that task
has migration disabled, or can't run for other reasons, the tasklet
softirq will never run because the count will never be zero, and
ksoftirqd will go into an infinite loop. As an RT task, it this
becomes a big problem.
This is a hack solution to have tasklet_disable stop tasklets, and
when a tasklet runs, instead of requeueing the tasklet softirqd
it delays it. When tasklet_enable() is called, and tasklets are
waiting, then the tasklet_enable() will kick the tasklets to continue.
This prevents the lock up from ksoftirq going into an infinite loop.
[ rostedt@goodmis.org: ported to 3.0-rt ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
If ksoftirqd gets woken during hot-unplug, __thread_do_softirq() will
call pin_current_cpu() which will block on the held cpu_hotplug.lock.
Moving the offline check in __thread_do_softirq() before the
pin_current_cpu() call doesn't work, since the wakeup can happen
before we mark the cpu offline.
So here we have the ksoftirq thread stuck until hotplug finishes, but
then the ksoftirq CPU_DOWN notifier issues kthread_stop() which will
wait for the ksoftirq thread to go away -- while holding the hotplug
lock.
Sort this by delaying the kthread_stop() until CPU_POST_DEAD, which is
outside of the cpu_hotplug.lock, but still serialized by the
cpu_add_remove_lock.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/1317391156.12973.3.camel@twins
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
ERROR: "in_serving_softirq" [net/sched/cls_cgroup.ko] undefined!
The above can be fixed by exporting in_serving_softirq
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: stable-rt@vger.kernel.org
Link: http://lkml.kernel.org/r/1321235083-21756-2-git-send-email-jkacur@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The reader_lock is mostly taken in normal context with interrupts enabled.
But because ftrace_dump() can happen anywhere, it is used as a spin lock
and in some cases a check to in_nmi() is performed to determine if the
ftrace_dump() was initiated from an NMI and if it is, the lock is not taken.
But having the lock as a raw_spin_lock() causes issues with the real-time
kernel as the lock is held during allocation and freeing of the buffer.
As memory locks convert into mutexes, keeping the reader_lock as a spin_lock
causes problems.
Converting the reader_lock is not straight forward as we must still deal
with the ftrace_dump() happening not only from an NMI but also from
true interrupt context in PREEPMT_RT.
Two wrapper functions are created to take and release the reader lock:
int read_buffer_lock(cpu_buffer, unsigned long *flags)
void read_buffer_unlock(cpu_buffer, unsigned long flags, int locked)
The read_buffer_lock() returns 1 if it actually took the lock, disables
interrupts and updates the flags. The only time it returns 0 is in the
case of a ftrace_dump() happening in an unsafe context.
The read_buffer_unlock() checks the return of locked and will simply
unlock the spin lock if it was successfully taken.
Instead of just having this in specific cases that the NMI might call
into, all instances of the reader_lock is converted to the wrapper
functions to make this a bit simpler to read and less error prone.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <clark@redhat.com>
Link: http://lkml.kernel.org/r/1317146210.26514.33.camel@gandalf.stny.rr.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/20110927124423.567944215@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/20110927124423.128129033@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
The migrate_disable() can cause a bit of a overhead to the RT kernel,
as changing the affinity is expensive to do at every lock encountered.
As a running task can not migrate, the actual disabling of migration
does not need to occur until the task is about to schedule out.
In most cases, a task that disables migration will enable it before
it schedules making this change improve performance tremendously.
[ Frank Rowand: UP compile fix ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Link: http://lkml.kernel.org/r/20110927124422.779693167@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
<NMI> [<ffffffff812dafd8>] spin_bug+0x94/0xa8
[<ffffffff812db07f>] do_raw_spin_lock+0x43/0xea
[<ffffffff814fa9be>] _raw_spin_lock_irqsave+0x6b/0x85
[<ffffffff8106ff9e>] ? migrate_disable+0x75/0x12d
[<ffffffff81078aaf>] ? pin_current_cpu+0x36/0xb0
[<ffffffff8106ff9e>] migrate_disable+0x75/0x12d
[<ffffffff81115b9d>] pagefault_disable+0xe/0x1f
[<ffffffff81047027>] copy_from_user_nmi+0x74/0xe6
[<ffffffff810489d7>] perf_callchain_user+0xf3/0x135
Now clearly we can't go around taking locks from NMI context, cure
this by short-circuiting migrate_disable() when we're in an atomic
context already.
Add some extra debugging to avoid things like:
preempt_disable()
migrate_disable();
preempt_enable();
migrate_enable();
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1314967297.1301.14.camel@twins
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-wbot4vsmwhi8vmbf83hsclk6@git.kernel.org
|
|
Assigning mask = tsk_cpus_allowed(p) after p->migrate_disable = 0 ensures
that we won't see a mask change.. no push/pull, we stack tasks on one CPU.
Also add a couple fields to sched_debug for the next guy.
[ Build fix from Stratos Psomadakis <psomas@gentoo.org> ]
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1314108763.6689.4.camel@marge.simson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Make migrate_disable() be a preempt_disable() for !rt kernels. This
allows generic code to use it but still enforces that these code
sections stay relatively small.
A preemptible migrate_disable() accessible for general use would allow
people growing arbitrary per-cpu crap instead of clean these things
up.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-275i87sl8e1jcamtchmehonm@git.kernel.org
|