<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/kernel/watchdog.c, branch v4.8-rc6</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>watchdog: don't run proc_watchdog_update if new value is same as old</title>
<updated>2016-03-17T22:09:34+00:00</updated>
<author>
<name>Joshua Hunt</name>
<email>johunt@akamai.com</email>
</author>
<published>2016-03-17T21:17:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a1ee1932aa6bea0bb074f5e3ced112664e4637ed'/>
<id>a1ee1932aa6bea0bb074f5e3ced112664e4637ed</id>
<content type='text'>
While working on a script to restore all sysctl params before a series of
tests I found that writing any value into the
/proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
causes them to call proc_watchdog_update().

  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

There doesn't appear to be a reason for doing this work every time a write
occurs, so only do it when the values change.

Signed-off-by: Josh Hunt &lt;johunt@akamai.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.1.x+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
While working on a script to restore all sysctl params before a series of
tests I found that writing any value into the
/proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
causes them to call proc_watchdog_update().

  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
  NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

There doesn't appear to be a reason for doing this work every time a write
occurs, so only do it when the values change.

Signed-off-by: Josh Hunt &lt;johunt@akamai.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.1.x+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq</title>
<updated>2016-01-12T02:53:13+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-01-12T02:53:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0f8c7901039f8b1366ae364462743c8f4125822e'/>
<id>0f8c7901039f8b1366ae364462743c8f4125822e</id>
<content type='text'>
Pull workqueue update from Tejun Heo:
 "Workqueue changes for v4.5.  One cleanup patch and three to improve
  the debuggability.

  Workqueue now has a stall detector which dumps workqueue state if any
  worker pool hasn't made forward progress over a certain amount of time
  (30s by default) and also triggers a warning if a workqueue which can
  be used in memory reclaim path tries to wait on something which can't
  be.

  These should make workqueue hangs a lot easier to debug."

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: simplify the apply_workqueue_attrs_locked()
  workqueue: implement lockup detector
  watchdog: introduce touch_softlockup_watchdog_sched()
  workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull workqueue update from Tejun Heo:
 "Workqueue changes for v4.5.  One cleanup patch and three to improve
  the debuggability.

  Workqueue now has a stall detector which dumps workqueue state if any
  worker pool hasn't made forward progress over a certain amount of time
  (30s by default) and also triggers a warning if a workqueue which can
  be used in memory reclaim path tries to wait on something which can't
  be.

  These should make workqueue hangs a lot easier to debug."

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: simplify the apply_workqueue_attrs_locked()
  workqueue: implement lockup detector
  watchdog: introduce touch_softlockup_watchdog_sched()
  workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue
</pre>
</div>
</content>
</entry>
<entry>
<title>panic, x86: Allow CPUs to save registers even if looping in NMI context</title>
<updated>2015-12-19T10:07:01+00:00</updated>
<author>
<name>Hidehiro Kawai</name>
<email>hidehiro.kawai.ez@hitachi.com</email>
</author>
<published>2015-12-14T10:19:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=58c5661f2144c089bbc2e5d87c9ec1dc1d2964fe'/>
<id>58c5661f2144c089bbc2e5d87c9ec1dc1d2964fe</id>
<content type='text'>
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends an NMI IPI to CPUs which haven't called panic() to stop them,
save their register information and do some cleanups for crash dumping.
However, if such a CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.

For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows:

  CPU 0                             CPU 1
  ===========================       ==========================
  receive an unknown NMI
  unknown_nmi_error()
    panic()                         receive an unknown NMI
      spin_trylock(&amp;panic_lock)     unknown_nmi_error()
      crash_kexec()                   panic()
                                        spin_trylock(&amp;panic_lock)
                                        panic_smp_self_stop()
                                          infinite loop
        kdump_nmi_shootdown_cpus()
          issue NMI IPI -----------&gt; blocked until IRET
                                          infinite loop...

Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
so the NMI is not handled and the callback function to save registers is
never called.

In practice, this can happen on some servers which broadcast NMIs to all
CPUs when the NMI button is pushed.

To save registers in this case, we need to:

  a) Return from NMI handler instead of looping infinitely
  or
  b) Call the callback function directly from the infinite loop

Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices.  So, we chose b).

This patch does the following:

1. Move the infinite looping of CPUs which haven't called panic() in NMI
   context (actually done by panic_smp_self_stop()) outside of panic() to
   enable us to refer pt_regs. Please note that panic_smp_self_stop() is
   still used for normal context.

2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
   registers and do some cleanups after setting waiting_for_crash_ipi which
   is used for counting down the number of CPUs which handled the callback

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@ezchip.com&gt;
Cc: Dave Young &lt;dyoung@redhat.com&gt;
Cc: David Hildenbrand &lt;dahi@linux.vnet.ibm.com&gt;
Cc: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Gobinda Charan Maji &lt;gobinda.cemk07@gmail.com&gt;
Cc: HATAYAMA Daisuke &lt;d.hatayama@jp.fujitsu.com&gt;
Cc: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Javi Merino &lt;javi.merino@arm.com&gt;
Cc: Jiang Liu &lt;jiang.liu@linux.intel.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: lkml &lt;linux-kernel@vger.kernel.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Nicolas Iooss &lt;nicolas.iooss_linux@m4x.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Seth Jennings &lt;sjenning@redhat.com&gt;
Cc: Stefan Lippers-Hollmann &lt;s.l-h@gmx.de&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Vitaly Kuznetsov &lt;vkuznets@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends an NMI IPI to CPUs which haven't called panic() to stop them,
save their register information and do some cleanups for crash dumping.
However, if such a CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.

For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows:

  CPU 0                             CPU 1
  ===========================       ==========================
  receive an unknown NMI
  unknown_nmi_error()
    panic()                         receive an unknown NMI
      spin_trylock(&amp;panic_lock)     unknown_nmi_error()
      crash_kexec()                   panic()
                                        spin_trylock(&amp;panic_lock)
                                        panic_smp_self_stop()
                                          infinite loop
        kdump_nmi_shootdown_cpus()
          issue NMI IPI -----------&gt; blocked until IRET
                                          infinite loop...

Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
so the NMI is not handled and the callback function to save registers is
never called.

In practice, this can happen on some servers which broadcast NMIs to all
CPUs when the NMI button is pushed.

To save registers in this case, we need to:

  a) Return from NMI handler instead of looping infinitely
  or
  b) Call the callback function directly from the infinite loop

Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices.  So, we chose b).

This patch does the following:

1. Move the infinite looping of CPUs which haven't called panic() in NMI
   context (actually done by panic_smp_self_stop()) outside of panic() to
   enable us to refer pt_regs. Please note that panic_smp_self_stop() is
   still used for normal context.

2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
   registers and do some cleanups after setting waiting_for_crash_ipi which
   is used for counting down the number of CPUs which handled the callback

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@ezchip.com&gt;
Cc: Dave Young &lt;dyoung@redhat.com&gt;
Cc: David Hildenbrand &lt;dahi@linux.vnet.ibm.com&gt;
Cc: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Gobinda Charan Maji &lt;gobinda.cemk07@gmail.com&gt;
Cc: HATAYAMA Daisuke &lt;d.hatayama@jp.fujitsu.com&gt;
Cc: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Javi Merino &lt;javi.merino@arm.com&gt;
Cc: Jiang Liu &lt;jiang.liu@linux.intel.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: lkml &lt;linux-kernel@vger.kernel.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Nicolas Iooss &lt;nicolas.iooss_linux@m4x.org&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Seth Jennings &lt;sjenning@redhat.com&gt;
Cc: Stefan Lippers-Hollmann &lt;s.l-h@gmx.de&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Vitaly Kuznetsov &lt;vkuznets@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Cc: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>panic, x86: Fix re-entrance problem due to panic on NMI</title>
<updated>2015-12-19T10:07:00+00:00</updated>
<author>
<name>Hidehiro Kawai</name>
<email>hidehiro.kawai.ez@hitachi.com</email>
</author>
<published>2015-12-14T10:19:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1717f2096b543cede7a380c858c765c41936bc35'/>
<id>1717f2096b543cede7a380c858c765c41936bc35</id>
<content type='text'>
If panic on NMI happens just after panic() on the same CPU, panic() is
recursively called. Kernel stalls, as a result, after failing to acquire
panic_lock.

To avoid this problem, don't call panic() in NMI context if we've
already entered panic().

For that, introduce nmi_panic() macro to reduce code duplication. In
the case of panic on NMI, don't return from NMI handlers if another CPU
already panicked.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@ezchip.com&gt;
Cc: David Hildenbrand &lt;dahi@linux.vnet.ibm.com&gt;
Cc: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Gobinda Charan Maji &lt;gobinda.cemk07@gmail.com&gt;
Cc: HATAYAMA Daisuke &lt;d.hatayama@jp.fujitsu.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Javi Merino &lt;javi.merino@arm.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: lkml &lt;linux-kernel@vger.kernel.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Nicolas Iooss &lt;nicolas.iooss_linux@m4x.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Cc: Seth Jennings &lt;sjenning@redhat.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Vitaly Kuznetsov &lt;vkuznets@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If panic on NMI happens just after panic() on the same CPU, panic() is
recursively called. Kernel stalls, as a result, after failing to acquire
panic_lock.

To avoid this problem, don't call panic() in NMI context if we've
already entered panic().

For that, introduce nmi_panic() macro to reduce code duplication. In
the case of panic on NMI, don't return from NMI handlers if another CPU
already panicked.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@ezchip.com&gt;
Cc: David Hildenbrand &lt;dahi@linux.vnet.ibm.com&gt;
Cc: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Gobinda Charan Maji &lt;gobinda.cemk07@gmail.com&gt;
Cc: HATAYAMA Daisuke &lt;d.hatayama@jp.fujitsu.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Javi Merino &lt;javi.merino@arm.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: lkml &lt;linux-kernel@vger.kernel.org&gt;
Cc: Masami Hiramatsu &lt;masami.hiramatsu.pt@hitachi.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Nicolas Iooss &lt;nicolas.iooss_linux@m4x.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Rasmus Villemoes &lt;linux@rasmusvillemoes.dk&gt;
Cc: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Cc: Seth Jennings &lt;sjenning@redhat.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Vitaly Kuznetsov &lt;vkuznets@redhat.com&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>workqueue: implement lockup detector</title>
<updated>2015-12-08T16:29:47+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-12-08T16:28:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=82607adcf9cdf40fb7b5331269780c8f70ec6e35'/>
<id>82607adcf9cdf40fb7b5331269780c8f70ec6e35</id>
<content type='text'>
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING.  These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.

To alleviate the situation, this patch implements workqueue lockup
detector.  It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.

 BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
 Showing busy workqueues and worker pools:
 workqueue events: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
     pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
 workqueue events_power_efficient: flags=0x80
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
     pending: check_lifetime, neigh_periodic_work
 workqueue cgroup_pidlist_destroy: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     pending: cgroup_pidlist_destroy_work_fn
 ...

The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.

v2: Decoupled from softlockup control knobs.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING.  These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.

To alleviate the situation, this patch implements workqueue lockup
detector.  It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.

 BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
 Showing busy workqueues and worker pools:
 workqueue events: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
     pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
 workqueue events_power_efficient: flags=0x80
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
     pending: check_lifetime, neigh_periodic_work
 workqueue cgroup_pidlist_destroy: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     pending: cgroup_pidlist_destroy_work_fn
 ...

The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.

v2: Decoupled from softlockup control knobs.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>watchdog: introduce touch_softlockup_watchdog_sched()</title>
<updated>2015-12-08T16:29:42+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-12-08T16:28:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=03e0d4610bf4d4a93bfa16b2474ed4fd5243aa71'/>
<id>03e0d4610bf4d4a93bfa16b2474ed4fd5243aa71</id>
<content type='text'>
touch_softlockup_watchdog() is used to tell watchdog that scheduler
stall is expected.  One group of usage is from paths where the task
may not be able to yield for a long time such as performing slow PIO
to finicky device and coming out of suspend.  The other is to account
for scheduler and timer going idle.

For scheduler softlockup detection, there's no reason to distinguish
the two cases; however, workqueue lockup detector is planned and it
can use the same signals from the former group while the latter would
spuriously prevent detection.  This patch introduces a new function
touch_softlockup_watchdog_sched() and convert the latter group to call
it instead.  For now, it just calls touch_softlockup_watchdog() and
there's no functional difference.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
touch_softlockup_watchdog() is used to tell watchdog that scheduler
stall is expected.  One group of usage is from paths where the task
may not be able to yield for a long time such as performing slow PIO
to finicky device and coming out of suspend.  The other is to account
for scheduler and timer going idle.

For scheduler softlockup detection, there's no reason to distinguish
the two cases; however, workqueue lockup detector is planned and it
can use the same signals from the former group while the latter would
spuriously prevent detection.  This patch introduces a new function
touch_softlockup_watchdog_sched() and convert the latter group to call
it instead.  For now, it just calls touch_softlockup_watchdog() and
there's no functional difference.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/watchdog.c: fix race between proc_watchdog_thresh() and watchdog_timer_fn()</title>
<updated>2015-11-06T03:34:48+00:00</updated>
<author>
<name>Ulrich Obergfell</name>
<email>uobergfe@redhat.com</email>
</author>
<published>2015-11-06T02:44:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=39d2da2161d35de301ec5397ce9103c68b883054'/>
<id>39d2da2161d35de301ec5397ce9103c68b883054</id>
<content type='text'>
Theoretically it is possible that the watchdog timer expires right at the
time when a user sets 'watchdog_thresh' to zero (note: this disables the
lockup detectors).  In this scenario, the is_softlockup() function - which
is called by the timer - could produce a false positive.

Fix this by checking the current value of 'watchdog_thresh'.

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Theoretically it is possible that the watchdog timer expires right at the
time when a user sets 'watchdog_thresh' to zero (note: this disables the
lockup detectors).  In this scenario, the is_softlockup() function - which
is called by the timer - could produce a false positive.

Fix this by checking the current value of 'watchdog_thresh'.

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/watchdog.c: remove {get|put}_online_cpus() from watchdog_{park|unpark}_threads()</title>
<updated>2015-11-06T03:34:48+00:00</updated>
<author>
<name>Ulrich Obergfell</name>
<email>uobergfe@redhat.com</email>
</author>
<published>2015-11-06T02:44:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a2a45b85ec45db4b041ea5d93b21033dbc3cc0fc'/>
<id>a2a45b85ec45db4b041ea5d93b21033dbc3cc0fc</id>
<content type='text'>
watchdog_{park|unpark}_threads() are now called in code paths that protect
themselves against CPU hotplug, so {get|put}_online_cpus() calls are
redundant and can be removed.

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
watchdog_{park|unpark}_threads() are now called in code paths that protect
themselves against CPU hotplug, so {get|put}_online_cpus() calls are
redundant and can be removed.

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/watchdog.c: avoid races between /proc handlers and CPU hotplug</title>
<updated>2015-11-06T03:34:48+00:00</updated>
<author>
<name>Ulrich Obergfell</name>
<email>uobergfe@redhat.com</email>
</author>
<published>2015-11-06T02:44:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8614ddef82139d08234dbf681188f9bcddae9f03'/>
<id>8614ddef82139d08234dbf681188f9bcddae9f03</id>
<content type='text'>
The handler functions for watchdog parameters in /proc/sys/kernel do not
protect themselves against races with CPU hotplug.  Hence, theoretically
it is possible that a new watchdog thread is started on a hotplugged CPU
while a parameter is being modified, and the thread could thus use a
parameter value that is 'in transition'.

For example, if 'watchdog_thresh' is being set to zero (note: this
disables the lockup detectors) the thread would erroneously use the value
zero as the sample period.

To avoid such races and to keep the /proc handler code consistent,
call
     {get|put}_online_cpus() in proc_watchdog_common()
     {get|put}_online_cpus() in proc_watchdog_thresh()
     {get|put}_online_cpus() in proc_watchdog_cpumask()

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The handler functions for watchdog parameters in /proc/sys/kernel do not
protect themselves against races with CPU hotplug.  Hence, theoretically
it is possible that a new watchdog thread is started on a hotplugged CPU
while a parameter is being modified, and the thread could thus use a
parameter value that is 'in transition'.

For example, if 'watchdog_thresh' is being set to zero (note: this
disables the lockup detectors) the thread would erroneously use the value
zero as the sample period.

To avoid such races and to keep the /proc handler code consistent,
call
     {get|put}_online_cpus() in proc_watchdog_common()
     {get|put}_online_cpus() in proc_watchdog_thresh()
     {get|put}_online_cpus() in proc_watchdog_cpumask()

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kernel/watchdog.c: avoid race between lockup detector suspend/resume and CPU hotplug</title>
<updated>2015-11-06T03:34:48+00:00</updated>
<author>
<name>Ulrich Obergfell</name>
<email>uobergfe@redhat.com</email>
</author>
<published>2015-11-06T02:44:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ee89e71eb091d3ef8ca2be8bd4ec77ccfa91334c'/>
<id>ee89e71eb091d3ef8ca2be8bd4ec77ccfa91334c</id>
<content type='text'>
The lockup detector suspend/resume interface that was introduced by
commit 8c073d27d7ad ("watchdog: introduce watchdog_suspend() and
watchdog_resume()") does not protect itself against races with CPU
hotplug.  Hence, theoretically it is possible that a new watchdog thread
is started on a hotplugged CPU while the lockup detector is suspended,
and the thread could thus interfere unexpectedly with the code that
requested to suspend the lockup detector.

Avoid the race by calling

  get_online_cpus() in lockup_detector_suspend()
  put_online_cpus() in lockup_detector_resume()

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The lockup detector suspend/resume interface that was introduced by
commit 8c073d27d7ad ("watchdog: introduce watchdog_suspend() and
watchdog_resume()") does not protect itself against races with CPU
hotplug.  Hence, theoretically it is possible that a new watchdog thread
is started on a hotplugged CPU while the lockup detector is suspended,
and the thread could thus interfere unexpectedly with the code that
requested to suspend the lockup detector.

Avoid the race by calling

  get_online_cpus() in lockup_detector_suspend()
  put_online_cpus() in lockup_detector_resume()

Signed-off-by: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Reviewed-by: Aaron Tomlin &lt;atomlin@redhat.com&gt;
Cc: Ulrich Obergfell &lt;uobergfe@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
