<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/arch/powerpc/kernel/watchdog.c, branch v6.7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>nmi_backtrace: allow excluding an arbitrary CPU</title>
<updated>2023-08-18T17:19:00+00:00</updated>
<author>
<name>Douglas Anderson</name>
<email>dianders@chromium.org</email>
</author>
<published>2023-08-04T14:00:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8d539b84f1e3478436f978ceaf55a0b6cab497b5'/>
<id>8d539b84f1e3478436f978ceaf55a0b6cab497b5</id>
<content type='text'>
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU.  This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.

Let's extend the API to take a CPU ID to exclude instead of just a
boolean.  This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.

Arguably, this new API also encourages safer behavior.  Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.

[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson &lt;dianders@chromium.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Cc: Lecopzer Chen &lt;lecopzer.chen@mediatek.com&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Pingfan Liu &lt;kernelfans@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU.  This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.

Let's extend the API to take a CPU ID to exclude instead of just a
boolean.  This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.

Arguably, this new API also encourages safer behavior.  Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.

[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson &lt;dianders@chromium.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Cc: Lecopzer Chen &lt;lecopzer.chen@mediatek.com&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Pingfan Liu &lt;kernelfans@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>watchdog/hardlockup: rename some "NMI watchdog" constants/function</title>
<updated>2023-06-10T00:44:20+00:00</updated>
<author>
<name>Douglas Anderson</name>
<email>dianders@chromium.org</email>
</author>
<published>2023-05-19T17:18:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=df95d3085caa5b99a60eb033d7ad6c2ff2b43dbf'/>
<id>df95d3085caa5b99a60eb033d7ad6c2ff2b43dbf</id>
<content type='text'>
Do a search and replace of:
- NMI_WATCHDOG_ENABLED =&gt; WATCHDOG_HARDLOCKUP_ENABLED
- SOFT_WATCHDOG_ENABLED =&gt; WATCHDOG_SOFTOCKUP_ENABLED
- watchdog_nmi_ =&gt; watchdog_hardlockup_
- nmi_watchdog_available =&gt; watchdog_hardlockup_available
- nmi_watchdog_user_enabled =&gt; watchdog_hardlockup_user_enabled
- soft_watchdog_user_enabled =&gt; watchdog_softlockup_user_enabled
- NMI_WATCHDOG_DEFAULT =&gt; WATCHDOG_HARDLOCKUP_DEFAULT

Then update a few comments near where names were changed.

This is specifically to make it less confusing when we want to introduce
the buddy hardlockup detector, which isn't using NMIs.  As part of this,
we sanitized a few names for consistency.

[trix@redhat.com: make variables static]
  Link: https://lkml.kernel.org/r/20230525162822.1.I0fb41d138d158c9230573eaa37dc56afa2fb14ee@changeid
Link: https://lkml.kernel.org/r/20230519101840.v5.12.I91f7277bab4bf8c0cb238732ed92e7ce7bbd71a6@changeid
Signed-off-by: Douglas Anderson &lt;dianders@chromium.org&gt;
Signed-off-by: Tom Rix &lt;trix@redhat.com&gt;
Reviewed-by: Tom Rix &lt;trix@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Chen-Yu Tsai &lt;wens@csie.org&gt;
Cc: Christophe Leroy &lt;christophe.leroy@csgroup.eu&gt;
Cc: Colin Cross &lt;ccross@android.com&gt;
Cc: Daniel Thompson &lt;daniel.thompson@linaro.org&gt;
Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Guenter Roeck &lt;groeck@chromium.org&gt;
Cc: Ian Rogers &lt;irogers@google.com&gt;
Cc: Lecopzer Chen &lt;lecopzer.chen@mediatek.com&gt;
Cc: Marc Zyngier &lt;maz@kernel.org&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Masayoshi Mizuma &lt;msys.mizuma@gmail.com&gt;
Cc: Matthias Kaehlcke &lt;mka@chromium.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Pingfan Liu &lt;kernelfans@gmail.com&gt;
Cc: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: "Ravi V. Shankar" &lt;ravi.v.shankar@intel.com&gt;
Cc: Ricardo Neri &lt;ricardo.neri@intel.com&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Stephen Boyd &lt;swboyd@chromium.org&gt;
Cc: Sumit Garg &lt;sumit.garg@linaro.org&gt;
Cc: Tzung-Bi Shih &lt;tzungbi@chromium.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Do a search and replace of:
- NMI_WATCHDOG_ENABLED =&gt; WATCHDOG_HARDLOCKUP_ENABLED
- SOFT_WATCHDOG_ENABLED =&gt; WATCHDOG_SOFTOCKUP_ENABLED
- watchdog_nmi_ =&gt; watchdog_hardlockup_
- nmi_watchdog_available =&gt; watchdog_hardlockup_available
- nmi_watchdog_user_enabled =&gt; watchdog_hardlockup_user_enabled
- soft_watchdog_user_enabled =&gt; watchdog_softlockup_user_enabled
- NMI_WATCHDOG_DEFAULT =&gt; WATCHDOG_HARDLOCKUP_DEFAULT

Then update a few comments near where names were changed.

This is specifically to make it less confusing when we want to introduce
the buddy hardlockup detector, which isn't using NMIs.  As part of this,
we sanitized a few names for consistency.

[trix@redhat.com: make variables static]
  Link: https://lkml.kernel.org/r/20230525162822.1.I0fb41d138d158c9230573eaa37dc56afa2fb14ee@changeid
Link: https://lkml.kernel.org/r/20230519101840.v5.12.I91f7277bab4bf8c0cb238732ed92e7ce7bbd71a6@changeid
Signed-off-by: Douglas Anderson &lt;dianders@chromium.org&gt;
Signed-off-by: Tom Rix &lt;trix@redhat.com&gt;
Reviewed-by: Tom Rix &lt;trix@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Chen-Yu Tsai &lt;wens@csie.org&gt;
Cc: Christophe Leroy &lt;christophe.leroy@csgroup.eu&gt;
Cc: Colin Cross &lt;ccross@android.com&gt;
Cc: Daniel Thompson &lt;daniel.thompson@linaro.org&gt;
Cc: "David S. Miller" &lt;davem@davemloft.net&gt;
Cc: Guenter Roeck &lt;groeck@chromium.org&gt;
Cc: Ian Rogers &lt;irogers@google.com&gt;
Cc: Lecopzer Chen &lt;lecopzer.chen@mediatek.com&gt;
Cc: Marc Zyngier &lt;maz@kernel.org&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Masayoshi Mizuma &lt;msys.mizuma@gmail.com&gt;
Cc: Matthias Kaehlcke &lt;mka@chromium.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Pingfan Liu &lt;kernelfans@gmail.com&gt;
Cc: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Cc: "Ravi V. Shankar" &lt;ravi.v.shankar@intel.com&gt;
Cc: Ricardo Neri &lt;ricardo.neri@intel.com&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Stephen Boyd &lt;swboyd@chromium.org&gt;
Cc: Sumit Garg &lt;sumit.garg@linaro.org&gt;
Cc: Tzung-Bi Shih &lt;tzungbi@chromium.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: introduce a NMI watchdog's factor</title>
<updated>2022-07-27T11:36:02+00:00</updated>
<author>
<name>Laurent Dufour</name>
<email>ldufour@linux.ibm.com</email>
</author>
<published>2022-07-13T15:47:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f5e74e836097d1004077390717d4bd95d4a2c27a'/>
<id>f5e74e836097d1004077390717d4bd95d4a2c27a</id>
<content type='text'>
Introduce a factor which would apply to the NMI watchdog timeout.

This factor is a percentage added to the watchdog_tresh value. The value is
set under the watchdog_mutex protection and lockup_detector_reconfigure()
is called to recompute wd_panic_timeout_tb.

Once the factor is set, it remains until it is set back to 0, which means
no impact.

Signed-off-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Reviewed-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220713154729.80789-4-ldufour@linux.ibm.com

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Introduce a factor which would apply to the NMI watchdog timeout.

This factor is a percentage added to the watchdog_tresh value. The value is
set under the watchdog_mutex protection and lockup_detector_reconfigure()
is called to recompute wd_panic_timeout_tb.

Once the factor is set, it remains until it is set back to 0, which means
no impact.

Signed-off-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Reviewed-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220713154729.80789-4-ldufour@linux.ibm.com

</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc: Fix all occurences of duplicate words</title>
<updated>2022-07-25T02:05:15+00:00</updated>
<author>
<name>Michael Ellerman</name>
<email>mpe@ellerman.id.au</email>
</author>
<published>2022-07-18T09:51:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2b461880c20777d317b4ad24ef040918860133ca'/>
<id>2b461880c20777d317b4ad24ef040918860133ca</id>
<content type='text'>
Since commit 87c78b612f4f ("powerpc: Fix all occurences of "the the"")
fixed "the the", there's now a steady stream of patches fixing other
duplicate words.

Just fix them all at once, to save the overhead of dealing with
individual patches for each case.

This leaves a few cases of "that that", which in some contexts is
correct.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220718095158.326606-1-mpe@ellerman.id.au

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since commit 87c78b612f4f ("powerpc: Fix all occurences of "the the"")
fixed "the the", there's now a steady stream of patches fixing other
duplicate words.

Just fix them all at once, to save the overhead of dealing with
individual patches for each case.

This leaves a few cases of "that that", which in some contexts is
correct.

Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220718095158.326606-1-mpe@ellerman.id.au

</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc: fix typos in comments</title>
<updated>2022-05-05T12:12:44+00:00</updated>
<author>
<name>Julia Lawall</name>
<email>Julia.Lawall@inria.fr</email>
</author>
<published>2022-04-30T18:56:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1fd02f6605b855b4af2883f29a2abc88bdf17857'/>
<id>1fd02f6605b855b4af2883f29a2abc88bdf17857</id>
<content type='text'>
Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall &lt;Julia.Lawall@inria.fr&gt;
Reviewed-by: Joel Stanley &lt;joel@jms.id.au&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Various spelling mistakes in comments.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall &lt;Julia.Lawall@inria.fr&gt;
Reviewed-by: Joel Stanley &lt;joel@jms.id.au&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr

</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: help remote CPUs to flush NMI printk output</title>
<updated>2021-11-29T12:08:43+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-19T11:31:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e012c499985c608c936410d8bab29d9596d62859'/>
<id>e012c499985c608c936410d8bab29d9596d62859</id>
<content type='text'>
The printk layer at the moment does not seem to have a good way to force
flush printk messages that are created in NMI context, except in the
panic path.

NMI-context printk messages normally get to the console with irq_work,
but that won't help if the CPU is stuck with irqs disabled, as can be
the case for hard lockup watchdog messages.

The watchdog currently flushes the printk buffers after detecting a
lockup on remote CPUs, but they may not have processed their NMI IPI
yet by that stage, or they may have self-detected a lockup in which
case they won't go via this NMI IPI path.

Improve the situation by having NMI-context mark a flag if it called
printk, and have watchdog timer interrupts check if that flag was set
and try to flush if it was. Latency is not a big problem because we
were already stuck for a while, just need to try to make sure the
messages eventually make it out.

Depends-on: 5d5e4522a7f4 ("printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces")
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211119113146.752759-6-npiggin@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The printk layer at the moment does not seem to have a good way to force
flush printk messages that are created in NMI context, except in the
panic path.

NMI-context printk messages normally get to the console with irq_work,
but that won't help if the CPU is stuck with irqs disabled, as can be
the case for hard lockup watchdog messages.

The watchdog currently flushes the printk buffers after detecting a
lockup on remote CPUs, but they may not have processed their NMI IPI
yet by that stage, or they may have self-detected a lockup in which
case they won't go via this NMI IPI path.

Improve the situation by having NMI-context mark a flag if it called
printk, and have watchdog timer interrupts check if that flag was set
and try to flush if it was. Latency is not a big problem because we
were already stuck for a while, just need to try to make sure the
messages eventually make it out.

Depends-on: 5d5e4522a7f4 ("printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces")
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211119113146.752759-6-npiggin@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: Fix wd_smp_last_reset_tb reporting</title>
<updated>2021-11-25T10:48:39+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-25T10:33:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3d030e301856da366380b3865fce6c03037b08a6'/>
<id>3d030e301856da366380b3865fce6c03037b08a6</id>
<content type='text'>
wd_smp_last_reset_tb now gets reset by watchdog_smp_panic() as part of
marking CPUs stuck and removing them from the pending mask before it
begins any printing. This causes last reset times reported to be off.

Fix this by reading it into a local variable before it gets reset.

Fixes: 76521c4b0291 ("powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi")
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211125103346.1188958-1-npiggin@gmail.com

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
wd_smp_last_reset_tb now gets reset by watchdog_smp_panic() as part of
marking CPUs stuck and removing them from the pending mask before it
begins any printing. This causes last reset times reported to be off.

Fix this by reading it into a local variable before it gets reset.

Fixes: 76521c4b0291 ("powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi")
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211125103346.1188958-1-npiggin@gmail.com

</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: read TB close to where it is used</title>
<updated>2021-11-25T00:25:34+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1f01bf90765fa5f88fbae452c131c1edf5cda7ba'/>
<id>1f01bf90765fa5f88fbae452c131c1edf5cda7ba</id>
<content type='text'>
When taking watchdog actions, printing messages, comparing and
re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
and under wd_smp_lock or printing lock (if applicable).

This should keep timebase mostly monotonic with kernel log messages, and
could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
something a long way in the past, and causing other CPUs to appear to be
stuck.

These additional TB reads are all slowpath (lockup has been detected),
so performance does not matter.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-5-npiggin@gmail.com

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When taking watchdog actions, printing messages, comparing and
re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
and under wd_smp_lock or printing lock (if applicable).

This should keep timebase mostly monotonic with kernel log messages, and
could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
something a long way in the past, and causing other CPUs to appear to be
stuck.

These additional TB reads are all slowpath (lockup has been detected),
so performance does not matter.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-5-npiggin@gmail.com

</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi</title>
<updated>2021-11-25T00:25:33+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=76521c4b0291ad25723638ade5a0ff4d5f659771'/>
<id>76521c4b0291ad25723638ade5a0ff4d5f659771</id>
<content type='text'>
There is a deadlock with the console_owner lock and the wd_smp_lock:

CPU x takes the console_owner lock
CPU y takes a watchdog timer interrupt and takes __wd_smp_lock
CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock
CPU y detects deadlock, tries to print something and spins on console_owner
-&gt; deadlock

Change the watchdog locking scheme so wd_smp_lock protects the watchdog
internal data, but "reporting" (printing, issuing NMI IPIs, taking any
action outside of watchdog) uses a non-waiting exclusion. If a CPU detects
a problem but can not take the reporting lock, it just returns because
something else is already reporting. It will try again at some point.

Typically hard lockup watchdog report usefulness is not impacted due to
failure to spewing a large enough amount of data in as short a time as
possible, but by messages getting garbled.

Laurent debugged this and found the deadlock, and this patch is based on
his general approach to avoid expensive operations while holding the lock.
With the addition of the reporting exclusion.

Signed-off-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
[np: rework to add reporting exclusion update changelog]
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-4-npiggin@gmail.com

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is a deadlock with the console_owner lock and the wd_smp_lock:

CPU x takes the console_owner lock
CPU y takes a watchdog timer interrupt and takes __wd_smp_lock
CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock
CPU y detects deadlock, tries to print something and spins on console_owner
-&gt; deadlock

Change the watchdog locking scheme so wd_smp_lock protects the watchdog
internal data, but "reporting" (printing, issuing NMI IPIs, taking any
action outside of watchdog) uses a non-waiting exclusion. If a CPU detects
a problem but can not take the reporting lock, it just returns because
something else is already reporting. It will try again at some point.

Typically hard lockup watchdog report usefulness is not impacted due to
failure to spewing a large enough amount of data in as short a time as
possible, but by messages getting garbled.

Laurent debugged this and found the deadlock, and this patch is based on
his general approach to avoid expensive operations while holding the lock.
With the addition of the reporting exclusion.

Signed-off-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
[np: rework to add reporting exclusion update changelog]
Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-4-npiggin@gmail.com

</pre>
</div>
</content>
</entry>
<entry>
<title>powerpc/watchdog: tighten non-atomic read-modify-write access</title>
<updated>2021-11-25T00:25:33+00:00</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2021-11-10T02:50:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=858c93c31504ac1507084493d7eafbe7e2302dc2'/>
<id>858c93c31504ac1507084493d7eafbe7e2302dc2</id>
<content type='text'>
Most updates to wd_smp_cpus_pending are under lock except the watchdog
interrupt bit clear.

This can race with non-atomic RMW updates to the mask under lock, which
can happen in two instances:

Firstly, if another CPU detects this one is stuck, removes it from the
mask, mask becomes empty and is re-filled with non-atomic stores. This
is okay because it would re-fill the mask with this CPU's bit clear
anyway (because this CPU is now stuck), so it doesn't matter that the
bit clear update got "lost". Add a comment for this.

Secondly, if another CPU detects a different CPU is stuck and removes it
from the pending mask with a non-atomic store to bytes which also
include the bit of this CPU. This case can result in the bit clear being
lost and the end result being the bit is set. This should be so rare it
hardly matters, but to make things simpler to reason about just avoid
the non-atomic access for that case.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-3-npiggin@gmail.com

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Most updates to wd_smp_cpus_pending are under lock except the watchdog
interrupt bit clear.

This can race with non-atomic RMW updates to the mask under lock, which
can happen in two instances:

Firstly, if another CPU detects this one is stuck, removes it from the
mask, mask becomes empty and is re-filled with non-atomic stores. This
is okay because it would re-fill the mask with this CPU's bit clear
anyway (because this CPU is now stuck), so it doesn't matter that the
bit clear update got "lost". Add a comment for this.

Secondly, if another CPU detects a different CPU is stuck and removes it
from the pending mask with a non-atomic store to bytes which also
include the bit of this CPU. This case can result in the bit clear being
lost and the end result being the bit is set. This should be so rare it
hardly matters, but to make things simpler to reason about just avoid
the non-atomic access for that case.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Reviewed-by: Laurent Dufour &lt;ldufour@linux.ibm.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20211110025056.2084347-3-npiggin@gmail.com

</pre>
</div>
</content>
</entry>
</feed>
