<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/linux, branch v3.4.8</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>posix_types.h: Cleanup stale __NFDBITS and related definitions</title>
<updated>2012-08-09T15:31:39+00:00</updated>
<author>
<name>Josh Boyer</name>
<email>jwboyer@redhat.com</email>
</author>
<published>2012-07-25T14:40:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=27cd8f51344dcf4799c7a092c1797402b833126a'/>
<id>27cd8f51344dcf4799c7a092c1797402b833126a</id>
<content type='text'>
commit 8ded2bbc1845e19c771eb55209aab166ef011243 upstream.

Recently, glibc made a change to suppress sign-conversion warnings in
FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
kernel's definition of __NFDBITS if applications #include
&lt;linux/types.h&gt; after including &lt;sys/select.h&gt;.  A build failure would
be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
flags to gcc.

It was suggested that the kernel should either match the glibc
definition of __NFDBITS or remove that entirely.  The current in-kernel
uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
uses of the related __FDELT and __FDMASK defines.  Given that, we'll
continue the cleanup that was started with commit 8b3d1cda4f5f
("posix_types: Remove fd_set macros") and drop the remaining unused
macros.

Additionally, linux/time.h has similar macros defined that expand to
nothing so we'll remove those at the same time.

Reported-by: Jeff Law &lt;law@redhat.com&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Josh Boyer &lt;jwboyer@redhat.com&gt;
[ .. and fix up whitespace as per akpm ]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 8ded2bbc1845e19c771eb55209aab166ef011243 upstream.

Recently, glibc made a change to suppress sign-conversion warnings in
FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
kernel's definition of __NFDBITS if applications #include
&lt;linux/types.h&gt; after including &lt;sys/select.h&gt;.  A build failure would
be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
flags to gcc.

It was suggested that the kernel should either match the glibc
definition of __NFDBITS or remove that entirely.  The current in-kernel
uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
uses of the related __FDELT and __FDMASK defines.  Given that, we'll
continue the cleanup that was started with commit 8b3d1cda4f5f
("posix_types: Remove fd_set macros") and drop the remaining unused
macros.

Additionally, linux/time.h has similar macros defined that expand to
nothing so we'll remove those at the same time.

Reported-by: Jeff Law &lt;law@redhat.com&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Josh Boyer &lt;jwboyer@redhat.com&gt;
[ .. and fix up whitespace as per akpm ]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>workqueue: perform cpu down operations from low priority cpu_notifier()</title>
<updated>2012-08-09T15:31:37+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2012-07-17T19:39:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d3b42543cf269243ccf2add49008db879ff7f146'/>
<id>d3b42543cf269243ccf2add49008db879ff7f146</id>
<content type='text'>
commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream.

Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers.  This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.

Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers.  This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.

However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress.  Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.

While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.

Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.

Workqueue cpu hotplug operations will soon go through further cleanup.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream.

Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers.  This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.

Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers.  This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.

However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress.  Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.

While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.

Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.

Workqueue cpu hotplug operations will soon go through further cleanup.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>tun: fix a crash bug and a memory leak</title>
<updated>2012-08-09T15:31:30+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mikulas@artax.karlin.mff.cuni.cz</email>
</author>
<published>2012-07-19T06:13:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=20855fe2097ccfde927c6997101ae35340f1d278'/>
<id>20855fe2097ccfde927c6997101ae35340f1d278</id>
<content type='text'>
commit b09e786bd1dd66418b69348cb110f3a64764626a upstream.

This patch fixes a crash
tun_chr_close -&gt; netdev_run_todo -&gt; tun_free_netdev -&gt; sk_release_kernel -&gt;
sock_release -&gt; iput(SOCK_INODE(sock))
introduced by commit 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d

The problem is that this socket is embedded in struct tun_struct, it has
no inode, iput is called on invalid inode, which modifies invalid memory
and optionally causes a crash.

sock_release also decrements sockets_in_use, this causes a bug that
"sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
creating and closing tun devices.

This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
sock_release to not free the inode and not decrement sockets_in_use,
fixing both memory corruption and sockets_in_use underflow.

It should be backported to 3.3 an 3.4 stabke.

Signed-off-by: Mikulas Patocka &lt;mikulas@artax.karlin.mff.cuni.cz&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b09e786bd1dd66418b69348cb110f3a64764626a upstream.

This patch fixes a crash
tun_chr_close -&gt; netdev_run_todo -&gt; tun_free_netdev -&gt; sk_release_kernel -&gt;
sock_release -&gt; iput(SOCK_INODE(sock))
introduced by commit 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d

The problem is that this socket is embedded in struct tun_struct, it has
no inode, iput is called on invalid inode, which modifies invalid memory
and optionally causes a crash.

sock_release also decrements sockets_in_use, this causes a bug that
"sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
creating and closing tun devices.

This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
sock_release to not free the inode and not decrement sockets_in_use,
fixing both memory corruption and sockets_in_use underflow.

It should be backported to 3.3 an 3.4 stabke.

Signed-off-by: Mikulas Patocka &lt;mikulas@artax.karlin.mff.cuni.cz&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>x86/mce: Fix siginfo_t-&gt;si_addr value for non-recoverable memory faults</title>
<updated>2012-08-09T15:31:29+00:00</updated>
<author>
<name>Tony Luck</name>
<email>tony.luck@intel.com</email>
</author>
<published>2012-07-11T17:20:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7b689c5d930f281e417597af9f817ba03dc9d898'/>
<id>7b689c5d930f281e417597af9f817ba03dc9d898</id>
<content type='text'>
commit 6751ed65dc6642af64f7b8a440a75563c8aab7ae upstream.

In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t-&gt;si_addr. This
would prevent application level recovery from the fault.

Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.

Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
Acked-by: Borislav Petkov &lt;borislav.petkov@amd.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 6751ed65dc6642af64f7b8a440a75563c8aab7ae upstream.

In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t-&gt;si_addr. This
would prevent application level recovery from the fault.

Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.

Signed-off-by: Tony Luck &lt;tony.luck@intel.com&gt;
Acked-by: Borislav Petkov &lt;borislav.petkov@amd.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>NFC: Export nfc.h to userland</title>
<updated>2012-07-19T15:59:00+00:00</updated>
<author>
<name>Samuel Ortiz</name>
<email>sameo@linux.intel.com</email>
</author>
<published>2012-05-10T17:45:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=016e7d822a72c4cbcbf7e4bc6d590cf50879ec26'/>
<id>016e7d822a72c4cbcbf7e4bc6d590cf50879ec26</id>
<content type='text'>
commit dbd4fcaf8d664fab4163b1f8682e41ad8bff3444 upstream.

The netlink commands and attributes, along with the socket structure
definitions need to be exported.

Signed-off-by: Samuel Ortiz &lt;sameo@linux.intel.com&gt;
Signed-off-by: John W. Linville &lt;linville@tuxdriver.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit dbd4fcaf8d664fab4163b1f8682e41ad8bff3444 upstream.

The netlink commands and attributes, along with the socket structure
definitions need to be exported.

Signed-off-by: Samuel Ortiz &lt;sameo@linux.intel.com&gt;
Signed-off-by: John W. Linville &lt;linville@tuxdriver.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>timekeeping: Provide hrtimer update function</title>
<updated>2012-07-19T15:59:00+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2012-07-17T06:39:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=765bdc4d82fadcddfec19222a545e904633c7816'/>
<id>765bdc4d82fadcddfec19222a545e904633c7816</id>
<content type='text'>
This is a backport of f6c06abfb3972ad4914cef57d8348fcb2932bc3b

To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Prarit Bhargava &lt;prarit@redhat.com&gt;
Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is a backport of f6c06abfb3972ad4914cef57d8348fcb2932bc3b

To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Prarit Bhargava &lt;prarit@redhat.com&gt;
Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>hrtimer: Provide clock_was_set_delayed()</title>
<updated>2012-07-19T15:58:59+00:00</updated>
<author>
<name>John Stultz</name>
<email>johnstul@us.ibm.com</email>
</author>
<published>2012-07-17T06:39:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5e5006e64cae9603841405af9febb67064869d83'/>
<id>5e5006e64cae9603841405af9febb67064869d83</id>
<content type='text'>
This is a backport of f55a6faa384304c89cfef162768e88374d3312cb

clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().

For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.

Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.

[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
  rid of all this ifdeffery ASAP ]

Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Reported-by: Jan Engelhardt &lt;jengelh@inai.de&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Prarit Bhargava &lt;prarit@redhat.com&gt;
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is a backport of f55a6faa384304c89cfef162768e88374d3312cb

clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().

For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.

Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.

[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
  rid of all this ifdeffery ASAP ]

Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Reported-by: Jan Engelhardt &lt;jengelh@inai.de&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Prarit Bhargava &lt;prarit@redhat.com&gt;
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: John Stultz &lt;johnstul@us.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/nohz: Rewrite and fix load-avg computation -- again</title>
<updated>2012-07-19T15:58:56+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2012-06-22T13:52:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7490d0a4cfefa16f9d8ce636eb5b2e13d2432db3'/>
<id>7490d0a4cfefa16f9d8ce636eb5b2e13d2432db3</id>
<content type='text'>
commit 5167e8d5417bf5c322a703d2927daec727ea40dd upstream.

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies &lt;dsmythies@telus.net&gt;
Reported-and-tested-by: Charles Wang &lt;muming.wq@gmail.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 5167e8d5417bf5c322a703d2927daec727ea40dd upstream.

Thanks to Charles Wang for spotting the defects in the current code:

 - If we go idle during the sample window -- after sampling, we get a
   negative bias because we can negate our own sample.

 - If we wake up during the sample window we get a positive bias
   because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies &lt;dsmythies@telus.net&gt;
Reported-and-tested-by: Charles Wang &lt;muming.wq@gmail.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>memblock: free allocated memblock_reserved_regions later</title>
<updated>2012-07-16T16:04:45+00:00</updated>
<author>
<name>Yinghai Lu</name>
<email>yinghai@kernel.org</email>
</author>
<published>2012-07-11T21:02:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7ad71f960f0f6e06cbded278809674afc515036a'/>
<id>7ad71f960f0f6e06cbded278809674afc515036a</id>
<content type='text'>
commit 29f6738609e40227dabcc63bfb3b84b3726a75bd upstream.

memblock_free_reserved_regions() calls memblock_free(), but
memblock_free() would double reserved.regions too, so we could free the
old range for reserved.regions.

Also tj said there is another bug which could be related to this.

| I don't think we're saving any noticeable
| amount by doing this "free - give it to page allocator - reserve
| again" dancing.  We should just allocate regions aligned to page
| boundaries and free them later when memblock is no longer in use.

in that case, when DEBUG_PAGEALLOC, will get panic:

     memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
  BUG: unable to handle kernel paging request at ffff88102febd948
  IP: [&lt;ffffffff836a5774&gt;] __next_free_mem_range+0x9b/0x155
  PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU 0
  Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
  RIP: 0010:[&lt;ffffffff836a5774&gt;]  [&lt;ffffffff836a5774&gt;] __next_free_mem_range+0x9b/0x155

See the discussion at https://lkml.org/lkml/2012/6/13/469

So try to allocate with PAGE_SIZE alignment and free it later.

Reported-by: Sasha Levin &lt;levinsasha928@gmail.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Signed-off-by: Yinghai Lu &lt;yinghai@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 29f6738609e40227dabcc63bfb3b84b3726a75bd upstream.

memblock_free_reserved_regions() calls memblock_free(), but
memblock_free() would double reserved.regions too, so we could free the
old range for reserved.regions.

Also tj said there is another bug which could be related to this.

| I don't think we're saving any noticeable
| amount by doing this "free - give it to page allocator - reserve
| again" dancing.  We should just allocate regions aligned to page
| boundaries and free them later when memblock is no longer in use.

in that case, when DEBUG_PAGEALLOC, will get panic:

     memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
  BUG: unable to handle kernel paging request at ffff88102febd948
  IP: [&lt;ffffffff836a5774&gt;] __next_free_mem_range+0x9b/0x155
  PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU 0
  Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
  RIP: 0010:[&lt;ffffffff836a5774&gt;]  [&lt;ffffffff836a5774&gt;] __next_free_mem_range+0x9b/0x155

See the discussion at https://lkml.org/lkml/2012/6/13/469

So try to allocate with PAGE_SIZE alignment and free it later.

Reported-by: Sasha Levin &lt;levinsasha928@gmail.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Signed-off-by: Yinghai Lu &lt;yinghai@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>memory hotplug: fix invalid memory access caused by stale kswapd pointer</title>
<updated>2012-07-16T16:04:44+00:00</updated>
<author>
<name>Jiang Liu</name>
<email>jiang.liu@huawei.com</email>
</author>
<published>2012-07-11T21:01:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0e343dbe08acb440f7914d989bcc32c1d1576735'/>
<id>0e343dbe08acb440f7914d989bcc32c1d1576735</id>
<content type='text'>
commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.

kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined.  But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)-&gt;kswapd to NULL.  The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again.  Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [&lt;ffffffff81051a94&gt;] exit_creds+0x12/0x78
  PGD 0
  Oops: 0000 [#1] SMP
  last sysfs file: /sys/devices/system/memory/memory391/state
  CPU 11
  Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
  RIP: 0010:exit_creds+0x12/0x78
  RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
  RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
  RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
  R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
  R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
  FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
  Stack:
   ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
   ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
   0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
  Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
  Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 &lt;8b&gt; 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
  RIP  exit_creds+0x12/0x78
   RSP &lt;ffff8806044f1d78&gt;
  CR2: 0000000000000000

[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Signed-off-by: Jiang Liu &lt;jiang.liu@huawei.com&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.

kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined.  But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)-&gt;kswapd to NULL.  The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again.  Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [&lt;ffffffff81051a94&gt;] exit_creds+0x12/0x78
  PGD 0
  Oops: 0000 [#1] SMP
  last sysfs file: /sys/devices/system/memory/memory391/state
  CPU 11
  Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
  RIP: 0010:exit_creds+0x12/0x78
  RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
  RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
  RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
  R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
  R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
  FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
  Stack:
   ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
   ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
   0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
  Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
  Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 &lt;8b&gt; 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
  RIP  exit_creds+0x12/0x78
   RSP &lt;ffff8806044f1d78&gt;
  CR2: 0000000000000000

[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Signed-off-by: Jiang Liu &lt;jiang.liu@huawei.com&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
