<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/mm/memory_hotplug.c, branch v4.10</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>base/memory, hotplug: fix a kernel oops in show_valid_zones()</title>
<updated>2017-02-03T22:13:19+00:00</updated>
<author>
<name>Toshi Kani</name>
<email>toshi.kani@hpe.com</email>
</author>
<published>2017-02-03T21:13:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a96dfddbcc04336bbed50dc2b24823e45e09e80c'/>
<id>a96dfddbcc04336bbed50dc2b24823e45e09e80c</id>
<content type='text'>
Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().

 BUG: unable to handle kernel paging request at ffffea017a000000
 IP: show_valid_zones+0x6f/0x160

This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB.  [1] An example of such
systems is desribed below.  0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.

 BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range.  show_valid_zones() then proceeds with the valid range.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
Signed-off-by: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Zhang Zhen &lt;zhenzhang.zhang@huawei.com&gt;
Cc: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.4+]

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().

 BUG: unable to handle kernel paging request at ffffea017a000000
 IP: show_valid_zones+0x6f/0x160

This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB.  [1] An example of such
systems is desribed below.  0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.

 BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable

Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range.  show_valid_zones() then proceeds with the valid range.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
Signed-off-by: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Zhang Zhen &lt;zhenzhang.zhang@huawei.com&gt;
Cc: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.4+]

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()</title>
<updated>2017-02-03T22:13:19+00:00</updated>
<author>
<name>Toshi Kani</name>
<email>toshi.kani@hpe.com</email>
</author>
<published>2017-02-03T21:13:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=deb88a2a19e85842d79ba96b05031739ec327ff4'/>
<id>deb88a2a19e85842d79ba96b05031739ec327ff4</id>
<content type='text'>
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory.  [1] When the start address of a
memory block is not backed by struct page, i.e.  a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops.  This issue was observed on multiple x86-64 systems with
more than 64GiB of memory.  This patch-set fixes this issue.

Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.

Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).

Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9.  However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.

So, I recommend that we backport it up to 4.4.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

This patch (of 2):

test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'.  Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.

Fix it by properly setting 'sec_end_pfn' to the next section pfn.

Also make sure that this function returns 1 only when the range belongs
to a zone.

Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Andrew Banman &lt;abanman@sgi.com&gt;
Cc: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.4+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory.  [1] When the start address of a
memory block is not backed by struct page, i.e.  a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops.  This issue was observed on multiple x86-64 systems with
more than 64GiB of memory.  This patch-set fixes this issue.

Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.

Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).

Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9.  However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.

So, I recommend that we backport it up to 4.4.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

This patch (of 2):

test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'.  Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.

Fix it by properly setting 'sec_end_pfn' to the next section pfn.

Also make sure that this function returns 1 only when the range belongs
to a zone.

Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani &lt;toshi.kani@hpe.com&gt;
Cc: Andrew Banman &lt;abanman@sgi.com&gt;
Cc: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Cc: Greg KH &lt;greg@kroah.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.4+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>memory_hotplug: make zone_can_shift() return a boolean value</title>
<updated>2017-01-25T00:26:14+00:00</updated>
<author>
<name>Yasuaki Ishimatsu</name>
<email>yasu.isimatu@gmail.com</email>
</author>
<published>2017-01-24T23:17:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8a1f780e7f28c7c1d640118242cf68d528c456cd'/>
<id>8a1f780e7f28c7c1d640118242cf68d528c456cd</id>
<content type='text'>
online_{kernel|movable} is used to change the memory zone to
ZONE_{NORMAL|MOVABLE} and online the memory.

To check that memory zone can be changed, zone_can_shift() is used.
Currently the function returns minus integer value, plus integer
value and 0. When the function returns minus or plus integer value,
it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

But when the function returns 0, there are two meanings.

One of the meanings is that the memory zone does not need to be changed.
For example, when memory is in ZONE_NORMAL and onlined by online_kernel
the memory zone does not need to be changed.

Another meaning is that the memory zone cannot be changed. When memory
is in ZONE_NORMAL and onlined by online_movable, the memory zone may
not be changed to ZONE_MOVALBE due to memory online limitation(see
Documentation/memory-hotplug.txt). In this case, memory must not be
onlined.

The patch changes the return type of zone_can_shift() so that memory
online operation fails when memory zone cannot be changed as follows:

Before applying patch:
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  7864320
           managed  7864320
   # echo online_movable &gt; memory4097/state
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  8388608
           managed  8388608

   online_movable operation succeeded. But memory is onlined as
   ZONE_NORMAL, not ZONE_MOVABLE.

After applying patch:
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  7864320
           managed  7864320
   # echo online_movable &gt; memory4097/state
   bash: echo: write error: Invalid argument
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  7864320
           managed  7864320

   online_movable operation failed because of failure of changing
   the memory zone from ZONE_NORMAL to ZONE_MOVABLE

Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
Signed-off-by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Reviewed-by: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
online_{kernel|movable} is used to change the memory zone to
ZONE_{NORMAL|MOVABLE} and online the memory.

To check that memory zone can be changed, zone_can_shift() is used.
Currently the function returns minus integer value, plus integer
value and 0. When the function returns minus or plus integer value,
it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.

But when the function returns 0, there are two meanings.

One of the meanings is that the memory zone does not need to be changed.
For example, when memory is in ZONE_NORMAL and onlined by online_kernel
the memory zone does not need to be changed.

Another meaning is that the memory zone cannot be changed. When memory
is in ZONE_NORMAL and onlined by online_movable, the memory zone may
not be changed to ZONE_MOVALBE due to memory online limitation(see
Documentation/memory-hotplug.txt). In this case, memory must not be
onlined.

The patch changes the return type of zone_can_shift() so that memory
online operation fails when memory zone cannot be changed as follows:

Before applying patch:
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  7864320
           managed  7864320
   # echo online_movable &gt; memory4097/state
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  8388608
           managed  8388608

   online_movable operation succeeded. But memory is onlined as
   ZONE_NORMAL, not ZONE_MOVABLE.

After applying patch:
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  7864320
           managed  7864320
   # echo online_movable &gt; memory4097/state
   bash: echo: write error: Invalid argument
   # grep -A 35 "Node 2" /proc/zoneinfo
   Node 2, zone   Normal
   &lt;snip&gt;
      node_scanned  0
           spanned  8388608
           present  7864320
           managed  7864320

   online_movable operation failed because of failure of changing
   the memory zone from ZONE_NORMAL to ZONE_MOVABLE

Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
Signed-off-by: Yasuaki Ishimatsu &lt;isimatu.yasuaki@jp.fujitsu.com&gt;
Reviewed-by: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: remove x86-only restriction of movable_node</title>
<updated>2016-12-13T02:55:07+00:00</updated>
<author>
<name>Reza Arbab</name>
<email>arbab@linux.vnet.ibm.com</email>
</author>
<published>2016-12-13T00:42:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=39fa104d5b87655c1c19d4b1990ea63d190c4817'/>
<id>39fa104d5b87655c1c19d4b1990ea63d190c4817</id>
<content type='text'>
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:

1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().

Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.

This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced.  (1) will happen for them, but (2) will not.

This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable.  Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.

So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.

Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Acked-by: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Alistair Popple &lt;apopple@au1.ibm.com&gt;
Cc: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Bharata B Rao &lt;bharata@linux.vnet.ibm.com&gt;
Cc: Frank Rowand &lt;frowand.list@gmail.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Nathan Fontenot &lt;nfont@linux.vnet.ibm.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Rob Herring &lt;robh+dt@kernel.org&gt;
Cc: Stewart Smith &lt;stewart@linux.vnet.ibm.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:

1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().

Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.

This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced.  (1) will happen for them, but (2) will not.

This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable.  Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.

So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.

Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Acked-by: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: Alistair Popple &lt;apopple@au1.ibm.com&gt;
Cc: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Bharata B Rao &lt;bharata@linux.vnet.ibm.com&gt;
Cc: Frank Rowand &lt;frowand.list@gmail.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Nathan Fontenot &lt;nfont@linux.vnet.ibm.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Rob Herring &lt;robh+dt@kernel.org&gt;
Cc: Stewart Smith &lt;stewart@linux.vnet.ibm.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: remove unused variable in memory hotplug</title>
<updated>2016-10-27T22:49:12+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-10-27T22:49:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9db4f36e82c2394c958d8e42a498fb664684bc22'/>
<id>9db4f36e82c2394c958d8e42a498fb664684bc22</id>
<content type='text'>
When I removed the per-zone bitlock hashed waitqueues in commit
9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
removed all the magic hotplug memory initialization of said waitqueues
too.

But when I actually _tested_ the resulting build, I stupidly assumed
that "allmodconfig" would enable memory hotplug.  And it doesn't,
because it enables KASAN instead, which then disables hotplug memory
support.

As a result, my build test of the per-zone waitqueues was totally
broken, and I didn't notice that the compiler warns about the now unused
iterator variable 'i'.

I guess I should be happy that that seems to be the worst breakage from
my clearly horribly failed test coverage.

Reported-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When I removed the per-zone bitlock hashed waitqueues in commit
9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
removed all the magic hotplug memory initialization of said waitqueues
too.

But when I actually _tested_ the resulting build, I stupidly assumed
that "allmodconfig" would enable memory hotplug.  And it doesn't,
because it enables KASAN instead, which then disables hotplug memory
support.

As a result, my build test of the per-zone waitqueues was totally
broken, and I didn't notice that the compiler warns about the now unused
iterator variable 'i'.

I guess I should be happy that that seems to be the worst breakage from
my clearly horribly failed test coverage.

Reported-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: remove per-zone hashtable of bitlock waitqueues</title>
<updated>2016-10-27T16:27:57+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-10-26T17:15:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9dcb8b685fc30813b35ab4b4bf39244430753190'/>
<id>9dcb8b685fc30813b35ab4b4bf39244430753190</id>
<content type='text'>
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

     wait_on_bit(&amp;gh-&gt;gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices.  In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Tested-by: Bob Peterson &lt;rpeterso@redhat.com&gt;
Acked-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

     wait_on_bit(&amp;gh-&gt;gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices.  In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher &lt;agruenba@redhat.com&gt;
Tested-by: Bob Peterson &lt;rpeterso@redhat.com&gt;
Acked-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/hugetlb: check for reserved hugepages during memory offline</title>
<updated>2016-10-08T01:46:29+00:00</updated>
<author>
<name>Gerald Schaefer</name>
<email>gerald.schaefer@de.ibm.com</email>
</author>
<published>2016-10-08T00:01:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=082d5b6b60e9f25e1511557fcfcb21eedd267446'/>
<id>082d5b6b60e9f25e1511557fcfcb21eedd267446</id>
<content type='text'>
In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.

Fix this by adding a return value to dissolve_free_huge_pages() and
checking h-&gt;free_huge_pages vs.  h-&gt;resv_huge_pages.  Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer &lt;gerald.schaefer@de.ibm.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: "Kirill A . Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: "Aneesh Kumar K . V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Cc: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Rui Teng &lt;rui.teng@linux.vnet.ibm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.

Fix this by adding a return value to dissolve_free_huge_pages() and
checking h-&gt;free_huge_pages vs.  h-&gt;resv_huge_pages.  Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer &lt;gerald.schaefer@de.ibm.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: "Kirill A . Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: "Aneesh Kumar K . V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Cc: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Rui Teng &lt;rui.teng@linux.vnet.ibm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mem-hotplug: use nodes that contain memory as mask in new_node_page()</title>
<updated>2016-09-28T23:19:02+00:00</updated>
<author>
<name>Li Zhong</name>
<email>zhong@linux.vnet.ibm.com</email>
</author>
<published>2016-09-28T22:22:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=231e97e2b8ec9a1556ced5d8a89cda03a480b179'/>
<id>231e97e2b8ec9a1556ced5d8a89cda03a480b179</id>
<content type='text'>
9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
prevents allocating from an empty nodemask, but as David points out, it is
still wrong.  As node_online_map may include memoryless nodes, only
allocating from these nodes is meaningless.

This patch uses node_states[N_MEMORY] mask to prevent the above case.

Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
Signed-off-by: Li Zhong &lt;zhong@linux.vnet.ibm.com&gt;
Suggested-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: John Allen &lt;jallen@linux.vnet.ibm.com&gt;
Cc: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
prevents allocating from an empty nodemask, but as David points out, it is
still wrong.  As node_online_map may include memoryless nodes, only
allocating from these nodes is meaningless.

This patch uses node_states[N_MEMORY] mask to prevent the above case.

Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
Signed-off-by: Li Zhong &lt;zhong@linux.vnet.ibm.com&gt;
Suggested-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: John Allen &lt;jallen@linux.vnet.ibm.com&gt;
Cc: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@i-love.sakura.ne.jp&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mem-hotplug: don't clear the only node in new_node_page()</title>
<updated>2016-09-19T22:36:16+00:00</updated>
<author>
<name>Li Zhong</name>
<email>zhong@linux.vnet.ibm.com</email>
</author>
<published>2016-09-19T21:43:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9bb627be47a574b764e162e8513d5db78d49e7f5'/>
<id>9bb627be47a574b764e162e8513d5db78d49e7f5</id>
<content type='text'>
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") introduced new_node_page() for memory
hotplug.

In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask().  But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.

The patch checks whether it is the last node on the system, and if it
is, then don't clear the nid in the nodemask.

Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong &lt;zhong@linux.vnet.ibm.com&gt;
Reported-by: John Allen &lt;jallen@linux.vnet.ibm.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") introduced new_node_page() for memory
hotplug.

In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask().  But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.

The patch checks whether it is the last node on the system, and if it
is, then don't clear the nid in the nodemask.

Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong &lt;zhong@linux.vnet.ibm.com&gt;
Reported-by: John Allen &lt;jallen@linux.vnet.ibm.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Xishi Qiu &lt;qiuxishi@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats</title>
<updated>2016-08-11T23:58:14+00:00</updated>
<author>
<name>Reza Arbab</name>
<email>arbab@linux.vnet.ibm.com</email>
</author>
<published>2016-08-11T22:33:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5830169f47269f78f6624bd70165eb571270da82'/>
<id>5830169f47269f78f6624bd70165eb571270da82</id>
<content type='text'>
The following oops occurs after a pgdat is hotadded:

  Unable to handle kernel paging request for data at address 0x00c30001
  Faulting instruction address: 0xc00000000022f8f4
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
  CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W 4.8.0-rc1-device #110
  task: c000000000ef3080 task.stack: c000000000f6c000
  NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
  REGS: c000000000f6fa50 TRAP: 0300   Tainted: G        W (4.8.0-rc1-device)
  MSR: 800000010280b033 &lt;SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]&gt;  CR: 84002028  XER: 20000000
  CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
  NIP refresh_cpu_vm_stats+0x1a4/0x2f0
  LR refresh_cpu_vm_stats+0x1f8/0x2f0
  Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

Add per_cpu_nodestats initialization to the hotplug codepath.

Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Paul Mackerras &lt;paulus@ozlabs.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The following oops occurs after a pgdat is hotadded:

  Unable to handle kernel paging request for data at address 0x00c30001
  Faulting instruction address: 0xc00000000022f8f4
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
  CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W 4.8.0-rc1-device #110
  task: c000000000ef3080 task.stack: c000000000f6c000
  NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
  REGS: c000000000f6fa50 TRAP: 0300   Tainted: G        W (4.8.0-rc1-device)
  MSR: 800000010280b033 &lt;SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]&gt;  CR: 84002028  XER: 20000000
  CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
  NIP refresh_cpu_vm_stats+0x1a4/0x2f0
  LR refresh_cpu_vm_stats+0x1f8/0x2f0
  Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

Add per_cpu_nodestats initialization to the hotplug codepath.

Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab &lt;arbab@linux.vnet.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Paul Mackerras &lt;paulus@ozlabs.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
