<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/mm, branch v4.16-rc6</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment"</title>
<updated>2018-03-14T23:33:28+00:00</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ard.biesheuvel@linaro.org</email>
</author>
<published>2018-03-14T19:29:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3e04040df6d4613a8af5a80882d5f7f298f49810'/>
<id>3e04040df6d4613a8af5a80882d5f7f298f49810</id>
<content type='text'>
This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock
alignment") modified the logic in memmap_init_zone() to initialize
struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
in move_freepages(), which is redundant by its own admission, and
dereferences struct page fields to obtain the zone without checking
whether the struct pages in question are valid to begin with.

Commit 864b75f9d6b0 only makes it worse, since the rounding it does
may cause pfn assume the same value it had in a prior iteration of
the loop, resulting in an infinite loop and a hang very early in the
boot. Also, since it doesn't perform the same rounding on start_pfn
itself but only on intermediate values following an invalid PFN, we
may still hit the same VM_BUG_ON() as before.

So instead, let's fix this at the core, and ensure that the BUG
check doesn't dereference struct page fields of invalid pages.

Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Tested-by: Jan Glauber &lt;jglauber@cavium.com&gt;
Tested-by: Shanker Donthineni &lt;shankerd@codeaurora.org&gt;
Cc: Daniel Vacek &lt;neelx@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Ard Biesheuvel &lt;ard.biesheuvel@linaro.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock
alignment") modified the logic in memmap_init_zone() to initialize
struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
in move_freepages(), which is redundant by its own admission, and
dereferences struct page fields to obtain the zone without checking
whether the struct pages in question are valid to begin with.

Commit 864b75f9d6b0 only makes it worse, since the rounding it does
may cause pfn assume the same value it had in a prior iteration of
the loop, resulting in an infinite loop and a hang very early in the
boot. Also, since it doesn't perform the same rounding on start_pfn
itself but only on intermediate values following an invalid PFN, we
may still hit the same VM_BUG_ON() as before.

So instead, let's fix this at the core, and ensure that the BUG
check doesn't dereference struct page fields of invalid pages.

Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Tested-by: Jan Glauber &lt;jglauber@cavium.com&gt;
Tested-by: Shanker Donthineni &lt;shankerd@codeaurora.org&gt;
Cc: Daniel Vacek &lt;neelx@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Ard Biesheuvel &lt;ard.biesheuvel@linaro.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/page_alloc: fix memmap_init_zone pageblock alignment</title>
<updated>2018-03-10T00:40:01+00:00</updated>
<author>
<name>Daniel Vacek</name>
<email>neelx@redhat.com</email>
</author>
<published>2018-03-09T23:51:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae'/>
<id>864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae</id>
<content type='text'>
Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") introduced a bug where move_freepages() triggers a
VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
To fix this, simply align the skipped pfns in memmap_init_zone() the
same way as in move_freepages_block().

Seen in one of the RHEL reports:

  crash&gt; log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
  kernel BUG at mm/page_alloc.c:1389!
  invalid opcode: 0000 [#1] SMP
  --
  RIP: 0010:[&lt;ffffffff8118833e&gt;]  [&lt;ffffffff8118833e&gt;] move_freepages+0x15e/0x160
  RSP: 0018:ffff88054d727688  EFLAGS: 00010087
  --
  Call Trace:
   [&lt;ffffffff811883b3&gt;] move_freepages_block+0x73/0x80
   [&lt;ffffffff81189e63&gt;] __rmqueue+0x263/0x460
   [&lt;ffffffff8118c781&gt;] get_page_from_freelist+0x7e1/0x9e0
   [&lt;ffffffff8118caf6&gt;] __alloc_pages_nodemask+0x176/0x420
  --
  RIP  [&lt;ffffffff8118833e&gt;] move_freepages+0x15e/0x160
   RSP &lt;ffff88054d727688&gt;

  crash&gt; page_init_bug -v | grep RAM
  &lt;struct resource 0xffff88067fffd2f8&gt;          1000 -        9bfff	System RAM (620.00 KiB)
  &lt;struct resource 0xffff88067fffd3a0&gt;        100000 -     430bffff	System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
  &lt;struct resource 0xffff88067fffd410&gt;      4b0c8000 -     4bf9cfff	System RAM ( 14.83 MiB = 15188.00 KiB)
  &lt;struct resource 0xffff88067fffd480&gt;      4bfac000 -     646b1fff	System RAM (391.02 MiB = 400408.00 KiB)
  &lt;struct resource 0xffff88067fffd560&gt;      7b788000 -     7b7fffff	System RAM (480.00 KiB)
  &lt;struct resource 0xffff88067fffd640&gt;     100000000 -    67fffffff	System RAM ( 22.00 GiB)

  crash&gt; page_init_bug | head -6
  &lt;struct resource 0xffff88067fffd560&gt;      7b788000 -     7b7fffff	System RAM (480.00 KiB)
  &lt;struct page 0xffffea0001ede200&gt;   1fffff00000000  0 &lt;struct pglist_data 0xffff88047ffd9000&gt; 1 &lt;struct zone 0xffff88047ffd9800&gt; DMA32          4096    1048575
  &lt;struct page 0xffffea0001ede200&gt; 505736 505344 &lt;struct page 0xffffea0001ed8000&gt; 505855 &lt;struct page 0xffffea0001edffc0&gt;
  &lt;struct page 0xffffea0001ed8000&gt;                0  0 &lt;struct pglist_data 0xffff88047ffd9000&gt; 0 &lt;struct zone 0xffff88047ffd9000&gt; DMA               1       4095
  &lt;struct page 0xffffea0001edffc0&gt;   1fffff00000400  0 &lt;struct pglist_data 0xffff88047ffd9000&gt; 1 &lt;struct zone 0xffff88047ffd9800&gt; DMA32          4096    1048575
  BUG, zones differ!

Note that this range follows two not populated sections
68000000-77ffffff in this zone.  7b788000-7b7fffff is the first one
after a gap.  This makes memmap_init_zone() skip all the pfns up to the
beginning of this range.  But this range is not pageblock (2M) aligned.
In fact no range has to be.

  crash&gt; kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffea0001e00000  78000000                0        0  0 0
  ffffea0001ed7fc0  7b5ff000                0        0  0 0
  ffffea0001ed8000  7b600000                0        0  0 0	&lt;&lt;&lt;&lt;
  ffffea0001ede1c0  7b787000                0        0  0 0
  ffffea0001ede200  7b788000                0        0  1 1fffff00000000

Top part of page flags should contain nodeid and zonenr, which is not
the case for page ffffea0001ed8000 here (&lt;&lt;&lt;&lt;).

  crash&gt; log | grep -o fffea0001ed[^\ ]* | sort -u
  fffea0001ed8000
  fffea0001eded20
  fffea0001edffc0

  crash&gt; bt -r | grep -o fffea0001ed[^\ ]* | sort -u
  fffea0001ed8000
  fffea0001eded00
  fffea0001eded20
  fffea0001edffc0

Initialization of the whole beginning of the section is skipped up to
the start of the range due to the commit b92df1de5d28.  Now any code
calling move_freepages_block() (like reusing the page from a freelist as
in this example) with a page from the beginning of the range will get
the page rounded down to start_page ffffea0001ed8000 and passed to
move_freepages() which crashes on assertion getting wrong zonenr.

  &gt;         VM_BUG_ON(page_zone(start_page) != page_zone(end_page));

Note, page_zone() derives the zone from page flags here.

From similar machine before commit b92df1de5d28:

  crash&gt; kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  fffff73941e00000  78000000                0        0  1 1fffff00000000
  fffff73941ed7fc0  7b5ff000                0        0  1 1fffff00000000
  fffff73941ed8000  7b600000                0        0  1 1fffff00000000
  fffff73941edff80  7b7fe000                0        0  1 1fffff00000000
  fffff73941edffc0  7b7ff000 ffff8e67e04d3ae0     ad84  1 1fffff00020068 uptodate,lru,active,mappedtodisk

All the pages since the beginning of the section are initialized.
move_freepages()' not gonna blow up.

The same machine with this fix applied:

  crash&gt; kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffea0001e00000  78000000                0        0  0 0
  ffffea0001e00000  7b5ff000                0        0  0 0
  ffffea0001ed8000  7b600000                0        0  1 1fffff00000000
  ffffea0001edff80  7b7fe000                0        0  1 1fffff00000000
  ffffea0001edffc0  7b7ff000 ffff88017fb13720        8  2 1fffff00020068 uptodate,lru,active,mappedtodisk

At least the bare minimum of pages is initialized preventing the crash
as well.

Customers started to report this as soon as 7.4 (where b92df1de5d28 was
merged in RHEL) was released.  I remember reports from
September/October-ish times.  It's not easily reproduced and happens on
a handful of machines only.  I guess that's why.  But that does not make
it less serious, I think.

Though there actually is a report here:
  https://bugzilla.kernel.org/show_bug.cgi?id=196443

And there are reports for Fedora from July:
  https://bugzilla.redhat.com/show_bug.cgi?id=1473242
and CentOS:
  https://bugs.centos.org/view.php?id=13964
and we internally track several dozens reports for RHEL bug
  https://bugzilla.redhat.com/show_bug.cgi?id=1525121

Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek &lt;neelx@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") introduced a bug where move_freepages() triggers a
VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
To fix this, simply align the skipped pfns in memmap_init_zone() the
same way as in move_freepages_block().

Seen in one of the RHEL reports:

  crash&gt; log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
  kernel BUG at mm/page_alloc.c:1389!
  invalid opcode: 0000 [#1] SMP
  --
  RIP: 0010:[&lt;ffffffff8118833e&gt;]  [&lt;ffffffff8118833e&gt;] move_freepages+0x15e/0x160
  RSP: 0018:ffff88054d727688  EFLAGS: 00010087
  --
  Call Trace:
   [&lt;ffffffff811883b3&gt;] move_freepages_block+0x73/0x80
   [&lt;ffffffff81189e63&gt;] __rmqueue+0x263/0x460
   [&lt;ffffffff8118c781&gt;] get_page_from_freelist+0x7e1/0x9e0
   [&lt;ffffffff8118caf6&gt;] __alloc_pages_nodemask+0x176/0x420
  --
  RIP  [&lt;ffffffff8118833e&gt;] move_freepages+0x15e/0x160
   RSP &lt;ffff88054d727688&gt;

  crash&gt; page_init_bug -v | grep RAM
  &lt;struct resource 0xffff88067fffd2f8&gt;          1000 -        9bfff	System RAM (620.00 KiB)
  &lt;struct resource 0xffff88067fffd3a0&gt;        100000 -     430bffff	System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
  &lt;struct resource 0xffff88067fffd410&gt;      4b0c8000 -     4bf9cfff	System RAM ( 14.83 MiB = 15188.00 KiB)
  &lt;struct resource 0xffff88067fffd480&gt;      4bfac000 -     646b1fff	System RAM (391.02 MiB = 400408.00 KiB)
  &lt;struct resource 0xffff88067fffd560&gt;      7b788000 -     7b7fffff	System RAM (480.00 KiB)
  &lt;struct resource 0xffff88067fffd640&gt;     100000000 -    67fffffff	System RAM ( 22.00 GiB)

  crash&gt; page_init_bug | head -6
  &lt;struct resource 0xffff88067fffd560&gt;      7b788000 -     7b7fffff	System RAM (480.00 KiB)
  &lt;struct page 0xffffea0001ede200&gt;   1fffff00000000  0 &lt;struct pglist_data 0xffff88047ffd9000&gt; 1 &lt;struct zone 0xffff88047ffd9800&gt; DMA32          4096    1048575
  &lt;struct page 0xffffea0001ede200&gt; 505736 505344 &lt;struct page 0xffffea0001ed8000&gt; 505855 &lt;struct page 0xffffea0001edffc0&gt;
  &lt;struct page 0xffffea0001ed8000&gt;                0  0 &lt;struct pglist_data 0xffff88047ffd9000&gt; 0 &lt;struct zone 0xffff88047ffd9000&gt; DMA               1       4095
  &lt;struct page 0xffffea0001edffc0&gt;   1fffff00000400  0 &lt;struct pglist_data 0xffff88047ffd9000&gt; 1 &lt;struct zone 0xffff88047ffd9800&gt; DMA32          4096    1048575
  BUG, zones differ!

Note that this range follows two not populated sections
68000000-77ffffff in this zone.  7b788000-7b7fffff is the first one
after a gap.  This makes memmap_init_zone() skip all the pfns up to the
beginning of this range.  But this range is not pageblock (2M) aligned.
In fact no range has to be.

  crash&gt; kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffea0001e00000  78000000                0        0  0 0
  ffffea0001ed7fc0  7b5ff000                0        0  0 0
  ffffea0001ed8000  7b600000                0        0  0 0	&lt;&lt;&lt;&lt;
  ffffea0001ede1c0  7b787000                0        0  0 0
  ffffea0001ede200  7b788000                0        0  1 1fffff00000000

Top part of page flags should contain nodeid and zonenr, which is not
the case for page ffffea0001ed8000 here (&lt;&lt;&lt;&lt;).

  crash&gt; log | grep -o fffea0001ed[^\ ]* | sort -u
  fffea0001ed8000
  fffea0001eded20
  fffea0001edffc0

  crash&gt; bt -r | grep -o fffea0001ed[^\ ]* | sort -u
  fffea0001ed8000
  fffea0001eded00
  fffea0001eded20
  fffea0001edffc0

Initialization of the whole beginning of the section is skipped up to
the start of the range due to the commit b92df1de5d28.  Now any code
calling move_freepages_block() (like reusing the page from a freelist as
in this example) with a page from the beginning of the range will get
the page rounded down to start_page ffffea0001ed8000 and passed to
move_freepages() which crashes on assertion getting wrong zonenr.

  &gt;         VM_BUG_ON(page_zone(start_page) != page_zone(end_page));

Note, page_zone() derives the zone from page flags here.

From similar machine before commit b92df1de5d28:

  crash&gt; kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  fffff73941e00000  78000000                0        0  1 1fffff00000000
  fffff73941ed7fc0  7b5ff000                0        0  1 1fffff00000000
  fffff73941ed8000  7b600000                0        0  1 1fffff00000000
  fffff73941edff80  7b7fe000                0        0  1 1fffff00000000
  fffff73941edffc0  7b7ff000 ffff8e67e04d3ae0     ad84  1 1fffff00020068 uptodate,lru,active,mappedtodisk

All the pages since the beginning of the section are initialized.
move_freepages()' not gonna blow up.

The same machine with this fix applied:

  crash&gt; kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffea0001e00000  78000000                0        0  0 0
  ffffea0001e00000  7b5ff000                0        0  0 0
  ffffea0001ed8000  7b600000                0        0  1 1fffff00000000
  ffffea0001edff80  7b7fe000                0        0  1 1fffff00000000
  ffffea0001edffc0  7b7ff000 ffff88017fb13720        8  2 1fffff00020068 uptodate,lru,active,mappedtodisk

At least the bare minimum of pages is initialized preventing the crash
as well.

Customers started to report this as soon as 7.4 (where b92df1de5d28 was
merged in RHEL) was released.  I remember reports from
September/October-ish times.  It's not easily reproduced and happens on
a handful of machines only.  I guess that's why.  But that does not make
it less serious, I think.

Though there actually is a report here:
  https://bugzilla.kernel.org/show_bug.cgi?id=196443

And there are reports for Fedora from July:
  https://bugzilla.redhat.com/show_bug.cgi?id=1473242
and CentOS:
  https://bugs.centos.org/view.php?id=13964
and we internally track several dozens reports for RHEL bug
  https://bugzilla.redhat.com/show_bug.cgi?id=1525121

Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
Signed-off-by: Daniel Vacek &lt;neelx@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memblock.c: hardcode the end_pfn being -1</title>
<updated>2018-03-10T00:40:01+00:00</updated>
<author>
<name>Daniel Vacek</name>
<email>neelx@redhat.com</email>
</author>
<published>2018-03-09T23:51:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=379b03b7fa05f7db521b7732a52692448a3c34fe'/>
<id>379b03b7fa05f7db521b7732a52692448a3c34fe</id>
<content type='text'>
This is just a cleanup.  It aids handling the special end case in the
next commit.

[akpm@linux-foundation.org: make it work against current -linus, not against -mm]
[akpm@linux-foundation.org: make it work against current -linus, not against -mm some more]
Link: http://lkml.kernel.org/r/1ca478d4269125a99bcfb1ca04d7b88ac1aee924.1520011944.git.neelx@redhat.com
Signed-off-by: Daniel Vacek &lt;neelx@redhat.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is just a cleanup.  It aids handling the special end case in the
next commit.

[akpm@linux-foundation.org: make it work against current -linus, not against -mm]
[akpm@linux-foundation.org: make it work against current -linus, not against -mm some more]
Link: http://lkml.kernel.org/r/1ca478d4269125a99bcfb1ca04d7b88ac1aee924.1520011944.git.neelx@redhat.com
Signed-off-by: Daniel Vacek &lt;neelx@redhat.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT</title>
<updated>2018-03-10T00:40:01+00:00</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2018-03-09T23:51:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=96312e61282ae3f6537a562625706498cbc75594'/>
<id>96312e61282ae3f6537a562625706498cbc75594</id>
<content type='text'>
KVM is hanging during postcopy live migration with userfaultfd because
get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.

Earlier FOLL_NOWAIT was only ever passed to get_user_pages.

Specifically faultin_page (the callee of get_user_pages_unlocked caller)
doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault
flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually
released (even if nonblocking is not NULL).  So it sets *nonblocking to
zero and the caller won't release the mmap_sem thinking it was already
released, but it wasn't because of FOLL_NOWAIT.

Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com
Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Reported-by: Dr. David Alan Gilbert &lt;dgilbert@redhat.com&gt;
Tested-by: Dr. David Alan Gilbert &lt;dgilbert@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
KVM is hanging during postcopy live migration with userfaultfd because
get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.

Earlier FOLL_NOWAIT was only ever passed to get_user_pages.

Specifically faultin_page (the callee of get_user_pages_unlocked caller)
doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault
flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually
released (even if nonblocking is not NULL).  So it sets *nonblocking to
zero and the caller won't release the mmap_sem thinking it was already
released, but it wasn't because of FOLL_NOWAIT.

Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com
Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Reported-by: Dr. David Alan Gilbert &lt;dgilbert@redhat.com&gt;
Tested-by: Dr. David Alan Gilbert &lt;dgilbert@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>hugetlb: fix surplus pages accounting</title>
<updated>2018-03-10T00:40:01+00:00</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2018-03-09T23:50:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4704dea36dd9e5b4bf37ed20f7f15e70632ccdd0'/>
<id>4704dea36dd9e5b4bf37ed20f7f15e70632ccdd0</id>
<content type='text'>
Dan Rue has noticed that libhugetlbfs test suite fails counter test:

  # mount_point="/mnt/hugetlb/"
  # echo 200 &gt; /proc/sys/vm/nr_hugepages
  # mkdir -p "${mount_point}"
  # mount -t hugetlbfs hugetlbfs "${mount_point}"
  # export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
  # /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
  Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
  Base pool size: 0
  Clean...
  FAIL    Line 326: Bad HugePages_Total: expected 0, actual 1

The bug was bisected to 0c397daea1d4 ("mm, hugetlb: further simplify
hugetlb allocation API").

The reason is that alloc_surplus_huge_page() misaccounts per node
surplus pages.  We should increase surplus_huge_pages_node rather than
nr_huge_pages_node which is already handled by alloc_fresh_huge_page.

Link: http://lkml.kernel.org/r/20180221191439.GM2231@dhcp22.suse.cz
Fixes: 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API")
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reported-by: Dan Rue &lt;dan.rue@linaro.org&gt;
Tested-by: Dan Rue &lt;dan.rue@linaro.org&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Dan Rue has noticed that libhugetlbfs test suite fails counter test:

  # mount_point="/mnt/hugetlb/"
  # echo 200 &gt; /proc/sys/vm/nr_hugepages
  # mkdir -p "${mount_point}"
  # mount -t hugetlbfs hugetlbfs "${mount_point}"
  # export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
  # /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
  Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
  Base pool size: 0
  Clean...
  FAIL    Line 326: Bad HugePages_Total: expected 0, actual 1

The bug was bisected to 0c397daea1d4 ("mm, hugetlb: further simplify
hugetlb allocation API").

The reason is that alloc_surplus_huge_page() misaccounts per node
surplus pages.  We should increase surplus_huge_pages_node rather than
nr_huge_pages_node which is already handled by alloc_fresh_huge_page.

Link: http://lkml.kernel.org/r/20180221191439.GM2231@dhcp22.suse.cz
Fixes: 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API")
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reported-by: Dan Rue &lt;dan.rue@linaro.org&gt;
Tested-by: Dan Rue &lt;dan.rue@linaro.org&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: don't defer struct page initialization for Xen pv guests</title>
<updated>2018-02-21T23:35:43+00:00</updated>
<author>
<name>Juergen Gross</name>
<email>jgross@suse.com</email>
</author>
<published>2018-02-21T22:46:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=895f7b8e90200cf1a5dc313329369adf30e51f9a'/>
<id>895f7b8e90200cf1a5dc313329369adf30e51f9a</id>
<content type='text'>
Commit f7f99100d8d9 ("mm: stop zeroing memory during allocation in
vmemmap") broke Xen pv domains in some configurations, as the "Pinned"
information in struct page of early page tables could get lost.

This will lead to the kernel trying to write directly into the page
tables instead of asking the hypervisor to do so.  The result is a crash
like the following:

  BUG: unable to handle kernel paging request at ffff8801ead19008
  IP: xen_set_pud+0x4e/0xd0
  PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 80100001ead19065
  Oops: 0003 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
  Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 06/26/2014
  task: ffffffff81c10480 task.stack: ffffffff81c00000
  RIP: e030:xen_set_pud+0x4e/0xd0
  Call Trace:
   __pmd_alloc+0x128/0x140
   ioremap_page_range+0x3f4/0x410
   __ioremap_caller+0x1c3/0x2e0
   acpi_os_map_iomem+0x175/0x1b0
   acpi_tb_acquire_table+0x39/0x66
   acpi_tb_validate_table+0x44/0x7c
   acpi_tb_verify_temp_table+0x45/0x304
   acpi_reallocate_root_table+0x12d/0x141
   acpi_early_init+0x4d/0x10a
   start_kernel+0x3eb/0x4a1
   xen_start_kernel+0x528/0x532
  Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d &lt;4c&gt; 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
  RIP: xen_set_pud+0x4e/0xd0 RSP: ffffffff81c03cd8
  CR2: ffff8801ead19008
  ---[ end trace 38eca2e56f1b642e ]---

Avoid this problem by not deferring struct page initialization when
running as Xen pv guest.

Pavel said:

: This is unique for Xen, so this particular issue won't effect other
: configurations.  I am going to investigate if there is a way to
: re-enable deferred page initialization on xen guests.

[akpm@linux-foundation.org: explicitly include xen.h]
Link: http://lkml.kernel.org/r/20180216154101.22865-1-jgross@suse.com
Fixes: f7f99100d8d95d ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Juergen Gross &lt;jgross@suse.com&gt;
Reviewed-by: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Steven Sistare &lt;steven.sistare@oracle.com&gt;
Cc: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Cc: Bob Picco &lt;bob.picco@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.15.x]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit f7f99100d8d9 ("mm: stop zeroing memory during allocation in
vmemmap") broke Xen pv domains in some configurations, as the "Pinned"
information in struct page of early page tables could get lost.

This will lead to the kernel trying to write directly into the page
tables instead of asking the hypervisor to do so.  The result is a crash
like the following:

  BUG: unable to handle kernel paging request at ffff8801ead19008
  IP: xen_set_pud+0x4e/0xd0
  PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 80100001ead19065
  Oops: 0003 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
  Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 06/26/2014
  task: ffffffff81c10480 task.stack: ffffffff81c00000
  RIP: e030:xen_set_pud+0x4e/0xd0
  Call Trace:
   __pmd_alloc+0x128/0x140
   ioremap_page_range+0x3f4/0x410
   __ioremap_caller+0x1c3/0x2e0
   acpi_os_map_iomem+0x175/0x1b0
   acpi_tb_acquire_table+0x39/0x66
   acpi_tb_validate_table+0x44/0x7c
   acpi_tb_verify_temp_table+0x45/0x304
   acpi_reallocate_root_table+0x12d/0x141
   acpi_early_init+0x4d/0x10a
   start_kernel+0x3eb/0x4a1
   xen_start_kernel+0x528/0x532
  Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d &lt;4c&gt; 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
  RIP: xen_set_pud+0x4e/0xd0 RSP: ffffffff81c03cd8
  CR2: ffff8801ead19008
  ---[ end trace 38eca2e56f1b642e ]---

Avoid this problem by not deferring struct page initialization when
running as Xen pv guest.

Pavel said:

: This is unique for Xen, so this particular issue won't effect other
: configurations.  I am going to investigate if there is a way to
: re-enable deferred page initialization on xen guests.

[akpm@linux-foundation.org: explicitly include xen.h]
Link: http://lkml.kernel.org/r/20180216154101.22865-1-jgross@suse.com
Fixes: f7f99100d8d95d ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Juergen Gross &lt;jgross@suse.com&gt;
Reviewed-by: Pavel Tatashin &lt;pasha.tatashin@oracle.com&gt;
Cc: Steven Sistare &lt;steven.sistare@oracle.com&gt;
Cc: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Cc: Bob Picco &lt;bob.picco@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.15.x]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems</title>
<updated>2018-02-21T23:35:43+00:00</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2018-02-21T22:46:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=698d0831ba87b92ae10b15e8203cfd59f5a59a35'/>
<id>698d0831ba87b92ae10b15e8203cfd59f5a59a35</id>
<content type='text'>
Kai Heng Feng has noticed that BUG_ON(PageHighMem(pg)) triggers in
drivers/media/common/saa7146/saa7146_core.c since 19809c2da28a ("mm,
vmalloc: use __GFP_HIGHMEM implicitly").

saa7146_vmalloc_build_pgtable uses vmalloc_32 and it is reasonable to
expect that the resulting page is not in highmem.  The above commit
aimed to add __GFP_HIGHMEM only for those requests which do not specify
any zone modifier gfp flag.  vmalloc_32 relies on GFP_VMALLOC32 which
should do the right thing.  Except it has been missed that GFP_VMALLOC32
is an alias for GFP_KERNEL on 32b architectures.  Thanks to Matthew to
notice this.

Fix the problem by unconditionally setting GFP_DMA32 in GFP_VMALLOC32
for !64b arches (as a bailout).  This should do the right thing and use
ZONE_NORMAL which should be always below 4G on 32b systems.

Debugged by Matthew Wilcox.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180212095019.GX21609@dhcp22.suse.cz
Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”)
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reported-by: Kai Heng Feng &lt;kai.heng.feng@canonical.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Laura Abbott &lt;labbott@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Kai Heng Feng has noticed that BUG_ON(PageHighMem(pg)) triggers in
drivers/media/common/saa7146/saa7146_core.c since 19809c2da28a ("mm,
vmalloc: use __GFP_HIGHMEM implicitly").

saa7146_vmalloc_build_pgtable uses vmalloc_32 and it is reasonable to
expect that the resulting page is not in highmem.  The above commit
aimed to add __GFP_HIGHMEM only for those requests which do not specify
any zone modifier gfp flag.  vmalloc_32 relies on GFP_VMALLOC32 which
should do the right thing.  Except it has been missed that GFP_VMALLOC32
is an alias for GFP_KERNEL on 32b architectures.  Thanks to Matthew to
notice this.

Fix the problem by unconditionally setting GFP_DMA32 in GFP_VMALLOC32
for !64b arches (as a bailout).  This should do the right thing and use
ZONE_NORMAL which should be always below 4G on 32b systems.

Debugged by Matthew Wilcox.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180212095019.GX21609@dhcp22.suse.cz
Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”)
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Reported-by: Kai Heng Feng &lt;kai.heng.feng@canonical.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Laura Abbott &lt;labbott@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/swap.c: make functions and their kernel-doc agree (again)</title>
<updated>2018-02-21T23:35:43+00:00</updated>
<author>
<name>Mike Rapoport</name>
<email>rppt@linux.vnet.ibm.com</email>
</author>
<published>2018-02-21T22:45:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cb6f0f34802dd7148d930f4f8d1cce991b8c23be'/>
<id>cb6f0f34802dd7148d930f4f8d1cce991b8c23be</id>
<content type='text'>
There was a conflict between the commit e02a9f048ef7 ("mm/swap.c: make
functions and their kernel-doc agree") and the commit f144c390f905 ("mm:
docs: fix parameter names mismatch") that both tried to fix mismatch
betweeen pagevec_lookup_entries() parameter names and their description.

Since nr_entries is a better name for the parameter, fix the description
again.

Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Acked-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There was a conflict between the commit e02a9f048ef7 ("mm/swap.c: make
functions and their kernel-doc agree") and the commit f144c390f905 ("mm:
docs: fix parameter names mismatch") that both tried to fix mismatch
betweeen pagevec_lookup_entries() parameter names and their description.

Since nr_entries is a better name for the parameter, fix the description
again.

Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Acked-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc</title>
<updated>2018-02-21T23:35:43+00:00</updated>
<author>
<name>Mike Rapoport</name>
<email>rppt@linux.vnet.ibm.com</email>
</author>
<published>2018-02-21T22:45:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=14fec9eba43b05d39825128e4354a2dc50fb59ea'/>
<id>14fec9eba43b05d39825128e4354a2dc50fb59ea</id>
<content type='text'>
[akpm@linux-foundation.org: add colon, per Randy]
Link: http://lkml.kernel.org/r/1518116984-21141-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[akpm@linux-foundation.org: add colon, per Randy]
Link: http://lkml.kernel.org/r/1518116984-21141-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm, swap, frontswap: fix THP swap if frontswap enabled</title>
<updated>2018-02-21T23:35:43+00:00</updated>
<author>
<name>Huang Ying</name>
<email>huang.ying.caritas@gmail.com</email>
</author>
<published>2018-02-21T22:45:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7ba716698cc53f8d5367766c93c538c7da6c68ce'/>
<id>7ba716698cc53f8d5367766c93c538c7da6c68ce</id>
<content type='text'>
It was reported by Sergey Senozhatsky that if THP (Transparent Huge
Page) and frontswap (via zswap) are both enabled, when memory goes low
so that swap is triggered, segfault and memory corruption will occur in
random user space applications as follow,

kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
 #0  0x00007fc08889ae0d _int_malloc (libc.so.6)
 #1  0x00007fc08889c2f3 malloc (libc.so.6)
 #2  0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
 #3  0x0000560e6005e75c n/a (urxvt)
 #4  0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
 #5  0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
 #6  0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
 #7  0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
 #8  0x0000560e6005cb55 ev_run (urxvt)
 #9  0x0000560e6003b9b9 main (urxvt)
 #10 0x00007fc08883af4a __libc_start_main (libc.so.6)
 #11 0x0000560e6003f9da _start (urxvt)

After bisection, it was found the first bad commit is bd4c82c22c36 ("mm,
THP, swap: delay splitting THP after swapped out").

The root cause is as follows:

When the pages are written to swap device during swapping out in
swap_writepage(), zswap (fontswap) is tried to compress the pages to
improve performance.  But zswap (frontswap) will treat THP as a normal
page, so only the head page is saved.  After swapping in, tail pages
will not be restored to their original contents, causing memory
corruption in the applications.

This is fixed by refusing to save page in the frontswap store functions
if the page is a THP.  So that the THP will be swapped out to swap
device.

Another choice is to split THP if frontswap is enabled.  But it is found
that the frontswap enabling isn't flexible.  For example, if
CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
zswap itself isn't enabled.

Frontswap has multiple backends, to make it easy for one backend to
enable THP support, the THP checking is put in backend frontswap store
functions instead of the general interfaces.

Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
Fixes: bd4c82c22c367e068 ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Reported-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Tested-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Suggested-by: Minchan Kim &lt;minchan@kernel.org&gt;	[put THP checking in backend]
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Dan Streetman &lt;ddstreet@ieee.org&gt;
Cc: Seth Jennings &lt;sjenning@redhat.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Cc: Juergen Gross &lt;jgross@suse.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.14]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It was reported by Sergey Senozhatsky that if THP (Transparent Huge
Page) and frontswap (via zswap) are both enabled, when memory goes low
so that swap is triggered, segfault and memory corruption will occur in
random user space applications as follow,

kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
 #0  0x00007fc08889ae0d _int_malloc (libc.so.6)
 #1  0x00007fc08889c2f3 malloc (libc.so.6)
 #2  0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
 #3  0x0000560e6005e75c n/a (urxvt)
 #4  0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
 #5  0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
 #6  0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
 #7  0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
 #8  0x0000560e6005cb55 ev_run (urxvt)
 #9  0x0000560e6003b9b9 main (urxvt)
 #10 0x00007fc08883af4a __libc_start_main (libc.so.6)
 #11 0x0000560e6003f9da _start (urxvt)

After bisection, it was found the first bad commit is bd4c82c22c36 ("mm,
THP, swap: delay splitting THP after swapped out").

The root cause is as follows:

When the pages are written to swap device during swapping out in
swap_writepage(), zswap (fontswap) is tried to compress the pages to
improve performance.  But zswap (frontswap) will treat THP as a normal
page, so only the head page is saved.  After swapping in, tail pages
will not be restored to their original contents, causing memory
corruption in the applications.

This is fixed by refusing to save page in the frontswap store functions
if the page is a THP.  So that the THP will be swapped out to swap
device.

Another choice is to split THP if frontswap is enabled.  But it is found
that the frontswap enabling isn't flexible.  For example, if
CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
zswap itself isn't enabled.

Frontswap has multiple backends, to make it easy for one backend to
enable THP support, the THP checking is put in backend frontswap store
functions instead of the general interfaces.

Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
Fixes: bd4c82c22c367e068 ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Reported-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Tested-by: Sergey Senozhatsky &lt;sergey.senozhatsky@gmail.com&gt;
Suggested-by: Minchan Kim &lt;minchan@kernel.org&gt;	[put THP checking in backend]
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Dan Streetman &lt;ddstreet@ieee.org&gt;
Cc: Seth Jennings &lt;sjenning@redhat.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Cc: Juergen Gross &lt;jgross@suse.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.14]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
