linux-toradex.git/mm/rmap.c, branch v5.0-rc3

mm/mmu_notifier: mm/rmap.c: Fix a mmu_notifier range bug in try_to_unmap_one

2019-01-10T10:58:21+00:00

The conversion to use a structure for mmu_notifier_invalidate_range_*()
unintentionally changed the usage in try_to_unmap_one() to init the
'struct mmu_notifier_range' with vma->vm_start instead of @address,
i.e. it invalidates the wrong address range.  Revert to the correct
address range.

Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page
state in process X" errors when reclaiming from a KVM guest due to KVM
removing the wrong pages from its own mappings.

Reported-by: leozinho29_eu@hotmail.com
Reported-by: Mike Galbraith 
Reported-and-tested-by: Adam Borowski 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Pankaj gupta 
Cc: Christian König 
Cc: Jan Kara 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Dan Williams 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Michal Hocko 
Cc: Felix Kuehling 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Andrew Morton 
Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Sean Christopherson 
Signed-off-by: Linus Torvalds

hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"

2019-01-09T01:15:11+00:00

This reverts b43a9990055958e70347c56f90ea2ae32c67334c

The reverted commit caused issues with migration and poisoning of anon
huge pages.  The LTP move_pages12 test will cause an "unable to handle
kernel NULL pointer" BUG would occur with stack similar to:

  RIP: 0010:down_write+0x1b/0x40
  Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

The purpose of the reverted patch was to fix some long existing races
with huge pmd sharing.  It used i_mmap_rwsem for this purpose with the
idea that this could also be used to address truncate/page fault races
with another patch.  Further analysis has determined that i_mmap_rwsem
can not be used to address all these hugetlbfs synchronization issues.
Therefore, revert this patch while working an another approach to the
underlying issues.

Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz 
Reported-by: Jan Stancek 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Cc: "Aneesh Kumar K . V" 
Cc: Andrea Arcangeli 
Cc: "Kirill A . Shutemov" 
Cc: Davidlohr Bueso 
Cc: Prakash Sangappa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

2018-12-28T20:11:51+00:00

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with the
  ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
  called.

[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz 
Acked-by: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Cc: "Aneesh Kumar K . V" 
Cc: Andrea Arcangeli 
Cc: Davidlohr Bueso 
Cc: Prakash Sangappa 
Cc: Colin Ian King 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove __hugepage_set_anon_rmap()

2018-12-28T20:11:51+00:00

This function is identical to __page_set_anon_rmap() since the time, when
it was introduced (8 years ago).  The patch removes the function, and
makes its users to use __page_set_anon_rmap() instead.

Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Kirill A. Shutemov 
Reviewed-by: Andrew Morton 
Reviewed-by: Mike Kravetz 
Cc: Jerome Glisse 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/mmu_notifier: use structure for invalidate_range_start/end calls v2

2018-12-28T20:11:50+00:00

To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks.  No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Acked-by: Christian König 
Acked-by: Jan Kara 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Cc: Dan Williams 
Cc: Paolo Bonzini 
Cc: Radim Krcmar 
Cc: Michal Hocko 
Cc: Felix Kuehling 
Cc: Ralph Campbell 
Cc: John Hubbard 
From: Jérôme Glisse 
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/huge_memory: rename freeze_page() to unmap_page()

2018-11-30T22:56:14+00:00

The term "freeze" is used in several ways in the kernel, and in mm it
has the particular meaning of forcing page refcount temporarily to 0.
freeze_page() is just too confusing a name for a function that unmaps a
page: rename it unmap_page(), and rename unfreeze_page() remap_page().

Went to change the mention of freeze_page() added later in mm/rmap.c,
but found it to be incorrect: ordinary page reclaim reaches there too;
but the substance of the comment still seems correct, so edit it down.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
Signed-off-by: Hugh Dickins 
Acked-by: Kirill A. Shutemov 
Cc: Jerome Glisse 
Cc: Konstantin Khlebnikov 
Cc: Matthew Wilcox 
Cc: 	[4.8+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: migration: fix migration of huge PMD shared pages

2018-10-05T23:32:04+00:00

The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.

mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.

Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
  Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz 
Acked-by: Kirill A. Shutemov 
Reviewed-by: Naoya Horiguchi 
Acked-by: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: Jerome Glisse 
Cc: Mike Kravetz 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm: do not drop unused pages when userfaultd is running

2018-07-14T18:11:09+00:00

KVM guests on s390 can notify the host of unused pages.  This can result
in pte_unused callbacks to be true for KVM guest memory.

If a page is unused (checked with pte_unused) we might drop this page
instead of paging it.  This can have side-effects on userfaultd, when
the page in question was already migrated:

The next access of that page will trigger a fault and a user fault
instead of faulting in a new and empty zero page.  As QEMU does not
expect a userfault on an already migrated page this migration will fail.

The most straightforward solution is to ignore the pte_unused hint if a
userfault context is active for this VMA.

Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
Signed-off-by: Christian Borntraeger 
Cc: Martin Schwidefsky 
Cc: Andrea Arcangeli 
Cc: Mike Rapoport 
Cc: Janosch Frank 
Cc: David Hildenbrand 
Cc: Cornelia Huck 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge tag 'v4.17-rc2' into docs-next

2018-04-27T23:13:20+00:00

  Merge -rc2 to pick up the changes to
  Documentation/core-api/kernel-api.rst that hit mainline via the
  networking tree.  In their absence, subsequent patches cannot be
  applied.

mm: enable thp migration for shmem thp

2018-04-21T00:18:35+00:00

My testing for the latest kernel supporting thp migration showed an
infinite loop in offlining the memory block that is filled with shmem
thps.  We can get out of the loop with a signal, but kernel should return
with failure in this case.

What happens in the loop is that scan_movable_pages() repeats returning
the same pfn without any progress.  That's because page migration always
fails for shmem thps.

In memory offline code, memory blocks containing unmovable pages should be
prevented from being offline targets by has_unmovable_pages() inside
start_isolate_page_range().  So it's possible to change migratability for
non-anonymous thps to avoid the issue, but it introduces more complex and
thp-specific handling in migration code, so it might not good.

So this patch is suggesting to fix the issue by enabling thp migration for
shmem thp.  Both of anon/shmem thp are migratable so we don't need
precheck about the type of thps.

Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: Zi Yan 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds