linux-toradex.git/include/linux/pgtable.h, branch v6.0-rc1

mm/mmap: define DECLARE_VM_GET_PAGE_PROT

2022-07-18T00:14:37+00:00

This just converts the generic vm_get_page_prot() implementation into a
new macro i.e DECLARE_VM_GET_PAGE_PROT which later can be used across
platforms when enabling them with ARCH_HAS_VM_GET_PAGE_PROT.  This does
not create any functional change.

Link: https://lkml.kernel.org/r/20220711070600.2378316-3-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual 
Reviewed-by: Christophe Leroy 
Suggested-by: Christoph Hellwig 
Cc: Arnd Bergmann 
Cc: Brian Cain 
Cc: Catalin Marinas 
Cc: Christoph Hellwig 
Cc: Chris Zankel 
Cc: "David S. Miller" 
Cc: Dinh Nguyen 
Cc: Geert Uytterhoeven 
Cc: Guo Ren 
Cc: Heiko Carstens 
Cc: Huacai Chen 
Cc: Ingo Molnar 
Cc: "James E.J. Bottomley" 
Cc: Jeff Dike 
Cc: Jonas Bonn 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Nicholas Piggin 
Cc: Palmer Dabbelt 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Richard Henderson 
Cc: Rich Felker 
Cc: Russell King 
Cc: Sam Ravnborg 
Cc: Stafford Horne 
Cc: Thomas Bogendoerfer 
Cc: Thomas Gleixner 
Cc: Vasily Gorbik 
Cc: Vineet Gupta 
Cc: WANG Xuerui 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton

mm: functions may simplify the use of return values

2022-05-13T14:20:18+00:00

p4d_clear_huge may be optimized for void return type and function usage. 
vunmap_p4d_range function saves a few steps here.

Link: https://lkml.kernel.org/r/20220507150630.90399-1-kunyu@nfschina.com
Signed-off-by: Li kunyu 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Signed-off-by: Andrew Morton

mm: remove __HAVE_ARCH_PTEP_CLEAR in pgtable.h

2022-05-13T14:20:17+00:00

Currently, there is no architecture definition __HAVE_ARCH_PTEP_CLEAR,
Generic ptep_clear() is the only definition for all architecture, So drop
the "#ifndef __HAVE_ARCH_PTEP_CLEAR".

Link: https://lkml.kernel.org/r/20220507110114.4128854-5-tongtiangen@huawei.com
Signed-off-by: Tong Tiangen 
Suggested-by: Anshuman Khandual 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Kefeng Wang 
Cc: Palmer Dabbelt 
Cc: Pasha Tatashin 
Cc: Paul Walmsley 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

mm: page_table_check: add hooks to public helpers

2022-05-13T14:20:17+00:00

Move ptep_clear() to the include/linux/pgtable.h and add page table check
relate hooks to some helpers, it's prepare for support page table check
feature on new architecture.

Optimize the implementation of ptep_clear(), page table hooks added page
table check stubs, the interface control should be at stubs, there is no
rationale for doing a IS_ENABLED() check here.

For architectures that do not enable CONFIG_PAGE_TABLE_CHECK, they will
call a fallback page table check stubs[1] when getting their page table
helpers[2] in include/linux/pgtable.h.

[1] page table check stubs defined in include/linux/page_table_check.h
[2] ptep_clear() ptep_get_and_clear()  pmdp_huge_get_and_clear()
pudp_huge_get_and_clear()

Link: https://lkml.kernel.org/r/20220507110114.4128854-4-tongtiangen@huawei.com
Signed-off-by: Tong Tiangen 
Acked-by: Pasha Tatashin 
Cc: Anshuman Khandual 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Kefeng Wang 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()

2022-05-13T14:20:17+00:00

shmem_swapin_page() only brings in order-0 pages, which are folios
by definition.

Link: https://lkml.kernel.org/r/20220504182857.4013401-24-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Andrew Morton

mm: avoid unnecessary flush on change_huge_pmd()

2022-05-13T14:20:05+00:00

Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.

The first TLB flush is only necessary to prevent the dirty bit (and with a
lesser importance the access bit) from changing while the PTE is modified.
However, this is not necessary as the x86 CPUs set the dirty-bit
atomically with an additional check that the PTE is (still) present.  One
caveat is Intel's Knights Landing that has a bug and does not do so.

Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd().  Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.

Link: https://lkml.kernel.org/r/20220401180821.1986781-4-namit@vmware.com
Signed-off-by: Nadav Amit 
Cc: Andrea Arcangeli 
Cc: Andrew Cooper 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Xu 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: Nick Piggin 
Signed-off-by: Andrew Morton

mm/swap: remember PG_anon_exclusive via a swp pte bit

2022-05-10T01:20:45+00:00

Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.

This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
| FOLL_GET) was taken on an anonymous page and COW logic fails to detect
exclusivity of the page to then replacing the anonymous page by a copy in
the page table: The GUP reference lost synchronicity with the pages mapped
into the page tables.  This series focuses on x86, arm64, s390x and
ppc64/book3s -- other architectures are fairly easy to support by
implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

This primarily fixes the O_DIRECT memory corruptions that can happen on
concurrent swapout, whereby we lose DMA reads to a page (modifying the
user page by writing to it).

O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
from/to a user page.  In the long run, we want to convert it to properly
use FOLL_PIN, and John is working on it, but that might take a while and
might not be easy to backport.  In the meantime, let's restore what used
to work before we started modifying our COW logic: make R/W FOLL_GET
references reliable as long as there is no fork() after GUP involved.

This is just the natural follow-up of part 2, that will also further
reduce "wrong COW" on the swapin path, for example, when we cannot remove
a page from the swapcache due to concurrent writeback, or if we have two
threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
nice side-product

This issue, including other related COW issues, has been summarized in [3]
under 2):
"
  2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

  It was discovered that we can create a memory corruption by reading a
  file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
  concurrently writing to an unrelated part (e.g., last byte) of the same
  page, and concurrently write-protecting the page via clear_refs
  SOFTDIRTY tracking [6].

  For the reproducer, the issue is that O_DIRECT grabs a reference of the
  target page (via FOLL_GET) and clear_refs write-protects the relevant
  page table entry. On successive write access to the page from the
  process itself, we wrongly COW the page when resolving the write fault,
  resulting in a loss of synchronicity and consequently a memory corruption.

  While some people might think that using clear_refs in this combination
  is a corner cases, it turns out to be a more generic problem unfortunately.

  For example, it was just recently discovered that we can similarly
  create a memory corruption without clear_refs, simply by concurrently
  swapping out the buffer pages [7]. Note that we nowadays even use the
  swap infrastructure in Linux without an actual swap disk/partition: the
  prime example is zram which is enabled as default under Fedora [10].

  The root issue is that a write-fault on a page that has additional
  references results in a COW and thereby a loss of synchronicity
  and consequently a memory corruption if two parties believe they are
  referencing the same page.
"

We don't particularly care about R/O FOLL_GET references: they were never
reliable and O_DIRECT doesn't expect to observe modifications from a page
after DMA was started.

Note that:
* this only fixes the issue on x86, arm64, s390x and ppc64/book3s
  ("enterprise architectures"). Other architectures have to implement
  __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
* this does *not * consider any kind of fork() after taking the reference:
  fork() after GUP never worked reliably with FOLL_GET.
* Not losing PG_anon_exclusive during swapout was the last remaining
  piece. KSM already makes sure that there are no other references on
  a page before considering it for sharing. Page migration maintains
  PG_anon_exclusive and simply fails when there are additional references
  (freezing the refcount fails). Only swapout code dropped the
  PG_anon_exclusive flag because it requires more work to remember +
  restore it.

With this series in place, most COW issues of [3] are fixed on said
architectures. Other architectures can implement
__HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

[1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com


This patch (of 8):

Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
it.  We do this, to keep fork() logic on swap entries easy and efficient:
for example, if we wouldn't clear it when unmapping, we'd have to lookup
the page in the swapcache for each and every swap entry during fork() and
clear PG_anon_exclusive if set.

Instead, we want to store that information directly in the swap pte,
protected by the page table lock, similarly to how we handle
SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
swap entries, we don't want to mess with the swap type (e.g., still one
bit) because it overcomplicates swap code.

In try_to_unmap(), we already reject to unmap in case the page might be
pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
Checking if there are other unexpected references reliably *before*
completely unmapping a page is unfortunately not really possible: THP
heavily overcomplicate the situation.  Once fully unmapped it's easier --
we, for example, make sure that there are no unexpected references *after*
unmapping a page before starting writeback on that page.

So, we currently might end up unmapping a page and clearing
PG_anon_exclusive if that page has additional references, for example, due
to a FOLL_GET.

do_swap_page() has to re-determine if a page is exclusive, which will
easily fail if there are other references on a page, most prominently GUP
references via FOLL_GET.  This can currently result in memory corruptions
when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
is never involved: try_to_unmap() will succeed, and when refaulting the
page, it cannot be marked exclusive and will get replaced by a copy in the
page tables on the next write access, resulting in writes via the GUP
reference to the page being lost.

In an ideal world, everybody that uses GUP and wants to modify page
content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
conversion will take a while.  It's easier to fix what used to work in the
past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
by remembering PG_anon_exclusive we can further reduce unnecessary COW in
some cases, so it's the natural thing to do.

So let's transfer the PG_anon_exclusive information to the swap pte and
store it via an architecture-dependant pte bit; use that information when
restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
we simply have to clear the pte bit and are done.

Of course, there is one corner case to handle: swap backends that don't
support concurrent page modifications while the page is under writeback. 
Special case these, and drop the exclusive marker.  Add a comment why that
is just fine (also, reuse_swap_page() would have done the same in the
past).

In the future, we'll hopefully have all architectures support
__HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
and the define completely.  Then, we can also convert
SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
support: either simply use a yet unused pte bit that can be used for swap
entries, steal one from the arch type bits if they exceed 5, or steal one
from the offset bits.

Note: R/O FOLL_GET references were never really reliable, especially when
taking one on a shared page and then writing to the page (e.g., GUP after
fork()).  FOLL_GET, including R/W references, were never really reliable
once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
steps back in case it stumbles over unexpected references and is,
therefore, fine.

[david@redhat.com: fix SWP_STABLE_WRITES test]
  Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Vlastimil Babka 
Cc: Hugh Dickins 
Cc: Shakeel Butt 
Cc: John Hubbard 
Cc: Jason Gunthorpe 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: "Kirill A. Shutemov" 
Cc: Matthew Wilcox (Oracle) 
Cc: Jann Horn 
Cc: Michal Hocko 
Cc: Nadav Amit 
Cc: Rik van Riel 
Cc: Roman Gushchin 
Cc: Andrea Arcangeli 
Cc: Peter Xu 
Cc: Don Dutile 
Cc: Christoph Hellwig 
Cc: Oleg Nesterov 
Cc: Jan Kara 
Cc: Liang Zhang 
Cc: Pedro Demarchi Gomes 
Cc: Oded Gabbay 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: Gerald Schaefer 
Signed-off-by: Andrew Morton

mm/pgtable: define pte_index so that preprocessor could recognize it

2022-02-04T17:25:05+00:00

Since commit 974b9b2c68f3 ("mm: consolidate pte_index() and
pte_offset_*() definitions") pte_index is a static inline and there is
no define for it that can be recognized by the preprocessor.  As a
result, vm_insert_pages() uses slower loop over vm_insert_page() instead
of insert_pages() that amortizes the cost of spinlock operations when
inserting multiple pages.

Link: https://lkml.kernel.org/r/20220111145457.20748-1-rppt@kernel.org
Fixes: 974b9b2c68f3 ("mm: consolidate pte_index() and pte_offset_*() definitions")
Signed-off-by: Mike Rapoport 
Reported-by: Christian Dietrich 
Reviewed-by: Khalid Aziz 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: ptep_clear() page table helper

2022-01-15T14:30:28+00:00

We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to
clear PTE from user page tables, but there is no variant for simple
clear of a present PTE from user page tables without using a low level
pte_clear() which can be either native or para-virtualised.

Add a new ptep_clear() that can be used in common code to clear PTEs
from page table.  We will need this call later in order to add a hook
for page table check.

Link: https://lkml.kernel.org/r/20211221154650.1047963-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin 
Cc: Aneesh Kumar K.V 
Cc: Dave Hansen 
Cc: David Rientjes 
Cc: Frederic Weisbecker 
Cc: Greg Thelen 
Cc: "H. Peter Anvin" 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: Jiri Slaby 
Cc: Jonathan Corbet 
Cc: Kees Cook 
Cc: Masahiro Yamada 
Cc: Mike Rapoport 
Cc: Muchun Song 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Sami Tolvanen 
Cc: Thomas Gleixner 
Cc: Wei Xu 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Revert "mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge"

2021-07-21T10:28:09+00:00

This reverts commit c742199a014de23ee92055c2473d91fe5561ffdf.

c742199a014d ("mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge")
breaks arm64 in at least two ways for configurations where PUD or PMD
folding occur:

  1. We no longer install huge-vmap mappings and silently fall back to
     page-granular entries, despite being able to install block entries
     at what is effectively the PGD level.

  2. If the linear map is backed with block mappings, these will now
     silently fail to be created in alloc_init_pud(), causing a panic
     early during boot.

The pgtable selftests caught this, although a fix has not been
forthcoming and Christophe is AWOL at the moment, so just revert the
change for now to get a working -rc3 on which we can queue patches for
5.15.

A simple revert breaks the build for 32-bit PowerPC 8xx machines, which
rely on the default function definitions when the corresponding
page-table levels are folded, since commit a6a8f7c4aa7e ("powerpc/8xx:
add support for huge pages on VMAP and VMALLOC"), eg:

  powerpc64-linux-ld: mm/vmalloc.o: in function `vunmap_pud_range':
  linux/mm/vmalloc.c:362: undefined reference to `pud_clear_huge'

To avoid that, add stubs for pud_clear_huge() and pmd_clear_huge() in
arch/powerpc/mm/nohash/8xx.c as suggested by Christophe.

Cc: Christophe Leroy 
Cc: Catalin Marinas 
Cc: Andrew Morton 
Cc: Nicholas Piggin 
Cc: Mike Rapoport 
Cc: Mark Rutland 
Cc: Geert Uytterhoeven 
Fixes: c742199a014d ("mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge")
Signed-off-by: Jonathan Marek 
Reviewed-by: Ard Biesheuvel 
Acked-by: Marc Zyngier 
[mpe: Fold in 8xx.c changes from Christophe and mention in change log]
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/linux-arm-kernel/CAMuHMdXShORDox-xxaeUfDW3wx2PeggFSqhVSHVZNKCGK-y_vQ@mail.gmail.com/
Link: https://lore.kernel.org/r/20210717160118.9855-1-jonathan@marek.ca
Link: https://lore.kernel.org/r/87r1fs1762.fsf@mpe.ellerman.id.au
Signed-off-by: Will Deacon