linux-toradex.git/include/linux/pgtable.h, branch v6.12-rc4

mm: always define pxx_pgprot()

2024-09-17T08:06:59+00:00

There're:

  - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
  support pte_pgprot().

  - 2 archs (x86, sparc) that support pmd_pgprot().

  - 1 arch (x86) that support pud_pgprot().

Always define them to be used in generic code, and then we don't need to
fiddle with "#ifdef"s when doing so.

Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.com
Signed-off-by: Peter Xu 
Reviewed-by: Jason Gunthorpe 
Cc: Alexander Gordeev 
Cc: Alex Williamson 
Cc: Aneesh Kumar K.V 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Christian Borntraeger 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Gavin Shan 
Cc: Gerald Schaefer 
Cc: Heiko Carstens 
Cc: Ingo Molnar 
Cc: Matthew Wilcox 
Cc: Niklas Schnelle 
Cc: Paolo Bonzini 
Cc: Ryan Roberts 
Cc: Sean Christopherson 
Cc: Sven Schnelle 
Cc: Thomas Gleixner 
Cc: Vasily Gorbik 
Cc: Will Deacon 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm/x86: implement arch_check_zapped_pud()

2024-09-02T03:26:09+00:00

Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD
zaps.  It has the same logic as the PMD helper.

One thing to mention is, it might be a good idea to use page_table_check
in the future for trapping wrong setups of shadow stack pgtable entries
[1].  That is left for the future as a separate effort.

[1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com

Link: https://lkml.kernel.org/r/20240812181225.1360970-6-peterx@redhat.com
Signed-off-by: Peter Xu 
Acked-by: David Hildenbrand 
Cc: "Edgecombe, Rick P" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: Aneesh Kumar K.V 
Cc: Christophe Leroy 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Matthew Wilcox 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Paolo Bonzini 
Cc: Rik van Riel 
Cc: Sean Christopherson 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

mm: define __pte_leaf_size() to also take a PMD entry

2024-07-12T22:52:15+00:00

On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry.  So allow architectures to provide __pte_leaf_size() instead of
pte_leaf_size() and provide the PMD entry to that function.

When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so
that architectures not interested in the PMD arguments are not impacted.

Only define a default pte_leaf_size() when __pte_leaf_size() is not
defined to make sure nobody adds new calls to pte_leaf_size() in the core.

Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy 
Reviewed-by: Oscar Salvador 
Cc: Jason Gunthorpe 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Peter Xu 
Signed-off-by: Andrew Morton

mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages

2024-07-04T02:30:01+00:00

Should do_swap_page() have the capability to directly map a large folio,
metadata restoration becomes necessary for a specified number of pages
denoted as nr.  It's important to highlight that metadata restoration is
solely required by the SPARC platform, which, however, does not enable
THP_SWAP.  Consequently, in the present kernel configuration, there exists
no practical scenario where users necessitate the restoration of nr
metadata.  Platforms implementing THP_SWAP might invoke this function with
nr values exceeding 1, subsequent to do_swap_page() successfully mapping
an entire large folio.  Nonetheless, their arch_do_swap_page_nr()
functions remain empty.

Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.com
Signed-off-by: Barry Song 
Reviewed-by: Ryan Roberts 
Reviewed-by: Khalid Aziz 
Cc: "David S. Miller" 
Cc: Andreas Larsson 
Cc: Baolin Wang 
Cc: Chris Li 
Cc: Christoph Hellwig 
Cc: Chuanhua Han 
Cc: David Hildenbrand 
Cc: Gao Xiang 
Cc: "Huang, Ying" 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Kairui Song 
Cc: Len Brown 
Cc: Matthew Wilcox (Oracle) 
Cc: Pavel Machek 
Cc: "Rafael J. Wysocki" 
Cc: Suren Baghdasaryan 
Cc: Yosry Ahmed 
Cc: Yu Zhao 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm: implement update_mmu_tlb() using update_mmu_tlb_range()

2024-07-04T02:29:57+00:00

Let's make update_mmu_tlb() simply a generic wrapper around
update_mmu_tlb_range().  Only the latter can now be overridden by the
architecture.  We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well.

Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.com
Signed-off-by: Bang Li 
Acked-by: David Hildenbrand 
Cc: Chris Zankel 
Cc: Huacai Chen 
Cc: Lance Yang 
Cc: Max Filippov 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Ryan Roberts 
Cc: Thomas Bogendoerfer 
Signed-off-by: Andrew Morton

mm: add update_mmu_tlb_range()

2024-07-04T02:29:57+00:00

Patch series "Add update_mmu_tlb_range() to simplify code", v4.

This series of commits mainly adds the update_mmu_tlb_range() to batch
update tlb in an address range and implement update_mmu_tlb() using
update_mmu_tlb_range().

After commit 19eaf44954df ("mm: thp: support allocation of anonymous
multi-size THP"), We may need to batch update tlb of a certain address
range by calling update_mmu_tlb() in a loop.  Using the
update_mmu_tlb_range(), we can simplify the code and possibly reduce the
execution of some unnecessary code in some architectures.


This patch (of 3):

Add update_mmu_tlb_range(), we can batch update tlb of an address range.

Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com
Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.com
Signed-off-by: Bang Li 
Acked-by: David Hildenbrand 
Cc: Chris Zankel 
Cc: Huacai Chen 
Cc: Lance Yang 
Cc: Max Filippov 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Ryan Roberts 
Cc: Thomas Bogendoerfer 
Signed-off-by: Andrew Morton

mm/madvise: introduce clear_young_dirty_ptes() batch helper

2024-05-06T00:53:42+00:00

Patch series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free",
v10.

This patchset adds support for lazyfreeing multi-size THP (mTHP) without
needing to first split the large folio via split_folio().  However, we
still need to split a large folio that is not fully mapped within the
target range.

If a large folio is locked or shared, or if we fail to split it, we just
leave it in place and advance to the next PTE in the range.  But note that
the behavior is changed; previously, any failure of this sort would cause
the entire operation to give up.  As large folios become more common,
sticking to the old way could result in wasted opportunities.

Performance Testing
===================

On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of
the same size results in the following runtimes for madvise(MADV_FREE) in
seconds (shorter is better):

Folio Size |   Old    |   New    | Change
------------------------------------------
      4KiB | 0.590251 | 0.590259 |    0%
     16KiB | 2.990447 | 0.185655 |  -94%
     32KiB | 2.547831 | 0.104870 |  -95%
     64KiB | 2.457796 | 0.052812 |  -97%
    128KiB | 2.281034 | 0.032777 |  -99%
    256KiB | 2.230387 | 0.017496 |  -99%
    512KiB | 2.189106 | 0.010781 |  -99%
   1024KiB | 2.183949 | 0.007753 |  -99%
   2048KiB | 0.002799 | 0.002804 |    0%


This patch (of 4):

This commit introduces clear_young_dirty_ptes() to replace mkold_ptes(). 
By doing so, we can use the same function for both use cases
(madvise_pageout and madvise_free), and it also provides the flexibility
to only clear the dirty flag in the future if needed.

Link: https://lkml.kernel.org/r/20240418134435.6092-1-ioworker0@gmail.com
Link: https://lkml.kernel.org/r/20240418134435.6092-2-ioworker0@gmail.com
Signed-off-by: Lance Yang 
Suggested-by: Ryan Roberts 
Acked-by: David Hildenbrand 
Reviewed-by: Ryan Roberts 
Cc: Barry Song <21cnbao@gmail.com>
Cc: Jeff Xie 
Cc: Kefeng Wang 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Muchun Song 
Cc: Peter Xu 
Cc: Yang Shi 
Cc: Yin Fengwei 
Cc: Zach O'Keefe 
Signed-off-by: Andrew Morton

mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLD

2024-04-26T03:56:38+00:00

Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
folio that is fully and contiguously mapped in the pageout/cold vm range. 
This change means that large folios will be maintained all the way to swap
storage.  This both improves performance during swap-out, by eliding the
cost of splitting the folio, and sets us up nicely for maintaining the
large folio when it is swapped back in (to be covered in a separate
series).

Folios that are not fully mapped in the target range are still split, but
note that behavior is changed so that if the split fails for any reason
(folio locked, shared, etc) we now leave it as is and move to the next pte
in the range and continue work on the proceeding folios.  Previously any
failure of this sort would cause the entire operation to give up and no
folios mapped at higher addresses were paged out or made cold.  Given
large folios are becoming more common, this old behavior would have likely
lead to wasted opportunities.

While we are at it, change the code that clears young from the ptes to use
ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
function.  This is more efficent than get_and_clear/modify/set, especially
for contpte mappings on arm64, where the old approach would require
unfolding/refolding and the new approach can be done in place.

Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts 
Reviewed-by: Barry Song 
Acked-by: David Hildenbrand 
Cc: Barry Song <21cnbao@gmail.com>
Cc: Chris Li 
Cc: Gao Xiang 
Cc: "Huang, Ying" 
Cc: Kefeng Wang 
Cc: Lance Yang 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Yang Shi 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton

mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

2024-04-26T03:56:37+00:00

Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and lock
a large folio much more often, which could lead to contention and (e.g.)
failure to split large folios, etc.

Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together.  This allows us to first drop a reference to each swap
slot before we try to release the cache folio.  This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.

Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are already
gathered in zap_pte_range().

While we are at it, let's simplify by converting the return type of both
functions to void.  The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra reporting
wasn't exactly guaranteed.  We will still get the warning with most of the
information from get_swap_device().  With the batch version, we wouldn't
know which pte was bad anyway so could print the wrong one.

[ryan.roberts@arm.com: fix a build warning on parisc]
  Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts 
Acked-by: David Hildenbrand 
Cc: Barry Song <21cnbao@gmail.com>
Cc: Barry Song 
Cc: Chris Li 
Cc: Gao Xiang 
Cc: "Huang, Ying" 
Cc: Kefeng Wang 
Cc: Lance Yang 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Yang Shi 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton

mm: remove "prot" parameter from move_pte()

2024-04-26T03:56:24+00:00

The "prot" parameter is unused, and using it instead of what's stored in
that particular PTE would very likely be wrong.  Let's simply remove it.

Link: https://lkml.kernel.org/r/20240327143301.741807-1-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Vishal Moola (Oracle) 
Cc: "David S. Miller" 
Cc: Andreas Larsson 
Signed-off-by: Andrew Morton