linux-toradex.git/mm/huge_memory.c, branch v5.12

mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument

2021-03-13T19:27:31+00:00

Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
in page number argument.

In this way, the interface name is more common and can be used by
potential users.  In addition, the complete info(memcg and flag) of the
memcg needs to be set to the tail pages.

Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui 
Acked-by: Johannes Weiner 
Reviewed-by: Zi Yan 
Reviewed-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Hugh Dickins 
Cc: "Kirill A. Shutemov" 
Cc: Nicholas Piggin 
Cc: Kefeng Wang 
Cc: Hanjun Guo 
Cc: Tianhong Ding 
Cc: Weilong Chen 
Cc: Rui Xiang 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: introduce page_needs_cow_for_dma() for deciding whether cow

2021-03-13T19:27:30+00:00

We've got quite a few places (pte, pmd, pud) that explicitly checked
against whether we should break the cow right now during fork().  It's
easier to provide a helper, especially before we work the same thing on
hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too.
Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
exported to the whole kernel.  With that we should expect another patch to
use is_cow_mapping() whenever we can across the kernel since we do use it
quite a lot but it's always done with raw code against VM_* flags.

Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
Signed-off-by: Peter Xu 
Reviewed-by: Jason Gunthorpe 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Christoph Hellwig 
Cc: Daniel Vetter 
Cc: David Airlie 
Cc: David Gibson 
Cc: Gal Pressman 
Cc: Jan Kara 
Cc: Jann Horn 
Cc: Kirill Shutemov 
Cc: Kirill Tkhai 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Roland Scheidegger 
Cc: VMware Graphics 
Cc: Wei Zhang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm,thp,shmem: limit shmem THP alloc gfp_mask

2021-02-26T17:40:59+00:00

Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

This patch (of 4):

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and
compaction code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck on
the LRU lock in the page reclaim code, trying to allocate dozens of THPs
simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

Controlling the gfp_mask of THP allocations through the knobs in sysfs
allows users to determine the balance between how aggressively the system
tries to allocate THPs at fault time, and how much the application may end
up stalling attempting those allocations.

This way a THP defrag setting of "never" or "defer+madvise" will result in
quick allocation failures without direct reclaim when no 2MB free pages
are available.

With this patch applied, THP allocations for tmpfs will be a little more
aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
less aggressive for files that are not mmapped or mapped without that
flag.

Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com
Signed-off-by: Rik van Riel 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Xu Yu 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Matthew Wilcox (Oracle) 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/pmem: avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled

2021-02-24T21:38:32+00:00

Differentiate between hardware not supporting hugepages and user disabling
THP via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'

For the devdax namespace, the kernel handles the above via the
supported_alignment attribute and failing to initialize the namespace if
the namespace align value is not supported on the platform.

For the fsdax namespace, the kernel will continue to initialize the
namespace.  This can result in the kernel creating a huge pte entry even
though the hardware don't support the same.

We do want hugepage support with pmem even if the end-user disabled THP
via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled).  Hence
differentiate between hardware/firmware lacking support vs user-controlled
disable of THP and prevent a huge fault if the hardware lacks hugepage
support.

Link: https://lkml.kernel.org/r/20210205023956.417587-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Reviewed-by: Dan Williams 
Cc: "Kirill A . Shutemov" 
Cc: Jan Kara 
Cc: David Hildenbrand 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/huge_memory.c: remove unused return value of set_huge_zero_page()

2021-02-24T21:38:32+00:00

The return value of set_huge_zero_page() is always ignored.  So we should
drop such return value.

Link: https://lkml.kernel.org/r/20210203084816.46307-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/huge_memory.c: update tlb entry if pmd is changed

2021-02-24T21:38:32+00:00

When set_pmd_at is called in function do_huge_pmd_anonymous_page, new tlb
entry can be added by software on MIPS platform.

Here add update_mmu_cache_pmd when pmd entry is set, and
update_mmu_cache_pmd is defined as empty excepts arc/mips platform.  This
patch has no negative effect on other platforms except arc/mips system.

Link: http://lkml.kernel.org/r/1592990792-1923-2-git-send-email-maobibo@loongson.cn
Signed-off-by: Bibo Mao 
Cc: Anshuman Khandual 
Cc: Daniel Silsby 
Cc: "Kirill A. Shutemov" 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Thomas Bogendoerfer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: convert NR_SHMEM_THPS account to pages

2021-02-24T21:38:29+00:00

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_THPS account to pages.  This patch is
consistent with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alexey Dobriyan 
Cc: Feng Tang 
Cc: Greg Kroah-Hartman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Michal Hocko 
Cc: NeilBrown 
Cc: Pankaj Gupta 
Cc: Rafael. J. Wysocki 
Cc: Randy Dunlap 
Cc: Roman Gushchin 
Cc: Sami Tolvanen 
Cc: Shakeel Butt 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: convert NR_FILE_THPS account to pages

2021-02-24T21:38:29+00:00

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with if hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_THPS account to pages.  This patch is consistent
with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Alexey Dobriyan 
Cc: Feng Tang 
Cc: Greg Kroah-Hartman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Michal Hocko 
Cc: NeilBrown 
Cc: Pankaj Gupta 
Cc: Rafael. J. Wysocki 
Cc: Randy Dunlap 
Cc: Roman Gushchin 
Cc: Sami Tolvanen 
Cc: Shakeel Butt 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: convert NR_ANON_THPS account to pages

2021-02-24T21:38:29+00:00

Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_ANON_THPS account to pages.  This patch is consistent
with 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Cc: Greg Kroah-Hartman 
Cc: Rafael. J. Wysocki 
Cc: Alexey Dobriyan 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Hugh Dickins 
Cc: Shakeel Butt 
Cc: Roman Gushchin 
Cc: Sami Tolvanen 
Cc: Feng Tang 
Cc: NeilBrown 
Cc: Joonsoo Kim 
Cc: Randy Dunlap 
Cc: Michal Hocko 
Cc: Pankaj Gupta 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/filemap: pass a sleep state to put_and_wait_on_page_locked

2021-02-24T21:38:28+00:00

This is prep work for the next patch, but I think at least one of the
current callers would prefer a killable sleep to an uninterruptible one.

Link: https://lkml.kernel.org/r/20210122160140.223228-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Kent Overstreet 
Reviewed-by: Christoph Hellwig 
Cc: Miaohe Lin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds