summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2025-04-01 09:29:18 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2025-04-01 09:29:18 -0700
commiteb0ece16027f8223d5dc9aaf90124f70577bd22a (patch)
tree1e2214cacd123b940ceca684322203643d5e9bc7 /Documentation/admin-guide
parent08733088b566b58283f0f12fb73f5db6a9a9de30 (diff)
parent0a1e082b64ccce165e7307a7b49d22b2504f9d1f (diff)
Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst36
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst4
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst25
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt30
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst10
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst87
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst10
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst21
-rw-r--r--Documentation/admin-guide/mm/zswap.rst10
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst9
10 files changed, 172 insertions, 70 deletions
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 1576fb93f06c..9bdb30901a93 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -54,7 +54,7 @@ The list of possible return codes:
If you use 'echo', the returned value is set by the 'echo' utility,
and, in general case, something like::
- echo 3 > /sys/block/zram0/max_comp_streams
+ echo foo > /sys/block/zram0/comp_algorithm
if [ $? -ne 0 ]; then
handle_error
fi
@@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3}
num_devices parameter is optional and tells zram how many devices should be
pre-created. Default: 1.
-2) Set max number of compression streams
-========================================
-
-Regardless of the value passed to this attribute, ZRAM will always
-allocate multiple compression streams - one per online CPU - thus
-allowing several concurrent compression operations. The number of
-allocated compression streams goes down when some of the CPUs
-become offline. There is no single-compression-stream mode anymore,
-unless you are running a UP system or have only 1 CPU online.
-
-To find out how many streams are currently available::
-
- cat /sys/block/zram0/max_comp_streams
-
-3) Select compression algorithm
+2) Select compression algorithm
===============================
Using comp_algorithm device attribute one can see available and
@@ -107,7 +93,7 @@ Examples::
For the time being, the `comp_algorithm` content shows only compression
algorithms that are supported by zram.
-4) Set compression algorithm parameters: Optional
+3) Set compression algorithm parameters: Optional
=================================================
Compression algorithms may support specific parameters which can be
@@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some
algorithms), for other algorithms `level` is acceleration level (the higher
the value the lower the compression ratio).
-5) Set Disksize
+4) Set Disksize
===============
Set disk size by writing the value to sysfs node 'disksize'.
@@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory
since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful.
-6) Set memory limit: Optional
+5) Set memory limit: Optional
=============================
Set memory limit by writing the value to sysfs node 'mem_limit'.
@@ -177,7 +163,7 @@ Examples::
# To disable memory limit
echo 0 > /sys/block/zram0/mem_limit
-7) Activate
+6) Activate
===========
::
@@ -188,7 +174,7 @@ Examples::
mkfs.ext4 /dev/zram1
mount /dev/zram1 /tmp
-8) Add/remove zram devices
+7) Add/remove zram devices
==========================
zram provides a control interface, which enables dynamic (on-demand) device
@@ -208,7 +194,7 @@ execute::
echo X > /sys/class/zram-control/hot_remove
-9) Stats
+8) Stats
========
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
@@ -228,8 +214,6 @@ mem_limit WO specifies the maximum amount of memory ZRAM can
writeback_limit WO specifies the maximum amount of write IO zram
can write out to backing device as 4KB unit
writeback_limit_enable RW show and set writeback_limit feature
-max_comp_streams RW the number of possible concurrent compress
- operations
comp_algorithm RW show and change the compression algorithm
algorithm_params WO setup compression algorithm parameters
compact WO trigger memory compaction
@@ -310,7 +294,7 @@ a single line of text and contains the following stats separated by whitespace:
Unit: 4K bytes
============== =============================================================
-10) Deactivate
+9) Deactivate
==============
::
@@ -318,7 +302,7 @@ a single line of text and contains the following stats separated by whitespace:
swapoff /dev/zram0
umount /dev/zram1
-11) Reset
+10) Reset
=========
Write any positive value to 'reset' sysfs node::
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 02b8206a3594..d6b1db8cc7eb 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -610,6 +610,10 @@ memory.stat file includes following statistics:
'rss + mapped_file" will give you resident set size of cgroup.
+ Note that some kernel configurations might account complete larger
+ allocations (e.g., THP) towards 'rss' and 'mapped_file', even if
+ only some, but not all that memory is mapped.
+
(Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page
cache.)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f293a13b42ed..1a16ce68a4d7 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1445,7 +1445,10 @@ The following nested keys are defined.
anon
Amount of memory used in anonymous mappings such as
- brk(), sbrk(), and mmap(MAP_ANONYMOUS)
+ brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that
+ some kernel configurations might account complete larger
+ allocations (e.g., THP) if only some, but not all the
+ memory of such an allocation is mapped anymore.
file
Amount of memory used to cache filesystem data,
@@ -1488,7 +1491,10 @@ The following nested keys are defined.
Amount of application memory swapped out to zswap.
file_mapped
- Amount of cached filesystem data mapped with mmap()
+ Amount of cached filesystem data mapped with mmap(). Note
+ that some kernel configurations might account complete
+ larger allocations (e.g., THP) if only some, but not
+ not all the memory of such an allocation is mapped.
file_dirty
Amount of cached filesystem data that was modified but
@@ -1560,6 +1566,12 @@ The following nested keys are defined.
workingset_nodereclaim
Number of times a shadow node has been reclaimed
+ pswpin (npn)
+ Number of pages swapped into memory
+
+ pswpout (npn)
+ Number of pages swapped out of memory
+
pgscan (npn)
Amount of scanned pages (in an inactive LRU list)
@@ -1575,6 +1587,9 @@ The following nested keys are defined.
pgscan_khugepaged (npn)
Amount of scanned pages by khugepaged (in an inactive LRU list)
+ pgscan_proactive (npn)
+ Amount of scanned pages proactively (in an inactive LRU list)
+
pgsteal_kswapd (npn)
Amount of reclaimed pages by kswapd
@@ -1584,6 +1599,9 @@ The following nested keys are defined.
pgsteal_khugepaged (npn)
Amount of reclaimed pages by khugepaged
+ pgsteal_proactive (npn)
+ Amount of reclaimed pages proactively
+
pgfault (npn)
Total number of page faults incurred
@@ -1661,6 +1679,9 @@ The following nested keys are defined.
pgdemote_khugepaged
Number of pages demoted by khugepaged.
+ pgdemote_proactive
+ Number of pages demoted by proactively.
+
hugetlb
Amount of memory used by hugetlb pages. This metric only shows
up if hugetlb usage is accounted for in memory.current (i.e.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3435a062a208..559f4fe51824 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1866,7 +1866,7 @@
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
- hugepages= [HW] Number of HugeTLB pages to allocate at boot.
+ hugepages= [HW,EARLY] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
@@ -1878,15 +1878,24 @@
<node>:<integer>[,<node>:<integer>]
hugepagesz=
- [HW] The size of the HugeTLB pages. This is used in
- conjunction with hugepages (above) to allocate huge
- pages of a specific size at boot. The pair
- hugepagesz=X hugepages=Y can be specified once for
- each supported huge page size. Huge page sizes are
- architecture dependent. See also
+ [HW,EARLY] The size of the HugeTLB pages. This is
+ used in conjunction with hugepages (above) to
+ allocate huge pages of a specific size at boot. The
+ pair hugepagesz=X hugepages=Y can be specified once
+ for each supported huge page size. Huge page sizes
+ are architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
+ hugepage_alloc_threads=
+ [HW] The number of threads that should be used to
+ allocate hugepages during boot. This option can be
+ used to improve system bootup time when allocating
+ a large amount of huge pages.
+ The default value is 25% of the available hardware threads.
+
+ Note that this parameter only applies to non-gigantic huge pages.
+
hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
@@ -1897,6 +1906,13 @@
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
+ hugetlb_cma_only=
+ [HW,CMA,EARLY] When allocating new HugeTLB pages, only
+ try to allocate from the CMA areas.
+
+ This option does nothing if hugetlb_cma= is not also
+ specified.
+
hugetlb_free_vmemmap=
[KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
enabled.
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
index 7367e6294ef6..4120e9cb0cd5 100644
--- a/Documentation/admin-guide/mm/cma_debugfs.rst
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -12,10 +12,16 @@ its CMA name like below:
The structure of the files created under that directory is as follows:
- - [RO] base_pfn: The base PFN (Page Frame Number) of the zone.
+ - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area.
+ This is the same as ranges/0/base_pfn.
- [RO] count: Amount of memory in the CMA area.
- [RO] order_per_bit: Order of pages represented by one bit.
- - [RO] bitmap: The bitmap of page states in the zone.
+ - [RO] bitmap: The bitmap of allocated pages in the area.
+ This is the same as ranges/0/base_pfn.
+ - [RO] ranges/N/base_pfn: The base PFN of contiguous range N
+ in the CMA area.
+ - [RO] ranges/N/bitmap: The bit map of allocated pages in
+ range N in the CMA area.
- [WO] alloc: Allocate N pages from that CMA area. For example::
echo 5 > <debugfs>/cma/<cma_name>/alloc
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 47a44bd348ab..ced2013db3df 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -64,6 +64,7 @@ comma (",").
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
+ │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us
│ │ │ │ │ │ nr_regions/min,max
│ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
│ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
@@ -82,8 +83,8 @@ comma (",").
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
- │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
- │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx
+ │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
+ │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
@@ -132,6 +133,11 @@ Users can write below commands for the kdamond to the ``state`` file.
- ``off``: Stop running.
- ``commit``: Read the user inputs in the sysfs files except ``state`` file
again.
+- ``update_tuned_intervals``: Update the contents of ``sample_us`` and
+ ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling
+ interval`` and ``aggregation interval`` for the files. Please refer to
+ :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>`
+ for more details.
- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
:ref:`quota goals <sysfs_schemes_quota_goals>`.
- ``update_schemes_stats``: Update the contents of stats files for each
@@ -213,6 +219,25 @@ writing to and rading from the files.
For more details about the intervals and monitoring regions range, please refer
to the Design document (:doc:`/mm/damon/design`).
+.. _damon_usage_sysfs_monitoring_intervals_goal:
+
+contexts/<N>/monitoring_attrs/intervals/intervals_goal/
+-------------------------------------------------------
+
+Under the ``intervals`` directory, one directory for automated tuning of
+``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists.
+Under the directory, four files for the auto-tuning control, namely
+``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist.
+Please refer to the :ref:`design document of the feature
+<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning
+mechanism. Reading and writing the four files under ``intervals_goal``
+directory shows and updates the tuning parameters that described in the
+:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same
+names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The
+tuning-applied current values of the two intervals can be read from the
+``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to
+the ``state`` file.
+
.. _sysfs_targets:
contexts/<N>/targets/
@@ -282,9 +307,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
schemes/<N>/
------------
-In each scheme directory, five directories (``access_pattern``, ``quotas``,
-``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
-(``action``, ``target_nid`` and ``apply_interval``) exist.
+In each scheme directory, seven directories (``access_pattern``, ``quotas``,
+``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and
+``tried_regions``) and three files (``action``, ``target_nid`` and
+``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
@@ -395,33 +421,43 @@ The ``interval`` should written in microseconds unit.
.. _sysfs_filters:
-schemes/<N>/filters/
---------------------
+schemes/<N>/{core\_,ops\_,}filters/
+-----------------------------------
-The directory for the :ref:`filters <damon_design_damos_filters>` of the given
+Directories for :ref:`filters <damon_design_damos_filters>` of the given
DAMON-based operation scheme.
-In the beginning, this directory has only one file, ``nr_filters``. Writing a
+``core_filters`` and ``ops_filters`` directories are for the filters handled by
+the DAMON core layer and operations set layer, respectively. ``filters``
+directory can be used for installing filters regardless of their handled
+layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be
+installed before those of ``filters``. All three directories have same files.
+
+Use of ``filters`` directory can make expecting evaluation orders of given
+filters with the files under directory bit confusing. Users are hence
+recommended to use ``core_filters`` and ``ops_filters`` directories. The
+``filters`` directory could be deprecated in future.
+
+In the beginning, the directory has only one file, ``nr_filters``. Writing a
number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
-Each filter directory contains seven files, namely ``type``, ``matching``,
-``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``.
-To ``type`` file, you can write one of five special keywords: ``anon`` for
-anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young
-pages, ``addr`` for specific address range (an open-ended interval), or
-``target`` for specific DAMON monitoring target filtering. Meaning of the
-types are same to the description on the :ref:`design doc
-<damon_design_damos_filters>`.
-
-In case of the memory cgroup filtering, you can specify the memory cgroup of
-the interest by writing the path of the memory cgroup from the cgroups mount
-point to ``memcg_path`` file. In case of the address range filtering, you can
-specify the start and end address of the range to ``addr_start`` and
-``addr_end`` files, respectively. For the DAMON monitoring target filtering,
-you can specify the index of the target between the list of the DAMON context's
-monitoring targets list to ``target_idx`` file.
+Each filter directory contains nine files, namely ``type``, ``matching``,
+``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max``
+and ``target_idx``. To ``type`` file, you can write the type of the filter.
+Refer to :ref:`the design doc <damon_design_damos_filters>` for available type
+names, their meaning and on what layer those are handled.
+
+For ``memcg`` type, you can specify the memory cgroup of the interest by
+writing the path of the memory cgroup from the cgroups mount point to
+``memcg_path`` file. For ``addr`` type, you can specify the start and end
+address of the range (open-ended interval) to ``addr_start`` and ``addr_end``
+files, respectively. For ``hugepage_size`` type, you can specify the minimum
+and maximum size of the range (closed interval) to ``min`` and ``max`` files,
+respectively. For ``target`` type, you can specify the index of the target
+between the list of the DAMON context's monitoring targets list to
+``target_idx`` file.
You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter
is for memory that matches the ``type``. You can write ``Y`` or ``N`` to
@@ -431,6 +467,7 @@ the ``type`` and ``matching`` should be allowed or not.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
+ # cd ops_filters/0/
# echo 2 > nr_filters
# # disallow anonymous pages
echo anon > 0/type
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f34a0d798d5b..67a941903fd2 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,7 +145,17 @@ hugepages
It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
If the node number is invalid, the parameter will be ignored.
+hugepage_alloc_threads
+ Specify the number of threads that should be used to allocate hugepages
+ during boot. This parameter can be used to improve system bootup time
+ when allocating a large amount of huge pages.
+ The default value is 25% of the available hardware threads.
+ Example to use 8 allocation threads::
+
+ hugepage_alloc_threads=8
+
+ Note that this parameter only applies to non-gigantic huge pages.
default_hugepagesz
Specify the default huge page size. This parameter can
only be specified once on the command line. default_hugepagesz can
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index caba0f52dd36..afce291649dd 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -21,7 +21,8 @@ There are four components to pagemap:
* Bit 56 page exclusively mapped (since 4.2)
* Bit 57 pte is uffd-wp write-protected (since 5.13) (see
Documentation/admin-guide/mm/userfaultfd.rst)
- * Bits 58-60 zero
+ * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page)
+ * Bits 59-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
@@ -37,12 +38,28 @@ There are four components to pagemap:
precisely which pages are mapped (or in swap) and comparing mapped
pages between processes.
+ Traditionally, bit 56 indicates that a page is mapped exactly once and bit
+ 56 is clear when a page is mapped multiple times, even when mapped in the
+ same process multiple times. In some kernel configurations, the semantics
+ for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set
+ if all pages part of the corresponding large allocation are *certainly*
+ mapped in the same process, even if the page is mapped multiple times in that
+ process. Bit 56 is clear when any page page of the larger allocation
+ is *maybe* mapped in a different process. In some cases, a large allocation
+ might be treated as "maybe mapped by multiple processes" even though this
+ is no longer the case.
+
Efficient users of this interface will use ``/proc/pid/maps`` to
determine which areas of memory are actually mapped and llseek to
skip over unmapped regions.
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
- times each page is mapped, indexed by PFN.
+ times each page is mapped, indexed by PFN. Some kernel configurations do
+ not track the precise number of times a page part of a larger allocation
+ (e.g., THP) is mapped. In these configurations, the average number of
+ mappings per page in this larger allocation is returned instead. However,
+ if any page of the large allocation is mapped, the returned value will
+ be at least 1.
The page-types tool in the tools/mm directory can be used to query the
number of times a page is mapped.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..fd3370aa43fe 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed
pages are freed. The pool is not preallocated. By default, a zpool
of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
but it can be overridden at boot time by setting the ``zpool`` attribute,
-e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
+e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs
``zpool`` attribute, e.g.::
- echo zbud > /sys/module/zswap/parameters/zpool
+ echo zsmalloc > /sys/module/zswap/parameters/zpool
-The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which
-means the compression ratio will always be 2:1 or worse (because of half-full
-zbud pages). The zsmalloc type zpool has a more complex compressed page
-storage method, and it can achieve greater storage densities.
+The zsmalloc type zpool has a complex compressed page storage method, and it
+can achieve great storage densities.
When a swap page is passed from swapout to zswap, zswap maintains a mapping
of the swap entry, a combination of the swap type and swap offset, to the zpool
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f48eaa98d22d..8290177b4f75 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
+- defrag_mode
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -145,6 +146,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
to compaction, which would block the task from becoming active until the fault
is resolved.
+defrag_mode
+===========
+
+When set to 1, the page allocator tries harder to avoid fragmentation
+and maintain the ability to produce huge pages / higher-order pages.
+
+It is recommended to enable this right after boot, as fragmentation,
+once it occurred, can be long-lasting or even permanent.
dirty_background_bytes
======================