diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-04-01 09:29:18 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-04-01 09:29:18 -0700 |
| commit | eb0ece16027f8223d5dc9aaf90124f70577bd22a (patch) | |
| tree | 1e2214cacd123b940ceca684322203643d5e9bc7 /Documentation/admin-guide | |
| parent | 08733088b566b58283f0f12fb73f5db6a9a9de30 (diff) | |
| parent | 0a1e082b64ccce165e7307a7b49d22b2504f9d1f (diff) | |
Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
Diffstat (limited to 'Documentation/admin-guide')
| -rw-r--r-- | Documentation/admin-guide/blockdev/zram.rst | 36 | ||||
| -rw-r--r-- | Documentation/admin-guide/cgroup-v1/memory.rst | 4 | ||||
| -rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 25 | ||||
| -rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 30 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/cma_debugfs.rst | 10 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/damon/usage.rst | 87 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/hugetlbpage.rst | 10 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/pagemap.rst | 21 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/zswap.rst | 10 | ||||
| -rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 9 |
10 files changed, 172 insertions, 70 deletions
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 1576fb93f06c..9bdb30901a93 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -54,7 +54,7 @@ The list of possible return codes: If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: - echo 3 > /sys/block/zram0/max_comp_streams + echo foo > /sys/block/zram0/comp_algorithm if [ $? -ne 0 ]; then handle_error fi @@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. -2) Set max number of compression streams -======================================== - -Regardless of the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPU - thus -allowing several concurrent compression operations. The number of -allocated compression streams goes down when some of the CPUs -become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or have only 1 CPU online. - -To find out how many streams are currently available:: - - cat /sys/block/zram0/max_comp_streams - -3) Select compression algorithm +2) Select compression algorithm =============================== Using comp_algorithm device attribute one can see available and @@ -107,7 +93,7 @@ Examples:: For the time being, the `comp_algorithm` content shows only compression algorithms that are supported by zram. -4) Set compression algorithm parameters: Optional +3) Set compression algorithm parameters: Optional ================================================= Compression algorithms may support specific parameters which can be @@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some algorithms), for other algorithms `level` is acceleration level (the higher the value the lower the compression ratio). -5) Set Disksize +4) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -6) Set memory limit: Optional +5) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -177,7 +163,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -7) Activate +6) Activate =========== :: @@ -188,7 +174,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -8) Add/remove zram devices +7) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -208,7 +194,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -9) Stats +8) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram<id>/ @@ -228,8 +214,6 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature -max_comp_streams RW the number of possible concurrent compress - operations comp_algorithm RW show and change the compression algorithm algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction @@ -310,7 +294,7 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -10) Deactivate +9) Deactivate ============== :: @@ -318,7 +302,7 @@ a single line of text and contains the following stats separated by whitespace: swapoff /dev/zram0 umount /dev/zram1 -11) Reset +10) Reset ========= Write any positive value to 'reset' sysfs node:: diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 02b8206a3594..d6b1db8cc7eb 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -610,6 +610,10 @@ memory.stat file includes following statistics: 'rss + mapped_file" will give you resident set size of cgroup. + Note that some kernel configurations might account complete larger + allocations (e.g., THP) towards 'rss' and 'mapped_file', even if + only some, but not all that memory is mapped. + (Note: file and shmem may be shared among other cgroups. In that case, mapped_file is accounted only when the memory cgroup is owner of page cache.) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f293a13b42ed..1a16ce68a4d7 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1445,7 +1445,10 @@ The following nested keys are defined. anon Amount of memory used in anonymous mappings such as - brk(), sbrk(), and mmap(MAP_ANONYMOUS) + brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that + some kernel configurations might account complete larger + allocations (e.g., THP) if only some, but not all the + memory of such an allocation is mapped anymore. file Amount of memory used to cache filesystem data, @@ -1488,7 +1491,10 @@ The following nested keys are defined. Amount of application memory swapped out to zswap. file_mapped - Amount of cached filesystem data mapped with mmap() + Amount of cached filesystem data mapped with mmap(). Note + that some kernel configurations might account complete + larger allocations (e.g., THP) if only some, but not + not all the memory of such an allocation is mapped. file_dirty Amount of cached filesystem data that was modified but @@ -1560,6 +1566,12 @@ The following nested keys are defined. workingset_nodereclaim Number of times a shadow node has been reclaimed + pswpin (npn) + Number of pages swapped into memory + + pswpout (npn) + Number of pages swapped out of memory + pgscan (npn) Amount of scanned pages (in an inactive LRU list) @@ -1575,6 +1587,9 @@ The following nested keys are defined. pgscan_khugepaged (npn) Amount of scanned pages by khugepaged (in an inactive LRU list) + pgscan_proactive (npn) + Amount of scanned pages proactively (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd @@ -1584,6 +1599,9 @@ The following nested keys are defined. pgsteal_khugepaged (npn) Amount of reclaimed pages by khugepaged + pgsteal_proactive (npn) + Amount of reclaimed pages proactively + pgfault (npn) Total number of page faults incurred @@ -1661,6 +1679,9 @@ The following nested keys are defined. pgdemote_khugepaged Number of pages demoted by khugepaged. + pgdemote_proactive + Number of pages demoted by proactively. + hugetlb Amount of memory used by hugetlb pages. This metric only shows up if hugetlb usage is accounted for in memory.current (i.e. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3435a062a208..559f4fe51824 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1866,7 +1866,7 @@ hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET registers. Default set by CONFIG_HPET_MMAP_DEFAULT. - hugepages= [HW] Number of HugeTLB pages to allocate at boot. + hugepages= [HW,EARLY] Number of HugeTLB pages to allocate at boot. If this follows hugepagesz (below), it specifies the number of pages of hugepagesz to be allocated. If this is the first HugeTLB parameter on the command @@ -1878,15 +1878,24 @@ <node>:<integer>[,<node>:<integer>] hugepagesz= - [HW] The size of the HugeTLB pages. This is used in - conjunction with hugepages (above) to allocate huge - pages of a specific size at boot. The pair - hugepagesz=X hugepages=Y can be specified once for - each supported huge page size. Huge page sizes are - architecture dependent. See also + [HW,EARLY] The size of the HugeTLB pages. This is + used in conjunction with hugepages (above) to + allocate huge pages of a specific size at boot. The + pair hugepagesz=X hugepages=Y can be specified once + for each supported huge page size. Huge page sizes + are architecture dependent. See also Documentation/admin-guide/mm/hugetlbpage.rst. Format: size[KMG] + hugepage_alloc_threads= + [HW] The number of threads that should be used to + allocate hugepages during boot. This option can be + used to improve system bootup time when allocating + a large amount of huge pages. + The default value is 25% of the available hardware threads. + + Note that this parameter only applies to non-gigantic huge pages. + hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation of gigantic hugepages. Or using node format, the size of a CMA area per node can be specified. @@ -1897,6 +1906,13 @@ hugepages using the CMA allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped. + hugetlb_cma_only= + [HW,CMA,EARLY] When allocating new HugeTLB pages, only + try to allocate from the CMA areas. + + This option does nothing if hugetlb_cma= is not also + specified. + hugetlb_free_vmemmap= [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP enabled. diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 7367e6294ef6..4120e9cb0cd5 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -12,10 +12,16 @@ its CMA name like below: The structure of the files created under that directory is as follows: - - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. + - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area. + This is the same as ranges/0/base_pfn. - [RO] count: Amount of memory in the CMA area. - [RO] order_per_bit: Order of pages represented by one bit. - - [RO] bitmap: The bitmap of page states in the zone. + - [RO] bitmap: The bitmap of allocated pages in the area. + This is the same as ranges/0/base_pfn. + - [RO] ranges/N/base_pfn: The base PFN of contiguous range N + in the CMA area. + - [RO] ranges/N/bitmap: The bit map of allocated pages in + range N in the CMA area. - [WO] alloc: Allocate N pages from that CMA area. For example:: echo 5 > <debugfs>/cma/<cma_name>/alloc diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 47a44bd348ab..ced2013db3df 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -64,6 +64,7 @@ comma (","). │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us + │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us │ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target @@ -82,8 +83,8 @@ comma (","). │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx + │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed @@ -132,6 +133,11 @@ Users can write below commands for the kdamond to the ``state`` file. - ``off``: Stop running. - ``commit``: Read the user inputs in the sysfs files except ``state`` file again. +- ``update_tuned_intervals``: Update the contents of ``sample_us`` and + ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling + interval`` and ``aggregation interval`` for the files. Please refer to + :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>` + for more details. - ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' :ref:`quota goals <sysfs_schemes_quota_goals>`. - ``update_schemes_stats``: Update the contents of stats files for each @@ -213,6 +219,25 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _damon_usage_sysfs_monitoring_intervals_goal: + +contexts/<N>/monitoring_attrs/intervals/intervals_goal/ +------------------------------------------------------- + +Under the ``intervals`` directory, one directory for automated tuning of +``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists. +Under the directory, four files for the auto-tuning control, namely +``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist. +Please refer to the :ref:`design document of the feature +<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning +mechanism. Reading and writing the four files under ``intervals_goal`` +directory shows and updates the tuning parameters that described in the +:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same +names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The +tuning-applied current values of the two intervals can be read from the +``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to +the ``state`` file. + .. _sysfs_targets: contexts/<N>/targets/ @@ -282,9 +307,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes/<N>/ ------------ -In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files -(``action``, ``target_nid`` and ``apply_interval``) exist. +In each scheme directory, seven directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and +``tried_regions``) and three files (``action``, ``target_nid`` and +``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read @@ -395,33 +421,43 @@ The ``interval`` should written in microseconds unit. .. _sysfs_filters: -schemes/<N>/filters/ --------------------- +schemes/<N>/{core\_,ops\_,}filters/ +----------------------------------- -The directory for the :ref:`filters <damon_design_damos_filters>` of the given +Directories for :ref:`filters <damon_design_damos_filters>` of the given DAMON-based operation scheme. -In the beginning, this directory has only one file, ``nr_filters``. Writing a +``core_filters`` and ``ops_filters`` directories are for the filters handled by +the DAMON core layer and operations set layer, respectively. ``filters`` +directory can be used for installing filters regardless of their handled +layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be +installed before those of ``filters``. All three directories have same files. + +Use of ``filters`` directory can make expecting evaluation orders of given +filters with the files under directory bit confusing. Users are hence +recommended to use ``core_filters`` and ``ops_filters`` directories. The +``filters`` directory could be deprecated in future. + +In the beginning, the directory has only one file, ``nr_filters``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains seven files, namely ``type``, ``matching``, -``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. -To ``type`` file, you can write one of five special keywords: ``anon`` for -anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young -pages, ``addr`` for specific address range (an open-ended interval), or -``target`` for specific DAMON monitoring target filtering. Meaning of the -types are same to the description on the :ref:`design doc -<damon_design_damos_filters>`. - -In case of the memory cgroup filtering, you can specify the memory cgroup of -the interest by writing the path of the memory cgroup from the cgroups mount -point to ``memcg_path`` file. In case of the address range filtering, you can -specify the start and end address of the range to ``addr_start`` and -``addr_end`` files, respectively. For the DAMON monitoring target filtering, -you can specify the index of the target between the list of the DAMON context's -monitoring targets list to ``target_idx`` file. +Each filter directory contains nine files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max`` +and ``target_idx``. To ``type`` file, you can write the type of the filter. +Refer to :ref:`the design doc <damon_design_damos_filters>` for available type +names, their meaning and on what layer those are handled. + +For ``memcg`` type, you can specify the memory cgroup of the interest by +writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. For ``addr`` type, you can specify the start and end +address of the range (open-ended interval) to ``addr_start`` and ``addr_end`` +files, respectively. For ``hugepage_size`` type, you can specify the minimum +and maximum size of the range (closed interval) to ``min`` and ``max`` files, +respectively. For ``target`` type, you can specify the index of the target +between the list of the DAMON context's monitoring targets list to +``target_idx`` file. You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter is for memory that matches the ``type``. You can write ``Y`` or ``N`` to @@ -431,6 +467,7 @@ the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: + # cd ops_filters/0/ # echo 2 > nr_filters # # disallow anonymous pages echo anon > 0/type diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index f34a0d798d5b..67a941903fd2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -145,7 +145,17 @@ hugepages It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. If the node number is invalid, the parameter will be ignored. +hugepage_alloc_threads + Specify the number of threads that should be used to allocate hugepages + during boot. This parameter can be used to improve system bootup time + when allocating a large amount of huge pages. + The default value is 25% of the available hardware threads. + Example to use 8 allocation threads:: + + hugepage_alloc_threads=8 + + Note that this parameter only applies to non-gigantic huge pages. default_hugepagesz Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index caba0f52dd36..afce291649dd 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,7 +21,8 @@ There are four components to pagemap: * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see Documentation/admin-guide/mm/userfaultfd.rst) - * Bits 58-60 zero + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present @@ -37,12 +38,28 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. + Traditionally, bit 56 indicates that a page is mapped exactly once and bit + 56 is clear when a page is mapped multiple times, even when mapped in the + same process multiple times. In some kernel configurations, the semantics + for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set + if all pages part of the corresponding large allocation are *certainly* + mapped in the same process, even if the page is mapped multiple times in that + process. Bit 56 is clear when any page page of the larger allocation + is *maybe* mapped in a different process. In some cases, a large allocation + might be treated as "maybe mapped by multiple processes" even though this + is no longer the case. + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 3598dcd7dbe7..fd3370aa43fe 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed pages are freed. The pool is not preallocated. By default, a zpool of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs +e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: - echo zbud > /sys/module/zswap/parameters/zpool + echo zsmalloc > /sys/module/zswap/parameters/zpool -The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which -means the compression ratio will always be 2:1 or worse (because of half-full -zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. +The zsmalloc type zpool has a complex compressed page storage method, and it +can achieve great storage densities. When a swap page is passed from swapout to zswap, zswap maintains a mapping of the swap entry, a combination of the swap type and swap offset, to the zpool diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..8290177b4f75 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - compact_memory - compaction_proactiveness - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -145,6 +146,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== |
