diff options
Diffstat (limited to 'Documentation')
23 files changed, 341 insertions, 137 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index f2af2ddedd32..2424237ebb10 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -316,6 +316,12 @@ Contact: SeongJae Park <sj@kernel.org> Description: Writing to and reading from this file sets and gets the path parameter of the goal. +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goal_tuner +Date: Mar 2026 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing to and reading from this file sets and gets the + goal-based effective quota auto-tuning algorithm to use. + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil Date: Mar 2022 Contact: SeongJae Park <sj@kernel.org> diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 451fa00d3004..60b07a7e30cd 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -462,7 +462,7 @@ know it via /sys/block/zram0/bd_stat's 3rd column. recompression ------------- -With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative +With `CONFIG_ZRAM_MULTI_COMP`, zram can recompress pages using alternative (secondary) compression algorithms. The basic idea is that alternative compression algorithm can provide better compression ratio at a price of (potentially) slower compression/decompression speeds. Alternative compression @@ -471,7 +471,7 @@ that default algorithm failed to compress). Another application is idle pages recompression - pages that are cold and sit in the memory can be recompressed using more effective algorithm and, hence, reduce zsmalloc memory usage. -With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: +With `CONFIG_ZRAM_MULTI_COMP`, zram supports up to 4 compression algorithms: one primary and up to 3 secondary ones. Primary zram compressor is explained in "3) Select compression algorithm", secondary algorithms are configured using recomp_algorithm device attribute. @@ -495,56 +495,43 @@ configuration::: #select deflate recompression algorithm, priority 2 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm -Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, +Another device attribute that `CONFIG_ZRAM_MULTI_COMP` enables is `recompress`, which controls recompression. Examples::: #IDLE pages recompression is activated by `idle` mode - echo "type=idle" > /sys/block/zramX/recompress + echo "type=idle priority=1" > /sys/block/zramX/recompress #HUGE pages recompression is activated by `huge` mode - echo "type=huge" > /sys/block/zram0/recompress + echo "type=huge priority=2" > /sys/block/zram0/recompress #HUGE_IDLE pages recompression is activated by `huge_idle` mode - echo "type=huge_idle" > /sys/block/zramX/recompress + echo "type=huge_idle priority=1" > /sys/block/zramX/recompress The number of idle pages can be significant, so user-space can pass a size threshold (in bytes) to the recompress knob: zram will recompress only pages of equal or greater size::: #recompress all pages larger than 3000 bytes - echo "threshold=3000" > /sys/block/zramX/recompress + echo "threshold=3000 priority=1" > /sys/block/zramX/recompress #recompress idle pages larger than 2000 bytes - echo "type=idle threshold=2000" > /sys/block/zramX/recompress + echo "type=idle threshold=2000 priority=1" > \ + /sys/block/zramX/recompress It is also possible to limit the number of pages zram re-compression will attempt to recompress::: - echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress - -During re-compression for every page, that matches re-compression criteria, -ZRAM iterates the list of registered alternative compression algorithms in -order of their priorities. ZRAM stops either when re-compression was -successful (re-compressed object is smaller in size than the original one) -and matches re-compression criteria (e.g. size threshold) or when there are -no secondary algorithms left to try. If none of the secondary algorithms can -successfully re-compressed the page such a page is marked as incompressible, -so ZRAM will not attempt to re-compress it in the future. - -This re-compression behaviour, when it iterates through the list of -registered compression algorithms, increases our chances of finding the -algorithm that successfully compresses a particular page. Sometimes, however, -it is convenient (and sometimes even necessary) to limit recompression to -only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a `algo` or `priority` parameter::: - - #use zstd algorithm only (if registered) - echo "type=huge algo=zstd" > /sys/block/zramX/recompress - - #use zstd algorithm only (if zstd was registered under priority 1) - echo "type=huge priority=1" > /sys/block/zramX/recompress + echo "type=huge_idle priority=1 max_pages=42" > \ + /sys/block/zramX/recompress + +It is advised to always specify `priority` parameter. While it is also +possible to specify `algo` parameter, so that `zram` will use algorithm's +name to determine the priority, it is not recommended, since it can lead to +unexpected results when the same algorithm is configured with different +priorities (e.g. different parameters). `priority` is the only way to +guarantee that the expected algorithm will be used. memory tracking =============== diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 91beaa6798ce..8ad0b2781317 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1734,6 +1734,11 @@ The following nested keys are defined. zswpwb Number of pages written from zswap to swap. + zswap_incomp + Number of incompressible pages currently stored in zswap + without compression. These pages could not be compressed to + a size smaller than PAGE_SIZE, so they are stored as-is. + thp_fault_alloc (npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 404a15f6782c..7663c610fe90 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -141,7 +141,7 @@ nodemask_t The size of a nodemask_t type. Used to compute the number of online nodes. -(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head) +(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_info) ---------------------------------------------------------------------------------- User-space tools compute their values based on the offset of these diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b9a2c649e411..4510b4b3c416 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2970,6 +2970,12 @@ Kernel parameters Format: <bool> Default: CONFIG_KFENCE_DEFERRABLE + kfence.fault= [MM,KFENCE] Controls the behavior when a KFENCE + error is detected. + report - print the error report and continue (default). + oops - print the error report and oops. + panic - print the error report and panic. + kfence.sample_interval= [MM,KFENCE] KFENCE's sample interval in milliseconds. Format: <unsigned integer> diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst index 20a8378d5a94..a7dea7c75a9b 100644 --- a/Documentation/admin-guide/mm/damon/lru_sort.rst +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -91,8 +91,8 @@ increases and decreases the effective level of the quota aiming the LRU Disabled by default. -Auto-tune monitoring intervals ------------------------------- +autotune_monitoring_intervals +----------------------------- If this parameter is set as ``Y``, DAMON_LRU_SORT automatically tunes DAMON's sampling and aggregation intervals. The auto-tuning aims to capture meaningful @@ -221,6 +221,10 @@ But, setting this too high could result in increased monitoring overhead. Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by default. +Note that this must be 3 or higher. Please refer to the :ref:`Monitoring +<damon_design_monitoring>` section of the design document for the rationale +behind this lower bound. + max_nr_regions -------------- @@ -351,3 +355,8 @@ the LRU-list based page granularity reclamation. :: # echo 400 > wmarks_mid # echo 200 > wmarks_low # echo Y > enabled + +Note that this module (damon_lru_sort) cannot run simultaneously with other +DAMON-based special-purpose modules. Refer to :ref:`DAMON design special +purpose modules exclusivity <damon_design_special_purpose_modules_exclusivity>` +for more details. diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 8eba3da8dcee..47854c461706 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -204,6 +204,10 @@ monitoring. This can be used to set lower-bound of the monitoring quality. But, setting this too high could result in increased monitoring overhead. Please refer to the DAMON documentation (:doc:`usage`) for more detail. +Note that this must be 3 or higher. Please refer to the :ref:`Monitoring +<damon_design_monitoring>` section of the design document for the rationale +behind this lower bound. + max_nr_regions -------------- @@ -318,6 +322,11 @@ granularity reclamation. :: # echo 200 > wmarks_low # echo Y > enabled +Note that this module (damon_reclaim) cannot run simultaneously with other +DAMON-based special-purpose modules. Refer to :ref:`DAMON design special +purpose modules exclusivity <damon_design_special_purpose_modules_exclusivity>` +for more details. + .. [1] https://research.google/pubs/pub48551/ .. [2] https://lwn.net/Articles/787611/ .. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst index e5a5a2c4f803..c4b14daeb2dd 100644 --- a/Documentation/admin-guide/mm/damon/stat.rst +++ b/Documentation/admin-guide/mm/damon/stat.rst @@ -45,6 +45,11 @@ You can enable DAMON_STAT by setting the value of this parameter as ``Y``. Setting it as ``N`` disables DAMON_STAT. The default value is set by ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option. +Note that this module (damon_stat) cannot run simultaneously with other +DAMON-based special-purpose modules. Refer to :ref:`DAMON design special +purpose modules exclusivity <damon_design_special_purpose_modules_exclusivity>` +for more details. + .. _damon_stat_aggr_interval_us: aggr_interval_us diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index b0f3969b6b3b..534e1199cf09 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -83,7 +83,7 @@ comma (","). │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max │ │ │ │ │ │ │ │ age/min,max - │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes + │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes,goal_tuner │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid,path @@ -377,9 +377,9 @@ schemes/<N>/quotas/ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given DAMON-based operation scheme. -Under ``quotas`` directory, four files (``ms``, ``bytes``, -``reset_interval_ms``, ``effective_bytes``) and two directories (``weights`` and -``goals``) exist. +Under ``quotas`` directory, five files (``ms``, ``bytes``, +``reset_interval_ms``, ``effective_bytes`` and ``goal_tuner``) and two +directories (``weights`` and ``goals``) exist. You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and ``reset interval`` in milliseconds by writing the values to the three files, @@ -390,6 +390,14 @@ apply the action to only up to ``bytes`` bytes of memory regions within the quota limits unless at least one :ref:`goal <sysfs_schemes_quota_goals>` is set. +You can set the goal-based effective quota auto-tuning algorithm to use, by +writing the algorithm name to ``goal_tuner`` file. Reading the file returns +the currently selected tuner algorithm. Refer to the design documentation of +:ref:`automatic quota tuning goals <damon_design_damos_quotas_auto_tuning>` for +the background design of the feature and the name of the selectable algorithms. +Refer to :ref:`goals directory <sysfs_schemes_quota_goals>` for the goals +setup. + The time quota is internally transformed to a size quota. Between the transformed size quota and user-specified size quota, smaller one is applied. Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst index 6dc18ed4b886..cb9a20f64920 100644 --- a/Documentation/admin-guide/mm/kho.rst +++ b/Documentation/admin-guide/mm/kho.rst @@ -28,20 +28,10 @@ per NUMA node scratch regions on boot. Perform a KHO kexec =================== -First, before you perform a KHO kexec, you need to move the system into -the :ref:`KHO finalization phase <kho-finalization-phase>` :: - - $ echo 1 > /sys/kernel/debug/kho/out/finalize - -After this command, the KHO FDT is available in -``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register -their own preserved sub FDTs under -``/sys/kernel/debug/kho/out/sub_fdts/``. - -Next, load the target payload and kexec into it. It is important that you -use the ``-s`` parameter to use the in-kernel kexec file loader, as user -space kexec tooling currently has no support for KHO with the user space -based file loader :: +To perform a KHO kexec, load the target payload and kexec into it. It +is important that you use the ``-s`` parameter to use the in-kernel +kexec file loader, as user space kexec tooling currently has no +support for KHO with the user space based file loader :: # kexec -l /path/to/bzImage --initrd /path/to/initrd -s # kexec -e @@ -52,40 +42,19 @@ For example, if you used ``reserve_mem`` command line parameter to create an early memory reservation, the new kernel will have that memory at the same physical address as the old kernel. -Abort a KHO exec -================ - -You can move the system out of KHO finalization phase again by calling :: - - $ echo 0 > /sys/kernel/debug/kho/out/active - -After this command, the KHO FDT is no longer available in -``/sys/kernel/debug/kho/out/fdt``. - debugfs Interfaces ================== +These debugfs interfaces are available when the kernel is compiled with +``CONFIG_KEXEC_HANDOVER_DEBUGFS`` enabled. + Currently KHO creates the following debugfs interfaces. Notice that these interfaces may change in the future. They will be moved to sysfs once KHO is stabilized. -``/sys/kernel/debug/kho/out/finalize`` - Kexec HandOver (KHO) allows Linux to transition the state of - compatible drivers into the next kexec'ed kernel. To do so, - device drivers will instruct KHO to preserve memory regions, - which could contain serialized kernel state. - While the state is serialized, they are unable to perform - any modifications to state that was serialized, such as - handed over memory allocations. - - When this file contains "1", the system is in the transition - state. When contains "0", it is not. To switch between the - two states, echo the respective number into this file. - ``/sys/kernel/debug/kho/out/fdt`` - When KHO state tree is finalized, the kernel exposes the - flattened device tree blob that carries its current KHO - state in this file. Kexec user space tooling can use this + The kernel exposes the flattened device tree blob that carries its + current KHO state in this file. Kexec user space tooling can use this as input file for the KHO payload image. ``/sys/kernel/debug/kho/out/scratch_len`` @@ -100,8 +69,8 @@ stabilized. it should place its payload images. ``/sys/kernel/debug/kho/out/sub_fdts/`` - In the KHO finalization phase, KHO producers register their own - FDT blob under this directory. + KHO producers can register their own FDT or another binary blob under + this directory. ``/sys/kernel/debug/kho/in/fdt`` When the kernel was booted with Kexec HandOver (KHO), diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index a70f20ce1ffb..90ab26e805a9 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -217,7 +217,7 @@ MPOL_PREFERRED the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described below. -MPOL_INTERLEAVED +MPOL_INTERLEAVE This mode specifies that page allocations be interleaved, on a page granularity, across the nodes specified in the policy. This mode also behaves slightly differently, based on the diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 2e63be3486cf..799d743105a6 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -22,6 +22,12 @@ memblock preservation ABI .. kernel-doc:: include/linux/kho/abi/memblock.h :doc: memblock kexec handover ABI +KHO persistent memory tracker ABI +================================= + +.. kernel-doc:: include/linux/kho/abi/kexec_handover.h + :doc: KHO persistent memory tracker + See Also ======== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index dcc6a36cc134..0a2dee4f8e7d 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -71,17 +71,17 @@ for boot memory allocations and as target memory for kexec blobs, some parts of that memory region may be reserved. These reservations are irrelevant for the next KHO, because kexec can overwrite even the original kernel. -.. _kho-finalization-phase: +Kexec Handover Radix Tree +========================= -KHO finalization phase -====================== +.. kernel-doc:: include/linux/kho_radix_tree.h + :doc: Kexec Handover Radix Tree -To enable user space based kexec file loader, the kernel needs to be able to -provide the FDT that describes the current kernel's state before -performing the actual kexec. The process of generating that FDT is -called serialization. When the FDT is generated, some properties -of the system may become immutable because they are already written down -in the FDT. That state is called the KHO finalization phase. +Public API +========== + +.. kernel-doc:: kernel/liveupdate/kexec_handover.c + :export: See Also ======== diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index a034700da7c4..4968b2aa60c8 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -75,9 +75,6 @@ Software Tag-Based KASAN supports slab, page_alloc, vmalloc, and stack memory. Hardware Tag-Based KASAN supports slab, page_alloc, and non-executable vmalloc memory. -For slab, both software KASAN modes support SLUB and SLAB allocators, while -Hardware Tag-Based KASAN only supports SLUB. - Usage ----- diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst index 541899353865..b03d1201ddae 100644 --- a/Documentation/dev-tools/kfence.rst +++ b/Documentation/dev-tools/kfence.rst @@ -81,6 +81,13 @@ tables being allocated. Error reports ~~~~~~~~~~~~~ +The boot parameter ``kfence.fault`` can be used to control the behavior when a +KFENCE error is detected: + +- ``kfence.fault=report``: Print the error report and continue (default). +- ``kfence.fault=oops``: Print the error report and oops. +- ``kfence.fault=panic``: Print the error report and panic. + A typical out-of-bounds access looks like this:: ================================================================== diff --git a/Documentation/driver-api/vme.rst b/Documentation/driver-api/vme.rst index c0b475369de0..7111999abc14 100644 --- a/Documentation/driver-api/vme.rst +++ b/Documentation/driver-api/vme.rst @@ -107,7 +107,7 @@ The function :c:func:`vme_master_read` can be used to read from and In addition to simple reads and writes, :c:func:`vme_master_rmw` is provided to do a read-modify-write transaction. Parts of a VME window can also be mapped -into user space memory using :c:func:`vme_master_mmap`. +into user space memory using :c:func:`vme_master_mmap_prepare`. Slave windows diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index f4873197587d..6cbc3e0292ae 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -29,6 +29,7 @@ algorithms work. fiemap files locks + mmap_prepare multigrain-ts mount_api quota diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst new file mode 100644 index 000000000000..82c99c95ad85 --- /dev/null +++ b/Documentation/filesystems/mmap_prepare.rst @@ -0,0 +1,168 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +mmap_prepare callback HOWTO +=========================== + +Introduction +============ + +The ``struct file->f_op->mmap()`` callback has been deprecated as it is both a +stability and security risk, and doesn't always permit the merging of adjacent +mappings resulting in unnecessary memory fragmentation. + +It has been replaced with the ``file->f_op->mmap_prepare()`` callback which +solves these problems. + +This hook is called right at the beginning of setting up the mapping, and +importantly it is invoked *before* any merging of adjacent mappings has taken +place. + +If an error arises upon mapping, it might arise after this callback has been +invoked, therefore it should be treated as effectively stateless. + +That is - no resources should be allocated nor state updated to reflect that a +mapping has been established, as the mapping may either be merged, or fail to be +mapped after the callback is complete. + +Mapped callback +--------------- + +If resources need to be allocated per-mapping, or state such as a reference +count needs to be manipulated, this should be done using the ``vm_ops->mapped`` +hook, which itself should be set by the >mmap_prepare hook. + +This callback is only invoked if a new mapping has been established and was not +merged with any other, and is invoked at a point where no error may occur before +the mapping is established. + +You may return an error to the callback itself, which will cause the mapping to +become unmapped and an error returned to the mmap() caller. This is useful if +resources need to be allocated, and that allocation might fail. + +How To Use +========== + +In your driver's struct file_operations struct, specify an ``mmap_prepare`` +callback rather than an ``mmap`` one, e.g. for ext4: + +.. code-block:: C + + const struct file_operations ext4_file_operations = { + ... + .mmap_prepare = ext4_file_mmap_prepare, + }; + +This has a signature of ``int (*mmap_prepare)(struct vm_area_desc *)``. + +Examining the struct vm_area_desc type: + +.. code-block:: C + + struct vm_area_desc { + /* Immutable state. */ + const struct mm_struct *const mm; + struct file *const file; /* May vary from vm_file in stacked callers. */ + unsigned long start; + unsigned long end; + + /* Mutable fields. Populated with initial state. */ + pgoff_t pgoff; + struct file *vm_file; + vma_flags_t vma_flags; + pgprot_t page_prot; + + /* Write-only fields. */ + const struct vm_operations_struct *vm_ops; + void *private_data; + + /* Take further action? */ + struct mmap_action action; + }; + +This is straightforward - you have all the fields you need to set up the +mapping, and you can update the mutable and writable fields, for instance: + +.. code-block:: C + + static int ext4_file_mmap_prepare(struct vm_area_desc *desc) + { + int ret; + struct file *file = desc->file; + struct inode *inode = file->f_mapping->host; + + ... + + file_accessed(file); + if (IS_DAX(file_inode(file))) { + desc->vm_ops = &ext4_dax_vm_ops; + vma_desc_set_flags(desc, VMA_HUGEPAGE_BIT); + } else { + desc->vm_ops = &ext4_file_vm_ops; + } + return 0; + } + +Importantly, you no longer have to dance around with reference counts or locks +when updating these fields - **you can simply go ahead and change them**. + +Everything is taken care of by the mapping code. + +VMA Flags +--------- + +Along with ``mmap_prepare``, VMA flags have undergone an overhaul. Where before +you would invoke one of vm_flags_init(), vm_flags_reset(), vm_flags_set(), +vm_flags_clear(), and vm_flags_mod() to modify flags (and to have the +locking done correctly for you, this is no longer necessary. + +Also, the legacy approach of specifying VMA flags via ``VM_READ``, ``VM_WRITE``, +etc. - i.e. using a ``-VM_xxx``- macro has changed too. + +When implementing mmap_prepare(), reference flags by their bit number, defined +as a ``VMA_xxx_BIT`` macro, e.g. ``VMA_READ_BIT``, ``VMA_WRITE_BIT`` etc., +and use one of (where ``desc`` is a pointer to struct vm_area_desc): + +* ``vma_desc_test_any(desc, ...)`` - Specify a comma-separated list of flags + you wish to test for (whether _any_ are set), e.g. - ``vma_desc_test_any( + desc, VMA_WRITE_BIT, VMA_MAYWRITE_BIT)`` - returns ``true`` if either are set, + otherwise ``false``. +* ``vma_desc_set_flags(desc, ...)`` - Update the VMA descriptor flags to set + additional flags specified by a comma-separated list, + e.g. - ``vma_desc_set_flags(desc, VMA_PFNMAP_BIT, VMA_IO_BIT)``. +* ``vma_desc_clear_flags(desc, ...)`` - Update the VMA descriptor flags to clear + flags specified by a comma-separated list, e.g. - ``vma_desc_clear_flags( + desc, VMA_WRITE_BIT, VMA_MAYWRITE_BIT)``. + +Actions +======= + +You can now very easily have actions be performed upon a mapping once set up by +utilising simple helper functions invoked upon the struct vm_area_desc +pointer. These are: + +* mmap_action_remap() - Remaps a range consisting only of PFNs for a specific + range starting a virtual address and PFN number of a set size. + +* mmap_action_remap_full() - Same as mmap_action_remap(), only remaps the + entire mapping from ``start_pfn`` onward. + +* mmap_action_ioremap() - Same as mmap_action_remap(), only performs an I/O + remap. + +* mmap_action_ioremap_full() - Same as mmap_action_ioremap(), only remaps + the entire mapping from ``start_pfn`` onward. + +* mmap_action_simple_ioremap() - Sets up an I/O remap from a specified + physical address and over a specified length. + +* mmap_action_map_kernel_pages() - Maps a specified array of `struct page` + pointers in the VMA from a specific offset. + +* mmap_action_map_kernel_pages_full() - Maps a specified array of `struct + page` pointers over the entire VMA. The caller must ensure there are + sufficient entries in the page array to cover the entire range of the + described VMA. + +**NOTE:** The ``action`` field should never normally be manipulated directly, +rather you ought to use one of these helpers. diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index dd64f5d7f319..afc7d52bda2f 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -150,6 +150,8 @@ address on the given address space. Support of ``address unit`` parameter is up to each operations set implementation. ``paddr`` is the only operations set implementation that supports the parameter. +If the value is smaller than ``PAGE_SIZE``, only a power of two should be used. + .. _damon_core_logic: Core Logics @@ -165,6 +167,13 @@ monitoring attributes, ``sampling interval``, ``aggregation interval``, ``update interval``, ``minimum number of regions``, and ``maximum number of regions``. +Note that ``minimum number of regions`` must be 3 or higher. This is because the +virtual address space monitoring is designed to handle at least three regions to +accommodate two large unmapped areas commonly found in normal virtual address +spaces. While this restriction might not be strictly necessary for other +operation sets like ``paddr``, it is currently enforced across all DAMON +operations for consistency. + To know how user-space can set the attributes via :ref:`DAMON sysfs interface <sysfs_interface>`, refer to :ref:`monitoring_attrs <sysfs_monitoring_attrs>` part of the documentation. @@ -458,9 +467,13 @@ that supports each action are as below. - ``pageout``: Reclaim the region. Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. - Supported by ``vaddr`` and ``fvaddr`` operations set. + Supported by ``vaddr`` and ``fvaddr`` operations set. When + TRANSPARENT_HUGEPAGE is disabled, the application of the action will just + fail. - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. - Supported by ``vaddr`` and ``fvaddr`` operations set. + Supported by ``vaddr`` and ``fvaddr`` operations set. When + TRANSPARENT_HUGEPAGE is disabled, the application of the action will just + fail. - ``lru_prio``: Prioritize the region on its LRU lists. Supported by ``paddr`` operations set. - ``lru_deprio``: Deprioritize the region on its LRU lists. @@ -564,6 +577,18 @@ aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS is under achieving the goal, DAMOS automatically increases the quota. If DAMOS is over achieving the goal, it decreases the quota. +There are two such tuning algorithms that users can select as they need. + +- ``consist``: A proportional feedback loop based algorithm. Tries to find an + optimum quota that should be consistently kept, to keep achieving the goal. + Useful for kernel-only operation on dynamic and long-running environments. + This is the default selection. If unsure, use this. +- ``temporal``: More straightforward algorithm. Tries to achieve the goal as + fast as possible, using maximum allowed quota, but only for a temporal short + time. When the quota is under-achieved, this algorithm keeps tuning quota to + a maximum allowed one. Once the quota is [over]-achieved, this sets the + quota zero. Useful for deterministic control required environments. + The goal can be specified with five parameters, namely ``target_metric``, ``target_value``, ``current_value``, ``nid`` and ``path``. The auto-tuning mechanism tries to make ``current_value`` of ``target_metric`` be same to @@ -839,6 +864,10 @@ more detail, please read the usage documents for those (:doc:`/admin-guide/mm/damon/stat`, :doc:`/admin-guide/mm/damon/reclaim` and :doc:`/admin-guide/mm/damon/lru_sort`). +.. _damon_design_special_purpose_modules_exclusivity: + +Note that these modules currently run in an exclusive manner. If one of those +is already running, others will return ``-EBUSY`` upon start requests. Sample DAMON Modules -------------------- diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 82f6c5eea49a..318f6a7bfea4 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -12,7 +12,7 @@ DAMON is a Linux kernel subsystem for efficient :ref:`data access monitoring - *light-weight* (for production online usages), - *scalable* (in terms of memory size), - *tunable* (for flexible usages), and - - *autoamted* (for production operation without manual tunings). + - *automated* (for production operation without manual tunings). .. toctree:: :maxdepth: 2 diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index 41b1d73b9bd7..bcb9798a27a8 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -63,10 +63,10 @@ management subsystem maintainer. Review cadence -------------- -The DAMON maintainer does the work on the usual work hour (09:00 to 17:00, -Mon-Fri) in PT (Pacific Time). The response to patches will occasionally be -slow. Do not hesitate to send a ping if you have not heard back within a week -of sending a patch. +The DAMON maintainer usually work in a flexible way, except early morning in PT +(Pacific Time). The response to patches will occasionally be slow. Do not +hesitate to send a ping if you have not heard back within a week of sending a +patch. Mailing tool ------------ diff --git a/Documentation/mm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst index 4914fbf07966..a49115db18c7 100644 --- a/Documentation/mm/hugetlbfs_reserv.rst +++ b/Documentation/mm/hugetlbfs_reserv.rst @@ -155,7 +155,7 @@ are enough free huge pages to accommodate the reservation. If there are, the global reservation count resv_huge_pages is adjusted something like the following:: - if (resv_needed <= (resv_huge_pages - free_huge_pages)) + if (resv_needed <= (free_huge_pages - resv_huge_pages) resv_huge_pages += resv_needed; Note that the global lock hugetlb_lock is held when checking and adjusting diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst index b4a55b6569fa..9fa8642ded48 100644 --- a/Documentation/mm/vmemmap_dedup.rst +++ b/Documentation/mm/vmemmap_dedup.rst @@ -24,7 +24,7 @@ For each base page, there is a corresponding ``struct page``. Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides this upper limit. The only 'useful' information in the remaining ``struct page`` -is the compound_head field, and this field is the same for all tail pages. +is the compound_info field, and this field is the same for all tail pages. By removing redundant ``struct page`` for HugeTLB pages, memory can be returned to the buddy allocator for other uses. @@ -124,33 +124,35 @@ Here is how things look before optimization:: | | +-----------+ -The value of page->compound_head is the same for all tail pages. The first -page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4 -``struct page`` necessary to describe the HugeTLB. The only use of the remaining -pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head. -Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page`` -will be used for each HugeTLB page. This will allow us to free the remaining -7 pages to the buddy allocator. +The first page of ``struct page`` (page 0) associated with the HugeTLB page +contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining +pages of ``struct page`` (page 1 to page 7) are tail pages. + +The optimization is only applied when the size of the struct page is a power +of 2. In this case, all tail pages of the same order are identical. See +compound_head(). This allows us to remap the tail pages of the vmemmap to a +shared, read-only page. The head page is also remapped to a new page. This +allows the original vmemmap pages to be freed. Here is how things look after remapping:: - HugeTLB struct pages(8 pages) page frame(8 pages) - +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ - | | | 0 | -------------> | 0 | - | | +-----------+ +-----------+ - | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ - | | +-----------+ | | | | | | - | | | 2 | -----------------+ | | | | | - | | +-----------+ | | | | | - | | | 3 | -------------------+ | | | | - | | +-----------+ | | | | - | | | 4 | ---------------------+ | | | - | PMD | +-----------+ | | | - | level | | 5 | -----------------------+ | | - | mapping | +-----------+ | | - | | | 6 | -------------------------+ | - | | +-----------+ | - | | | 7 | ---------------------------+ + HugeTLB struct pages(8 pages) page frame (new) + +-----------+ ---virt_to_page---> +-----------+ mapping to +----------------+ + | | | 0 | -------------> | 0 | + | | +-----------+ +----------------+ + | | | 1 | ------┐ + | | +-----------+ | + | | | 2 | ------┼ +----------------------------+ + | | +-----------+ | | A single, per-zone page | + | | | 3 | ------┼------> | frame shared among all | + | | +-----------+ | | hugepages of the same size | + | | | 4 | ------┼ +----------------------------+ + | | +-----------+ | + | | | 5 | ------┼ + | PMD | +-----------+ | + | level | | 6 | ------┼ + | mapping | +-----------+ | + | | | 7 | ------┘ | | +-----------+ | | | | @@ -172,16 +174,6 @@ The contiguous bit is used to increase the mapping size at the pmd and pte (last) level. So this type of HugeTLB page can be optimized only when its size of the ``struct page`` structs is greater than **1** page. -Notice: The head vmemmap page is not freed to the buddy allocator and all -tail vmemmap pages are mapped to the head vmemmap page frame. So we can see -more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB -page) associated with each HugeTLB page. The ``compound_head()`` can handle -this correctly. There is only **one** head ``struct page``, the tail -``struct page`` with ``PG_head`` are fake head ``struct page``. We need an -approach to distinguish between those two different types of ``struct page`` so -that ``compound_head()`` can return the real head ``struct page`` when the -parameter is the tail ``struct page`` but with ``PG_head``. - Device DAX ========== |
