diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-05-31 15:44:16 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-05-31 15:44:16 -0700 |
| commit | 00c010e130e58301db2ea0cec1eadc931e1cb8cf (patch) | |
| tree | 885eca54cb733ca2b91fc563f09a23f8c0123fe1 /Documentation | |
| parent | b42966552bb8d3027b66782fc1b53ce570e4d356 (diff) | |
| parent | c544a952ba61b1a025455098033c17e0573ab085 (diff) | |
Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
Diffstat (limited to 'Documentation')
27 files changed, 656 insertions, 102 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index 293197f180ad..5697ab154c1f 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -283,6 +283,12 @@ Contact: SeongJae Park <sj@kernel.org> Description: Writing to and reading from this file sets and gets the current value of the goal metric. +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/nid +Date: Apr 2025 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing to and reading from this file sets and gets the nid + parameter of the goal. + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil Date: Mar 2022 Contact: SeongJae Park <sj@kernel.org> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave index 0b7972de04e9..649c0e9b895c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -20,6 +20,35 @@ Description: Weight configuration interface for nodeN Minimum weight: 1 Maximum weight: 255 - Writing an empty string or `0` will reset the weight to the - system default. The system default may be set by the kernel - or drivers at boot or during hotplug events. + Writing invalid values (i.e. any values not in [1,255], + empty string, ...) will return -EINVAL. + + Changing the weight to a valid value will automatically + switch the system to manual mode as well. + +What: /sys/kernel/mm/mempolicy/weighted_interleave/auto +Date: May 2025 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Auto-weighting configuration interface + + Configuration mode for weighted interleave. 'true' indicates + that the system is in auto mode, and a 'false' indicates that + the system is in manual mode. + + In auto mode, all node weights are re-calculated and overwritten + (visible via the nodeN interfaces) whenever new bandwidth data + is made available during either boot or hotplug events. + + In manual mode, node weights can only be updated by the user. + Note that nodes that are onlined with previously set weights + will reuse those weights. If they were not previously set or + are onlined with missing bandwidth data, the weights will use + a default weight of 1. + + Writing any true value string (e.g. Y or 1) will enable auto + mode, while writing any false value string (e.g. N or 0) will + enable manual mode. All other strings are ignored and will + return -EINVAL. + + Writing a new weight to a node directly via the nodeN interface + will also automatically switch the system to manual mode. diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa index 77e559d4ed80..90e375ff54cb 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa @@ -16,9 +16,13 @@ Description: Enable/disable demoting pages during reclaim Allowing page migration during reclaim enables these systems to migrate pages from fast tiers to slow tiers when the fast tier is under pressure. This migration - is performed before swap. It may move data to a NUMA - node that does not fall into the cpuset of the - allocating process which might be construed to violate - the guarantees of cpusets. This should not be enabled - on systems which need strict cpuset location - guarantees. + is performed before swap if an eligible numa node is + present in cpuset.mems for the cgroup (or if cpuset v1 + is being used). If cpusets.mems changes at runtime, it + may move data to a NUMA node that does not fall into the + cpuset of the new cpusets.mems, which might be construed + to violate the guarantees of cpusets. Shared memory, + such as libraries, owned by another cgroup may still be + demoted and result in memory use on a node not present + in cpusets.mem. This should not be enabled on systems + which need strict cpuset location guarantees. diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index cd5fb8fa3ddf..658999be5164 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -2,7 +2,7 @@ What: /sys/kernel/slab Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The /sys/kernel/slab directory contains a snapshot of the internal state of the SLUB allocator for each cache. Certain @@ -14,7 +14,7 @@ What: /sys/kernel/slab/<cache>/aliases Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The aliases file is read-only and specifies how many caches have merged into this cache. @@ -23,7 +23,7 @@ What: /sys/kernel/slab/<cache>/align Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The align file is read-only and specifies the cache's object alignment in bytes. @@ -32,7 +32,7 @@ What: /sys/kernel/slab/<cache>/alloc_calls Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The alloc_calls file is read-only and lists the kernel code locations from which allocations for this cache were performed. @@ -43,7 +43,7 @@ What: /sys/kernel/slab/<cache>/alloc_fastpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The alloc_fastpath file shows how many objects have been allocated using the fast path. It can be written to clear the @@ -54,7 +54,7 @@ What: /sys/kernel/slab/<cache>/alloc_from_partial Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The alloc_from_partial file shows how many times a cpu slab has been full and it has been refilled by using a slab from the list @@ -66,7 +66,7 @@ What: /sys/kernel/slab/<cache>/alloc_refill Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The alloc_refill file shows how many times the per-cpu freelist was empty but there were objects available as the result of @@ -77,7 +77,7 @@ What: /sys/kernel/slab/<cache>/alloc_slab Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The alloc_slab file is shows how many times a new slab had to be allocated from the page allocator. It can be written to @@ -88,7 +88,7 @@ What: /sys/kernel/slab/<cache>/alloc_slowpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The alloc_slowpath file shows how many objects have been allocated using the slow path because of a refill or @@ -100,7 +100,7 @@ What: /sys/kernel/slab/<cache>/cache_dma Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The cache_dma file is read-only and specifies whether objects are from ZONE_DMA. @@ -110,7 +110,7 @@ What: /sys/kernel/slab/<cache>/cpu_slabs Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The cpu_slabs file is read-only and displays how many cpu slabs are active and their NUMA locality. @@ -119,7 +119,7 @@ What: /sys/kernel/slab/<cache>/cpuslab_flush Date: April 2009 KernelVersion: 2.6.31 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The file cpuslab_flush shows how many times a cache's cpu slabs have been flushed as the result of destroying or shrinking a @@ -132,7 +132,7 @@ What: /sys/kernel/slab/<cache>/ctor Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The ctor file is read-only and specifies the cache's object constructor function, which is invoked for each object when a @@ -142,7 +142,7 @@ What: /sys/kernel/slab/<cache>/deactivate_empty Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The deactivate_empty file shows how many times an empty cpu slab was deactivated. It can be written to clear the current count. @@ -152,7 +152,7 @@ What: /sys/kernel/slab/<cache>/deactivate_full Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The deactivate_full file shows how many times a full cpu slab was deactivated. It can be written to clear the current count. @@ -162,7 +162,7 @@ What: /sys/kernel/slab/<cache>/deactivate_remote_frees Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The deactivate_remote_frees file shows how many times a cpu slab has been deactivated and contained free objects that were freed @@ -173,7 +173,7 @@ What: /sys/kernel/slab/<cache>/deactivate_to_head Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The deactivate_to_head file shows how many times a partial cpu slab was deactivated and added to the head of its node's partial @@ -184,7 +184,7 @@ What: /sys/kernel/slab/<cache>/deactivate_to_tail Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The deactivate_to_tail file shows how many times a partial cpu slab was deactivated and added to the tail of its node's partial @@ -195,7 +195,7 @@ What: /sys/kernel/slab/<cache>/destroy_by_rcu Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The destroy_by_rcu file is read-only and specifies whether slabs (not objects) are freed by rcu. @@ -204,7 +204,7 @@ What: /sys/kernel/slab/<cache>/free_add_partial Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_add_partial file shows how many times an object has been freed in a full slab so that it had to added to its node's @@ -215,7 +215,7 @@ What: /sys/kernel/slab/<cache>/free_calls Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_calls file is read-only and lists the locations of object frees if slab debugging is enabled (see @@ -225,7 +225,7 @@ What: /sys/kernel/slab/<cache>/free_fastpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_fastpath file shows how many objects have been freed using the fast path because it was an object from the cpu slab. @@ -236,7 +236,7 @@ What: /sys/kernel/slab/<cache>/free_frozen Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_frozen file shows how many objects have been freed to a frozen slab (i.e. a remote cpu slab). It can be written to @@ -247,7 +247,7 @@ What: /sys/kernel/slab/<cache>/free_remove_partial Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_remove_partial file shows how many times an object has been freed to a now-empty slab so that it had to be removed from @@ -259,7 +259,7 @@ What: /sys/kernel/slab/<cache>/free_slab Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_slab file shows how many times an empty slab has been freed back to the page allocator. It can be written to clear @@ -270,7 +270,7 @@ What: /sys/kernel/slab/<cache>/free_slowpath Date: February 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The free_slowpath file shows how many objects have been freed using the slow path (i.e. to a full or partial slab). It can @@ -281,7 +281,7 @@ What: /sys/kernel/slab/<cache>/hwcache_align Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The hwcache_align file is read-only and specifies whether objects are aligned on cachelines. @@ -301,7 +301,7 @@ What: /sys/kernel/slab/<cache>/object_size Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The object_size file is read-only and specifies the cache's object size. @@ -310,7 +310,7 @@ What: /sys/kernel/slab/<cache>/objects Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The objects file is read-only and displays how many objects are active and from which nodes they are from. @@ -319,7 +319,7 @@ What: /sys/kernel/slab/<cache>/objects_partial Date: April 2008 KernelVersion: 2.6.26 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The objects_partial file is read-only and displays how many objects are on partial slabs and from which nodes they are @@ -329,7 +329,7 @@ What: /sys/kernel/slab/<cache>/objs_per_slab Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The file objs_per_slab is read-only and specifies how many objects may be allocated from a single slab of the order @@ -339,7 +339,7 @@ What: /sys/kernel/slab/<cache>/order Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The order file specifies the page order at which new slabs are allocated. It is writable and can be changed to increase the @@ -356,7 +356,7 @@ What: /sys/kernel/slab/<cache>/order_fallback Date: April 2008 KernelVersion: 2.6.26 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The order_fallback file shows how many times an allocation of a new slab has not been possible at the cache's order and instead @@ -369,7 +369,7 @@ What: /sys/kernel/slab/<cache>/partial Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The partial file is read-only and displays how long many partial slabs there are and how long each node's list is. @@ -378,7 +378,7 @@ What: /sys/kernel/slab/<cache>/poison Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The poison file specifies whether objects should be poisoned when a new slab is allocated. @@ -387,7 +387,7 @@ What: /sys/kernel/slab/<cache>/reclaim_account Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The reclaim_account file specifies whether the cache's objects are reclaimable (and grouped by their mobility). @@ -396,7 +396,7 @@ What: /sys/kernel/slab/<cache>/red_zone Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The red_zone file specifies whether the cache's objects are red zoned. @@ -405,7 +405,7 @@ What: /sys/kernel/slab/<cache>/remote_node_defrag_ratio Date: January 2008 KernelVersion: 2.6.25 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The file remote_node_defrag_ratio specifies the percentage of times SLUB will attempt to refill the cpu slab with a partial @@ -419,7 +419,7 @@ What: /sys/kernel/slab/<cache>/sanity_checks Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The sanity_checks file specifies whether expensive checks should be performed on free and, at minimum, enables double free @@ -430,7 +430,7 @@ What: /sys/kernel/slab/<cache>/shrink Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The shrink file is used to reclaim unused slab cache memory from a cache. Empty per-cpu or partial slabs @@ -446,7 +446,7 @@ What: /sys/kernel/slab/<cache>/slab_size Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The slab_size file is read-only and specifies the object size with metadata (debugging information and alignment) in bytes. @@ -455,7 +455,7 @@ What: /sys/kernel/slab/<cache>/slabs Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The slabs file is read-only and displays how long many slabs there are (both cpu and partial) and from which nodes they are @@ -465,7 +465,7 @@ What: /sys/kernel/slab/<cache>/store_user Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The store_user file specifies whether the location of allocation or free should be tracked for a cache. @@ -474,7 +474,7 @@ What: /sys/kernel/slab/<cache>/total_objects Date: April 2008 KernelVersion: 2.6.26 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The total_objects file is read-only and displays how many total objects a cache has and from which nodes they are from. @@ -483,7 +483,7 @@ What: /sys/kernel/slab/<cache>/trace Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: The trace file specifies whether object allocations and frees should be traced. @@ -492,7 +492,7 @@ What: /sys/kernel/slab/<cache>/validate Date: May 2007 KernelVersion: 2.6.22 Contact: Pekka Enberg <penberg@cs.helsinki.fi>, - Christoph Lameter <cl@linux-foundation.org> + Christoph Lameter <cl@gentwo.org> Description: Writing to the validate file causes SLUB to traverse all of its cache's objects and check the validity of metadata. @@ -506,14 +506,14 @@ Description: What: /sys/kernel/slab/<cache>/slabs_cpu_partial Date: Aug 2011 -Contact: Christoph Lameter <cl@linux.com> +Contact: Christoph Lameter <cl@gentwo.org> Description: This read-only file shows the number of partialli allocated frozen slabs. What: /sys/kernel/slab/<cache>/cpu_partial Date: Aug 2011 -Contact: Christoph Lameter <cl@linux.com> +Contact: Christoph Lameter <cl@gentwo.org> Description: This read-only file shows the number of per cpu partial pages to keep around. diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 9bdb30901a93..3e273c1bb749 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -317,6 +317,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -331,24 +351,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -369,6 +372,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. @@ -482,8 +502,6 @@ attempt to recompress::: echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress -Recompression of idle pages requires memory tracking. - During re-compression for every page, that matches re-compression criteria, ZRAM iterates the list of registered alternative compression algorithms in order of their priorities. ZRAM stops either when re-compression was diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index a3e2edb3d274..463f98453323 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -13,7 +13,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson <pj@sgi.com> -Modified by Christoph Lameter <cl@linux.com> +Modified by Christoph Lameter <cl@gentwo.org> .. CONTENTS: diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index f401af5e2f09..c7909e5ac136 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -10,7 +10,7 @@ Written by Simon.Derr@bull.net - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Modified by Paul Jackson <pj@sgi.com> -- Modified by Christoph Lameter <cl@linux.com> +- Modified by Christoph Lameter <cl@gentwo.org> - Modified by Paul Menage <menage@google.com> - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 1edc26622594..bd98ea3175ec 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1334,6 +1334,18 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1351,6 +1363,18 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. @@ -1383,6 +1407,9 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The valid range for swappiness is [0-200, max], setting + swappiness=max exclusively reclaims anonymous memory. + memory.peak A read-write single value file which exists on non-root cgroups. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ea81784be981..a3ea40b22fb9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2749,6 +2749,31 @@ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..bc7e976120e0 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ced2013db3df..d960aba72b82 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -81,7 +81,7 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max @@ -390,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains four files, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. Users can set and get the +four parameters for the quota auto-tuning goals that specified on the +:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory <sysfs_kdamond>` to pass the feedback to DAMON. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..2d2f6c222308 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -42,3 +42,4 @@ the Linux memory management. transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..6dc18ed4b886 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts <kho-concepts>`. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase <kho-finalization-phase>` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index afce291649dd..e60e9211fd9b 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -250,6 +250,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d385985b305f..9bef46151d53 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -132,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index e9789bd381d8..7a4ca18ca6e2 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -115,6 +115,7 @@ more memory-management documentation in Documentation/mm/index.rst. pin_user_pages boot-time-mm gfp_mask-from-fs-io + kho/index Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/kho/bindings/kho.yaml b/Documentation/core-api/kho/bindings/kho.yaml new file mode 100644 index 000000000000..11e8ab7b219d --- /dev/null +++ b/Documentation/core-api/kho/bindings/kho.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Kexec HandOver (KHO) root tree + +maintainers: + - Mike Rapoport <rppt@kernel.org> + - Changyuan Lyu <changyuanl@google.com> + +description: | + System memory preserved by KHO across kexec. + +properties: + compatible: + enum: + - kho-v1 + + preserved-memory-map: + description: | + physical address (u64) of an in-memory structure describing all preserved + folios and memory ranges. + +patternProperties: + "$[0-9a-f_]+^": + $ref: sub-fdt.yaml# + description: physical address of a KHO user's own FDT. + +required: + - compatible + - preserved-memory-map + +additionalProperties: false + +examples: + - | + kho { + compatible = "kho-v1"; + preserved-memory-map = <0xf0be16 0x1000000>; + + memblock { + fdt = <0x80cc16 0x1000000>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/memblock.yaml b/Documentation/core-api/kho/bindings/memblock/memblock.yaml new file mode 100644 index 000000000000..d388c28eb91d --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/memblock.yaml @@ -0,0 +1,39 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory + +maintainers: + - Mike Rapoport <rppt@kernel.org> + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + The post-KHO kernel can then consume these reservations and they are + guaranteed to have the same physical address. + +properties: + compatible: + enum: + - reserve-mem-v1 + +patternProperties: + "$[0-9a-f_]+^": + $ref: reserve-mem.yaml# + description: reserved memory regions + +required: + - compatible + +additionalProperties: false + +examples: + - | + memblock { + compatible = "memblock-v1"; + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml new file mode 100644 index 000000000000..10282d3d1bcd --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml @@ -0,0 +1,40 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory regions + +maintainers: + - Mike Rapoport <rppt@kernel.org> + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + This object describes each such region. + +properties: + compatible: + enum: + - reserve-mem-v1 + + start: + description: | + physical address (u64) of the reserved memory region. + + size: + description: | + size (u64) of the reserved memory region. + +required: + - compatible + - start + - size + +additionalProperties: false + +examples: + - | + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; diff --git a/Documentation/core-api/kho/bindings/sub-fdt.yaml b/Documentation/core-api/kho/bindings/sub-fdt.yaml new file mode 100644 index 000000000000..b9a3d2d24850 --- /dev/null +++ b/Documentation/core-api/kho/bindings/sub-fdt.yaml @@ -0,0 +1,27 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: KHO users' FDT address + +maintainers: + - Mike Rapoport <rppt@kernel.org> + - Changyuan Lyu <changyuanl@google.com> + +description: | + Physical address of an FDT blob registered by a KHO user. + +properties: + fdt: + description: | + physical address (u64) of an FDT blob. + +required: + - fdt + +additionalProperties: false + +examples: + - | + memblock { + fdt = <0x80cc16 0x1000000>; + }; diff --git a/Documentation/core-api/kho/concepts.rst b/Documentation/core-api/kho/concepts.rst new file mode 100644 index 000000000000..36d5c05cfb30 --- /dev/null +++ b/Documentation/core-api/kho/concepts.rst @@ -0,0 +1,74 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later +.. _kho-concepts: + +======================= +Kexec Handover Concepts +======================= + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +It introduces multiple concepts: + +KHO FDT +======= + +Every KHO kexec carries a KHO specific flattened device tree (FDT) blob +that describes preserved memory regions. These regions contain either +serialized subsystem states, or in-memory data that shall not be touched +across kexec. After KHO, subsystems can retrieve and restore preserved +memory regions from KHO FDT. + +KHO only uses the FDT container format and libfdt library, but does not +adhere to the same property semantics that normal device trees do: Properties +are passed in native endianness and standardized properties like ``regs`` and +``ranges`` do not exist, hence there are no ``#...-cells`` properties. + +KHO is still under development. The FDT schema is unstable and would change +in the future. + +Scratch Regions +=============== + +To boot into kexec, we need to have a physically contiguous memory range that +contains no handed over memory. Kexec then places the target kernel and initrd +into that region. The new kernel exclusively uses this region for memory +allocations before during boot up to the initialization of the page allocator. + +We guarantee that we always have such regions through the scratch regions: On +first boot KHO allocates several physically contiguous memory regions. Since +after kexec these regions will be used by early memory allocations, there is a +scratch region per NUMA node plus a scratch region to satisfy allocations +requests that do not require particular NUMA node assignment. +By default, size of the scratch region is calculated based on amount of memory +allocated during boot. The ``kho_scratch`` kernel command line option may be +used to explicitly define size of the scratch regions. +The scratch regions are declared as CMA when page allocator is initialized so +that their memory can be used during system lifetime. CMA gives us the +guarantee that no handover pages land in that region, because handover pages +must be at a static physical memory location and CMA enforces that only +movable pages can be located inside. + +After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and +instead reuse the exact same region that was originally allocated. This allows +us to recursively execute any amount of KHO kexecs. Because we used this region +for boot memory allocations and as target memory for kexec blobs, some parts +of that memory region may be reserved. These reservations are irrelevant for +the next KHO, because kexec can overwrite even the original kernel. + +.. _kho-finalization-phase: + +KHO finalization phase +====================== + +To enable user space based kexec file loader, the kernel needs to be able to +provide the FDT that describes the current kernel's state before +performing the actual kexec. The process of generating that FDT is +called serialization. When the FDT is generated, some properties +of the system may become immutable because they are already written down +in the FDT. That state is called the KHO finalization phase. + +Public API +========== +.. kernel-doc:: kernel/kexec_handover.c + :export: diff --git a/Documentation/core-api/kho/fdt.rst b/Documentation/core-api/kho/fdt.rst new file mode 100644 index 000000000000..62505285d60d --- /dev/null +++ b/Documentation/core-api/kho/fdt.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======= +KHO FDT +======= + +KHO uses the flattened device tree (FDT) container format and libfdt +library to create and parse the data that is passed between the +kernels. The properties in KHO FDT are stored in native format. +It includes the physical address of an in-memory structure describing +all preserved memory regions, as well as physical addresses of KHO users' +own FDTs. Interpreting those sub FDTs is the responsibility of KHO users. + +KHO nodes and properties +======================== + +Property ``preserved-memory-map`` +--------------------------------- + +KHO saves a special property named ``preserved-memory-map`` under the root node. +This node contains the physical address of an in-memory structure for KHO to +preserve memory regions across kexec. + +Property ``compatible`` +----------------------- + +The ``compatible`` property determines compatibility between the kernel +that created the KHO FDT and the kernel that attempts to load it. +If the kernel that loads the KHO FDT is not compatible with it, the entire +KHO process will be bypassed. + +Property ``fdt`` +---------------- + +Generally, a KHO user serialize its state into its own FDT and instructs +KHO to preserve the underlying memory, such that after kexec, the new kernel +can recover its state from the preserved FDT. + +A KHO user thus can create a node in KHO root tree and save the physical address +of its own FDT in that node's property ``fdt`` . + +Examples +======== + +The following example demonstrates KHO FDT that preserves two memory +regions created with ``reserve_mem`` kernel command line parameter:: + + /dts-v1/; + + / { + compatible = "kho-v1"; + + preserved-memory-map = <0x40be16 0x1000000>; + + memblock { + fdt = <0x1517 0x1000000>; + }; + }; + +where the ``memblock`` node contains an FDT that is requested by the +subsystem memblock for preservation. The FDT contains the following +serialized data:: + + /dts-v1/; + + / { + compatible = "memblock-v1"; + + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + + n2 { + compatible = "reserve-mem-v1"; + start = <0xc067 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst new file mode 100644 index 000000000000..0c63b0c5c143 --- /dev/null +++ b/Documentation/core-api/kho/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======================== +Kexec Handover Subsystem +======================== + +.. toctree:: + :maxdepth: 1 + + concepts + fdt + +.. only:: subproject and html diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index f12d33749329..ddc50db3afa4 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -54,7 +54,7 @@ monitoring are address-space dependent. DAMON consolidates these implementations in a layer called DAMON Operations Set, and defines the interface between it and the upper layer. The upper layer is dedicated for DAMON's core logics including the mechanism for control of the -monitoring accruracy and the overhead. +monitoring accuracy and the overhead. Hence, DAMON can easily be extended for any address space and/or available hardware features by configuring the core logic to use the appropriate @@ -550,10 +550,10 @@ aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS is under achieving the goal, DAMOS automatically increases the quota. If DAMOS is over achieving the goal, it decreases the quota. -The goal can be specified with three parameters, namely ``target_metric``, -``target_value``, and ``current_value``. The auto-tuning mechanism tries to -make ``current_value`` of ``target_metric`` be same to ``target_value``. -Currently, two ``target_metric`` are provided. +The goal can be specified with four parameters, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. The auto-tuning mechanism +tries to make ``current_value`` of ``target_metric`` be same to +``target_value``. - ``user_input``: User-provided value. Users could use any metric that they has interest in for the value. Use space main workload's latency or @@ -565,6 +565,11 @@ Currently, two ``target_metric`` are provided. in microseconds that measured from last quota reset to next quota reset. DAMOS does the measurement on its own, so only ``target_value`` need to be set by users at the initial time. In other words, DAMOS does self-feedback. +- ``node_mem_used_bp``: Specific NUMA node's used memory ratio in bp (1/10,000). +- ``node_mem_free_bp``: Specific NUMA node's free memory ratio in bp (1/10,000). + +``nid`` is optionally required for only ``node_mem_used_bp`` and +``node_mem_free_bp`` to point the specific NUMA node. To know how user-space can set the tuning goal metric, the target value, and/or the current value via :ref:`DAMON sysfs interface <sysfs_interface>`, refer to diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 5a3359704cce..31c1fa955b3d 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ DAMON is a Linux kernel subsystem that provides a framework for data access monitoring and the monitoring results based system operations. The core diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst index 982215723582..3bf7f99cd7bb 100644 --- a/Documentation/networking/arcnet-hardware.rst +++ b/Documentation/networking/arcnet-hardware.rst @@ -3152,7 +3152,7 @@ Tiara (model unknown) --------------- - - from Christoph Lameter <christoph@lameter.com> + - from Christoph Lameter <cl@gentwo.org> Here is information about my card as far as I could figure it out:: |
