linux-toradex.git/include/linux/userfaultfd_k.h, branch v6.17-rc6

mm/mremap: use an explicit uffd failure path for mremap

2025-07-25T02:12:29+00:00

Right now it appears that the code is relying upon the returned
destination address having bits outside PAGE_MASK to indicate whether an
error value is specified, and decrementing the increased refcount on the
uffd ctx if so.

This is not a safe means of determining an error value, so instead, be
specific.  It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.

A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.

That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.

No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.

However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.

At this point, it is not correct to raise a uffd remap event, and we must
handle it.

This refactoring makes it abundantly clear what we are doing.

We assume vrm->new_addr is always valid, which a prior change made the
case even for mremap() invocations which don't move the VMA, however given
no uffd context would be set up in this case it's immaterial to this
change anyway.

No functional change intended.

Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes 
Reviewed-by: Vlastimil Babka 
Cc: Al Viro 
Cc: Christian Brauner 
Cc: Jan Kara 
Cc: Jann Horn 
Cc: Liam Howlett 
Cc: Peter Xu 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton

mm: update core kernel code to use vm_flags_t consistently

2025-07-10T05:42:13+00:00

The core kernel code is currently very inconsistent in its use of
vm_flags_t vs.  unsigned long.  This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.

While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.

Additionally, update VMA userland tests to account for the changes.

To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.

The code has been adjusted to cascade the changes across all calling code
as far as is needed.

We will adjust architecture-specific and driver code in a subsequent patch.

Overall, this patch does not introduce any functional change.

Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes 
Acked-by: Kees Cook 
Acked-by: Mike Rapoport (Microsoft) 
Acked-by: Jan Kara 
Acked-by: Christian Brauner 
Reviewed-by: Vlastimil Babka 
Acked-by: Oscar Salvador 
Reviewed-by: Pedro Falcato 
Acked-by: Zi Yan 
Acked-by: David Hildenbrand 
Reviewed-by: Anshuman Khandual 
Cc: Jann Horn 
Cc: Liam R. Howlett 
Cc: Catalin Marinas 
Cc: Jarkko Sakkinen 
Signed-off-by: Andrew Morton

userfaultfd: remove UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET

2025-07-10T05:42:01+00:00

UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since
they were added in commit 932b18e0aec6 ("userfaultfd:
linux/userfaultfd_k.h").  Remove them and the associated BUILD_BUG_ON()
checks.

Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman 
Acked-by: David Hildenbrand 
Acked-by: Peter Xu 
Cc: Al Viro 
Cc: Andrea Arcangeli 
Cc: Christian Brauner 
Cc: Jan Kara 
Cc: Jason A. Donenfeld 
Signed-off-by: Andrew Morton

userfaultfd: correctly prevent registering VM_DROPPABLE regions

2025-07-10T05:42:00+00:00

Patch series "mm: userfaultfd: assorted fixes and cleanups", v3.

Two fixes and two cleanups for userfaultfd.

Note that the third patch yields a small change in the ABI, but we seem to
have concluded that it is acceptable in this case.


This patch (of 4):

vma_can_userfault() masks off non-userfaultfd VM flags from vm_flags.  The
vm_flags & VM_DROPPABLE test will then always be false, incorrectly
allowing VM_DROPPABLE regions to be registered with userfaultfd.

Additionally, vm_flags is not guaranteed to correspond to the actual VMA's
flags.  Fix this test by checking the VMA's flags directly.

Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-0-a7274d3bd5e4@columbia.edu
Link: https://lore.kernel.org/linux-mm/5a875a3a-2243-4eab-856f-bc53ccfec3ea@redhat.com/
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-1-a7274d3bd5e4@columbia.edu
Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings")
Signed-off-by: Tal Zussman 
Acked-by: David Hildenbrand 
Acked-by: Peter Xu 
Acked-by: Jason A. Donenfeld 
Cc: Al Viro 
Cc: Andrea Arcangeli 
Cc: Christian Brauner 
Cc: Jan Kara 
Signed-off-by: Andrew Morton

mm: clear uffd-wp PTE/PMD state on mremap()

2025-01-13T03:03:37+00:00

When mremap()ing a memory region previously registered with userfaultfd as
write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in
flag clearing leads to a mismatch between the vma flags (which have
uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp
cleared).  This mismatch causes a subsequent mprotect(PROT_WRITE) to
trigger a warning in page_table_check_pte_flags() due to setting the pte
to writable while uffd-wp is still set.

Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any
such mremap() so that the values are consistent with the existing clearing
of VM_UFFD_WP.  Be careful to clear the logical flag regardless of its
physical form; a PTE bit, a swap PTE bit, or a PTE marker.  Cover PTE,
huge PMD and hugetlb paths.

Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com
Co-developed-by: Mikołaj Lenczewski 
Signed-off-by: Mikołaj Lenczewski 
Signed-off-by: Ryan Roberts 
Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/
Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Liam R. Howlett 
Cc: Lorenzo Stoakes 
Cc: Mark Rutland 
Cc: Muchun Song 
Cc: Peter Xu 
Cc: Shuah Khan 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton

fork: do not invoke uffd on fork if error occurs

2024-10-29T04:40:38+00:00

Patch series "fork: do not expose incomplete mm on fork".

During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.

In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.

As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.

We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.


This patch (of 2):

Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.

This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().

This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.

The change in commit d24062914837 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.

We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().

We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.

Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.

Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes 
Reported-by: Jann Horn 
Reviewed-by: Jann Horn 
Reviewed-by: Liam R. Howlett 
Cc: Alexander Viro 
Cc: Christian Brauner 
Cc: Jan Kara 
Cc: Linus Torvalds 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton

userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c

2024-09-02T03:25:53+00:00

Patch series "Make core VMA operations internal and testable", v4.

There are a number of "core" VMA manipulation functions implemented in
mm/mmap.c, notably those concerning VMA merging, splitting, modifying,
expanding and shrinking, which logically don't belong there.

More importantly this functionality represents an internal implementation
detail of memory management and should not be exposed outside of mm/
itself.

This patch series isolates core VMA manipulation functionality into its
own file, mm/vma.c, and provides an API to the rest of the mm code in
mm/vma.h.

Importantly, it also carefully implements mm/vma_internal.h, which
specifies which headers need to be imported by vma.c, leading to the very
useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h.

This means we can then re-implement vma_internal.h in userland, adding
shims for kernel mechanisms as required, allowing us to unit test internal
VMA functionality.

This testing is useful as opposed to an e.g.  kunit implementation as this
way we can avoid all external kernel side-effects while testing, run tests
VERY quickly, and iterate on and debug problems quickly.

Excitingly this opens the door to, in the future, recreating precise
problems observed in production in userland and very quickly debugging
problems that might otherwise be very difficult to reproduce.

This patch series takes advantage of existing shim logic and full userland
maple tree support contained in tools/testing/radix-tree/ and
tools/include/linux/, separating out shared components of the radix tree
implementation to provide this testing.

Kernel functionality is stubbed and shimmed as needed in
tools/testing/vma/ which contains a fully functional userland
vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly
tested from userland.

A simple, skeleton testing implementation is provided in
tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA
merge, modify (testing split), expand and shrink functionality work
correctly.


This patch (of 4):

This patch forms part of a patch series intending to separate out VMA
logic and render it testable from userspace, which requires that core
manipulation functions be exposed in an mm/-internal header file.

In order to do this, we must abstract APIs we wish to test, in this
instance functions which ultimately invoke vma_modify().

This patch therefore moves all logic which ultimately invokes vma_modify()
to mm/userfaultfd.c, trying to transfer code at a functional granularity
where possible.

[lorenzo.stoakes@oracle.com: fix user-after-free in userfaultfd_clear_vma()]
  Link: https://lkml.kernel.org/r/3c947ddc-b804-49b7-8fe9-3ea3ca13def5@lucifer.local
Link: https://lkml.kernel.org/r/cover.1722251717.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Liam R. Howlett 
Cc: Alexander Viro 
Cc: Brendan Higgins 
Cc: Christian Brauner 
Cc: David Gow 
Cc: Eric W. Biederman 
Cc: Jan Kara 
Cc: Kees Cook 
Cc: Matthew Wilcox (Oracle) 
Cc: Rae Moar 
Cc: SeongJae Park 
Cc: Shuah Khan 
Cc: Suren Baghdasaryan 
Cc: Pengfei Xu 
Signed-off-by: Andrew Morton

mm: add MAP_DROPPABLE for designating always lazily freeable mappings

2024-07-19T18:22:12+00:00

The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

- It shouldn't reserve actual memory, but it also shouldn't crash when
  page faulting in memory if none is available
  * Uh-oh: VM_NORESERVE means segfaults.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.
e) If there's not enough memory to service a page fault, it's not fatal,
   and no signal is sent.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.

In order to let vDSO getrandom() use this, expose these via mmap(2) as
MAP_DROPPABLE.

Note that this involves removing the MADV_FREE special case from
sort_folio(), which according to Yu Zhao is unnecessary and will simply
result in an extra call to shrink_folio_list() in the worst case. The
chunk removed reenables the swapbacked flag, which we don't want for
VM_DROPPABLE, and we can't conditionalize it here because there isn't a
vma reference available.

Finally, the provided self test ensures that this is working as desired.

Cc: linux-mm@kvack.org
Acked-by: David Hildenbrand 
Signed-off-by: Jason A. Donenfeld

userfaultfd: use per-vma locks in userfaultfd operations

2024-02-22T23:27:20+00:00

All userfaultfd operations, except write-protect, opportunistically use
per-vma locks to lock vmas.  On failure, attempt again inside mmap_lock
critical section.

Write-protect operation requires mmap_lock as it iterates over multiple
vmas.

Link: https://lkml.kernel.org/r/20240215182756.3448972-5-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Reviewed-by: Liam R. Howlett 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Kalesh Singh 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Rapoport (IBM) 
Cc: Nicolas Geoffray 
Cc: Peter Xu 
Cc: Ryan Roberts 
Cc: Suren Baghdasaryan 
Cc: Tim Murray 
Signed-off-by: Andrew Morton

userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx

2024-02-22T23:27:20+00:00

Increments and loads to mmap_changing are always in mmap_lock critical
section.  This ensures that if userspace requests event notification for
non-cooperative operations (e.g.  mremap), userfaultfd operations don't
occur concurrently.

This can be achieved by using a separate read-write semaphore in
userfaultfd_ctx such that increments are done in write-mode and loads in
read-mode, thereby eliminating the dependency on mmap_lock for this
purpose.

This is a preparatory step before we replace mmap_lock usage with per-vma
locks in fill/move ioctls.

Link: https://lkml.kernel.org/r/20240215182756.3448972-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Reviewed-by: Mike Rapoport (IBM) 
Reviewed-by: Liam R. Howlett 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Kalesh Singh 
Cc: Matthew Wilcox (Oracle) 
Cc: Nicolas Geoffray 
Cc: Peter Xu 
Cc: Ryan Roberts 
Cc: Suren Baghdasaryan 
Cc: Tim Murray 
Signed-off-by: Andrew Morton