<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/uapi/linux/userfaultfd.h, branch v6.0-rc6</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>mm/uffd: enable write protection for shmem &amp; hugetlbfs</title>
<updated>2022-05-13T14:20:11+00:00</updated>
<author>
<name>Peter Xu</name>
<email>peterx@redhat.com</email>
</author>
<published>2022-05-13T03:22:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b1f9e876862d8f7176299ec4fb2108bc1045cbc8'/>
<id>b1f9e876862d8f7176299ec4fb2108bc1045cbc8</id>
<content type='text'>
We've had all the necessary changes ready for both shmem and hugetlbfs. 
Turn on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too
because all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.com
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: "Kirill A . Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Nadav Amit &lt;nadav.amit@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We've had all the necessary changes ready for both shmem and hugetlbfs. 
Turn on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too
because all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.com
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: "Kirill A . Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Nadav Amit &lt;nadav.amit@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: provide unmasked address on page-fault</title>
<updated>2022-03-22T22:57:08+00:00</updated>
<author>
<name>Nadav Amit</name>
<email>namit@vmware.com</email>
</author>
<published>2022-03-22T21:45:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=824ddc601adc2cc48efb7f58b57997986c1c1276'/>
<id>824ddc601adc2cc48efb7f58b57997986c1c1276</id>
<content type='text'>
Userfaultfd is supposed to provide the full address (i.e., unmasked) of
the faulting access back to userspace.  However, that is not the case for
quite some time.

Even running "userfaultfd_demo" from the userfaultfd man page provides the
wrong output (and contradicts the man page).  Notice that
"UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
not the first read address (0x7fc5e30b300f).

	Address returned by mmap() = 0x7fc5e30b3000

	fault_handler_thread():
	    poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
	    UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
		(uffdio_copy.copy returned 4096)
	Read address 0x7fc5e30b300f in main(): A
	Read address 0x7fc5e30b340f in main(): A
	Read address 0x7fc5e30b380f in main(): A
	Read address 0x7fc5e30b3c0f in main(): A

The exact address is useful for various reasons and specifically for
prefetching decisions.  If it is known that the memory is populated by
certain objects whose size is not page-aligned, then based on the faulting
address, the uffd-monitor can decide whether to prefetch and prefault the
adjacent page.

This bug has been for quite some time in the kernel: since commit
1a29d85eb0f1 ("mm: use vmf-&gt;address instead of of vmf-&gt;virtual_address")
vmf-&gt;virtual_address"), which dates back to 2016.  A concern has been
raised that existing userspace application might rely on the old/wrong
behavior in which the address is masked.  Therefore, it was suggested to
provide the masked address unless the user explicitly asks for the exact
address.

Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
userfaultfd to provide the exact address.  Add a new "real_address" field
to vmf to hold the unmasked address.  Provide the address to userspace
accordingly.

Initialize real_address in various code-paths to be consistent with
address, even when it is not used, to be on the safe side.

[namit@vmware.com: initialize real_address on all code paths, per Jan]
  Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
[akpm@linux-foundation.org: fix typo in comment, per Jan]

Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com
Signed-off-by: Nadav Amit &lt;namit@vmware.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Userfaultfd is supposed to provide the full address (i.e., unmasked) of
the faulting access back to userspace.  However, that is not the case for
quite some time.

Even running "userfaultfd_demo" from the userfaultfd man page provides the
wrong output (and contradicts the man page).  Notice that
"UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
not the first read address (0x7fc5e30b300f).

	Address returned by mmap() = 0x7fc5e30b3000

	fault_handler_thread():
	    poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
	    UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
		(uffdio_copy.copy returned 4096)
	Read address 0x7fc5e30b300f in main(): A
	Read address 0x7fc5e30b340f in main(): A
	Read address 0x7fc5e30b380f in main(): A
	Read address 0x7fc5e30b3c0f in main(): A

The exact address is useful for various reasons and specifically for
prefetching decisions.  If it is known that the memory is populated by
certain objects whose size is not page-aligned, then based on the faulting
address, the uffd-monitor can decide whether to prefetch and prefault the
adjacent page.

This bug has been for quite some time in the kernel: since commit
1a29d85eb0f1 ("mm: use vmf-&gt;address instead of of vmf-&gt;virtual_address")
vmf-&gt;virtual_address"), which dates back to 2016.  A concern has been
raised that existing userspace application might rely on the old/wrong
behavior in which the address is masked.  Therefore, it was suggested to
provide the masked address unless the user explicitly asks for the exact
address.

Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
userfaultfd to provide the exact address.  Add a new "real_address" field
to vmf to hold the unmasked address.  Provide the address to userspace
accordingly.

Initialize real_address in various code-paths to be consistent with
address, even when it is not used, to be on the safe side.

[namit@vmware.com: initialize real_address on all code paths, per Jan]
  Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
[akpm@linux-foundation.org: fix typo in comment, per Jan]

Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com
Signed-off-by: Nadav Amit &lt;namit@vmware.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd/shmem: advertise shmem minor fault support</title>
<updated>2021-07-01T03:47:27+00:00</updated>
<author>
<name>Axel Rasmussen</name>
<email>axelrasmussen@google.com</email>
</author>
<published>2021-07-01T01:49:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=964ab0040ff9598783bf37776b5e31b27b50e293'/>
<id>964ab0040ff9598783bf37776b5e31b27b50e293</id>
<content type='text'>
Now that the feature is fully implemented (the faulting path hooks exist
so userspace is notified, and the ioctl to resolve such faults is
available), advertise this as a supported feature.

Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Joe Perches &lt;joe@perches.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Oliver Upton &lt;oupton@google.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Cc: Wang Qing &lt;wangqing@vivo.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that the feature is fully implemented (the faulting path hooks exist
so userspace is notified, and the ioctl to resolve such faults is
available), advertise this as a supported feature.

Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Joe Perches &lt;joe@perches.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Oliver Upton &lt;oupton@google.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Cc: Wang Qing &lt;wangqing@vivo.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: uapi: fix UFFDIO_CONTINUE ioctl request definition</title>
<updated>2021-06-25T17:53:26+00:00</updated>
<author>
<name>Gleb Fotengauer-Malinovskiy</name>
<email>glebfm@altlinux.org</email>
</author>
<published>2021-06-25T17:36:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=808e9df477757955a9644ca323010339be0c40ee'/>
<id>808e9df477757955a9644ca323010339be0c40ee</id>
<content type='text'>
This ioctl request reads from uffdio_continue structure written by
userspace which justifies _IOC_WRITE flag.  It also writes back to that
structure which justifies _IOC_READ flag.

See NOTEs in include/uapi/asm-generic/ioctl.h for more information.

Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Signed-off-by: Gleb Fotengauer-Malinovskiy &lt;glebfm@altlinux.org&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Reviewed-by: Dmitry V. Levin &lt;ldv@altlinux.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This ioctl request reads from uffdio_continue structure written by
userspace which justifies _IOC_WRITE flag.  It also writes back to that
structure which justifies _IOC_READ flag.

See NOTEs in include/uapi/asm-generic/ioctl.h for more information.

Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Signed-off-by: Gleb Fotengauer-Malinovskiy &lt;glebfm@altlinux.org&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Reviewed-by: Dmitry V. Levin &lt;ldv@altlinux.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: add UFFDIO_CONTINUE ioctl</title>
<updated>2021-05-05T18:27:22+00:00</updated>
<author>
<name>Axel Rasmussen</name>
<email>axelrasmussen@google.com</email>
</author>
<published>2021-05-05T01:35:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f619147104c8ea71e120e4936d2b68ec11a1e527'/>
<id>f619147104c8ea71e120e4936d2b68ec11a1e527</id>
<content type='text'>
This ioctl is how userspace ought to resolve "minor" userfaults.  The
idea is, userspace is notified that a minor fault has occurred.  It
might change the contents of the page using its second non-UFFD mapping,
or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`.  We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case.  (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Reviewed-by: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Adam Ruprecht &lt;ruprecht@google.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Cannon Matthews &lt;cannonmatthews@google.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Chinwen Chang &lt;chinwen.chang@mediatek.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: "Matthew Wilcox (Oracle)" &lt;willy@infradead.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: "Michal Koutn" &lt;mkoutny@suse.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Oliver Upton &lt;oupton@google.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Shawn Anastasio &lt;shawn@anastas.io&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This ioctl is how userspace ought to resolve "minor" userfaults.  The
idea is, userspace is notified that a minor fault has occurred.  It
might change the contents of the page using its second non-UFFD mapping,
or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`.  We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case.  (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Reviewed-by: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Adam Ruprecht &lt;ruprecht@google.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Cannon Matthews &lt;cannonmatthews@google.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Chinwen Chang &lt;chinwen.chang@mediatek.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: "Matthew Wilcox (Oracle)" &lt;willy@infradead.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: "Michal Koutn" &lt;mkoutny@suse.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Oliver Upton &lt;oupton@google.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Shawn Anastasio &lt;shawn@anastas.io&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: add minor fault registration mode</title>
<updated>2021-05-05T18:27:22+00:00</updated>
<author>
<name>Axel Rasmussen</name>
<email>axelrasmussen@google.com</email>
</author>
<published>2021-05-05T01:35:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7677f7fd8be76659cd2d0db8ff4093bbb51c20e5'/>
<id>7677f7fd8be76659cd2d0db8ff4093bbb51c20e5</id>
<content type='text'>
Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated &amp; filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated &amp; filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Reviewed-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Chinwen Chang &lt;chinwen.chang@mediatek.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: "Matthew Wilcox (Oracle)" &lt;willy@infradead.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: "Michal Koutn" &lt;mkoutny@suse.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Shawn Anastasio &lt;shawn@anastas.io&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Adam Ruprecht &lt;ruprecht@google.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Cannon Matthews &lt;cannonmatthews@google.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Oliver Upton &lt;oupton@google.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated &amp; filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated &amp; filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Reviewed-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Chinwen Chang &lt;chinwen.chang@mediatek.com&gt;
Cc: Huang Ying &lt;ying.huang@intel.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: "Matthew Wilcox (Oracle)" &lt;willy@infradead.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: "Michal Koutn" &lt;mkoutny@suse.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Shawn Anastasio &lt;shawn@anastas.io&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Adam Ruprecht &lt;ruprecht@google.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Cannon Matthews &lt;cannonmatthews@google.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Oliver Upton &lt;oupton@google.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: add UFFD_USER_MODE_ONLY</title>
<updated>2020-12-15T20:13:46+00:00</updated>
<author>
<name>Lokesh Gidra</name>
<email>lokeshgidra@google.com</email>
</author>
<published>2020-12-15T03:13:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=37cd0575b8510159992d279c530c05f872990b02'/>
<id>37cd0575b8510159992d279c530c05f872990b02</id>
<content type='text'>
Patch series "Control over userfaultfd kernel-fault handling", v6.

This patch series is split from [1].  The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.

It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel.  For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object.  It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.

The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808

This patch (of 2):

userfaultfd handles page faults from both user and kernel code.  Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.

A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.

Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione &lt;dancol@google.com&gt;
Signed-off-by: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Reviewed-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;calin@google.com&gt;
Cc: Daniel Colascione &lt;dancol@dancol.org&gt;
Cc: Eric Biggers &lt;ebiggers@kernel.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Cc: Jeff Vander Stoep &lt;jeffv@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: "Joel Fernandes (Google)" &lt;joel@joelfernandes.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Mauro Carvalho Chehab &lt;mchehab+huawei@kernel.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Nitin Gupta &lt;nigupta@nvidia.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Stephen Smalley &lt;stephen.smalley.work@gmail.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Patch series "Control over userfaultfd kernel-fault handling", v6.

This patch series is split from [1].  The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.

It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel.  For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object.  It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.

The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808

This patch (of 2):

userfaultfd handles page faults from both user and kernel code.  Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.

A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.

Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione &lt;dancol@google.com&gt;
Signed-off-by: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Reviewed-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;calin@google.com&gt;
Cc: Daniel Colascione &lt;dancol@dancol.org&gt;
Cc: Eric Biggers &lt;ebiggers@kernel.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Cc: Jeff Vander Stoep &lt;jeffv@google.com&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: "Joel Fernandes (Google)" &lt;joel@joelfernandes.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Mauro Carvalho Chehab &lt;mchehab+huawei@kernel.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Nitin Gupta &lt;nigupta@nvidia.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Stephen Smalley &lt;stephen.smalley.work@gmail.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: wp: enabled write protection in userfaultfd API</title>
<updated>2020-04-07T17:43:39+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2020-04-07T03:06:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e06f1e1dd4998ffc9da37f580703b55a93fc4de4'/>
<id>e06f1e1dd4998ffc9da37f580703b55a93fc4de4</id>
<content type='text'>
Now it's safe to enable write protection in userfaultfd API

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jerome Glisse &lt;jglisse@redhat.com&gt;
Reviewed-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Bobby Powers &lt;bobbypowers@gmail.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Denis Plotnikov &lt;dplotnikov@virtuozzo.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Martin Cracauer &lt;cracauer@cons.org&gt;
Cc: Marty McFadden &lt;mcfadden8@llnl.gov&gt;
Cc: Maya Gokhale &lt;gokhale2@llnl.gov&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Pavel Emelyanov &lt;xemul@openvz.org&gt;
Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now it's safe to enable write protection in userfaultfd API

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jerome Glisse &lt;jglisse@redhat.com&gt;
Reviewed-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Bobby Powers &lt;bobbypowers@gmail.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Denis Plotnikov &lt;dplotnikov@virtuozzo.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Martin Cracauer &lt;cracauer@cons.org&gt;
Cc: Marty McFadden &lt;mcfadden8@llnl.gov&gt;
Cc: Maya Gokhale &lt;gokhale2@llnl.gov&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Pavel Emelyanov &lt;xemul@openvz.org&gt;
Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: wp: add the writeprotect API to userfaultfd ioctl</title>
<updated>2020-04-07T17:43:39+00:00</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2020-04-07T03:06:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=63b2d4174c4ad1f40b48d7138e71bcb564c1fe03'/>
<id>63b2d4174c4ad1f40b48d7138e71bcb564c1fe03</id>
<content type='text'>
Introduce the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this flag can
co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
userspace program can not only resolve missing page faults, and at the
same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
write protection tracking.  Note that we will need to register the memory
region with UFFDIO_REGISTER_MODE_WP before that.

[peterx@redhat.com: write up the commit message]
[peterx@redhat.com: remove useless block, write commit message, check against
 VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Bobby Powers &lt;bobbypowers@gmail.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Denis Plotnikov &lt;dplotnikov@virtuozzo.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: "Kirill A . Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Martin Cracauer &lt;cracauer@cons.org&gt;
Cc: Marty McFadden &lt;mcfadden8@llnl.gov&gt;
Cc: Maya Gokhale &lt;gokhale2@llnl.gov&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Pavel Emelyanov &lt;xemul@openvz.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Introduce the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this flag can
co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
userspace program can not only resolve missing page faults, and at the
same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
write protection tracking.  Note that we will need to register the memory
region with UFFDIO_REGISTER_MODE_WP before that.

[peterx@redhat.com: write up the commit message]
[peterx@redhat.com: remove useless block, write commit message, check against
 VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: Bobby Powers &lt;bobbypowers@gmail.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Denis Plotnikov &lt;dplotnikov@virtuozzo.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: "Kirill A . Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Martin Cracauer &lt;cracauer@cons.org&gt;
Cc: Marty McFadden &lt;mcfadden8@llnl.gov&gt;
Cc: Maya Gokhale &lt;gokhale2@llnl.gov&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Pavel Emelyanov &lt;xemul@openvz.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>userfaultfd: wp: add UFFDIO_COPY_MODE_WP</title>
<updated>2020-04-07T17:43:39+00:00</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2020-04-07T03:05:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=72981e0e7b609c741d7764cc920c8fec00920bd5'/>
<id>72981e0e7b609c741d7764cc920c8fec00920bd5</id>
<content type='text'>
This allows UFFDIO_COPY to map pages write-protected.

[peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
 around "dst_vma-&gt;vm_flags &amp; VM_WRITE"; fix wordings in comments and
 commit messages]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jerome Glisse &lt;jglisse@redhat.com&gt;
Reviewed-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Bobby Powers &lt;bobbypowers@gmail.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Denis Plotnikov &lt;dplotnikov@virtuozzo.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: "Kirill A . Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Martin Cracauer &lt;cracauer@cons.org&gt;
Cc: Marty McFadden &lt;mcfadden8@llnl.gov&gt;
Cc: Maya Gokhale &lt;gokhale2@llnl.gov&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Pavel Emelyanov &lt;xemul@openvz.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This allows UFFDIO_COPY to map pages write-protected.

[peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
 around "dst_vma-&gt;vm_flags &amp; VM_WRITE"; fix wordings in comments and
 commit messages]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Jerome Glisse &lt;jglisse@redhat.com&gt;
Reviewed-by: Mike Rapoport &lt;rppt@linux.vnet.ibm.com&gt;
Cc: Bobby Powers &lt;bobbypowers@gmail.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Denis Plotnikov &lt;dplotnikov@virtuozzo.com&gt;
Cc: "Dr . David Alan Gilbert" &lt;dgilbert@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: "Kirill A . Shutemov" &lt;kirill@shutemov.name&gt;
Cc: Martin Cracauer &lt;cracauer@cons.org&gt;
Cc: Marty McFadden &lt;mcfadden8@llnl.gov&gt;
Cc: Maya Gokhale &lt;gokhale2@llnl.gov&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Pavel Emelyanov &lt;xemul@openvz.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
