<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/arch/mips/include/uapi/asm, branch master</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2026-06-14T22:29:45+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-06-14T22:29:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7e0e7bd60d4a812b694c477716597fcb038b00cb'/>
<id>7e0e7bd60d4a812b694c477716597fcb038b00cb</id>
<content type='text'>
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe-&gt;mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe-&gt;mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size &gt; PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&amp;tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 &lt;&lt; n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to -&gt;poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 &lt;&lt; n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe-&gt;mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe-&gt;mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe-&gt;mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size &gt; PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&amp;tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 &lt;&lt; n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to -&gt;poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 &lt;&lt; n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe-&gt;mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>vfs: uapi: retire octal and hex numbers in favor of (1 &lt;&lt; n) for O_ flags</title>
<updated>2026-06-06T13:34:03+00:00</updated>
<author>
<name>Jori Koolstra</name>
<email>jkoolstra@xs4all.nl</email>
</author>
<published>2026-06-04T22:24:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0da79c259ad0554b36761a7135d4f92eb7c46263'/>
<id>0da79c259ad0554b36761a7135d4f92eb7c46263</id>
<content type='text'>
A recent build failure[1] exposed the diffculty of working with the
current octal and hex definitions of O_ flags when trying to find a gap
for a new flag. This difficulty is compounded by the fact that O_ flags
may have architectural specific values.

Replace the hex/octal #defines, which are hard to parse when looking for
free bits, with explicit bit shifts like (1 &lt;&lt; 11). Also, add comments
that identify which architectures redefine some of the seemingly free
("cursed") bits in uapi/asm-generic/fcntl.h. These should not be used to
define new O_ flags (for now, at least).

The translastion was done with Claude Opus 4.8, and verified with a
(non-AI) gawk script. The accounting of which architectures claim
which bit-gaps in uapi/asm-generic/fcntl.h is also done by hand.

[1]: https://lore.kernel.org/all/agruPPybCx8q2XcJ@sirena.org.uk/

Assisted-by: Claude:Opus 4.8
Signed-off-by: Jori Koolstra &lt;jkoolstra@xs4all.nl&gt;
Link: https://patch.msgid.link/20260604222405.5382-1-jkoolstra@xs4all.nl
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A recent build failure[1] exposed the diffculty of working with the
current octal and hex definitions of O_ flags when trying to find a gap
for a new flag. This difficulty is compounded by the fact that O_ flags
may have architectural specific values.

Replace the hex/octal #defines, which are hard to parse when looking for
free bits, with explicit bit shifts like (1 &lt;&lt; 11). Also, add comments
that identify which architectures redefine some of the seemingly free
("cursed") bits in uapi/asm-generic/fcntl.h. These should not be used to
define new O_ flags (for now, at least).

The translastion was done with Claude Opus 4.8, and verified with a
(non-AI) gawk script. The accounting of which architectures claim
which bit-gaps in uapi/asm-generic/fcntl.h is also done by hand.

[1]: https://lore.kernel.org/all/agruPPybCx8q2XcJ@sirena.org.uk/

Assisted-by: Claude:Opus 4.8
Signed-off-by: Jori Koolstra &lt;jkoolstra@xs4all.nl&gt;
Link: https://patch.msgid.link/20260604222405.5382-1-jkoolstra@xs4all.nl
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>openat2: introduce EFTYPE error code</title>
<updated>2026-05-21T08:53:41+00:00</updated>
<author>
<name>Dorjoy Chowdhury</name>
<email>dorjoychy111@gmail.com</email>
</author>
<published>2026-03-28T17:22:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cf1b04aaef8b83f668fac10bfc4d4d76ba6e6fa1'/>
<id>cf1b04aaef8b83f668fac10bfc4d4d76ba6e6fa1</id>
<content type='text'>
Introduce a new error code EFTYPE for wrong file type operations.
EFTYPE is already used in BSD systems like FreeBSD and macOS.

This will be used by the upcoming OPENAT2_REGULAR flag support to
return a specific error when a path doesn't refer to a regular file.

Signed-off-by: Dorjoy Chowdhury &lt;dorjoychy111@gmail.com&gt;
Link: https://patch.msgid.link/20260328172314.45807-2-dorjoychy111@gmail.com
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Aleksa Sarai &lt;aleksa@amutable.com&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Introduce a new error code EFTYPE for wrong file type operations.
EFTYPE is already used in BSD systems like FreeBSD and macOS.

This will be used by the upcoming OPENAT2_REGULAR flag support to
return a specific error when a path doesn't refer to a regular file.

Signed-off-by: Dorjoy Chowdhury &lt;dorjoychy111@gmail.com&gt;
Link: https://patch.msgid.link/20260328172314.45807-2-dorjoychy111@gmail.com
Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Aleksa Sarai &lt;aleksa@amutable.com&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>uapi: promote EFSCORRUPTED and EUCLEAN to errno.h</title>
<updated>2026-01-13T08:58:01+00:00</updated>
<author>
<name>Darrick J. Wong</name>
<email>djwong@kernel.org</email>
</author>
<published>2026-01-13T00:31:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=602544773763da411ffa67567fa1d146f3a40231'/>
<id>602544773763da411ffa67567fa1d146f3a40231</id>
<content type='text'>
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.

Cc: linux-api@vger.kernel.org
Signed-off-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs
Reviewed-by: Gao Xiang &lt;hsiangkao@linux.alibaba.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.

Cc: linux-api@vger.kernel.org
Signed-off-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs
Reviewed-by: Gao Xiang &lt;hsiangkao@linux.alibaba.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: Introduce SO_INQ.</title>
<updated>2025-07-09T01:05:25+00:00</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@google.com</email>
</author>
<published>2025-07-02T22:35:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=df30285b3670bf52e1e5512e4d4482bec5e93c16'/>
<id>df30285b3670bf52e1e5512e4d4482bec5e93c16</id>
<content type='text'>
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).

TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.

Let's introduce the generic version of TCP_INQ.

If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ).  The cmsg is also
included when msg-&gt;msg_get_inq is non-zero to make sockets
io_uring-friendly.

Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.

By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)-&gt;recvmsg_inq.

Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@google.com&gt;
Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).

TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.

Let's introduce the generic version of TCP_INQ.

If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ).  The cmsg is also
included when msg-&gt;msg_get_inq is non-zero to make sockets
io_uring-friendly.

Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.

By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)-&gt;recvmsg_inq.

Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@google.com&gt;
Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>af_unix: Introduce SO_PASSRIGHTS.</title>
<updated>2025-05-23T09:24:18+00:00</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2025-05-19T20:57:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=77cbe1a6d8730a07f99f9263c2d5f2304cf5e830'/>
<id>77cbe1a6d8730a07f99f9263c2d5f2304cf5e830</id>
<content type='text'>
As long as recvmsg() or recvmmsg() is used with cmsg, it is not
possible to avoid receiving file descriptors via SCM_RIGHTS.

This behaviour has occasionally been flagged as problematic, as
it can be (ab)used to trigger DoS during close(), for example, by
passing a FUSE-controlled fd or a hung NFS fd.

For instance, as noted on the uAPI Group page [0], an untrusted peer
could send a file descriptor pointing to a hung NFS mount and then
close it.  Once the receiver calls recvmsg() with msg_control, the
descriptor is automatically installed, and then the responsibility
for the final close() now falls on the receiver, which may result
in blocking the process for a long time.

Regarding this, systemd calls cmsg_close_all() [1] after each
recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS.

However, this cannot work around the issue at all, because the final
fput() may still occur on the receiver's side once sendmsg() with
SCM_RIGHTS succeeds.  Also, even filtering by LSM at recvmsg() does
not work for the same reason.

Thus, we need a better way to refuse SCM_RIGHTS at sendmsg().

Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS.

Note that this option is enabled by default for backward
compatibility.

Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0]
Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1]
Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
As long as recvmsg() or recvmmsg() is used with cmsg, it is not
possible to avoid receiving file descriptors via SCM_RIGHTS.

This behaviour has occasionally been flagged as problematic, as
it can be (ab)used to trigger DoS during close(), for example, by
passing a FUSE-controlled fd or a hung NFS fd.

For instance, as noted on the uAPI Group page [0], an untrusted peer
could send a file descriptor pointing to a hung NFS mount and then
close it.  Once the receiver calls recvmsg() with msg_control, the
descriptor is automatically installed, and then the responsibility
for the final close() now falls on the receiver, which may result
in blocking the process for a long time.

Regarding this, systemd calls cmsg_close_all() [1] after each
recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS.

However, this cannot work around the issue at all, because the final
fput() may still occur on the receiver's side once sendmsg() with
SCM_RIGHTS succeeds.  Also, even filtering by LSM at recvmsg() does
not work for the same reason.

Thus, we need a better way to refuse SCM_RIGHTS at sendmsg().

Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS.

Note that this option is enabled by default for backward
compatibility.

Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0]
Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1]
Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sock: Introduce SO_RCVPRIORITY socket option</title>
<updated>2024-12-17T02:16:44+00:00</updated>
<author>
<name>Anna Emese Nyiri</name>
<email>annaemesenyiri@gmail.com</email>
</author>
<published>2024-12-13T08:44:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e45469e594b255ef8d750ed5576698743450d2ac'/>
<id>e45469e594b255ef8d750ed5576698743450d2ac</id>
<content type='text'>
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for SO_RCVMARK,
as implemented in commit 6fd1d51cfa253 ("net: SO_RCVMARK socket option
for SO_MARK with recvmsg()").

Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Suggested-by: Ferenc Fejes &lt;fejes@inf.elte.hu&gt;
Signed-off-by: Anna Emese Nyiri &lt;annaemesenyiri@gmail.com&gt;
Link: https://patch.msgid.link/20241213084457.45120-5-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add new socket option, SO_RCVPRIORITY, to include SO_PRIORITY in the
ancillary data returned by recvmsg().
This is analogous to the existing support for SO_RCVMARK,
as implemented in commit 6fd1d51cfa253 ("net: SO_RCVMARK socket option
for SO_MARK with recvmsg()").

Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Suggested-by: Ferenc Fejes &lt;fejes@inf.elte.hu&gt;
Signed-off-by: Anna Emese Nyiri &lt;annaemesenyiri@gmail.com&gt;
Link: https://patch.msgid.link/20241213084457.45120-5-annaemesenyiri@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-11-23T17:58:07+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-11-23T17:58:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5c00ff742bf5caf85f60e1c73999f99376fb865d'/>
<id>5c00ff742bf5caf85f60e1c73999f99376fb865d</id>
<content type='text'>
Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page-&gt;index removals in mm" from Matthew Wilcox remove
   most references to page-&gt;index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page-&gt;index removals in mm" from Matthew Wilcox remove
   most references to page-&gt;index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: madvise: implement lightweight guard page mechanism</title>
<updated>2024-11-11T08:26:45+00:00</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2024-10-28T14:13:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=662df3e5c37666d6ed75c88098699e070a4b35b5'/>
<id>662df3e5c37666d6ed75c88098699e070a4b35b5</id>
<content type='text'>
Implement a new lightweight guard page feature, that is regions of
userland virtual memory that, when accessed, cause a fatal signal to
arise.

Currently users must establish PROT_NONE ranges to achieve this.

However this is very costly memory-wise - we need a VMA for each and every
one of these regions AND they become unmergeable with surrounding VMAs.

In addition repeated mmap() calls require repeated kernel context switches
and contention of the mmap lock to install these ranges, potentially also
having to unmap memory if installed over existing ranges.

The lightweight guard approach eliminates the VMA cost altogether - rather
than establishing a PROT_NONE VMA, it operates at the level of page table
entries - establishing PTE markers such that accesses to them cause a
fault followed by a SIGSGEV signal being raised.

This is achieved through the PTE marker mechanism, which we have already
extended to provide PTE_MARKER_GUARD, which we installed via the generic
page walking logic which we have extended for this purpose.

These guard ranges are established with MADV_GUARD_INSTALL.  If the range
in which they are installed contain any existing mappings, they will be
zapped, i.e.  free the range and unmap memory (thus mimicking the
behaviour of MADV_DONTNEED in this respect).

Any existing guard entries will be left untouched.  There is therefore no
nesting of guarded pages.

Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both
instances the memory range may be reused at which point a user would
expect guards to still be in place), but they are cleared via
MADV_GUARD_REMOVE, process teardown or unmapping of memory ranges.

The guard property can be removed from ranges via MADV_GUARD_REMOVE.  The
ranges over which this is applied, should they contain non-guard entries,
will be untouched, with only guard entries being cleared.

We permit this operation on anonymous memory only, and only VMAs which are
non-special, non-huge and not mlock()'d (if we permitted this we'd have to
drop locked pages which would be rather counterintuitive).

Racing page faults can cause repeated attempts to install guard pages that
are interrupted, result in a zap, and this process can end up being
repeated.  If this happens more than would be expected in normal
operation, we rescind locks and retry the whole thing, which avoids lock
contention in this scenario.

Link: https://lkml.kernel.org/r/6aafb5821bf209f277dfae0787abb2ef87a37542.1730123433.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Suggested-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Suggested-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Suggested-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Suggested-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Arnd Bergmann &lt;arnd@kernel.org&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Chris Zankel &lt;chris@zankel.net&gt;
Cc: Helge Deller &lt;deller@gmx.de&gt;
Cc: James E.J. Bottomley &lt;James.Bottomley@HansenPartnership.com&gt;
Cc: Jeff Xu &lt;jeffxu@chromium.org&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Matt Turner &lt;mattst88@gmail.com&gt;
Cc: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Cc: Richard Henderson &lt;richard.henderson@linaro.org&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Cc: Sidhartha Kumar &lt;sidhartha.kumar@oracle.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Thomas Bogendoerfer &lt;tsbogend@alpha.franken.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Implement a new lightweight guard page feature, that is regions of
userland virtual memory that, when accessed, cause a fatal signal to
arise.

Currently users must establish PROT_NONE ranges to achieve this.

However this is very costly memory-wise - we need a VMA for each and every
one of these regions AND they become unmergeable with surrounding VMAs.

In addition repeated mmap() calls require repeated kernel context switches
and contention of the mmap lock to install these ranges, potentially also
having to unmap memory if installed over existing ranges.

The lightweight guard approach eliminates the VMA cost altogether - rather
than establishing a PROT_NONE VMA, it operates at the level of page table
entries - establishing PTE markers such that accesses to them cause a
fault followed by a SIGSGEV signal being raised.

This is achieved through the PTE marker mechanism, which we have already
extended to provide PTE_MARKER_GUARD, which we installed via the generic
page walking logic which we have extended for this purpose.

These guard ranges are established with MADV_GUARD_INSTALL.  If the range
in which they are installed contain any existing mappings, they will be
zapped, i.e.  free the range and unmap memory (thus mimicking the
behaviour of MADV_DONTNEED in this respect).

Any existing guard entries will be left untouched.  There is therefore no
nesting of guarded pages.

Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both
instances the memory range may be reused at which point a user would
expect guards to still be in place), but they are cleared via
MADV_GUARD_REMOVE, process teardown or unmapping of memory ranges.

The guard property can be removed from ranges via MADV_GUARD_REMOVE.  The
ranges over which this is applied, should they contain non-guard entries,
will be untouched, with only guard entries being cleared.

We permit this operation on anonymous memory only, and only VMAs which are
non-special, non-huge and not mlock()'d (if we permitted this we'd have to
drop locked pages which would be rather counterintuitive).

Racing page faults can cause repeated attempts to install guard pages that
are interrupted, result in a zap, and this process can end up being
repeated.  If this happens more than would be expected in normal
operation, we rescind locks and retry the whole thing, which avoids lock
contention in this scenario.

Link: https://lkml.kernel.org/r/6aafb5821bf209f277dfae0787abb2ef87a37542.1730123433.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Suggested-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Suggested-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Suggested-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Suggested-by: Jann Horn &lt;jannh@google.com&gt;
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Arnd Bergmann &lt;arnd@kernel.org&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Chris Zankel &lt;chris@zankel.net&gt;
Cc: Helge Deller &lt;deller@gmx.de&gt;
Cc: James E.J. Bottomley &lt;James.Bottomley@HansenPartnership.com&gt;
Cc: Jeff Xu &lt;jeffxu@chromium.org&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Matt Turner &lt;mattst88@gmail.com&gt;
Cc: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Paul E. McKenney &lt;paulmck@kernel.org&gt;
Cc: Richard Henderson &lt;richard.henderson@linaro.org&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Cc: Sidhartha Kumar &lt;sidhartha.kumar@oracle.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Thomas Bogendoerfer &lt;tsbogend@alpha.franken.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message</title>
<updated>2024-10-04T18:52:19+00:00</updated>
<author>
<name>Vadim Fedorenko</name>
<email>vadfed@meta.com</email>
</author>
<published>2024-10-01T12:57:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4aecca4c76808f3736056d18ff510df80424bc9f'/>
<id>4aecca4c76808f3736056d18ff510df80424bc9f</id>
<content type='text'>
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Reviewed-by: Jason Xing &lt;kerneljasonxing@gmail.com&gt;
Signed-off-by: Vadim Fedorenko &lt;vadfed@meta.com&gt;
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
timestamps and packets sent via socket. Unfortunately, there is no way
to reliably predict socket timestamp ID value in case of error returned
by sendmsg. For UDP sockets it's impossible because of lockless
nature of UDP transmit, several threads may send packets in parallel. In
case of RAW sockets MSG_MORE option makes things complicated. More
details are in the conversation [1].
This patch adds new control message type to give user-space
software an opportunity to control the mapping between packets and
values by providing ID with each sendmsg for UDP sockets.
The documentation is also added in this patch.

[1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Reviewed-by: Willem de Bruijn &lt;willemb@google.com&gt;
Reviewed-by: Jason Xing &lt;kerneljasonxing@gmail.com&gt;
Signed-off-by: Vadim Fedorenko &lt;vadfed@meta.com&gt;
Link: https://patch.msgid.link/20241001125716.2832769-2-vadfed@meta.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
