linux-toradex.git/fs/ocfs2, branch master

Merge tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:37:58+00:00

Pull procfs updates from Christian Brauner:

 - Revamp fs/filesystems.c

   The file was a mess with a hand-rolled linked list in desperate need
   of a cleanup. The filesystems list is now RCU-ified, /proc files can
   be marked permanent from outside fs/proc/, and the string emitted
   when reading /proc/filesystems is pre-generated and cached instead of
   pointer-chasing and printfing entry by entry on every read.

   The file is read frequently because libselinux reads it and is linked
   into numerous frequently used programs (even ones you would not
   suspect, like sed!). Scalability also improves since reference
   maintenance on open/close is bypassed.

    open+read+close cycle single-threaded (ops/s):
      before: 442732
      after:  1063462 (+140%)

    open+read+close cycle with 20 processes (ops/s):
      before: 606177
      after:  3300576 (+444%)

   A follow-up patch adds missing unlocks in some corner cases and
   tidies things up.

 - Relax the mount visibility check for subset=pid mounts

   When procfs is mounted with subset=pid, all static files become
   unavailable and only the dynamic pid information is accessible. In
   that case there is no point in imposing the full mount visibility
   restrictions on the mounter - everything that can be hidden in procfs
   is already inaccessible. These restrictions prevented procfs from
   being mounted inside rootless containers since almost all container
   implementations overmount parts of procfs to hide certain
   directories.

   As part of this /proc/self/net is only shown in subset=pid mounts for
   CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the
   SB_I_USERNS_VISIBLE superblock flag is replaced with an
   FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are
   recorded in a list, and the mount restrictions are finally
   documented.

 - Protect ptrace_may_access() with exec_update_lock in procfs

   Most uses of ptrace_may_access() in procfs should hold
   exec_update_lock to avoid TOCTOU issues with concurrent privileged
   execve() (like setuid binary execution).

   This fixes the easy cases - the owner and visibility checks and the
   FD link permission checks - with the gnarlier ones to follow later.

* tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix ups and tidy ups to /proc/filesystems caching
  proc: protect ptrace_may_access() with exec_update_lock (FD links)
  proc: protect ptrace_may_access() with exec_update_lock (part 1)
  docs: proc: add documentation about mount restrictions
  proc: handle subset=pid separately in userns visibility checks
  proc: prevent reconfiguring subset=pid
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  fs: cache the string generated by reading /proc/filesystems
  sysfs: remove trivial sysfs_get_tree() wrapper
  fs: RCU-ify filesystems list
  fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
  proc: allow to mark /proc files permanent outside of fs/proc/
  namespace: record fully visible mounts in list

Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:29:45+00:00

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe->mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe->mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size > PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 << n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to ->poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...

ocfs2: Convert ocfs2_write_super_or_backup to bh_submit()

2026-06-04T08:28:09+00:00

Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().

Signed-off-by: Matthew Wilcox (Oracle) 
Link: https://patch.msgid.link/20260528173150.1093780-22-willy@infradead.org
Reviewed-by: Joseph Qi 
Reviewed-by: Jan Kara 
Cc: ocfs2-devel@lists.linux.dev
Signed-off-by: Christian Brauner (Amutable)

ocfs2: Convert ocfs2_read_blocks to bh_submit()

2026-06-04T08:28:08+00:00

Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().

Signed-off-by: Matthew Wilcox (Oracle) 
Link: https://patch.msgid.link/20260528173150.1093780-21-willy@infradead.org
Reviewed-by: Joseph Qi 
Reviewed-by: Jan Kara 
Cc: ocfs2-devel@lists.linux.dev
Signed-off-by: Christian Brauner (Amutable)

ocfs2: Convert ocfs2_read_block to bh_submit()

2026-06-04T08:28:08+00:00

Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().

Signed-off-by: Matthew Wilcox (Oracle) 
Link: https://patch.msgid.link/20260528173150.1093780-20-willy@infradead.org
Reviewed-by: Joseph Qi 
Reviewed-by: Jan Kara 
Cc: ocfs2-devel@lists.linux.dev
Signed-off-by: Christian Brauner (Amutable)

ocfs2: Convert ocfs2_write_block to bh_submit()

2026-06-04T08:28:08+00:00

Avoid an extra indirect function call and changing the buffer
refcount by using bh_submit() instead of submit_bh().

Signed-off-by: Matthew Wilcox (Oracle) 
Link: https://patch.msgid.link/20260528173150.1093780-19-willy@infradead.org
Reviewed-by: Joseph Qi 
Reviewed-by: Jan Kara 
Cc: ocfs2-devel@lists.linux.dev
Signed-off-by: Christian Brauner (Amutable)

ocfs2/dlm: replace __get_free_page() with kmalloc()

2026-05-27T13:12:23+00:00

A few places in ocsfs2 allocate temporary buffers with __get_free_page() or
get_zeroed_page().

kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.

Replace use of __get_free_page() and get_zeroed_page() with kmalloc() and
kzalloc() respectively.

Signed-off-by: Mike Rapoport (Microsoft) 
Link: https://patch.msgid.link/20260523-b4-fs-v1-3-275e36a83f0e@kernel.org
Reviewed-by: Joseph Qi 
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner (Amutable)

fs: RCU-ify filesystems list

2026-05-11T21:13:01+00:00

The drivers list was protected by an rwlock; every mount, every open
of /proc/filesystems and the legacy sysfs(2) syscall walked a
hand-rolled singly-linked list under it. /proc/filesystems is
especially hot because libselinux causes programs as mundane as
mkdir, ls and sed to open and read it on every invocation.

Convert the list to an RCU-protected hlist and switch the writer side
to a plain spinlock. Writers keep their existing non-sleeping
section while readers walk under rcu_read_lock() with no lock traffic:

  - register_filesystem()/unregister_filesystem() take
    file_systems_lock, publish via hlist_{add_tail,del_init}_rcu()
    and invalidate the cached /proc/filesystems string.
    unregister_filesystem() keeps its synchronize_rcu() after
    dropping the lock so in-flight readers are drained before the
    module (and its embedded file_system_type) can go away.

  - __get_fs_type(), list_bdev_fs_names() and the
    fs_index()/fs_name()/fs_maxindex() helpers walk the list under
    rcu_read_lock(). fs_name() continues to drop the read-side
    lock after try_module_get() and accesses ->name outside the RCU
    section; the module reference pins the embedded file_system_type
    across the boundary.

struct file_system_type::next becomes struct hlist_node list; no
in-tree caller references the old ->next field outside
fs/filesystems.c.

Link: https://patch.msgid.link/20260425220844.1763933-3-mjguzik@gmail.com
Signed-off-by: Christian Brauner

Merge tag 'pull-dcache-busy-wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2026-04-21T14:30:44+00:00

Pull dcache busy loop updates from Al Viro:
 "Fix livelocks in shrink_dcache_tree()

  If shrink_dcache_tree() finds a dentry in the middle of being killed
  by another thread, it has to wait until the victim finishes dying,
  gets detached from the tree and ceases to pin its parent.

  The way we used to deal with that amounted to busy-wait;
  unfortunately, it's not just inefficient but can lead to reliably
  reproducible hard livelocks.

  Solved by having shrink_dentry_tree() attach a completion to such
  dentry, with dentry_unlist() calling complete() on all objects
  attached to it. With a bit of care it can be done without growing
  struct dentry or adding overhead in normal case"

* tag 'pull-dcache-busy-wait' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get rid of busy-waiting in shrink_dcache_tree()
  dcache.c: more idiomatic "positives are not allowed" sanity checks
  struct dentry: make ->d_u anonymous
  for_each_alias(): helper macro for iterating through dentries of given inode

Merge tag 'ext4_for_linux-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

2026-04-18T00:08:31+00:00

Pull ext4 updates from Ted Ts'o:

 - Refactor code paths involved with partial block zero-out in
   prearation for converting ext4 to use iomap for buffered writes

 - Remove use of d_alloc() from ext4 in preparation for the deprecation
   of this interface

 - Replace some J_ASSERTS with a journal abort so we can avoid a kernel
   panic for a localized file system error

 - Simplify various code paths in mballoc, move_extent, and fast commit

 - Fix rare deadlock in jbd2_journal_cancel_revoke() that can be
   triggered by generic/013 when blocksize < pagesize

 - Fix memory leak when releasing an extended attribute when its value
   is stored in an ea_inode

 - Fix various potential kunit test bugs in fs/ext4/extents.c

 - Fix potential out-of-bounds access in check_xattr() with a corrupted
   file system

 - Make the jbd2_inode dirty range tracking safe for lockless reads

 - Avoid a WARN_ON when writeback files due to a corrupted file system;
   we already print an ext4 warning indicatign that data will be lost,
   so the WARN_ON is not necessary and doesn't add any new information

* tag 'ext4_for_linux-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (37 commits)
  jbd2: fix deadlock in jbd2_journal_cancel_revoke()
  ext4: fix missing brelse() in ext4_xattr_inode_dec_ref_all()
  ext4: fix possible null-ptr-deref in mbt_kunit_exit()
  ext4: fix possible null-ptr-deref in extents_kunit_exit()
  ext4: fix the error handling process in extents_kunit_init).
  ext4: call deactivate_super() in extents_kunit_exit()
  ext4: fix miss unlock 'sb->s_umount' in extents_kunit_init()
  ext4: fix bounds check in check_xattrs() to prevent out-of-bounds access
  ext4: zero post-EOF partial block before appending write
  ext4: move pagecache_isize_extended() out of active handle
  ext4: remove ctime/mtime update from ext4_alloc_file_blocks()
  ext4: unify SYNC mode checks in fallocate paths
  ext4: ensure zeroed partial blocks are persisted in SYNC mode
  ext4: move zero partial block range functions out of active handle
  ext4: pass allocate range as loff_t to ext4_alloc_file_blocks()
  ext4: remove handle parameters from zero partial block functions
  ext4: move ordered data handling out of ext4_block_do_zero_range()
  ext4: rename ext4_block_zero_page_range() to ext4_block_zero_range()
  ext4: factor out journalled block zeroing range
  ext4: rename and extend ext4_block_truncate_page()
  ...