linux-toradex.git/include/linux/fs, branch master

Merge tag 'fuse-update-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

2026-06-18T15:50:52+00:00

Pull fuse updates from Miklos Szeredi:

 - Fix lots of bugs, most from the late 6.x era, but some going back
   to 2.6.x

 - Add subsystems (io-uring, passthrough) and respective maintainers
   (Bernd, Joanne and Amir)

 - Separate transport and fs layers (Miklos)

 - Don't block on cat /dev/fuse (Joanne)

 - Perform some refactoring in fuse-uring (Joanne)

 - Don't use bounce-buffer for READDIR reply in virtio-fs (Matthew Ochs)

 - Clean up documentation (Randy)

 - Improve tracing (Amir)

 - Extend page cache invalidation after DIO (Cheng Ding)

 - Invalidate readdir cache on epoch change (Jun Wu)

 - Misc cleanups

* tag 'fuse-update-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (81 commits)
  fuse-uring: clear ent->fuse_req in commit_fetch error path
  fuse-uring: use named constants for io-uring iovec indices
  fuse-uring: refactor setting up copy state for payload copying
  fuse-uring: use enum types for header copying
  fuse-uring: refactor io-uring header copying from ring
  fuse-uring: refactor io-uring header copying to ring
  fuse-uring: separate next request fetching from sending logic
  fuse: invalidate readdir cache on epoch bump
  virtio-fs: avoid double-free on failed queue setup
  fuse: invalidate page cache after DIO and async DIO writes
  fuse: set ff->flock only on success
  fuse: clean up interrupt reading
  fuse: remove stray newline in fuse_dev_do_read()
  fuse: use READ_ONCE in fuse_chan_num_background()
  fuse: dax: Move long delayed work on system_dfl_long_wq
  fuse: add fuse_request_sent tracepoint
  fuse: Add SPDX ID lines to some files
  fuse: use QSTR() instead of QSTR_INIT() in fuse_get_dentry
  fuse: convert page array allocation to kcalloc()
  fuse: use current creds for backing files
  ...

fuse: invalidate page cache after DIO and async DIO writes

2026-06-15T12:06:20+00:00

This fixe does page cache invalidation after DIO and async DIO writes for
both O_DIRECT and FOPEN_DIRECT_IO cases.

Commit b359af8275a9 ("fuse: Invalidate the page cache after FOPEN_DIRECT_IO
write") fixed xfstests generic/209 for DIO writes in the FOPEN_DIRECT_IO
path. DIO writes without FOPEN_DIRECT_IO are already handled by
generic_file_direct_write().
However, async DIO writes (xfstests generic/451) remain unhandled.

After this fix:
- Async write with FUSE_ASYNC_DIO:
    invalidate in fuse_aio_invalidate_worker()

- Otherwise (Sync or async write without FUSE_ASYNC_DIO):
    - With FOPEN_DIRECT_IO:
        invalidate in fuse_direct_write_iter()
    - Without FOPEN_DIRECT_IO:
        invalidate in generic_file_direct_write()

Workqueue is required for async write invalidation to prevent deadlock:
calling it directly in the I/O end routine (which is in fuse worker thread
context) can block on a folio lock held by a buffered I/O thread waiting
for the same fuse worker thread.

Co-developed-by: Jingbo Xu 
Signed-off-by: Jingbo Xu 
Signed-off-by: Cheng Ding 
Reviewed-by: Jingbo Xu 
Signed-off-by: Miklos Szeredi

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2026-06-14T22:45:31+00:00

Pull dcache updates from Al Viro:

 - d_alloc_parallel() API change (Neil's with my changes)

 - NORCU fixes

 - Reorganization and simplification of dentry eviction logic

 - Simplifying rcu_read_lock() scopes in fs/dcache.c

 - Secondary roots work - getting rid of NFS fake root dentries and
   dealing with remaining shrink_dcache_for_umount() and
   shrink_dentry_list() races

 - making cursors NORCU (surprisingly easy)

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits)
  make cursors NORCU
  nfs: get rid of fake root dentries
  wind ->s_roots via ->d_sib instead of ->d_hash
  shrink_dentry_tree(): unify the calls of shrink_dentry_list()
  shrinking rcu_read_lock() scope in d_alloc_parallel()
  d_walk(): shrink rcu_read_lock() scope
  document dentry_kill()
  adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
  Document rcu_read_lock() use in select_collect2()
  Shift rcu_read_{,un}lock() inside fast_dput()
  simplify safety for lock_for_kill() slowpath
  fold lock_for_kill() and __dentry_kill() into common helper
  fold lock_for_kill() into shrink_kill()
  shrink_dentry_list(): start with removing from shrink list
  d_prune_aliases(): make sure to skip NORCU aliases
  kill d_dispose_if_unused()
  make to_shrink_list() return whether it has moved dentry to list
  select_collect(): ignore dentries on shrink lists if they have positive refcounts
  find_acceptable_alias(): skip NORCU aliases with zero refcount
  fix a race between d_find_any_alias() and final dput() of NORCU dentries
  ...

Merge tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:37:58+00:00

Pull procfs updates from Christian Brauner:

 - Revamp fs/filesystems.c

   The file was a mess with a hand-rolled linked list in desperate need
   of a cleanup. The filesystems list is now RCU-ified, /proc files can
   be marked permanent from outside fs/proc/, and the string emitted
   when reading /proc/filesystems is pre-generated and cached instead of
   pointer-chasing and printfing entry by entry on every read.

   The file is read frequently because libselinux reads it and is linked
   into numerous frequently used programs (even ones you would not
   suspect, like sed!). Scalability also improves since reference
   maintenance on open/close is bypassed.

    open+read+close cycle single-threaded (ops/s):
      before: 442732
      after:  1063462 (+140%)

    open+read+close cycle with 20 processes (ops/s):
      before: 606177
      after:  3300576 (+444%)

   A follow-up patch adds missing unlocks in some corner cases and
   tidies things up.

 - Relax the mount visibility check for subset=pid mounts

   When procfs is mounted with subset=pid, all static files become
   unavailable and only the dynamic pid information is accessible. In
   that case there is no point in imposing the full mount visibility
   restrictions on the mounter - everything that can be hidden in procfs
   is already inaccessible. These restrictions prevented procfs from
   being mounted inside rootless containers since almost all container
   implementations overmount parts of procfs to hide certain
   directories.

   As part of this /proc/self/net is only shown in subset=pid mounts for
   CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the
   SB_I_USERNS_VISIBLE superblock flag is replaced with an
   FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are
   recorded in a list, and the mount restrictions are finally
   documented.

 - Protect ptrace_may_access() with exec_update_lock in procfs

   Most uses of ptrace_may_access() in procfs should hold
   exec_update_lock to avoid TOCTOU issues with concurrent privileged
   execve() (like setuid binary execution).

   This fixes the easy cases - the owner and visibility checks and the
   FD link permission checks - with the gnarlier ones to follow later.

* tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix ups and tidy ups to /proc/filesystems caching
  proc: protect ptrace_may_access() with exec_update_lock (FD links)
  proc: protect ptrace_may_access() with exec_update_lock (part 1)
  docs: proc: add documentation about mount restrictions
  proc: handle subset=pid separately in userns visibility checks
  proc: prevent reconfiguring subset=pid
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  fs: cache the string generated by reading /proc/filesystems
  sysfs: remove trivial sysfs_get_tree() wrapper
  fs: RCU-ify filesystems list
  fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
  proc: allow to mark /proc files permanent outside of fs/proc/
  namespace: record fully visible mounts in list

wind ->s_roots via ->d_sib instead of ->d_hash

2026-06-05T04:34:56+00:00

shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.

That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list.  Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.

If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.

Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
	lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.

AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.

Changes:
	* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
	* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
	* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
	* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
	* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
	* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
	* ___d_drop() doesn't care about ->s_roots anymore.
	* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.

Signed-off-by: Al Viro

writeback: use a per-sb counter to drain inode wb switches at umount

2026-05-22T10:06:35+00:00

Tracking in-flight inode wb switches with a single global counter
(isw_nr_in_flight) plus a synchronize_rcu() based wait in
cgroup_writeback_umount() forces every umount to take a global hit
whenever any other superblock on the system has wb switches in flight,
even if the superblock being unmounted has none of its own.

Replace the global synchronize_rcu()/flush_workqueue() pair with a
per-sb counter, s_isw_nr_in_flight, plus three small helpers:

  - cgroup_writeback_pin(sb)   - increment counter
  - cgroup_writeback_unpin(sb) - decrement and wake drainer if last
  - cgroup_writeback_drain(sb) - wait for counter to reach zero

The wiring is:

  - inode_prepare_wbs_switch() pins before checking SB_ACTIVE and
    grabbing the inode; failure paths unpin before returning.  A
    lockless SB_ACTIVE check at the top of the function lets us skip
    the atomic_inc/smp_mb dance once SB_ACTIVE has been cleared (it
    is monotonic and never set back).
  - process_inode_switch_wbs() unpins after the matching iput().
  - cgroup_writeback_umount() drains the per-sb counter via
    wait_var_event().

The smp_mb() pair between inode_prepare_wbs_switch() and
cgroup_writeback_umount() keeps the SB_ACTIVE / counter ordering:
either the umounter sees a non-zero counter and waits, or the
switcher sees SB_ACTIVE cleared and aborts before grabbing the
inode.

The global isw_nr_in_flight is left in place, since it is still used
to throttle in-flight switches via WB_FRN_MAX_IN_FLIGHT.

The rcu_read_lock() extension in inode_switch_wbs() and
cleanup_offline_cgwb() that the race fix added is no longer needed
and is reverted; the synchronize_rcu() that the race fix added to
cgroup_writeback_umount() is dropped as well.

The following numbers were measured on a 16 vCPU QEMU guest with 4
background superblocks each churning "create memcg -> write 1 MiB ->
rmdir memcg" to keep the global isw_nr_in_flight non-zero.  Latencies
are wall-clock around umount(8); only the target sb's umount is
measured.

Target sb runs its own cgwb churn:

                              p50      p95      p99      max
  global synchronize_rcu()   67.6 ms  88.3 ms  88.3 ms  96.8 ms
  per-sb counter (this)       7.9 ms  10.0 ms  10.0 ms  10.1 ms

Idle target umount latency under cross-sb cgwb-switch pressure:

                              p50      p95      p99      max
  global synchronize_rcu()   62.7 ms  95.4 ms 108.1 ms 108.6 ms
  per-sb counter (this)       5.3 ms   6.9 ms   7.4 ms   7.4 ms
  no-pressure baseline        4.9 ms   5.9 ms   6.3 ms   6.7 ms

8 concurrent umounts of idle sbs under the same pressure:

                              p50      p95      max
  global synchronize_rcu()   61.3 ms  99.5 ms 113.7 ms
  per-sb counter (this)       8.1 ms   9.1 ms   9.5 ms

In-kernel cgroup_writeback_umount() time across the same run
(bpftrace, ~340 calls covering all scenarios):

  global synchronize_rcu()    12371 ms total (~36 ms / call)
  per-sb counter (this)        1.37 ms total ( ~4 us / call)

Suggested-by: Christian Brauner 
Link: https://lore.kernel.org/r/177910456953.488929.2169908940676707307.b4-review@b4
Reviewed-by: Jan Kara 
Signed-off-by: Baokun Li 
Link: https://patch.msgid.link/20260521095016.2791354-4-libaokun@linux.alibaba.com
Acked-by: Tejun Heo 
Signed-off-by: Christian Brauner (Amutable)

proc: handle subset=pid separately in userns visibility checks

2026-05-11T21:13:02+00:00

When procfs is mounted with subset=pid, only the dynamic process-related
part of the filesystem remains visible. That part cannot be hidden by
overmounts, so checking whether an existing procfs mount is fully
visible does not make sense for this mode.

At the same time, a subset=pid procfs mount must not be used as evidence
that a later procfs mount would not reveal additional information. It
provides a restricted view of procfs, not the full filesystem view.

Mark subset=pid procfs instances as restricted variants. Ignore
restricted variants when looking for an already-visible mount, and allow
new restricted variants without consulting mnt_already_visible().

Signed-off-by: Alexey Gladkov 
Link: https://patch.msgid.link/4d5e760c3d534dd2e05578d119cc408450053a98.1777278334.git.legion@kernel.org
Reviewed-by: Aleksa Sarai 
Signed-off-by: Christian Brauner

fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED

2026-05-11T21:13:01+00:00

Whether a filesystem's mounts need to undergo a visibility check in user
namespaces is a static property of the filesystem type, not a runtime
property of each superblock instance. Both proc and sysfs always set
SB_I_USERNS_VISIBLE on their superblocks unconditionally (sysfs does so
on first creation, and subsequent mounts reuse the same superblock).

Move this flag from sb->s_iflags (SB_I_USERNS_VISIBLE) to
file_system_type->fs_flags (FS_USERNS_MOUNT_RESTRICTED) so the intent
is expressed at the filesystem type level where it belongs.

All check sites are updated to test sb->s_type->fs_flags instead of
sb->s_iflags. The SB_I_NOEXEC and SB_I_NODEV flags remain on the
superblock as they are runtime properties set during fill_super.

Link: https://patch.msgid.link/72887c5b6204dc3adf5a53104f0be6bd8bc4f6cd.1777278334.git.legion@kernel.org
Reviewed-by: Aleksa Sarai 
Signed-off-by: Christian Brauner

writeback: don't block sync for filesystems with no data integrity guarantees

2026-03-20T13:18:56+00:00

Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot
guarantee data persistence on sync (eg fuse). For superblocks with this
flag set, sync kicks off writeback of dirty inodes but does not wait
for the flusher threads to complete the writeback.

This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in
commit f9a49aa302a0 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings
in wait_sb_inodes()"). The flag belongs at the superblock level because
data integrity is a filesystem-wide property, not a per-inode one.
Having this flag at the superblock level also allows us to skip having
to iterate every dirty inode in wait_sb_inodes() only to skip each inode
individually.

Prior to this commit, mappings with no data integrity guarantees skipped
waiting on writeback completion but still waited on the flusher threads
to finish initiating the writeback. Waiting on the flusher threads is
unnecessary. This commit kicks off writeback but does not wait on the
flusher threads. This change properly addresses a recent report [1] for
a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting
on the flusher threads to finish:

Workqueue: pm_fs_sync pm_fs_sync_work_fn
Call Trace:
 
 __schedule+0x457/0x1720
 schedule+0x27/0xd0
 wb_wait_for_completion+0x97/0xe0
 sync_inodes_sb+0xf8/0x2e0
 __iterate_supers+0xdc/0x160
 ksys_sync+0x43/0xb0
 pm_fs_sync_work_fn+0x17/0xa0
 process_one_work+0x193/0x350
 worker_thread+0x1a1/0x310
 kthread+0xfc/0x240
 ret_from_fork+0x243/0x280
 ret_from_fork_asm+0x1a/0x30
 

On fuse this is problematic because there are paths that may cause the
flusher thread to block (eg if systemd freezes the user session cgroups
first, which freezes the fuse daemon, before invoking the kernel
suspend. The kernel suspend triggers ->write_node() which on fuse issues
a synchronous setattr request, which cannot be processed since the
daemon is frozen. Or if the daemon is buggy and cannot properly complete
writeback, initiating writeback on a dirty folio already under writeback
leads to writeback_get_folio() -> folio_prepare_writeback() ->
unconditional wait on writeback to finish, which will cause a hang).
This commit restores fuse to its prior behavior before tmp folios were
removed, where sync was essentially a no-op.

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1

Fixes: 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree")
Reported-by: John 
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong 
Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com
Reviewed-by: Jan Kara 
Reviewed-by: David Hildenbrand (Arm) 
Signed-off-by: Christian Brauner

Merge tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-02-09T22:43:47+00:00

Pull vfs mount updates from Christian Brauner:

 - statmount: accept fd as a parameter

   Extend struct mnt_id_req with a file descriptor field and a new
   STATMOUNT_BY_FD flag. When set, statmount() returns mount information
   for the mount the fd resides on — including detached mounts
   (unmounted via umount2(MNT_DETACH)).

   For detached mounts the STATMOUNT_MNT_POINT and STATMOUNT_MNT_NS_ID
   mask bits are cleared since neither is meaningful. The capability
   check is skipped for STATMOUNT_BY_FD since holding an fd already
   implies prior access to the mount and equivalent information is
   available through fstatfs() and /proc/pid/mountinfo without
   privilege. Includes comprehensive selftests covering both attached
   and detached mount cases.

 - fs: Remove internal old mount API code (1 patch)

   Now that every in-tree filesystem has been converted to the new
   mount API, remove all the legacy shim code in fs_context.c that
   handled unconverted filesystems. This deletes ~280 lines including
   legacy_init_fs_context(), the legacy_fs_context struct, and
   associated wrappers. The mount(2) syscall path for userspace remains
   untouched. Documentation references to the legacy callbacks are
   cleaned up.

 - mount: add OPEN_TREE_NAMESPACE to open_tree()

   Container runtimes currently use CLONE_NEWNS to copy the caller's
   entire mount namespace — only to then pivot_root() and recursively
   unmount everything they just copied. With large mount tables and
   thousands of parallel container launches this creates significant
   contention on the namespace semaphore.

   OPEN_TREE_NAMESPACE copies only the specified mount tree (like
   OPEN_TREE_CLONE) but returns a mount namespace fd instead of a
   detached mount fd. The new namespace contains the copied tree mounted
   on top of a clone of the real rootfs.

   This functions as a combined unshare(CLONE_NEWNS) + pivot_root() in a
   single syscall. Works with user namespaces: an unshare(CLONE_NEWUSER)
   followed by OPEN_TREE_NAMESPACE creates a mount namespace owned by
   the new user namespace. Mount namespace file mounts are excluded from
   the copy to prevent cycles. Includes ~1000 lines of selftests"

* tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/open_tree: add OPEN_TREE_NAMESPACE tests
  mount: add OPEN_TREE_NAMESPACE
  fs: Remove internal old mount API code
  selftests: statmount: tests for STATMOUNT_BY_FD
  statmount: accept fd as a parameter
  statmount: permission check should return EPERM