linux-toradex.git/fs/kernfs, branch master

Merge tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

2026-04-14T02:03:11+00:00

Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Fix NULL pointer dereference in debugfs_create_str()
   - Fix misplaced EXPORT_SYMBOL_GPL for debugfs_create_str()
   - Fix soundwire debugfs NULL pointer dereference from uninitialized
     firmware_file

  device property:
   - Make fwnode flags modifications thread safe; widen the field to
     unsigned long and use set_bit() / clear_bit() based accessors
   - Document how to check for the property presence

  devres:
   - Separate struct devres_node from its "subclasses" (struct devres,
     struct devres_group); give struct devres_node its own release and
     free callbacks for per-type dispatch
   - Introduce struct devres_action for devres actions, avoiding the
     ARCH_DMA_MINALIGN alignment overhead of struct devres
   - Export struct devres_node and its init/add/remove/dbginfo
     primitives for use by Rust Devres
   - Fix missing node debug info in devm_krealloc()
   - Use guard(spinlock_irqsave) where applicable; consolidate unlock
     paths in devres_release_group()

  driver_override:
   - Convert PCI, WMI, vdpa, s390/cio, s390/ap, and fsl-mc to the
     generic driver_override infrastructure, replacing per-bus
     driver_override strings, sysfs attributes, and match logic; fixes a
     potential UAF from unsynchronized access to driver_override in bus
     match() callbacks
   - Simplify __device_set_driver_override() logic

  kernfs:
   - Send IN_DELETE_SELF and IN_IGNORED inotify events on kernfs file
     and directory removal
   - Add corresponding selftests for memcg

  platform:
   - Allow attaching software nodes when creating platform devices via a
     new 'swnode' field in struct platform_device_info
   - Add kerneldoc for struct platform_device_info

  software node:
   - Move software node initialization from postcore_initcall() to
     driver_init(), making it available early in the boot process
   - Move kernel_kobj initialization (ksysfs_init) earlier to support
     the above
   - Remove software_node_exit(); dead code in a built-in unit

  SoC:
   - Introduce of_machine_read_compatible() and of_machine_read_model()
     OF helpers and export soc_attr_read_machine() to replace direct
     accesses to of_root from SoC drivers; also enables
     CONFIG_COMPILE_TEST coverage for these drivers

  sysfs:
   - Constify attribute group array pointers to
     'const struct attribute_group *const *' in sysfs functions,
     device_add_groups() / device_remove_groups(), and struct class

  Rust:
   - Devres:
      - Embed struct devres_node directly in Devres instead of going
        through devm_add_action(), avoiding the extra allocation and the
        unnecessary ARCH_DMA_MINALIGN alignment

   - I/O:
      - Turn IoCapable from a marker trait into a functional trait
        carrying the raw I/O accessor implementation (io_read /
        io_write), providing working defaults for the per-type Io
        methods
      - Add RelaxedMmio wrapper type, making relaxed accessors usable in
        code generic over the Io trait
      - Remove overloaded per-type Io methods and per-backend macros
        from Mmio and PCI ConfigSpace

   - I/O (Register):
      - Add IoLoc trait and generic read/write/update methods to the Io
        trait, making I/O operations parameterizable by typed locations
      - Add register! macro for defining hardware register types with
        typed bitfield accessors backed by Bounded values; supports
        direct, relative, and array register addressing
      - Add write_reg() / try_write_reg() and LocatedRegister trait
      - Update PCI sample driver to demonstrate the register! macro

         Example:

         ```
             register! {
                 /// UART control register.
                 CTRL(u32) @ 0x18 {
                     /// Receiver enable.
                     19:19   rx_enable => bool;
                     /// Parity configuration.
                     14:13   parity ?=> Parity;
                 }

                 /// FIFO watermark and counter register.
                 WATER(u32) @ 0x2c {
                     /// Number of datawords in the receive FIFO.
                     26:24   rx_count;
                     /// RX interrupt threshold.
                     17:16   rx_water;
                 }
             }

             impl WATER {
                 fn rx_above_watermark(&self) -> bool {
                     self.rx_count() > self.rx_water()
                 }
             }

             fn init(bar: &pci::Bar) {
                 let water = WATER::zeroed()
                     .with_const_rx_water::<1>(); // > 3 would not compile
                 bar.write_reg(water);

                 let ctrl = CTRL::zeroed()
                     .with_parity(Parity::Even)
                     .with_rx_enable(true);
                 bar.write_reg(ctrl);
             }

             fn handle_rx(bar: &pci::Bar) {
                 if bar.read(WATER).rx_above_watermark() {
                     // drain the FIFO
                 }
             }

             fn set_parity(bar: &pci::Bar, parity: Parity) {
                 bar.update(CTRL, |r| r.with_parity(parity));
             }
         ```

   - IRQ:
      - Move 'static bounds from where clauses to trait declarations for
        IRQ handler traits

   - Misc:
      - Enable the generic_arg_infer Rust feature
      - Extend Bounded with shift operations, single-bit bool
        conversion, and const get()

  Misc:
   - Make deferred_probe_timeout default a Kconfig option
   - Drop auxiliary_dev_pm_ops; the PM core falls back to driver PM
     callbacks when no bus type PM ops are set
   - Add conditional guard support for device_lock()
   - Add ksysfs.c to the DRIVER CORE MAINTAINERS entry
   - Fix kernel-doc warnings in base.h
   - Fix stale reference to memory_block_add_nid() in documentation"

* tag 'driver-core-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (67 commits)
  bus: fsl-mc: use generic driver_override infrastructure
  s390/ap: use generic driver_override infrastructure
  s390/cio: use generic driver_override infrastructure
  vdpa: use generic driver_override infrastructure
  platform/wmi: use generic driver_override infrastructure
  PCI: use generic driver_override infrastructure
  driver core: make software nodes available earlier
  software node: remove software_node_exit()
  kernel: ksysfs: initialize kernel_kobj earlier
  MAINTAINERS: add ksysfs.c to the DRIVER CORE entry
  drivers/base/memory: fix stale reference to memory_block_add_nid()
  device property: Document how to check for the property presence
  soundwire: debugfs: initialize firmware_file to empty string
  debugfs: fix placement of EXPORT_SYMBOL_GPL for debugfs_create_str()
  debugfs: check for NULL pointer in debugfs_create_str()
  driver core: Make deferred_probe_timeout default a Kconfig option
  driver core: simplify __device_set_driver_override() clearing logic
  driver core: auxiliary bus: Drop auxiliary_dev_pm_ops
  device property: Make modifications of fwnode "flags" thread safe
  rust: devres: embed struct devres_node directly
  ...

Merge tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-04-13T17:10:28+00:00

Pull vfs xattr updates from Christian Brauner:
 "This reworks the simple_xattr infrastructure and adds support for
  user.* extended attributes on sockets.

  The simple_xattr subsystem currently uses an rbtree protected by a
  reader-writer spinlock. This series replaces the rbtree with an
  rhashtable giving O(1) average-case lookup with RCU-based lockless
  reads. This sped up concurrent access patterns on tmpfs quite a bit
  and it's an overall easy enough conversion to do and gets rid or
  rwlock_t.

  The conversion is done incrementally: a new rhashtable path is added
  alongside the existing rbtree, consumers are migrated one at a time
  (shmem, kernfs, pidfs), and then the rbtree code is removed. All three
  consumers switch from embedded structs to pointer-based lazy
  allocation so the rhashtable overhead is only paid for inodes that
  actually use xattrs.

  With this infrastructure in place the series adds support for user.*
  xattrs on sockets. Path-based AF_UNIX sockets inherit xattr support
  from the underlying filesystem (e.g. tmpfs) but sockets in sockfs -
  that is everything created via socket() including abstract namespace
  AF_UNIX sockets - had no xattr support at all.

  The xattr_permission() checks are reworked to allow user.* xattrs on
  S_IFSOCK inodes. Sockfs sockets get per-inode limits of 128 xattrs and
  128KB total value size matching the limits already in use for kernfs.

  The practical motivation comes from several directions. systemd and
  GNOME are expanding their use of Varlink as an IPC mechanism.

  For D-Bus there are tools like dbus-monitor that can observe IPC
  traffic across the system but this only works because D-Bus has a
  central broker.

  For Varlink there is no broker and there is currently no way to
  identify which sockets speak Varlink. With user.* xattrs on sockets a
  service can label its socket with the IPC protocol it speaks (e.g.,
  user.varlink=1) and an eBPF program can then selectively capture
  traffic on those sockets. Enumerating bound sockets via netlink
  combined with these xattr labels gives a way to discover all Varlink
  IPC entrypoints for debugging and introspection.

  Similarly, systemd-journald wants to use xattrs on the /dev/log socket
  for protocol negotiation to indicate whether RFC 5424 structured
  syslog is supported or whether only the legacy RFC 3164 format should
  be used.

  In containers these labels are particularly useful as high-privilege
  or more complicated solutions for socket identification aren't
  available.

  The series comes with comprehensive selftests covering path-based
  AF_UNIX sockets, sockfs socket operations, per-inode limit
  enforcement, and xattr operations across multiple address families
  (AF_INET, AF_INET6, AF_NETLINK, AF_PACKET)"

* tag 'vfs-7.1-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/xattr: test xattrs on various socket families
  selftests/xattr: sockfs socket xattr tests
  selftests/xattr: path-based AF_UNIX socket xattr tests
  xattr: support extended attributes on sockets
  xattr,net: support limited amount of extended attributes on sockfs sockets
  xattr: move user limits for xattrs to generic infra
  xattr: switch xattr_permission() to switch statement
  xattr: add xattr_permission_error()
  xattr: remove rbtree-based simple_xattr infrastructure
  pidfs: adapt to rhashtable-based simple_xattrs
  kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation
  shmem: adapt to rhashtable-based simple_xattrs with lazy allocation
  xattr: add rhashtable-based simple_xattr infrastructure
  xattr: add rcu_head and rhash_head to struct simple_xattr

kernfs: make directory seek namespace-aware

2026-04-09T12:36:52+00:00

The rbtree backing kernfs directories is ordered by (hash, ns_id, name)
but kernfs_dir_pos() only searches by hash when seeking to a position
during readdir. When two nodes from different namespaces share the same
hash value, the binary search can land on a node in the wrong namespace.
The subsequent skip-forward loop walks rb_next() and may overshoot the
correct node, silently dropping an entry from the readdir results.

With the recent switch from raw namespace pointers to public namespace
ids as hash seeds, computing hash collisions became an offline operation.
An unprivileged user could unshare into a new network namespace, create
a single interface whose name-hash collides with a target entry in
init_net, and cause a victim's seekdir/readdir on /sys/class/net to miss
that entry.

Fix this by extending the rbtree search in kernfs_dir_pos() to also
compare namespace ids when hashes match. Since the rbtree is already
ordered by (hash, ns_id, name), this makes the seek land directly in the
correct namespace's range, eliminating the wrong-namespace overshoot.

Signed-off-by: Christian Brauner

kernfs: use namespace id instead of pointer for hashing and comparison

2026-04-09T12:36:52+00:00

kernfs uses the namespace tag as both a hash seed (via init_name_hash())
and a comparison key in the rbtree. The resulting hash values are exposed
to userspace through directory seek positions (ctx->pos), and the raw
pointer comparisons in kernfs_name_compare() encode kernel pointer
ordering into the rbtree layout.

This constitutes a KASLR information leak since the hash and ordering
derived from kernel pointers can be observed from userspace.

Fix this by using the 64-bit namespace id (ns_common::ns_id) instead of
the raw pointer value for both hashing and comparison. The namespace id
is a stable, non-secret identifier that is already exposed to userspace
through other interfaces (e.g., /proc/pid/ns/, ioctl NS_GET_NSID).

Introduce kernfs_ns_id() as a helper that extracts the namespace id from
a potentially-NULL ns_common pointer, returning 0 for the no-namespace
case.

All namespace equality checks in the directory iteration and dentry
revalidation paths are also switched from pointer comparison to ns_id
comparison for consistency.

Signed-off-by: Christian Brauner

kernfs: pass struct ns_common instead of const void * for namespace tags

2026-04-09T12:36:52+00:00

kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.

Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.

Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.

This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.

Signed-off-by: Christian Brauner

kernfs: Add missing documentation for kernfs_put_active's drop_supers argument

2026-03-14T11:12:07+00:00

The drop_supers argument was added to kernfs_put_active to control
whether the kernfs_supers_rwsem is temporarily dropped along with the
kernfs_rwsem, but no documentation was added for it.

Fixes: eea5d2bb34ba ("kernfs: Send IN_DELETE_SELF and IN_IGNORED")
Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-kbuild-all/202603130112.2FcCzv1g-lkp@intel.com/
Signed-off-by: T.J. Mercier 
Link: https://patch.msgid.link/20260313175153.235681-1-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman

kernfs: Send IN_DELETE_SELF and IN_IGNORED

2026-03-12T14:51:03+00:00

Currently some kernfs files (e.g. cgroup.events, memory.events) support
inotify watches for IN_MODIFY, but unlike with regular filesystems, they
do not receive IN_DELETE_SELF or IN_IGNORED events when they are
removed. This means inotify watches persist after file deletion until
the process exits and the inotify file descriptor is cleaned up, or
until inotify_rm_watch is called manually.

This creates a problem for processes monitoring cgroups. For example, a
service monitoring memory.events for memory.high breaches needs to know
when a cgroup is removed to clean up its state. Where it's known that a
cgroup is removed when all processes die, without IN_DELETE_SELF the
service must resort to inefficient workarounds such as:
  1) Periodically scanning procfs to detect process death (wastes CPU
     and is susceptible to PID reuse).
  2) Holding a pidfd for every monitored cgroup (can exhaust file
     descriptors).

This patch enables IN_DELETE_SELF and IN_IGNORED events for kernfs files
and directories by clearing inode i_nlink values during removal. This
allows VFS to make the necessary fsnotify calls so that userspace
receives the inotify events.

As a result, applications can rely on a single existing watch on a file
of interest (e.g. memory.events) to receive notifications for both
modifications and the eventual removal of the file, as well as automatic
watch descriptor cleanup, simplifying userspace logic and improving
efficiency.

There is gap in this implementation for certain file removals due their
unique nature in kernfs. Directory removals that trigger file removals
occur through vfs_rmdir, which shrinks the dcache and emits fsnotify
events after the rmdir operation; there is no issue here. However kernfs
writes to particular files (e.g. cgroup.subtree_control) can also cause
file removal, but vfs_write does not attempt to emit fsnotify events
after the write operation, even if i_nlink counts are 0. As a usecase
for monitoring this category of file removals is not known, they are
left without having IN_DELETE or IN_DELETE_SELF events generated.
Fanotify recursive monitoring also does not work for kernfs nodes that
do not have inodes attached, as they are created on-demand in kernfs.

Suggested-by: Jan Kara 
Signed-off-by: T.J. Mercier 
Tested-by: syzbot@syzkaller.appspotmail.com
Acked-by: Tejun Heo 
Link: https://patch.msgid.link/20260225223404.783173-3-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman

kernfs: Don't set_nlink for directories being removed

2026-03-12T14:51:03+00:00

If a directory is already in the process of removal its i_nlink count
becomes irrelevant because its contents are also about to be removed and
any pending filesystem operations on it or its contents will soon start
to fail. So we can avoid setting it for directories already flagged for
removal.

This avoids a race in the next patch, which adds clearing of the i_nlink
count for kernfs nodes being removed to support inotify delete events.

Use protection from the kernfs_iattr_rwsem to avoid adding more
contention to the kernfs_rwsem for calls to kernfs_refresh_inode.

Signed-off-by: T.J. Mercier 
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20260225223404.783173-2-tjmercier@google.com
Signed-off-by: Greg Kroah-Hartman

xattr: move user limits for xattrs to generic infra

2026-03-02T10:06:42+00:00

Link: https://patch.msgid.link/20260216-work-xattr-socket-v1-9-c2efa4f74cb7@kernel.org
Acked-by: Darrick J. Wong 
Signed-off-by: Christian Brauner

kernfs: adapt to rhashtable-based simple_xattrs with lazy allocation

2026-03-02T10:05:50+00:00

Adapt kernfs to use the rhashtable-based xattr path and switch from an
embedded struct to pointer-based lazy allocation.

Change kernfs_iattrs.xattrs from embedded 'struct simple_xattrs' to a
pointer 'struct simple_xattrs *', initialized to NULL (zeroed by
kmem_cache_zalloc). Since kernfs_iattrs is already lazily allocated
itself, this adds a second level of lazy allocation specifically for
the xattr store.

The xattr store is allocated on first setxattr. Read paths
check for NULL and return -ENODATA or empty list.

Replaced xattr entries are freed via simple_xattr_free_rcu() to allow
concurrent RCU readers to finish.

The cleanup paths in kernfs_free_rcu() and __kernfs_new_node() error
handling conditionally free the xattr store only when allocated.

As Jan noted in [1]:

> This is a slight change in the lifetime rules because previously kernfs
> xattrs could be safely accessed only under RCU but after this change you
> have to hold inode reference *and* RCU to safely access them. I don't think
> anybody would be accessing xattrs without holding inode reference so this
> should be safe [...].

Link: https://patch.msgid.link/20260216-work-xattr-socket-v1-4-c2efa4f74cb7@kernel.org
Link:
https://lore.kernel.org/3cnmtqmakpbb2uwhenrj7kdqu3uefykiykjllgfbtpkiwhaa4s@sghkevv7jned [1]
Acked-by: Darrick J. Wong 
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner