summaryrefslogtreecommitdiff
path: root/fs/overlayfs/readdir.c
AgeCommit message (Collapse)Author
5 daysConvert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 daysConvert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 daystreewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
10 daysMerge tag 'vfs-7.0-rc1.misc.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull more misc vfs updates from Christian Brauner: "Features: - Optimize close_range() from O(range size) to O(active FDs) by using find_next_bit() on the open_fds bitmap instead of linearly scanning the entire requested range. This is a significant improvement for large-range close operations on sparse file descriptor tables. - Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only. Add tracepoints for fs-verity enable and verify operations, replacing the previously removed debug printk's. - Prevent nfsd from exporting special kernel filesystems like pidfs and nsfs. These filesystems have custom ->open() and ->permission() export methods that are designed for open_by_handle_at(2) only and are incompatible with nfsd. Update the exportfs documentation accordingly. Fixes: - Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used on a non-null-terminated decrypted directory entry name from fscrypt. This triggered on encrypted lower layers when the decrypted name buffer contained uninitialized tail data. The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and name_is_dot_dotdot() helpers, replacing various open-coded "." and ".." checks across the tree. - Fix read-only fsflags not being reset together with xflags in vfs_fileattr_set(). Currently harmless since no read-only xflags overlap with flags, but this would cause inconsistencies for any future shared read-only flag - Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the target process is in a different pid namespace. This lets userspace distinguish "process exited" from "process in another namespace", matching glibc's pidfd_getpid() behavior Cleanups: - Use C-string literals in the Rust seq_file bindings, replacing the kernel::c_str!() macro (available since Rust 1.77) - Fix typo in d_walk_ret enum comment, add porting notes for the readlink_copy() calling convention change" * tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add porting notes about readlink_copy() pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns nfsd: do not allow exporting of special kernel filesystems exportfs: clarify the documentation of open()/permission() expotrfs ops fsverity: add tracepoints fs: add FS_XFLAG_VERITY for fs-verity files rust: seq_file: replace `kernel::c_str!` with C-Strings fs: dcache: fix typo in enum d_walk_ret comment ovl: use name_is_dot* helpers in readdir code fs: add helpers name_is_dot{,dot,_dotdot} ovl: Fix uninit-value in ovl_fill_real fs: reset read-only fsflags together with xflags fs/file: optimize close_range() complexity from O(N) to O(Sparse)
2026-01-29ovl: use name_is_dot* helpers in readdir codeAmir Goldstein
Use the helpers in place of all the different open coded variants. This makes the code more readable and robust. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260128132406.23768-4-amir73il@gmail.com Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-29fs: add helpers name_is_dot{,dot,_dotdot}Amir Goldstein
Rename the helper is_dot_dotdot() into the name_ namespace and add complementary helpers to check for dot and dotdot names individually. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260128132406.23768-3-amir73il@gmail.com Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-29ovl: Fix uninit-value in ovl_fill_realQing Wang
Syzbot reported a KMSAN uninit-value issue in ovl_fill_real. This iusse's call chain is: __do_sys_getdents64() -> iterate_dir() ... -> ext4_readdir() -> fscrypt_fname_alloc_buffer() // alloc -> fscrypt_fname_disk_to_usr // write without tail '\0' -> dir_emit() -> ovl_fill_real() // read by strcmp() The string is used to store the decrypted directory entry name for an encrypted inode. As shown in the call chain, fscrypt_fname_disk_to_usr() write it without null-terminate. However, ovl_fill_real() uses strcmp() to compare the name against "..", which assumes a null-terminated string and may trigger a KMSAN uninit-value warning when the buffer tail contains uninit data. Reported-by: syzbot+d130f98b2c265fae5297@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d130f98b2c265fae5297 Fixes: 4edb83bb1041 ("ovl: constant d_ino for non-merge dirs") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260128132406.23768-2-amir73il@gmail.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12overlayfs: add setlease file operationJeff Layton
Add the setlease file_operation to ovl_file_operations and ovl_dir_operations, pointing to generic_setlease. A future patch will change the default behavior to reject lease attempts with -EINVAL when there is no setlease file operation defined. Add generic_setlease to retain the ability to set leases on this filesystem. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260108-setlease-6-20-v1-17-ea4dec9b67fa@kernel.org Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19ovl: port ovl_check_empty_dir() to cred guardChristian Brauner
Use the scoped ovl cred guard. Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-27-b31603935724@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19ovl: port ovl_dir_llseek() to cred guardChristian Brauner
Use the scoped ovl cred guard. Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-26-b31603935724@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19ovl: refactor ovl_iterate() and port to cred guardChristian Brauner
factor out ovl_iterate_merged() and move some code into ovl_iterate_real() for easier use of the scoped ovl cred guard. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-25-b31603935724@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19ovl: don't override credentials for ovl_check_whiteouts()Christian Brauner
The function is only called when rdd->dentry is non-NULL: if (!err && rdd->first_maybe_whiteout && rdd->dentry) err = ovl_check_whiteouts(realpath, rdd); | Caller | Sets rdd->dentry? | Can call ovl_check_whiteouts()? | |-------------------------------|-------------------|---------------------------------| | ovl_dir_read_merged() | ✓ Yes (line 430) | ✓ YES | | ovl_dir_read_impure() | ✗ No | ✗ NO | | ovl_check_d_type_supported() | ✗ No | ✗ NO | | ovl_workdir_cleanup_recurse() | ✗ No | ✗ NO | | ovl_indexdir_cleanup() | ✗ No | ✗ NO | VFS layer (.iterate_shared file operation) → ovl_iterate() [CRED OVERRIDE] → ovl_cache_get() → ovl_dir_read_merged() → ovl_dir_read() → ovl_check_whiteouts() [CRED REVERT] ovl_unlink() → ovl_do_remove() → ovl_check_empty_dir() [CRED OVERRIDE] → ovl_dir_read_merged() → ovl_dir_read() → ovl_check_whiteouts() [CRED REVERT] ovl_rename() → ovl_check_empty_dir() [CRED OVERRIDE] → ovl_dir_read_merged() → ovl_dir_read() → ovl_check_whiteouts() [CRED REVERT] All valid callchains already override credentials so drop the override. Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-24-b31603935724@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14VFS: introduce start_removing_dentry()NeilBrown
start_removing_dentry() is similar to start_removing() but instead of providing a name for lookup, the target dentry is given. start_removing_dentry() checks that the dentry is still hashed and in the parent, and if so it locks and increases the refcount so that end_removing() can be used to finish the operation. This is used in cachefiles, overlayfs, smb/server, and apparmor. There will be other users including ecryptfs. As start_removing_dentry() takes an extra reference to the dentry (to be put by end_removing()), there is no need to explicitly take an extra reference to stop d_delete() from using dentry_unlink_inode() to negate the dentry - as in cachefiles_delete_object(), and ksmbd_vfs_unlink(). cachefiles_bury_object() now gets an extra ref to the victim, which is drops. As it includes the needed end_removing() calls, the caller doesn't need them. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Link: https://patch.msgid.link/20251113002050.676694-9-neilb@ownmail.net Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-03Merge tag 'ovl-update-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs updates from Amir Goldstein: - Work by André Almeida to support case-insensitive overlayfs Underlying case-insensitive filesystems casefolding is per directory, but for overlayfs it is all-or-nothing. It supports layers where all directories are casefolded (with same encoding) or layers where no directories are casefolded. - A fix for a "bug" in Neil's ovl directory lock changes, which only manifested itself with casefold enabled layers which may return an unhashed negative dentry from lookup. * tag 'ovl-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: make sure that ovl_create_real() returns a hashed dentry ovl: Support mounting case-insensitive enabled layers ovl: Check for casefold consistency when creating new dentries ovl: Add S_CASEFOLD as part of the inode flag to be copied ovl: Set case-insensitive dentry operations for ovl sb ovl: Ensure that all layers have the same encoding ovl: Create ovl_casefold() to support casefolded strncmp() ovl: Prepare for mounting case-insensitive enabled layers fs: Create sb_same_encoding() helper fs: Create sb_encoding() helper
2025-09-23VFS/ovl: add lookup_one_positive_killable()NeilBrown
ovl wants a lookup which won't block on a fatal signal. It currently uses down_write_killable() and then repeatedly calls to lookup_one() The lock may not be needed if the name is already in the dcache and it aids proposed future changes if the locking is kept internal to namei.c So this patch adds lookup_one_positive_killable() which is like lookup_one_positive() but will abort in the face of a fatal signal. overlayfs is changed to use this. Note that instead of always getting an exclusive lock, ovl now only gets a shared lock, and only sometimes. The exclusive lock was never needed. However down_read_killable() was only added in v4.15 but overlayfs started using down_write_killable() here in v4.7. Note that the linked list ->first_maybe_whiteout ->next_maybe_white is local to the thread so there is no concurrency in that list which could be threatened by removing the locking. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-23ovl: Create ovl_casefold() to support casefolded strncmp()André Almeida
To add overlayfs support casefold layers, create a new function ovl_casefold(), to be able to do case-insensitive strncmp(). ovl_casefold() allocates a new buffer and stores the casefolded version of the string on it. If the allocation or the casefold operation fails, fallback to use the original string. The case-insentive name is then used in the rb-tree search/insertion operation. If the name is found in the rb-tree, the name can be discarded and the buffer is freed. If the name isn't found, it's then stored at struct ovl_cache_entry to be used later. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2025-07-18ovl: rename ovl_cleanup_unlocked() to ovl_cleanup()NeilBrown
The only remaining user of ovl_cleanup() is ovl_cleanup_locked(), so we no longer need both. This patch renames ovl_cleanup() to ovl_cleanup_locked() and makes it static. ovl_cleanup_unlocked() is renamed to ovl_cleanup(). Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-22-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: change ovl_cleanup_and_whiteout() to take rename lock as neededNeilBrown
Rather than locking the directory(s) before calling ovl_cleanup_and_whiteout(), change it (and ovl_whiteout()) to do the locking, so the locking can be fine grained as will be needed for proposed locking changes. Sometimes this is called to whiteout something in the index dir, in which case only that dir must be locked. In one case it is called on something in an upperdir, so two directories must be locked. We use ovl_lock_rename_workdir() for this and remove the restriction that upperdir cannot be indexdir - because now sometimes it is. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-18-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: change ovl_workdir_cleanup() to take dir lock as needed.NeilBrown
Rather than calling ovl_workdir_cleanup() with the dir already locked, change it to take the dir lock only when needed. Also change ovl_workdir_cleanup() to take a dentry for the parent rather than an inode. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-16-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_workdir_cleanup_recurse()NeilBrown
Only take the dir lock when needed, rather than for the whole loop. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-15-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_indexdir_cleanup()NeilBrown
Instead of taking the directory lock for the whole cleanup, only take it when needed. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-14-neil@brown.name Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-18ovl: narrow locking in ovl_cleanup_whiteouts()NeilBrown
Rather than lock the directory for the whole operation, use ovl_lookup_upper_unlocked() and ovl_cleanup_unlocked() to take the lock only when needed. This makes way for future changes where locks are taken on individual dentries rather than the whole directory. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250716004725.1206467-11-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16VFS: change old_dir and new_dir in struct renamedata to dentrysNeilBrown
all users of 'struct renamedata' have the dentry for the old and new directories, and often have no use for the inode except to store it in the renamedata. This patch changes struct renamedata to hold the dentry, rather than the inode, for the old and new directories, and changes callers to match. The names are also changed from a _dir suffix to _parent. This is consistent with other usage in namei.c and elsewhere. This results in the removal of several local variables and several dereferences of ->d_inode at the cost of adding ->d_inode dereferences to vfs_rename(). Acked-by: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/174977089072.608730.4244531834577097454@noble.neil.brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-06Merge tag 'ovl-update-v2-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs update from Miklos Szeredi: - Fix a regression in getting the path of an open file (e.g. in /proc/PID/maps) for a nested overlayfs setup (André Almeida) - Support data-only layers and verity in a user namespace (unprivileged composefs use case) - Fix a gcc warning (Kees) - Cleanups * tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: Annotate struct ovl_entry with __counted_by() ovl: Replace offsetof() with struct_size() in ovl_stack_free() ovl: Replace offsetof() with struct_size() in ovl_cache_entry_new() ovl: Check for NULL d_inode() in ovl_dentry_upper() ovl: Use str_on_off() helper in ovl_show_options() ovl: don't require "metacopy=on" for "verity" ovl: relax redirect/metacopy requirements for lower -> data redirect ovl: make redirect/metacopy rejection consistent ovl: Fix nested backing file paths
2025-05-26Merge tag 'vfs-6.16-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()|info_plog()|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...
2025-05-15readdir: supply dir_context.count as readdir buffer size hintMiklos Szeredi
This is a preparation for large readdir buffers in fuse. Simply setting the fuse buffer size to the userspace buffer size should work, the record sizes are similar (fuse's is slightly larger than libc's, so no overflow should ever happen). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jaco Kroon <jaco@uls.co.za> Link: https://lore.kernel.org/20250513151012.1476536-1-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-05ovl: Replace offsetof() with struct_size() in ovl_cache_entry_new()Thorsten Blum
Compared to offsetof(), struct_size() provides additional compile-time checks for structs with flexible arrays (e.g., __must_be_array()). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-04-07VFS: improve interface for lookup_one functionsNeilBrown
The family of functions: lookup_one() lookup_one_unlocked() lookup_one_positive_unlocked() appear designed to be used by external clients of the filesystem rather than by filesystems acting on themselves as the lookup_one_len family are used. They are used by: btrfs/ioctl - which is a user-space interface rather than an internal activity exportfs - i.e. from nfsd or the open_by_handle_at interface overlayfs - at access the underlying filesystems smb/server - for file service They should be used by nfsd (more than just the exportfs path) and cachefs but aren't. It would help if the documentation didn't claim they should "not be called by generic code". Also the path component name is passed as "name" and "len" which are (confusingly?) separate by the "base". In some cases the len in simply "strlen" and so passing a qstr using QSTR() would make the calling clearer. Other callers do pass separate name and len which are stored in a struct. Sometimes these are already stored in a qstr, other times it easily could be. So this patch changes these three functions to receive a 'struct qstr *', and improves the documentation. QSTR_LEN() is added to make it easy to pass a QSTR containing a known len. [brauner@kernel.org: take a struct qstr pointer] Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-11ovl: use wrapper ovl_revert_creds()Vinicius Costa Gomes
Introduce ovl_revert_creds() wrapper of revert_creds() to match callers of ovl_override_creds(). Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-01-23ovl: mark xwhiteouts directory with overlay.opaque='x'Amir Goldstein
An opaque directory cannot have xwhiteouts, so instead of marking an xwhiteouts directory with a new xattr, overload overlay.opaque xattr for marking both opaque dir ('y') and xwhiteouts dir ('x'). This is more efficient as the overlay.opaque xattr is checked during lookup of directory anyway. This also prevents unnecessary checking the xattr when reading a directory without xwhiteouts, i.e. most of the time. Note that the xwhiteouts marker is not checked on the upper layer and on the last layer in lowerstack, where xwhiteouts are not expected. Fixes: bc8df7a3dc03 ("ovl: Add an alternative type of whiteout") Cc: <stable@vger.kernel.org> # v6.7 Reviewed-by: Alexander Larsson <alexl@redhat.com> Tested-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-11-20ovl: remove redundant ofs->indexdir memberAmir Goldstein
When the index feature is disabled, ofs->indexdir is NULL. When the index feature is enabled, ofs->indexdir has the same value as ofs->workdir and takes an extra reference. This makes the code harder to understand when it is not always clear that ofs->indexdir in one function is the same dentry as ofs->workdir in another function. Remove this redundancy, by referencing ofs->workdir directly in index helpers and by using the ovl_indexdir() accessor in generic code. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-31ovl: Add an alternative type of whiteoutAlexander Larsson
An xattr whiteout (called "xwhiteout" in the code) is a reguar file of zero size with the "overlay.whiteout" xattr set. A file like this in a directory with the "overlay.whiteouts" xattrs set will be treated the same way as a regular whiteout. The "overlay.whiteouts" directory xattr is used in order to efficiently handle overlay checks in readdir(), as we only need to checks xattrs in affected directories. The advantage of this kind of whiteout is that they can be escaped using the standard overlay xattr escaping mechanism. So, a file with a "overlay.overlay.whiteout" xattr would be unescaped to "overlay.whiteout", which could then be consumed by another overlayfs as a whiteout. Overlayfs itself doesn't create whiteouts like this, but a userspace mechanism could use this alternative mechanism to convert images that may contain whiteouts to be used with overlayfs. To work as a whiteout for both regular overlayfs mounts as well as userxattr mounts both the "user.overlay.whiteout*" and the "trusted.overlay.whiteout*" xattrs will need to be created. Signed-off-by: Alexander Larsson <alexl@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-08-06vfs: get rid of old '->iterate' directory operationLinus Torvalds
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19ovl: pass ovl_fs to xino helpersAmir Goldstein
Internal ovl methods should use ovl_fs and not sb as much as possible. Use a constant_table to translate from enum xino mode to string in preperation for new mount api option parsing. Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-01-19fs: port ->permission() to pass mnt_idmapChristian Brauner
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-12-08ovl: use inode instead of dentry where possibleMiklos Szeredi
Passing dentry to some helpers is unnecessary. Simplify these cases. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-12-08ovl: use plain list filler in indexdir and workdir cleanupAmir Goldstein
Those two cleanup routines are using the helper ovl_dir_read() with the merge dir filler, which populates an rb tree, that is never used. The index dir entry names all have a long (42 bytes) constant prefix, so it is not surprising that perf top has demostrated high CPU usage by rb tree population during cleanup of a large index dir: - 9.53% ovl_fill_merge - 78.41% ovl_cache_entry_find_link.constprop.27 + 72.11% strncmp Use the plain list filler that does not populate the unneeded rb tree. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-10-06Merge tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs constification updates from Al Viro: "whack-a-mole: constifying struct path *" * tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ecryptfs: constify path spufs: constify path nd_jump_link(): constify path audit_init_parent(): constify path __io_setxattr(): constify path do_proc_readlink(): constify path overlayfs: constify path fs/notify: constify path may_linkat(): constify path do_sys_name_to_handle(): constify path ->getprocattr(): attribute name is const char *, TYVM...
2022-09-01overlayfs: constify pathAl Viro
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-17Change calling conventions for filldir_tAl Viro
filldir_t instances (directory iterators callbacks) used to return 0 for "OK, keep going" or -E... for "stop". Note that it's *NOT* how the error values are reported - the rules for those are callback-dependent and ->iterate{,_shared}() instances only care about zero vs. non-zero (look at emit_dir() and friends). So let's just return bool ("should we keep going?") - it's less confusing that way. The choice between "true means keep going" and "true means stop" is bikesheddable; we have two groups of callbacks - do something for everything in directory, until we run into problem and find an entry in directory and do something to it. The former tended to use 0/-E... conventions - -E<something> on failure. The latter tended to use 0/1, 1 being "stop, we are done". The callers treated anything non-zero as "stop", ignoring which non-zero value did they get. "true means stop" would be more natural for the second group; "true means keep going" - for the first one. I tried both variants and the things like if allocation failed something = -ENOMEM; return true; just looked unnatural and asking for trouble. [folded suggestion from Matthew Wilcox <willy@infradead.org>] Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-28ovl: handle idmappings for layer lookupChristian Brauner
Make the two places where lookup helpers can be called either on lower or upper layers take the mount's idmapping into account. To this end we pass down the mount in struct ovl_lookup_data. It can later also be used to construct struct path for various other helpers. This is needed to support idmapped base layers with overlay. Cc: <linux-unionfs@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-28ovl: use ovl_lookup_upper() wrapperChristian Brauner
Introduce ovl_lookup_upper() as a simple wrapper around lookup_one(). Make it clear in the helper's name that this only operates on the upper layer. The wrapper will take upper layer's idmapping into account when checking permission in lookup_one(). Cc: <linux-unionfs@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-28ovl: pass ofs to creation operationsChristian Brauner
Pass down struct ovl_fs to all creation helpers so we can ultimately retrieve the relevant upper mount and take the mount's idmapping into account when creating new filesystem objects. This is needed to support idmapped base layers with overlay. Cc: <linux-unionfs@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-28ovl: use wrappers to all vfs_*xattr() callsAmir Goldstein
Use helpers ovl_*xattr() to access user/trusted.overlay.* xattrs and use helpers ovl_do_*xattr() to access generic xattrs. This is a preparatory patch for using idmapped base layers with overlay. Note that a few of those places called vfs_*xattr() calls directly to reduce the amount of debug output. But as Miklos pointed out since overlayfs has been stable for quite some time the debug output isn't all that relevant anymore and the additional debug in all locations was actually quite helpful when developing this patch series. Cc: <linux-unionfs@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10ovl: skip stale entries in merge dir cache iterationAmir Goldstein
On the first getdents call, ovl_iterate() populates the readdir cache with a list of entries, but for upper entries with origin lower inode, p->ino remains zero. Following getdents calls traverse the readdir cache list and call ovl_cache_update_ino() for entries with zero p->ino to lookup the entry in the overlay and return d_ino that is consistent with st_ino. If the upper file was unlinked between the first getdents call and the getdents call that lists the file entry, ovl_cache_update_ino() will not find the entry and fall back to setting d_ino to the upper real st_ino, which is inconsistent with how this object was presented to users. Instead of listing a stale entry with inconsistent d_ino, simply skip the stale entry, which is better for users. xfstest overlay/077 is failing without this patch. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-30Merge tag 'ovl-update-5.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Fix a regression introduced in 5.2 that resulted in valid overlayfs mounts being rejected with ELOOP (Too many levels of symbolic links) - Fix bugs found by various tools - Miscellaneous improvements and cleanups * tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: add debug print to ovl_do_getxattr() ovl: invalidate readdir cache on changes to dir with origin ovl: allow upperdir inside lowerdir ovl: show "userxattr" in the mount data ovl: trivial typo fixes in the file inode.c ovl: fix misspellings using codespell tool ovl: do not copy attr several times ovl: remove ovl_map_dev_ino() return value ovl: fix error for ovl_fill_super() ovl: fix missing revert_creds() on error path ovl: fix leaked dentry ovl: restrict lower null uuid for "xino=auto" ovl: check that upperdir path is not on a read-only mount ovl: plumb through flush method
2021-04-12ovl: remove unneeded ioctlsMiklos Szeredi
The FS_IOC_[GS]ETFLAGS/FS_IOC_FS[GS]ETXATTR ioctls are now handled via the fileattr api. The only unconverted filesystem remaining is CIFS and it is not allowed to be overlayed due to case insensitive filenames. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-12ovl: invalidate readdir cache on changes to dir with originAmir Goldstein
The test in ovl_dentry_version_inc() was out-dated and did not include the case where readdir cache is used on a non-merge dir that has origin xattr, indicating that it may contain leftover whiteouts. To make the code more robust, use the same helper ovl_dir_is_real() to determine if readdir cache should be used and if readdir cache should be invalidated. Fixes: b79e05aaa166 ("ovl: no direct iteration for dir with origin xattr") Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxht70nODhNHNwGFMSqDyOKLXOKrY0H6g849os4BQ7cokA@mail.gmail.com/ Cc: Chris Murphy <lists@colorremedies.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: implement volatile-specific fsync error behaviourSargun Dhillon
Overlayfs's volatile option allows the user to bypass all forced sync calls to the upperdir filesystem. This comes at the cost of safety. We can never ensure that the user's data is intact, but we can make a best effort to expose whether or not the data is likely to be in a bad state. The best way to handle this in the time being is that if an overlayfs's upperdir experiences an error after a volatile mount occurs, that error will be returned on fsync, fdatasync, sync, and syncfs. This is contradictory to the traditional behaviour of VFS which fails the call once, and only raises an error if a subsequent fsync error has occurred, and been raised by the filesystem. One awkward aspect of the patch is that we have to manually set the superblock's errseq_t after the sync_fs callback as opposed to just returning an error from syncfs. This is because the call chain looks something like this: sys_syncfs -> sync_filesystem -> __sync_filesystem -> /* The return value is ignored here sb->s_op->sync_fs(sb) _sync_blockdev /* Where the VFS fetches the error to raise to userspace */ errseq_check_and_advance Because of this we call errseq_set every time the sync_fs callback occurs. Due to the nature of this seen / unseen dichotomy, if the upperdir is an inconsistent state at the initial mount time, overlayfs will refuse to mount, as overlayfs cannot get a snapshot of the upperdir's errseq that will increment on error until the user calls syncfs. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: c86243b090bc ("ovl: provide a mount option "volatile"") Cc: stable@vger.kernel.org Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: avoid deadlock on directory ioctlMiklos Szeredi
The function ovl_dir_real_file() currently uses the inode lock to serialize writes to the od->upperfile field. However, this function will get called by ovl_ioctl_set_flags(), which utilizes the inode lock too. In this case ovl_dir_real_file() will try to claim a lock that is owned by a function in its call stack, which won't get released before ovl_dir_real_file() returns. Fix by replacing the open coded compare and exchange by an explicit atomic op. Fixes: 61536bed2149 ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories") Cc: stable@vger.kernel.org # v5.10 Reported-by: Icenowy Zheng <icenowy@aosc.io> Tested-by: Icenowy Zheng <icenowy@aosc.io> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>