summaryrefslogtreecommitdiff
path: root/fs/super.c
AgeCommit message (Collapse)Author
2020-01-12fs: call fsnotify_sb_delete after evict_inodesEric Sandeen
[ Upstream commit 1edc8eb2e93130e36ac74ac9c80913815a57d413 ] When a filesystem is unmounted, we currently call fsnotify_sb_delete() before evict_inodes(), which means that fsnotify_unmount_inodes() must iterate over all inodes on the superblock looking for any inodes with watches. This is inefficient and can lead to livelocks as it iterates over many unwatched inodes. At this point, SB_ACTIVE is gone and dropping refcount to zero kicks the inode out out immediately, so anything processed by fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop. After that, the call to evict_inodes will evict everything else with a zero refcount. This should speed things up overall, and avoid livelocks in fsnotify_unmount_inodes(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-10Merge branch 'work.mount3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount fixes from Al Viro: "A couple of regressions from the mount series" * 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: add missing blkdev_put() in get_tree_bdev() shmem: fix LSM options parsing
2019-10-09vfs: add missing blkdev_put() in get_tree_bdev()Ian Kent
Is there are a couple of missing blkdev_put() in get_tree_bdev()? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-25Merge tag 'fuse-update-5.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Continue separating the transport (user/kernel communication) and the filesystem layers of fuse. Getting rid of most layering violations will allow for easier cleanup and optimization later on. - Prepare for the addition of the virtio-fs filesystem. The actual filesystem will be introduced by a separate pull request. - Convert to new mount API. - Various fixes, optimizations and cleanups. * tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits) fuse: Make fuse_args_to_req static fuse: fix memleak in cuse_channel_open fuse: fix beyond-end-of-page access in fuse_parse_cache() fuse: unexport fuse_put_request fuse: kmemcg account fs data fuse: on 64-bit store time in d_fsdata directly fuse: fix missing unlock_page in fuse_writepage() fuse: reserve byteswapped init opcodes fuse: allow skipping control interface and forced unmount fuse: dissociate DESTROY from fuseblk fuse: delete dentry if timeout is zero fuse: separate fuse device allocation and installation in fuse_conn fuse: add fuse_iqueue_ops callbacks fuse: extract fuse_fill_super_common() fuse: export fuse_dequeue_forget() function fuse: export fuse_get_unique() fuse: export fuse_send_init_request() fuse: export fuse_len_args() fuse: export fuse_end_request() fuse: fix request limit ...
2019-09-19Merge branch 'work.mount2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc mount API conversions from Al Viro: "Conversions to new API for shmem and friends and for mount_mtd()-using filesystems. As for the rest of the mount API conversions in -next, some of them belong in the individual trees (e.g. binderfs one should definitely go through android folks, after getting redone on top of their changes). I'm going to drop those and send the rest (trivial ones + stuff ACKed by maintainers) in a separate series - by that point they are independent from each other. Some stuff has already migrated into individual trees (NFS conversion, for example, or FUSE stuff, etc.); those presumably will go through the regular merges from corresponding trees." * 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Make fs_parse() handle fs_param_is_fd-type params better vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API shmem_parse_one(): switch to use of fs_parse() shmem_parse_options(): take handling a single option into a helper shmem_parse_options(): don't bother with mpol in separate variable shmem_parse_options(): use a separate structure to keep the results make shmem_fill_super() static make ramfs_fill_super() static devtmpfs: don't mix {ramfs,shmem}_fill_super() with mount_single() vfs: Convert squashfs to use the new mount API mtd: Kill mount_mtd() vfs: Convert jffs2 to use the new mount API vfs: Convert cramfs to use the new mount API vfs: Convert romfs to use the new mount API vfs: Add a single-or-reconfig keying to vfs_get_super()
2019-09-19Merge tag 'y2038-vfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull y2038 vfs updates from Arnd Bergmann: "Add inode timestamp clamping. This series from Deepa Dinamani adds a per-superblock minimum/maximum timestamp limit for a file system, and clamps timestamps as they are written, to avoid random behavior from integer overflow as well as having different time stamps on disk vs in memory. At mount time, a warning is now printed for any file system that can represent current timestamps but not future timestamps more than 30 years into the future, similar to the arbitrary 30 year limit that was added to settimeofday(). This was picked as a compromise to warn users to migrate to other file systems (e.g. ext4 instead of ext3) when they need the file system to survive beyond 2038 (or similar limits in other file systems), but not get in the way of normal usage" * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: ext4: Reduce ext4 timestamp warnings isofs: Initialize filesystem timestamp ranges pstore: fs superblock limits fs: omfs: Initialize filesystem timestamp ranges fs: hpfs: Initialize filesystem timestamp ranges fs: ceph: Initialize filesystem timestamp ranges fs: sysv: Initialize filesystem timestamp ranges fs: affs: Initialize filesystem timestamp ranges fs: fat: Initialize filesystem timestamp ranges fs: cifs: Initialize filesystem timestamp ranges fs: nfs: Initialize filesystem timestamp ranges ext4: Initialize timestamps limits 9p: Fill min and max timestamps in sb fs: Fill in max and min timestamps in superblock utimes: Clamp the timestamps before update mount: Add mount warning for impending timestamp expiry timestamp_truncate: Replace users of timespec64_trunc vfs: Add timestamp_truncate() api vfs: Add file timestamp range support
2019-09-18Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Eric Biggers: "This is a large update to fs/crypto/ which includes: - Add ioctls that add/remove encryption keys to/from a filesystem-level keyring. These fix user-reported issues where e.g. an encrypted home directory can break NetworkManager, sshd, Docker, etc. because they don't get access to the needed keyring. These ioctls also provide a way to lock encrypted directories that doesn't use the vm.drop_caches sysctl, so is faster, more reliable, and doesn't always need root. - Add a new encryption policy version ("v2") which switches to a more standard, secure, and flexible key derivation function, and starts verifying that the correct key was supplied before using it. The key derivation improvement is needed for its own sake as well as for ongoing feature work for which the current way is too inflexible. Work is in progress to update both Android and the 'fscrypt' userspace tool to use both these features. (Working patches are available and just need to be reviewed+merged.) Chrome OS will likely use them too. This has also been tested on ext4, f2fs, and ubifs with xfstests -- both the existing encryption tests, and the new tests for this. This has also been in linux-next since Aug 16 with no reported issues. I'm also using an fscrypt v2-encrypted home directory on my personal desktop" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (27 commits) ext4 crypto: fix to check feature status before get policy fscrypt: document the new ioctls and policy version ubifs: wire up new fscrypt ioctls f2fs: wire up new fscrypt ioctls ext4: wire up new fscrypt ioctls fscrypt: require that key be added when setting a v2 encryption policy fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl fscrypt: allow unprivileged users to add/remove keys for v2 policies fscrypt: v2 encryption policy support fscrypt: add an HKDF-SHA512 implementation fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl fscrypt: rename keyinfo.c to keysetup.c fscrypt: move v1 policy key setup to keysetup_v1.c fscrypt: refactor key setup code in preparation for v2 policies fscrypt: rename fscrypt_master_key to fscrypt_direct_key fscrypt: add ->ci_inode to fscrypt_info fscrypt: use FSCRYPT_* definitions, not FS_* fscrypt: use FSCRYPT_ prefix for uapi constants ...
2019-09-06vfs: subtype handling moved to fuseDavid Howells
The unused vfs code can be removed. Don't pass empty subtype (same as if ->parse callback isn't called). The bits that are left involve determining whether it's permitted to split the filesystem type string passed in to mount(2). Consequently, this means that we cannot get rid of the FS_HAS_SUBTYPE flag unless we define that a type string with a dot in it always indicates a subtype specification. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-05vfs: Add a single-or-reconfig keying to vfs_get_super()David Howells
Add an additional keying mode to vfs_get_super() to indicate that only a single superblock should exist in the system, and that, if it does, further mounts should invoke reconfiguration upon it. This allows mount_single() to be replaced. [Fix by Eric Biggers folded in] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05vfs: Create fs_context-aware mount_bdev() replacementDavid Howells
Create a function, get_tree_bdev(), that is fs_context-aware and a ->get_tree() counterpart of mount_bdev(). It caches the block device pointer in the fs_context struct so that this information can be passed into sget_fc()'s test and set functions. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: linux-block@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05new helper: get_tree_keyed()Al Viro
For vfs_get_keyed_super users. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30vfs: Add file timestamp range supportDeepa Dinamani
Add fields to the superblock to track the min and max timestamps supported by filesystems. Initially, when a superblock is allocated, initialize it to the max and min values the fields can hold. Individual filesystems override these to match their actual limits. Pseudo filesystems are assumed to always support the min and max allowable values for the fields. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Jeff Layton <jlayton@kernel.org>
2019-08-12fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctlEric Biggers
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an encryption key to the filesystem's fscrypt keyring ->s_master_keys, making any files encrypted with that key appear "unlocked". Why we need this ~~~~~~~~~~~~~~~~ The main problem is that the "locked/unlocked" (ciphertext/plaintext) status of encrypted files is global, but the fscrypt keys are not. fscrypt only looks for keys in the keyring(s) the process accessing the filesystem is subscribed to: the thread keyring, process keyring, and session keyring, where the session keyring may contain the user keyring. Therefore, userspace has to put fscrypt keys in the keyrings for individual users or sessions. But this means that when a process with a different keyring tries to access encrypted files, whether they appear "unlocked" or not is nondeterministic. This is because it depends on whether the files are currently present in the inode cache. Fixing this by consistently providing each process its own view of the filesystem depending on whether it has the key or not isn't feasible due to how the VFS caches work. Furthermore, while sometimes users expect this behavior, it is misguided for two reasons. First, it would be an OS-level access control mechanism largely redundant with existing access control mechanisms such as UNIX file permissions, ACLs, LSMs, etc. Encryption is actually for protecting the data at rest. Second, almost all users of fscrypt actually do need the keys to be global. The largest users of fscrypt, Android and Chromium OS, achieve this by having PID 1 create a "session keyring" that is inherited by every process. This works, but it isn't scalable because it prevents session keyrings from being used for any other purpose. On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't similarly abuse the session keyring, so to make 'sudo' work on all systems it has to link all the user keyrings into root's user keyring [2]. This is ugly and raises security concerns. Moreover it can't make the keys available to system services, such as sshd trying to access the user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to read certificates from the user's home directory (see [5]); or to Docker containers (see [6], [7]). By having an API to add a key to the *filesystem* we'll be able to fix the above bugs, remove userspace workarounds, and clearly express the intended semantics: the locked/unlocked status of an encrypted directory is global, and encryption is orthogonal to OS-level access control. Why not use the add_key() syscall ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use an ioctl for this API rather than the existing add_key() system call because the ioctl gives us the flexibility needed to implement fscrypt-specific semantics that will be introduced in later patches: - Supporting key removal with the semantics such that the secret is removed immediately and any unused inodes using the key are evicted; also, the eviction of any in-use inodes can be retried. - Calculating a key-dependent cryptographic identifier and returning it to userspace. - Allowing keys to be added and removed by non-root users, but only keys for v2 encryption policies; and to prevent denial-of-service attacks, users can only remove keys they themselves have added, and a key is only really removed after all users who added it have removed it. Trying to shoehorn these semantics into the keyrings syscalls would be very difficult, whereas the ioctls make things much easier. However, to reuse code the implementation still uses the keyrings service internally. Thus we get lockless RCU-mode key lookups without having to re-implement it, and the keys automatically show up in /proc/keys for debugging purposes. References: [1] https://github.com/google/fscrypt [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939 [4] https://github.com/google/fscrypt/issues/116 [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715 [6] https://github.com/google/fscrypt/issues/128 [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-31Unbreak mount_capable()Al Viro
In "consolidate the capability checks in sget_{fc,userns}())" the wrong argument had been passed to mount_capable() by sget_fc(). That mistake had been further obscured later, when switching mount_capable() to fs_context has moved the calculation of bogus argument from sget_fc() to mount_capable() itself. It should've been fc->user_ns all along. Screwed-up-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Christian Brauner <christian@brauner.io> Tested-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04convenience helper: get_tree_single()Al Viro
counterpart of mount_single(); switch fusectl to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04convenience helper get_tree_nodev()Al Viro
counterpart of mount_nodev(). Switch hugetlb and pseudo to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25vfs: Kill sget_userns()David Howells
Kill sget_userns(), folding it into sget() as that's the only remaining user. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org
2019-05-25vfs: Provide sb->s_iflags settings in fs_context structDavid Howells
Provide a field in the fs_context struct through which bits in the sb->s_iflags superblock field can be set. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org
2019-05-25move mount_capable() further outAl Viro
Call graph of vfs_get_tree(): vfs_fsconfig_locked() # neither kernmount, nor submount do_new_mount() # neither kernmount, nor submount fc_mount() afs_mntpt_do_automount() # submount mount_one_hugetlbfs() # kernmount pid_ns_prepare_proc() # kernmount mq_create_mount() # kernmount vfs_kern_mount() simple_pin_fs() # kernmount vfs_submount() # submount kern_mount() # kernmount init_mount_tree() btrfs_mount() nfs_do_root_mount() The first two need the check (unconditionally). init_mount_tree() is setting rootfs up; any capability checks make zero sense for that one. And btrfs_mount()/ nfs_do_root_mount() have the checks already done in their callers. IOW, we can shift mount_capable() handling into the two callers - one in the normal case of mount(2), another - in fsconfig(2) handling of FSCONFIG_CMD_CREATE. I.e. the syscalls that set a new filesystem up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25move mount_capable() calls to vfs_get_tree()Al Viro
sget_fc() is called only from ->get_tree() instances and the only instance not calling it is legacy_get_tree(), which calls mount_capable() directly. In all sget_fc() callers the checks could be moved to the very beginning of ->get_tree() - ->user_ns is not changed in between. So lifting the checks to the only caller of ->get_tree() is OK. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25switch mount_capable() to fs_contextAl Viro
now both callers of mount_capable() have access to fs_context; the only difference is that for sget_fc() we have the possibility of fc->global being true, while for legacy_get_tree() it's guaranteed to be impossible. Unify to more generic variant... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25move the capability checks from sget_userns() to legacy_get_tree()Al Viro
1) all call chains leading to sget_userns() pass through ->mount() instances. 2) none of ->mount() instances is ever called directly - the only call site is legacy_get_tree() 3) all remaining ->mount() instances end up calling sget_userns() IOW, we might as well do the capability checks just before calling ->mount(). As for the arguments passed to mount_capable(), in case of call chains to sget_userns() going through sget(), we either don't call mount_capable() at all, or pass current_user_ns() to it. The call chains going through mount_pseudo_xattr() don't call mount_capable() at all (SB_KERNMOUNT in flags on those). That could've been split into smaller steps (lifting the checks into sget(), then callers of sget(), then all the way to the entries of every ->mount() out there, then to the sole caller), but that would be too much churn for little benefit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25vfs: Kill mount_ns()David Howells
Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount API. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25consolidate the capability checks in sget_{fc,userns}()Al Viro
... into a common helper - mount_capable(type, userns) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25start massaging the checks in sget_...(): move to sget_userns()Al Viro
there are 3 remaining callers of sget_userns() - sget(), mount_ns() and mount_pseudo_xattr(). Extra check in sget() is conditional upon mount being neither KERNMOUNT nor SUBMOUNT, the identical one in mount_ns() - upon being not KERNMOUNT; mount_pseudo_xattr() has no such checks at all. However, mount_ns() is never used with SUBMOUNT and mount_pseudo_xattr() is used only for KERNMOUNT, so both would be fine with the same logics as currently done in sget(), allowing to consolidate the entire thing in sget_userns() itself. That's not where these checks will end up in the long run, though - the whole reason why they'd been done so deep in the bowels of mount(2) was that there had been no way for a filesystem to specify which userns to look at until it has entered ->mount(). Now there is a place where filesystem could override the defaults - ->init_fs_context(). Which allows to pull the checks out into the callers of vfs_get_tree(). That'll take quite a bit of massage, but that mess is possible to tease apart. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-28[fix] get rid of checking for absent device name in vfs_get_tree()Al Viro
It has no business being there, it's checked by relevant ->get_tree() as it is *and* it returns the wrong error for no reason whatsoever. Fixes: f3a09c92018a "introduce fs_context methods" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28vfs: Add some logging to the core users of the fs_context logDavid Howells
Add some logging to the core users of the fs_context log so that information can be extracted from them as to the reason for failure. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28convenience helpers: vfs_get_super() and sget_fc()Al Viro
the former is an analogue of mount_{single,nodev} for use in ->get_tree() instances, the latter - analogue of sget() for the same. These are fairly similar to the originals, but the callback signature for sget_fc() is different from sget() ones, so getting bits and pieces shared would be too convoluted; we might get around to that later, but for now let's just remember to keep them in sync. They do live next to each other, and changes in either won't be hard to spot. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30introduce fs_context methodsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30convert do_remount_sb() to fs_contextDavid Howells
Replace do_remount_sb() with a function, reconfigure_super(), that's fs_context aware. The fs_context is expected to be parameterised already and have ->root pointing to the superblock to be reconfigured. A legacy wrapper is provided that is intended to be called from the fs_context ops when those appear, but for now is called directly from reconfigure_super(). This wrapper invokes the ->remount_fs() superblock op for the moment. It is intended that the remount_fs() op will be phased out. The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate that the context is being used for reconfiguration. do_umount_root() is provided to consolidate remount-to-R/O for umount and emergency remount by creating a context and invoking reconfiguration. do_remount(), do_umount() and do_emergency_remount_callback() are switched to use the new process. [AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the umount / bug, gets rid of pointless complexity] [AV -- set ->net_ns in all cases; nfs remount will need that] [AV -- shift security_sb_remount() call into reconfigure_super(); the callers that didn't do security_sb_remount() have NULL fc->security anyway, so it's a no-op for them] Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30vfs_get_tree(): evict the call of security_sb_kern_mount()Al Viro
Right now vfs_get_tree() calls security_sb_kern_mount() (i.e. mount MAC) unless it gets MS_KERNMOUNT or MS_SUBMOUNT in flags. Doing it that way is both clumsy and imprecise. Consider the callers' tree of vfs_get_tree(): vfs_get_tree() <- do_new_mount() <- vfs_kern_mount() <- simple_pin_fs() <- vfs_submount() <- kern_mount_data() <- init_mount_tree() <- btrfs_mount() <- vfs_get_tree() <- nfs_do_root_mount() <- nfs4_try_mount() <- nfs_fs_mount() <- vfs_get_tree() <- nfs4_referral_mount() do_new_mount() always does need MAC (we are guaranteed that neither MS_KERNMOUNT nor MS_SUBMOUNT will be passed there). simple_pin_fs(), vfs_submount() and kern_mount_data() pass explicit flags inhibiting that check. So does nfs4_referral_mount() (the flags there are ulimately coming from vfs_submount()). init_mount_tree() is called too early for anything LSM-related; it doesn't matter whether we attempt those checks, they'll do nothing. Finally, in case of btrfs_mount() and nfs_fs_mount(), doing MAC is pointless - either the caller will do it, or the flags are such that we wouldn't have done it either. In other words, the one and only case when we want that check done is when we are called from do_new_mount(), and there we want it unconditionally. So let's simply move it there. The superblock is still locked, so nobody is going to get access to it (via ustat(2), etc.) until we get a chance to apply the checks - we are free to move them to any point up to where we drop ->s_umount (in do_new_mount_fc()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30teach vfs_get_tree() to handle subtype, switch do_new_mount() to itAl Viro
Roll the handling of subtypes into do_new_mount() and vfs_get_tree(). The former determines any subtype string and hangs it off the fs_context; the latter applies it. Make do_new_mount() create, parameterise and commit an fs_context and create a mount for itself rather than calling vfs_kern_mount(). [AV -- missing kstrdup()] [AV -- ... and no kstrdup() if we get to setting ->s_submount - we simply transfer it from fc, leaving NULL behind] [AV -- constify ->s_submount, while we are at it] Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30vfs: Introduce fs_context, switch vfs_kern_mount() to it.David Howells
Introduce a filesystem context concept to be used during superblock creation for mount and superblock reconfiguration for remount. This is allocated at the beginning of the mount procedure and into it is placed: (1) Filesystem type. (2) Namespaces. (3) Source/Device names (there may be multiple). (4) Superblock flags (SB_*). (5) Security details. (6) Filesystem-specific data, as set by the mount options. Accessor functions are then provided to set up a context, parameterise it from monolithic mount data (the data page passed to mount(2)) and tear it down again. A legacy wrapper is provided that implements what will be the basic operations, wrapping access to filesystems that aren't yet aware of the fs_context. Finally, vfs_kern_mount() is changed to make use of the fs_context and mount_fs() is replaced by vfs_get_tree(), called from vfs_kern_mount(). [AV -- add missing kstrdup()] [AV -- put_cred() can be unconditional - fc->cred can't be NULL] [AV -- take legacy_validate() contents into legacy_parse_monolithic()] [AV -- merge KERNEL_MOUNT and USER_MOUNT] [AV -- don't unlock superblock on success return from vfs_get_tree()] [AV -- kill 'reference' argument of init_fs_context()] Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNTAl Viro
Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21LSM: hide struct security_mnt_opts from any generic codeAl Viro
Keep void * instead, allocate on demand (in parse_str_opts, at the moment). Eventually both selinux and smack will be better off with private structures with several strings in those, rather than this "counter and two pointers to dynamically allocated arrays" ugliness. This commit allows to do that at leisure, without disrupting anything outside of given module. Changes: * instead of struct security_mnt_opt use an opaque pointer initialized to NULL. * security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and security_free_mnt_opts() take it as var argument (i.e. as void **); call sites are unchanged. * security_sb_set_mnt_opts() and security_sb_remount() take it by value (i.e. as void *). * new method: ->sb_free_mnt_opts(). Takes void *, does whatever freeing that needs to be done. * ->sb_set_mnt_opts() and ->sb_remount() might get NULL as mnt_opts argument, meaning "empty". Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount()Al Viro
... leaving the "is it kernel-internal" logics in the caller. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21new helper: security_sb_eat_lsm_opts()Al Viro
combination of alloc_secdata(), security_sb_copy_data(), security_sb_parse_opt_str() and free_secdata(). Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21LSM: lift parsing LSM options into the caller of ->sb_kern_mount()Al Viro
This paves the way for retaining the LSM options from a common filesystem mount context during a mount parameter parsing phase to be instituted prior to actual mount/reconfiguration actions. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-20vfs: Suppress MS_* flag defs within the kernel unless explicitly enabledDavid Howells
Only the mount namespace code that implements mount(2) should be using the MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is included. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com>
2018-09-03fsnotify: add super block object typeAmir Goldstein
Add the infrastructure to attach a mark to a super_block struct and detach all attached marks when super block is destroyed. This is going to be used by fanotify backend to setup super block marks. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-26Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull IDA updates from Matthew Wilcox: "A better IDA API: id = ida_alloc(ida, GFP_xxx); ida_free(ida, id); rather than the cumbersome ida_simple_get(), ida_simple_remove(). The new IDA API is similar to ida_simple_get() but better named. The internal restructuring of the IDA code removes the bitmap preallocation nonsense. I hope the net -200 lines of code is convincing" * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits) ida: Change ida_get_new_above to return the id ida: Remove old API test_ida: check_ida_destroy and check_ida_alloc test_ida: Convert check_ida_conv to new API test_ida: Move ida_check_max test_ida: Move ida_check_leaf idr-test: Convert ida_check_nomem to new API ida: Start new test_ida module target/iscsi: Allocate session IDs from an IDA iscsi target: fix session creation failure handling drm/vmwgfx: Convert to new IDA API dmaengine: Convert to new IDA API ppc: Convert vas ID allocation to new IDA API media: Convert entity ID allocation to new IDA API ppc: Convert mmu context allocation to new IDA API Convert net_namespace to new IDA API cb710: Convert to new IDA API rsxx: Convert to new IDA API osd: Convert to new IDA API sd: Convert to new IDA API ...
2018-08-21fs: Convert unnamed_dev_ida to new APIMatthew Wilcox
The new API is much easier for this user. Also add kerneldoc for get_anon_bdev(). Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-17mm: add SHRINK_EMPTY shrinker methods return valueKirill Tkhai
We need to distinguish the situations when shrinker has very small amount of objects (see vfs_pressure_ratio() called from super_cache_count()), and when it has no objects at all. Currently, in the both of these cases, shrinker::count_objects() returns 0. The patch introduces new SHRINK_EMPTY return value, which will be used for "no objects at all" case. It's is a refactoring mostly, as SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this patch, and all the magic will happen in further. Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs: propagate shrinker::id to list_lruKirill Tkhai
Add list_lru::shrinker_id field and populate it by registered shrinker id. This will be used to set correct bit in memcg shrinkers map by lru code in next patches, after there appeared the first related to memcg element in list_lru. Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/super.c: refactor alloc_super()Kirill Tkhai
Do two list_lru_init_memcg() calls after prealloc_super(). destroy_unused_super() in fail path is OK with this. Next patch needs such the order. Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-04Merge branch 'work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Misc bits and pieces not fitting into anything more specific" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: delete unnecessary assignment in vfs_listxattr Documentation: filesystems: update filesystem locking documentation vfs: namei: use path_equal() in follow_dotdot() fs.h: fix outdated comment about file flags __inode_security_revalidate() never gets NULL opt_dentry make xattr_getsecurity() static vfat: simplify checks in vfat_lookup() get rid of dead code in d_find_alias() it's SB_BORN, not MS_BORN... msdos_rmdir(): kill BS comment remove rpc_rmdir() fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
2018-05-11fs: don't scan the inode cache before SB_BORN is setDave Chinner
We recently had an oops reported on a 4.14 kernel in xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage and so the m_perag_tree lookup walked into lala land. It produces an oops down this path during the failed mount: radix_tree_gang_lookup_tag+0xc4/0x130 xfs_perag_get_tag+0x37/0xf0 xfs_reclaim_inodes_count+0x32/0x40 xfs_fs_nr_cached_objects+0x11/0x20 super_cache_count+0x35/0xc0 shrink_slab.part.66+0xb1/0x370 shrink_node+0x7e/0x1a0 try_to_free_pages+0x199/0x470 __alloc_pages_slowpath+0x3a1/0xd20 __alloc_pages_nodemask+0x1c3/0x200 cache_grow_begin+0x20b/0x2e0 fallback_alloc+0x160/0x200 kmem_cache_alloc+0x111/0x4e0 The problem is that the superblock shrinker is running before the filesystem structures it depends on have been fully set up. i.e. the shrinker is registered in sget(), before ->fill_super() has been called, and the shrinker can call into the filesystem before fill_super() does it's setup work. Essentially we are exposed to both use-after-free and use-before-initialisation bugs here. To fix this, add a check for the SB_BORN flag in super_cache_count. In general, this flag is not set until ->fs_mount() completes successfully, so we know that it is set after the filesystem setup has completed. This matches the trylock_super() behaviour which will not let super_cache_scan() run if SB_BORN is not set, and hence will not allow the superblock shrinker from entering the filesystem while it is being set up or after it has failed setup and is being torn down. Cc: stable@kernel.org Signed-Off-By: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-10it's SB_BORN, not MS_BORN...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-16mm,vmscan: Allow preallocating memory for register_shrinker().Tetsuo Handa
syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97 ("sget(): handle failures of register_shrinker()"). That commit expected that calling kill_sb() from deactivate_locked_super() without successful fill_super() is safe, but the reality was different; some callers assign attributes which are needed for kill_sb() after sget() succeeds. For example, [1] is a report where sb->s_mode (which seems to be either FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not assigned unless sget() succeeds. But it does not worth complicate sget() so that register_shrinker() failure path can safely call kill_block_super() via kill_sb(). Making alloc_super() fail if memory allocation for register_shrinker() failed is much simpler. Let's avoid calling deactivate_locked_super() from sget_userns() by preallocating memory for the shrinker and making register_shrinker() in sget_userns() never fail. [1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+5a170e19c963a2e0df79@syzkaller.appspotmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-12Merge branch 'work.thaw' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs thaw updates from Al Viro: "An ancient series that has fallen through the cracks in the previous cycle" * 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: buffer.c: call thaw_super during emergency thaw vfs: factor sb iteration out of do_emergency_remount