summaryrefslogtreecommitdiff
path: root/fs/namei.c
AgeCommit message (Collapse)Author
2014-09-14vfs: avoid non-forwarding large load after small store in path lookupLinus Torvalds
The performance regression that Josef Bacik reported in the pathname lookup (see commit 99d263d4c5b2 "vfs: fix bad hashing of dentries") made me look at performance stability of the dcache code, just to verify that the problem was actually fixed. That turned up a few other problems in this area. There are a few cases where we exit RCU lookup mode and go to the slow serializing case when we shouldn't, Al has fixed those and they'll come in with the next VFS pull. But my performance verification also shows that link_path_walk() turns out to have a very unfortunate 32-bit store of the length and hash of the name we look up, followed by a 64-bit read of the combined hash_len field. That screws up the processor store to load forwarding, causing an unnecessary hickup in this critical routine. It's caused by the ugly calling convention for the "hash_name()" function, and easily fixed by just making hash_name() fill in the whole 'struct qstr' rather than passing it a pointer to just the hash value. With that, the profile for this function looks much smoother. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-13vfs: fix bad hashing of dentriesLinus Torvalds
Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32|64()" functions to do that. Reported-by: Josef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07namei: trivial fix to vfs_rename_dir commentJ. Bruce Fields
Looks like the directory loop check is actually done in renameat? Whatever, leave this out rather than trying to keep it up to date with the code. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.NeilBrown
In REF-walk mode, ->d_manage can return -EISDIR to indicate that the dentry is not really a mount trap (or even a mount point) and that any mounts or any DCACHE_NEED_AUTOMOUNT flag should be ignored. RCU-walk mode doesn't currently support this, so if there is a dentry with DCACHE_NEED_AUTOMOUNT set but which shouldn't be a mount-trap, lookup_fast() will always drop in REF-walk mode. With this patch, an -EISDIR from ->d_manage will always cause mounts and automounts to be ignored, both in REF-walk and RCU-walk. Bug-fixed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ian Kent <raven@themaw.net> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07fs: call rename2 if existsMiklos Szeredi
Christoph Hellwig suggests: 1) make vfs_rename call ->rename2 if it exists instead of ->rename 2) switch all filesystems that you're adding NOREPLACE support for to use ->rename2 3) see how many ->rename instances we'll have left after a few iterations of 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-07-24fs: umount on symlink leaks mnt countVasily Averin
Currently umount on symlink blocks following umount: /vz is separate mount # ls /vz/ -al | grep test drwxr-xr-x. 2 root root 4096 Jul 19 01:14 testdir lrwxrwxrwx. 1 root root 11 Jul 19 01:16 testlink -> /vz/testdir # umount -l /vz/testlink umount: /vz/testlink: not mounted (expected) # lsof /vz # umount /vz umount: /vz: device is busy. (unexpected) In this case mountpoint_last() gets an extra refcount on path->mnt Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Ian Kent <raven@themaw.net> Acked-by: Jeff Layton <jlayton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-06-10fs,userns: Change inode_capable to capable_wrt_inode_uidgidAndy Lutomirski
The kernel has no concept of capabilities with respect to inodes; inodes exist independently of namespaces. For example, inode_capable(inode, CAP_LINUX_IMMUTABLE) would be nonsense. This patch changes inode_capable to check for uid and gid mappings and renames it to capable_wrt_inode_uidgid, which should make it more obvious what it does. Fixes CVE-2014-4014. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Chinner <david@fromorbit.com> Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-19fix races between __d_instantiate() and checks of dentry flagsAl Viro
in non-lazy walk we need to be careful about dentry switching from negative to positive - both ->d_flags and ->d_inode are updated, and in some places we might see only one store. The cases where dentry has been obtained by dcache lookup with ->i_mutex held on parent are safe - ->d_lock and ->i_mutex provide all the barriers we need. However, there are several places where we run into trouble: * do_last() fetches ->d_inode, then checks ->d_flags and assumes that inode won't be NULL unless d_is_negative() is true. Race with e.g. creat() - we might have fetched the old value of ->d_inode (still NULL) and new value of ->d_flags (already not DCACHE_MISS_TYPE). Lin Ming has observed and reported the resulting oops. * a bunch of places checks ->d_inode for being non-NULL, then checks ->d_flags for "is it a symlink". Race with symlink(2) in case if our CPU sees ->d_inode update first - we see non-NULL there, but ->d_flags still contains DCACHE_MISS_TYPE instead of DCACHE_SYMLINK_TYPE. Result: false negative on "should we follow link here?", with subsequent unpleasantness. Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one Reported-and-tested-by: Lin Ming <minggr@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-04Merge branch 'locks-3.15' of git://git.samba.org/jlayton/linuxLinus Torvalds
Pull file locking updates from Jeff Layton: "Highlights: - maintainership change for fs/locks.c. Willy's not interested in maintaining it these days, and is OK with Bruce and I taking it. - fix for open vs setlease race that Al ID'ed - cleanup and consolidation of file locking code - eliminate unneeded BUG() call - merge of file-private lock implementation" * 'locks-3.15' of git://git.samba.org/jlayton/linux: locks: make locks_mandatory_area check for file-private locks locks: fix locks_mandatory_locked to respect file-private locks locks: require that flock->l_pid be set to 0 for file-private locks locks: add new fcntl cmd values for handling file private locks locks: skip deadlock detection on FL_FILE_PVT locks locks: pass the cmd value to fcntl_getlk/getlk64 locks: report l_pid as -1 for FL_FILE_PVT locks locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT" locks: rename locks_remove_flock to locks_remove_file locks: consolidate checks for compatible filp->f_mode values in setlk handlers locks: fix posix lock range overflow handling locks: eliminate BUG() call when there's an unexpected lock on file close locks: add __acquires and __releases annotations to locks_start and locks_stop locks: remove "inline" qualifier from fl_link manipulation functions locks: clean up comment typo locks: close potential race between setlease and open MAINTAINERS: update entry for fs/locks.c
2014-04-01new helper: readlink_copy()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01namei.c: move EXPORT_SYMBOL to corresponding definitionsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01get_write_access() is inlined, exporting it is pointlessAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01vfs: add cross-renameMiklos Szeredi
If flags contain RENAME_EXCHANGE then exchange source and destination files. There's no restriction on the type of the files; e.g. a directory can be exchanged with a symlink. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01security: add flags to rename hooksMiklos Szeredi
Add flags to security_path_rename() and security_inode_rename() hooks. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01vfs: add RENAME_NOREPLACE flagMiklos Szeredi
If this flag is specified and the target of the rename exists then the rename syscall fails with EEXIST. The VFS does the existence checking, so it is trivial to enable for most local filesystems. This patch only enables it in ext4. For network filesystems the VFS check is not enough as there may be a race between a remote create and the rename, so these filesystems need to handle this flag in their ->rename() implementations to ensure atomicity. Andy writes about why this is useful: "The trivial answer: to eliminate the race condition from 'mv -i'. Another answer: there's a common pattern to atomically create a file with contents: open a temporary file, write to it, optionally fsync it, close it, then link(2) it to the final name, then unlink the temporary file. The reason to use link(2) is because it won't silently clobber the destination. This is annoying: - It requires an extra system call that shouldn't be necessary. - It doesn't work on (IMO sensible) filesystems that don't support hard links (e.g. vfat). - It's not atomic -- there's an intermediate state where both files exist. - It's ugly. The new rename flag will make this totally sensible. To be fair, on new enough kernels, you can also use O_TMPFILE and linkat to achieve the same thing even more cleanly." Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01vfs: add renameat2 syscallMiklos Szeredi
Add new renameat2 syscall, which is the same as renameat with an added flags argument. Pass flags to vfs_rename() and to i_op->rename() as well. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01vfs: rename: use common code for dir and non-dirMiklos Szeredi
There's actually very little difference between vfs_rename_dir() and vfs_rename_other() so move both inline into vfs_rename() which still stays reasonably readable. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01vfs: rename: move d_move() upMiklos Szeredi
Move the d_move() in vfs_rename_dir() up, similarly to how it's done in vfs_rename_other(). The next patch will consolidate these two functions and this is the only structural difference between them. I'm not sure if doing the d_move() after the dput is even valid. But there may be a logical explanation for that. But moving the d_move() before the dput() (and the mutex_unlock()) should definitely not hurt. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-04-01vfs: add d_is_dir()Miklos Szeredi
Add d_is_dir(dentry) helper which is analogous to S_ISDIR(). To avoid confusion, rename d_is_directory() to d_can_lookup(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
2014-03-31locks: fix locks_mandatory_locked to respect file-private locksJeff Layton
As Trond pointed out, you can currently deadlock yourself by setting a file-private lock on a file that requires mandatory locking and then trying to do I/O on it. Avoid this problem by plumbing some knowledge of file-private locks into the mandatory locking code. In order to do this, we must pass down information about the struct file that's being used to locks_verify_locked. Reported-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: J. Bruce Fields <bfields@redhat.com>
2014-03-23rcuwalk: recheck mount_lock after mountpoint crossing attemptsAl Viro
We can get false negative from __lookup_mnt() if an unrelated vfsmount gets moved. In that case legitimize_mnt() is guaranteed to fail, and we will fall back to non-RCU walk... unless we end up running into a hard error on a filesystem object we wouldn't have reached if not for that false negative. IOW, delaying that check until the end of pathname resolution is wrong - we should recheck right after we attempt to cross the mountpoint. We don't need to recheck unless we see d_mountpoint() being true - in that case even if we have just raced with mount/umount, we can simply go on as if we'd come at the moment when the sucker wasn't a mountpoint; if we run into a hard error as the result, it was a legitimate outcome. __lookup_mnt() returning NULL is different in that respect, since it might've happened due to operation on completely unrelated mountpoint. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10vfs: atomic f_pos accesses as per POSIXLinus Torvalds
Our write() system call has always been atomic in the sense that you get the expected thread-safe contiguous write, but we haven't actually guaranteed that concurrent writes are serialized wrt f_pos accesses, so threads (or processes) that share a file descriptor and use "write()" concurrently would quite likely overwrite each others data. This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says: "2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: [...]" and one of the effects is the file position update. This unprotected file position behavior is not new behavior, and nobody has ever cared. Until now. Yongzhi Pan reported unexpected behavior to Michael Kerrisk that was due to this. This resolves the issue with a f_pos-specific lock that is taken by read/write/lseek on file descriptors that may be shared across threads or processes. Reported-by: Yongzhi Pan <panyongzhi@gmail.com> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-05execve: use 'struct filename *' for executable name passingLinus Torvalds
This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31Fix mountpoint reference leakage in linkatOleg Drokin
Recent changes to retry on ESTALE in linkat (commit 442e31ca5a49e398351b2954b51f578353fdf210) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31vfs: unexport the getname() symbolJeff Layton
Leaving getname() exported when putname() isn't is a bad idea. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25fs: add get_acl helperChristoph Hellwig
Factor out the code to get an ACL either from the inode or disk from check_acl, so that it can be used elsewhere later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-12-12dcache: allow word-at-a-time name hashing with big-endian CPUsWill Deacon
When explicitly hashing the end of a string with the word-at-a-time interface, we have to be careful which end of the word we pick up. On big-endian CPUs, the upper-bits will contain the data we're after, so ensure we generate our masks accordingly (and avoid hashing whatever random junk may have been sitting after the string). This patch adds a new dcache helper, bytemask_from_count, which creates a mask appropriate for the CPU endianness. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-29fix bogus path_put() of nd->root after some unlazy_walk() failuresAl Viro
Failure to grab reference to parent dentry should go through the same cleanup as nd->seq mismatch. As it is, we might end up with caller thinking it needs to path_put() nd->root, with obvious nasty results once we'd hit that bug enough times to drive the refcount of root dentry all the way to zero... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-21Merge git://git.infradead.org/users/eparis/auditLinus Torvalds
Pull audit updates from Eric Paris: "Nothing amazing. Formatting, small bug fixes, couple of fixes where we didn't get records due to some old VFS changes, and a change to how we collect execve info..." Fixed conflict in fs/exec.c as per Eric and linux-next. * git://git.infradead.org/users/eparis/audit: (28 commits) audit: fix type of sessionid in audit_set_loginuid() audit: call audit_bprm() only once to add AUDIT_EXECVE information audit: move audit_aux_data_execve contents into audit_context union audit: remove unused envc member of audit_aux_data_execve audit: Kill the unused struct audit_aux_data_capset audit: do not reject all AUDIT_INODE filter types audit: suppress stock memalloc failure warnings since already managed audit: log the audit_names record type audit: add child record before the create to handle case where create fails audit: use given values in tty_audit enable api audit: use nlmsg_len() to get message payload length audit: use memset instead of trying to initialize field by field audit: fix info leak in AUDIT_GET requests audit: update AUDIT_INODE filter rule to comparator function audit: audit feature to set loginuid immutable audit: audit feature to only allow unsetting the loginuid audit: allow unsetting the loginuid (with priv) audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE audit: loginuid functions coding style selinux: apply selinux checks on new audit message types ...
2013-11-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "All kinds of stuff this time around; some more notable parts: - RCU'd vfsmounts handling - new primitives for coredump handling - files_lock is gone - Bruce's delegations handling series - exportfs fixes plus misc stuff all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits) ecryptfs: ->f_op is never NULL locks: break delegations on any attribute modification locks: break delegations on link locks: break delegations on rename locks: helper functions for delegation breaking locks: break delegations on unlink namei: minor vfs_unlink cleanup locks: implement delegations locks: introduce new FL_DELEG lock flag vfs: take i_mutex on renamed file vfs: rename I_MUTEX_QUOTA now that it's not used for quotas vfs: don't use PARENT/CHILD lock classes for non-directories vfs: pull ext4's double-i_mutex-locking into common code exportfs: fix quadratic behavior in filehandle lookup exportfs: better variable name exportfs: move most of reconnect_path to helper function exportfs: eliminate unused "noprogress" counter exportfs: stop retrying once we race with rename/remove exportfs: clear DISCONNECTED on all parents sooner exportfs: more detailed comment for path_reconnect ...
2013-11-09locks: break delegations on linkJ. Bruce Fields
Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Dustin Kirkland <dustin.kirkland@gazzang.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09locks: break delegations on renameJ. Bruce Fields
Cc: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09locks: helper functions for delegation breakingJ. Bruce Fields
We'll need the same logic for rename and link. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09locks: break delegations on unlinkJ. Bruce Fields
We need to break delegations on any operation that changes the set of links pointing to an inode. Start with unlink. Such operations also hold the i_mutex on a parent directory. Breaking a delegation may require waiting for a timeout (by default 90 seconds) in the case of a unresponsive NFS client. To avoid blocking all directory operations, we therefore drop locks before waiting for the delegation. The logic then looks like: acquire locks ... test for delegation; if found: take reference on inode release locks wait for delegation break drop reference on inode retry It is possible this could never terminate. (Even if we take precautions to prevent another delegation being acquired on the same inode, we could get a different inode on each retry.) But this seems very unlikely. The initial test for a delegation happens after the lock on the target inode is acquired, but the directory inode may have been acquired further up the call stack. We therefore add a "struct inode **" argument to any intervening functions, which we use to pass the inode back up to the caller in the case it needs a delegation synchronously broken. Cc: David Howells <dhowells@redhat.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Dustin Kirkland <dustin.kirkland@gazzang.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09namei: minor vfs_unlink cleanupJ. Bruce Fields
We'll be using dentry->d_inode in one more place. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09vfs: take i_mutex on renamed fileJ. Bruce Fields
A read delegation is used by NFSv4 as a guarantee that a client can perform local read opens without informing the server. The open operation takes the last component of the pathname as an argument, thus is also a lookup operation, and giving the client the above guarantee means informing the client before we allow anything that would change the set of names pointing to the inode. Therefore, we need to break delegations on rename, link, and unlink. We also need to prevent new delegations from being acquired while one of these operations is in progress. We could add some completely new locking for that purpose, but it's simpler to use the i_mutex, since that's already taken by all the operations we care about. The single exception is rename. So, modify rename to take the i_mutex on the file that is being renamed. Also fix up lockdep and Documentation/filesystems/directory-locking to reflect the change. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09dcache: fix outdated DCACHE_NEED_LOOKUP commentJ. Bruce Fields
The DCACHE_NEED_LOOKUP case referred to here was removed with 39e3c9553f34381a1b664c27b0c696a266a5735e "vfs: remove DCACHE_NEED_LOOKUP". There are only four real_lookup() callers and all of them pass in an unhashed dentry just returned from d_alloc. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09VFS: Put a small type field into struct dentry::d_flagsDavid Howells
Put a type field into struct dentry::d_flags to indicate if the dentry is one of the following types that relate particularly to pathwalk: Miss (negative dentry) Directory "Automount" directory (defective - no i_op->lookup()) Symlink Other (regular, socket, fifo, device) The type field is set to one of the first five types on a dentry by calls to __d_instantiate() and d_obtain_alias() from information in the inode (if one is given). The type is cleared by dentry_unlink_inode() when it reconstitutes an existing dentry as a negative dentry. Accessors provided are: d_set_type(dentry, type) d_is_directory(dentry) d_is_autodir(dentry) d_is_symlink(dentry) d_is_file(dentry) d_is_negative(dentry) d_is_positive(dentry) A bunch of checks in pathname resolution switched to those. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09get rid of {lock,unlock}_rcu_walk()Al Viro
those have become aliases for rcu_read_{lock,unlock}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09RCU'd vfsmountsAl Viro
* RCU-delayed freeing of vfsmounts * vfsmount_lock replaced with a seqlock (mount_lock) * sequence number from mount_lock is stored in nameidata->m_seq and used when we exit RCU mode * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its caller knows that vfsmount will have no surviving references. * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock() and doing pending mntput(). * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence number against seq, then grabs reference to mnt. Then it rechecks mount_lock again to close the race and either returns success or drops the reference it has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can simply decrement the refcount and sod off - aforementioned synchronize_rcu() makes sure that final mntput() won't come until we leave RCU mode. We need that, since we don't want to end up with some lazy pathwalk racing with umount() and stealing the final mntput() from it - caller of umount() may expect it to return only once the fs is shut down and we don't want to break that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do full-blown mntput() in case of mount_lock sequence number mismatch happening just as we'd grabbed the reference, but in those cases we won't be stealing the final mntput() from anything that would care. * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally, SMP and UP cases are handled the same way - no ifdefs there. * normal pathname resolution does *not* do any writes to mount_lock. It does, of course, bump the refcounts of vfsmount and dentry in the very end, but that's it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-05audit: add child record before the create to handle case where create failsJeff Layton
Historically, when a syscall that creates a dentry fails, you get an audit record that looks something like this (when trying to create a file named "new" in "/tmp/tmp.SxiLnCcv63"): type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023 This record makes no sense since it's associating the inode information for "/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent patch I posted to fix the audit_inode call in do_last fixes this, by making it look more like this: type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023 While this is more correct, if the creation of the file fails, then we have no record of the filename that the user tried to create. This patch adds a call to audit_inode_child to may_create. This creates an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the create succeeds. When and if the create does succeed, then this record will be updated with the correct inode info from the create. This fixes what was broken in commit bfcec708. Commit 79f6530c should also be backported to stable v3.7+. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2013-10-24split __lookup_mnt() in two functionsAl Viro
Instead of passing the direction as argument (and checking it on every step through the hash chain), just have separate __lookup_mnt() and __lookup_mnt_last(). And use the standard iterators... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-22fs/namei.c: fix new kernel-doc warningRandy Dunlap
Add @path parameter to fix kernel-doc warning. Also fix a spello/typo. Warning(fs/namei.c:2304): No description found for parameter 'path' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-17atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in ↵Al Viro
fs/namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16vfs: don't set FILE_CREATED before calling ->atomic_open()Miklos Szeredi
If O_CREAT|O_EXCL are passed to open, then we know that either - the file is successfully created, or - the operation fails in some way. So previously we set FILE_CREATED before calling ->atomic_open() so the filesystem doesn't have to. This, however, led to bugs in the implementation that went unnoticed when the filesystem didn't check for existence, yet returned success. To prevent this kind of bug, require filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and verify this in the VFS. Also added a couple more verifications for the result of atomic_open(): - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT. - Warn if filesystem set FILE_CREATED but gave a negative dentry. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10Add missing unlocks to error paths of mountpoint_last.Dave Jones
Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10... and fold the renamed __vfs_follow_link() into its only callerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10fs: remove vfs_follow_linkChristoph Hellwig
For a long time no filesystem has been using vfs_follow_link, and as seen by recent filesystem submissions any new use is accidental as well. Remove vfs_follow_link, document the replacement in Documentation/filesystems/porting and also rename __vfs_follow_link to match its only caller better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks