summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-02-10Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds
Merge misc fixes from Andrew Morton: "A bunch of fixes" * emailed patches fron Andrew Morton <akpm@linux-foundation.org>: ocfs2: check existence of old dentry in ocfs2_link() ocfs2: update inode size after zeroing the hole ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASED smp.h: fix x86+cpu.c sparse warnings about arch nonboot CPU calls mm: fix page leak at nfs_symlink() slub: do not assert not having lock in removing freed partial gitignore: add all.config ocfs2: fix ocfs2_sync_file() if filesystem is readonly drivers/edac/edac_mc_sysfs.c: poll timeout cannot be zero fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem xen: properly account for _PAGE_NUMA during xen pte translations mm/slub.c: list_lock may not be held in some circumstances drivers/md/bcache/extents.c: use %zi to format size_t vmcore: prevent PT_NOTE p_memsz overflow during header update drivers/message/i2o/i2o_config.c: fix deadlock in compat_ioctl(I2OGETIOPS) Documentation/: update 00-INDEX files checkpatch: fix detection of git repository get_maintainer: fix detection of git repository drivers/misc/sgi-gru/grukdump.c: unlocking should be conditional in gru_dump_context()
2014-02-10ocfs2: check existence of old dentry in ocfs2_link()Xue jiufei
System call linkat first calls user_path_at(), check the existence of old dentry, and then calls vfs_link()->ocfs2_link() to do the actual work. There may exist a race when Node A create a hard link for file while node B rm it. Node A Node B user_path_at() ->ocfs2_lookup(), find old dentry exist rm file, add inode say inodeA to orphan_dir call ocfs2_link(),create a hard link for inodeA. rm the link, add inodeA to orphan_dir again When orphan_scan work start, it calls ocfs2_queue_orphans() to do the main work. It first tranverses entrys in orphan_dir, linking all inodes in this orphan_dir to a list look like this: inodeA->inodeB->...->inodeA When tranvering this list, it will fall into loop, calling iput() again and again. And finally trigger BUG_ON(inode->i_state & I_CLEAR). Signed-off-by: joyce <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10ocfs2: update inode size after zeroing the holeJunxiao Bi
fs-writeback will release the dirty pages without page lock whose offset are over inode size, the release happens at block_write_full_page_endio(). If not update, dirty pages in file holes may be released before flushed to the disk, then file holes will contain some non-zero data, this will cause sparse file md5sum error. To reproduce the bug, find a big sparse file with many holes, like vm image file, its actual size should be bigger than available mem size to make writeback work more frequently, tar it with -S option, then keep untar it and check its md5sum again and again until you get a wrong md5sum. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Younger Liu <younger.liu@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_sizeYounger Liu
The issue scenario is as following: - Create a small file and fallocate a large disk space for a file with FALLOC_FL_KEEP_SIZE option. - ftruncate the file back to the original size again. but the disk free space is not changed back. This is a real bug that be fixed in this patch. In order to solve the issue above, we modified ocfs2_setattr(), if attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and truncate disk space to attr->ia_size. Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Jensen <shencanquan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10mm: fix page leak at nfs_symlink()Rafael Aquini
Changes in commit a0b8cab3b9b2 ("mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API") have introduced a call to add_to_page_cache_lru() which causes a leak in nfs_symlink() as now the page gets an extra refcount that is not dropped. Jan Stancek observed and reported the leak effect while running test8 from Connectathon Testsuite. After several iterations over the test case, which creates several symlinks on a NFS mountpoint, the test system was quickly getting into an out-of-memory scenario. This patch fixes the page leak by dropping that extra refcount add_to_page_cache_lru() is grabbing. Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Jeff Layton <jlayton@redhat.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: <stable@vger.kernel.org> [3.11.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10ocfs2: fix ocfs2_sync_file() if filesystem is readonlyYounger Liu
If filesystem is readonly, there is no need to flush drive's caches or force any uncommitted transactions. [akpm@linux-foundation.org: return -EROFS, not 0] Signed-off-by: Younger Liu <younger.liucn@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmemEric W. Biederman
Recently due to a spike in connections per second memcached on 3 separate boxes triggered the OOM killer from accept. At the time the OOM killer was triggered there was 4GB out of 36GB free in zone 1. The problem was that alloc_fdtable was allocating an order 3 page (32KiB) to hold a bitmap, and there was sufficient fragmentation that the largest page available was 8KiB. I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious but I do agree that order 3 allocations are very likely to succeed. There are always pathologies where order > 0 allocations can fail when there are copious amounts of free memory available. Using the pigeon hole principle it is easy to show that it requires 1 page more than 50% of the pages being free to guarantee an order 1 (8KiB) allocation will succeed, 1 page more than 75% of the pages being free to guarantee an order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of the pages being free to guarantee an order 3 allocate will succeed. A server churning memory with a lot of small requests and replies like memcached is a common case that if anything can will skew the odds against large pages being available. Therefore let's not give external applications a practical way to kill linux server applications, and specify __GFP_NORETRY to the kmalloc in alloc_fdmem. Unless I am misreading the code and by the time the code reaches should_alloc_retry in __alloc_pages_slowpath (where __GFP_NORETRY becomes signification). We have already tried everything reasonable to allocate a page and the only thing left to do is wait. So not waiting and falling back to vmalloc immediately seems like the reasonable thing to do even if there wasn't a chance of triggering the OOM killer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Cong Wang <cwang@twopensource.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10vmcore: prevent PT_NOTE p_memsz overflow during header updateGreg Pearson
Currently, update_note_header_size_elf64() and update_note_header_size_elf32() will add the size of a PT_NOTE entry to real_sz even if that causes real_sz to exceeds max_sz. This patch corrects the while loop logic in those routines to ensure that does not happen and prints a warning if a PT_NOTE entry is dropped. If zero PT_NOTE entries are found or this condition is encountered because the only entry was dropped, a warning is printed and an error is returned. One possible negative side effect of exceeding the max_sz limit is an allocation failure in merge_note_headers_elf64() or merge_note_headers_elf32() which would produce console output such as the following while booting the crash kernel. vmalloc: allocation failure: 14076997632 bytes swapper/0: page allocation failure: order:0, mode:0x80d2 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7 Call Trace: dump_stack+0x19/0x1b warn_alloc_failed+0xf0/0x160 __vmalloc_node_range+0x19e/0x250 vmalloc_user+0x4c/0x70 merge_note_headers_elf64.constprop.9+0x116/0x24a vmcore_init+0x2d4/0x76c do_one_initcall+0xe2/0x190 kernel_init_freeable+0x17c/0x207 kernel_init+0xe/0x180 ret_from_fork+0x7c/0xb0 Kdump: vmcore not initialized kdump: dump target is /dev/sda4 kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/ kdump: saving vmcore-dmesg.txt Cannot open /proc/vmcore: No such file or directory kdump: saving vmcore-dmesg.txt failed kdump: saving vmcore kdump: saving vmcore failed This type of failure has been seen on a four socket prototype system with certain memory configurations. Most PT_NOTE sections have a single entry similar to: n_namesz = 0x5 n_descsz = 0x150 n_type = 0x1 Occasionally, a second entry is encountered with very large n_namesz and n_descsz sizes: n_namesz = 0x80000008 n_descsz = 0x510ae163 n_type = 0x80000008 Not yet sure of the source of these extra entries, they seem bogus, but they shouldn't cause crash dump to fail. Signed-off-by: Greg Pearson <greg.pearson@hp.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "Small fix from Jeff for writepages leak, and some fixes for ACLs and xattrs when SMB2 enabled. Am expecting another fix from Jeff and at least one more fix (for mounting SMB2 with cifsacl) in the next week" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] clean up page array when uncached write send fails cifs: use a flexarray in cifs_writedata retrieving CIFS ACLs when mounted with SMB2 fails dropping session Add protocol specific operation for CIFS xattrs
2014-02-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes, both -stable fodder. The O_SYNC bug is fairly old..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix a kmap leak in virtio_console fix O_SYNC|O_APPEND syncing the wrong range on write()
2014-02-09fix O_SYNC|O_APPEND syncing the wrong range on write()Al Viro
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support) when sync_page_range() had been introduced; generic_file_write{,v}() correctly synced pos_after_write - written .. pos_after_write - 1 but generic_file_aio_write() synced pos_before_write .. pos_before_write + written - 1 instead. Which is not the same thing with O_APPEND, obviously. A couple of years later correct variant had been killed off when everything switched to use of generic_file_aio_write(). All users of generic_file_aio_write() are affected, and the same bug has been copied into other instances of ->aio_write(). The fix is trivial; the only subtle point is that generic_write_sync() ought to be inlined to avoid calculations useless for the majority of calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix data corruption when reading/updating compressed extents Btrfs: don't loop forever if we can't run because of the tree mod log btrfs: reserve no transaction units in btrfs_ioctl_set_features btrfs: commit transaction after setting label and features Btrfs: fix assert screwup for the pending move stuff
2014-02-08Btrfs: fix data corruption when reading/updating compressed extentsFilipe David Borba Manana
When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08Btrfs: don't loop forever if we can't run because of the tree mod logJosef Bacik
A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08btrfs: reserve no transaction units in btrfs_ioctl_set_featuresDavid Sterba
Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08btrfs: commit transaction after setting label and featuresJeff Mahoney
The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08Btrfs: fix assert screwup for the pending move stuffJosef Bacik
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set, which meant that a key part of Filipe's original patch was not being built in. This appears to be a mess up with merging Filipe's patch as it does not exist in his original patch. Fix this by changing how we make sure del_waiting_dir_move asserts that it did not error and take the function out of the ifdef check. This makes btrfs/030 pass with the assert on or off. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-08Merge tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs fix from David Kleikamp: "Fix regression" * tag 'jfs-3.14-rc2' of git://github.com/kleikamp/linux-shaggy: jfs: fix generic posix ACL regression
2014-02-08jfs: fix generic posix ACL regressionDave Kleikamp
I missed a couple errors in reviewing the patches converting jfs to use the generic posix ACL function. Setting ACL's currently fails with -EOPNOTSUPP. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reported-by: Michael L. Semon <mlsemon35@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-02-07[CIFS] clean up page array when uncached write send failsSteve French
In the event that a send fails in an uncached write, or we end up needing to reissue it (-EAGAIN case), we'll kfree the wdata but the pages currently leak. Fix this by adding a new kref release routine for uncached writedata that releases the pages, and have the uncached codepaths use that. [original patch by Jeff modified to fix minor formatting problems] Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-07cifs: use a flexarray in cifs_writedataJeff Layton
The cifs_writedata code uses a single element trailing array, which just adds unneeded complexity. Use a flexarray instead. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>
2014-02-07Merge tag 'driver-core-3.14-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here is a single kernfs fix to resolve a much-reported lockdep issue with the removal of entries in sysfs" * tag 'driver-core-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flag
2014-02-07retrieving CIFS ACLs when mounted with SMB2 fails dropping sessionSteve French
The get/set ACL xattr support for CIFS ACLs attempts to send old cifs dialect protocol requests even when mounted with SMB2 or later dialects. Sending cifs requests on an smb2 session causes problems - the server drops the session due to the illegal request. This patch makes CIFS ACL operations protocol specific to fix that. Attempting to query/set CIFS ACLs for SMB2 will now return EOPNOTSUPP (until we add worker routines for sending query ACL requests via SMB2) instead of sending invalid (cifs) requests. A separate followon patch will be needed to fix cifs_acl_to_fattr (which takes a cifs specific u16 fid so can't be abstracted to work with SMB2 until that is changed) and will be needed to fix mount problems when "cifsacl" is specified on mount with e.g. vers=2.1 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> CC: Stable <stable@kernel.org>
2014-02-07Add protocol specific operation for CIFS xattrsSteve French
Changeset 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 added protocol operations for get/setxattr to avoid calling cifs operations on smb2/smb3 mounts for xattr operations and this changeset adds the calls to cifs specific protocol operations for xattrs (in order to reenable cifs support for xattrs which was temporarily disabled by the previous changeset. We do not have SMB2/SMB3 worker function for setting xattrs yet so this only enables it for cifs. CCing stable since without these two small changsets (its small coreq 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 is also needed) calling getfattr/setfattr on smb2/smb3 mounts causes problems. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com> CC: Stable <stable@kernel.org>
2014-02-06mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irqKOSAKI Motohiro
To use spin_{un}lock_irq is dangerous if caller disabled interrupt. During aio buffer migration, we have a possibility to see the following call stack. aio_migratepage [disable interrupt] migrate_page_copy clear_page_dirty_for_io set_page_dirty __set_page_dirty_buffers __set_page_dirty spin_lock_irq This mean, current aio migration is a deadlockable. spin_lock_irqsave is a safer alternative and we should use it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Rientjes rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06ocfs2: free allocated clusters if error occurs after ocfs2_claim_clustersZongxun Wang
Even if using the same jbd2 handle, we cannot rollback a transaction. So once some error occurs after successfully allocating clusters, the allocated clusters will never be used and it means they are lost. For example, call ocfs2_claim_clusters successfully when expanding a file, but failed in ocfs2_insert_extent. So we need free the allocated clusters if they are not used indeed. Signed-off-by: Zongxun Wang <wangzongxun@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-05execve: use 'struct filename *' for executable name passingLinus Torvalds
This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-05kernfs: make kernfs_deactivate() honor KERNFS_LOCKDEP flagTejun Heo
kernfs_deactivate() forgot to check whether KERNFS_LOCKDEP is set before performing lockdep annotations and ends up feeding uninitialized lockdep_map to lockdep triggering warning like the following on USB stick hotunplug. usb 1-2: USB disconnect, device number 2 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 62 Comm: khubd Not tainted 3.13.0-work+ #82 Hardware name: empty empty/S3992, BIOS 080011 10/26/2007 ffff880065ca7f60 ffff88013a4ffa08 ffffffff81cfb6bd 0000000000000002 ffff88013a4ffac8 ffffffff810f8530 ffff88013a4fc710 0000000000000002 ffff880100000000 ffffffff82a3db50 0000000000000001 ffff88013a4fc710 Call Trace: [<ffffffff81cfb6bd>] dump_stack+0x4e/0x7a [<ffffffff810f8530>] __lock_acquire+0x1910/0x1e70 [<ffffffff810f931a>] lock_acquire+0x9a/0x1d0 [<ffffffff8127c75e>] kernfs_deactivate+0xee/0x130 [<ffffffff8127d4c8>] kernfs_addrm_finish+0x38/0x60 [<ffffffff8127d701>] kernfs_remove_by_name_ns+0x51/0xa0 [<ffffffff8127b4f1>] remove_files.isra.1+0x41/0x80 [<ffffffff8127b7e7>] sysfs_remove_group+0x47/0xa0 [<ffffffff8127b873>] sysfs_remove_groups+0x33/0x50 [<ffffffff8177d66d>] device_remove_attrs+0x4d/0x80 [<ffffffff8177e25e>] device_del+0x12e/0x1d0 [<ffffffff819722c2>] usb_disconnect+0x122/0x1a0 [<ffffffff819749b5>] hub_thread+0x3c5/0x1290 [<ffffffff810c6a6d>] kthread+0xed/0x110 [<ffffffff81d0a56c>] ret_from_fork+0x7c/0xb0 Fix it by making kernfs_deactivate() perform lockdep annotations only if KERNFS_LOCKDEP is set. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fabio Estevam <festevam@gmail.com> Reported-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Jiri Kosina <jkosina@suse.cz> Reported-by: Dave Jones <davej@redhat.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Tested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is fixing compile and boot problems with our crc32c rework, and Josef has disabled snapshot aware defrag for now. As the number of snapshots increases, we're hitting OOM. For the short term we're disabling things until a bigger fix is ready" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use late_initcall instead of module_init Btrfs: use btrfs_crc32c everywhere instead of libcrc32c Btrfs: disable snapshot aware defrag for now
2014-02-04Merge tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights: - Fix NFSv3 acl regressions - Fix NFSv4 memory corruption due to slot table abuse in nfs4_proc_open_confirm - nfs4_destroy_session must call rpc_destroy_waitqueue" * tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: fs: get_acl() must be allowed to return EOPNOTSUPP NFSv3: Fix return value of nfs3_proc_setacls NFSv3: Remove unused function nfs3_proc_set_default_acl NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue NFSv4: Fix memory corruption in nfs4_proc_open_confirm nfs: fix setting of ACLs on file creation.
2014-02-03Merge branch 'acl_fixes' into linux-nextTrond Myklebust
2014-02-03fs: get_acl() must be allowed to return EOPNOTSUPPTrond Myklebust
posix_acl_xattr_get requires get_acl() to return EOPNOTSUPP if the filesystem cannot support acls. This is needed for NFS, which can't know whether or not the server supports acls until it tries to get/set one. This patch converts posix_acl_chmod and posix_acl_create to deal with EOPNOTSUPP return values from get_acl(). Reported-by: Russell King <linux@arm.linux.org.uk> Link: http://lkml.kernel.org/r/20140130140834.GW15937@n2100.arm.linux.org.uk Cc: Al Viro viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-03NFSv3: Fix return value of nfs3_proc_setaclsTrond Myklebust
nfs3_proc_setacls is used internally by the NFSv3 create operations to set the acl after the file has been created. If the operation fails because the server doesn't support acls, then it must return '0', not -EOPNOTSUPP. Reported-by: Russell King <linux@arm.linux.org.uk> Link: http://lkml.kernel.org/r/20140201010328.GI15937@n2100.arm.linux.org.uk Cc: Christoph Hellwig <hch@lst.de> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-03NFSv3: Remove unused function nfs3_proc_set_default_aclTrond Myklebust
Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-03Btrfs: use late_initcall instead of module_initFilipe David Borba Manana
It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might not always be already initialized, which results in the call to crypto_alloc_shash() returning -ENOENT, as experienced by Ahmet who reported this. Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which is at initialization level 6, module_init), by using late_initcall (which is at initialization level 7) instead of module_init for btrfs. Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03Btrfs: use btrfs_crc32c everywhere instead of libcrc32cFilipe David Borba Manana
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in", LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not possible to build a kernel with btrfs enabled (either as module or built-in) if libcrc32c is not enabled as well. So just replace all uses of libcrc32c with the equivalent function in btrfs hash.h - btrfs_crc32c. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-03Btrfs: disable snapshot aware defrag for nowJosef Bacik
It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-02-02hpfs: optimize quad buffer loadingMikulas Patocka
HPFS needs to load 4 consecutive 512-byte sectors when accessing the directory nodes or bitmaps. We can't switch to 2048-byte block size because files are allocated in the units of 512-byte sectors. Previously, the driver would allocate a 2048-byte area using kmalloc, copy the data from four buffers to this area and eventually copy them back if they were modified. In the current implementation of the buffer cache, buffers are allocated in the pagecache. That means that 4 consecutive 512-byte buffers are stored in consecutive areas in the kernel address space. So, we don't need to allocate extra memory and copy the content of the buffers there. This patch optimizes the code to avoid copying the buffers. It checks if the four buffers are stored in contiguous memory - if they are not, it falls back to allocating a 2048-byte area and copying data there. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-02hpfs: remember free spaceMikulas Patocka
Previously, hpfs scanned all bitmaps each time the user asked for free space using statfs. This patch changes it so that hpfs scans the bitmaps only once, remembes the free space and on next invocation of statfs it returns the value instantly. New versions of wine are hammering on the statfs syscall very heavily, making some games unplayable when they're stored on hpfs, with load times in minutes. This should be backported to the stable kernels because it fixes user-visible problem (excessive level load times in wine). Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-01NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueueTrond Myklebust
There may still be timers active on the session waitqueues. Make sure that we kill them before freeing the memory. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-01NFSv4: Fix memory corruption in nfs4_proc_open_confirmTrond Myklebust
nfs41_wake_and_assign_slot() relies on the task->tk_msg.rpc_argp and task->tk_msg.rpc_resp always pointing to the session sequence arguments. nfs4_proc_open_confirm tries to pull a fast one by reusing the open sequence structure, thus causing corruption of the NFSv4 slot table. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-02-01afs: proc cells and rootcell are writeablePali Rohár
Both proc files are writeable and used for configuring cells. But there is missing correct mode flag for writeable files. Without this patch both proc files are read only. [ It turns out they aren't really read-only, since root can write to them even if the write bit isn't set due to CAP_DAC_OVERRIDE ] Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-01Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "A set of cifs fixes (mostly for symlinks, and SMB2 xattrs) and cleanups" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix check for regular file in couldbe_mf_symlink() [CIFS] Fix SMB2 mounts so they don't try to set or get xattrs via cifs CIFS: Cleanup cifs open codepath CIFS: Remove extra indentation in cifs_sfu_type CIFS: Cleanup cifs_mknod CIFS: Cleanup CIFSSMBOpen cifs: Add support for follow_link on dfs shares under posix extensions cifs: move unix extension call to cifs_query_symlink() cifs: Re-order M-F Symlink code cifs: Add create MFSymlinks to protocol ops struct cifs: use protocol specific call for query_mf_symlink() cifs: Rename MF symlink function names cifs: Rename and cleanup open_query_close_cifs_symlink() cifs: Fix memory leak in cifs_hardlink()
2014-02-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Several obvious fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Fix mountpoint reference leakage in linkat hfsplus: use xattr handlers for removexattr Typo in compat_sys_lseek() declaration fs/super.c: sync ro remount after blocking writers vfs: unexport the getname() symbol
2014-01-31nfs: fix setting of ACLs on file creation.Noah Massey
nfs3_get_acl() tries to skip posix equivalent ACLs, but misinterprets the return value of posix_acl_equiv_mode(). Fix it. This is a regression introduced by "nfs: use generic posix ACL infrastructure for v3 Posix ACLs" CC: Christoph Hellwig <hch@infradead.org> CC: linux-nfs@vger.kernel.org CC: linux-fsdevel@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-01-31Merge tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights: - Fix several races in nfs_revalidate_mapping - NFSv4.1 slot leakage in the pNFS files driver - Stable fix for a slot leak in nfs40_sequence_done - Don't reject NFSv4 servers that support ACLs with only ALLOW aces" * tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: initialize the ACL support bits to zero. NFSv4.1: Cleanup NFSv4.1: Clean up nfs41_sequence_done NFSv4: Fix a slot leak in nfs40_sequence_done NFSv4.1 free slot before resending I/O to MDS nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING NFS: Fix races in nfs_revalidate_mapping sunrpc: turn warn_gssd() log message into a dprintk() NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping nfs: handle servers that support only ALLOW ACE type.
2014-01-31Fix mountpoint reference leakage in linkatOleg Drokin
Recent changes to retry on ESTALE in linkat (commit 442e31ca5a49e398351b2954b51f578353fdf210) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31hfsplus: use xattr handlers for removexattrChristoph Hellwig
hfsplus was already using the handlers for get and set operations, and with the removal of can_set_xattr we've now allow operations that wouldn't otherwise be allowed. With this we can also centralize the special-casing of the osx. attrs that don't have prefixes on disk in the osx xattr handlers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31fs/super.c: sync ro remount after blocking writersAndrew Ruder
Move sync_filesystem() after sb_prepare_remount_readonly(). If writers sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly() it can cause inodes to be dirtied and writeback to occur well after sys_mount() has completely successfully. This was spotted by corrupted ubifs filesystems on reboot, but appears that it can cause issues with any filesystem using writeback. Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> CC: Richard Weinberger <richard@nod.at> Co-authored-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Ruder <andrew.ruder@elecsyscorp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-31vfs: unexport the getname() symbolJeff Layton
Leaving getname() exported when putname() isn't is a bad idea. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>