summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-08-03nfs: fix potential credential leak in ff_layout_update_mirror_credJeff Layton
commit a24221dca1868101c9b4b5adde4a6a5b1a3a64a7 upstream. If we have two tasks racing to update a mirror's credentials, then they can end up leaking one (or more) sets of credentials. The first task will set mirror->cred and then the second task will just overwrite it. Use a cmpxchg to ensure that the creds are only set once. If we get to the point where we would set mirror->cred and find that they're already set, then we just release the creds that were just found. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writesTrond Myklebust
commit c70701131f7a8edea91fc49d11796d342cff7c62 upstream. If a write attempt fails, and the write is queued up for resending to the server, as opposed to being dropped, then we need to set the appropriate flag so that nfs_file_fsync() does the right thing. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03fixing infinite OPEN loop in 4.0 stateid recoveryOlga Kornievskaia
commit e8d975e73e5fa05f983fbf2723120edcf68e0b38 upstream. Problem: When an operation like WRITE receives a BAD_STATEID, even though recovery code clears the RECLAIM_NOGRACE recovery flag before recovering the open state, because of clearing delegation state for the associated inode, nfs_inode_find_state_and_recover() gets called and it makes the same state with RECLAIM_NOGRACE flag again. As a results, when we restart looking over the open states, we end up in the infinite loop instead of breaking out in the next test of state flags. Solution: unset the RECLAIM_NOGRACE set because of calling of nfs_inode_find_state_and_recover() after returning from calling recover_open() function. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03NFS: Fix size of NFSACL SETACL operationsChuck Lever
commit d683cc49daf7c5afca8cd9654aaa1bf63cdf2ad9 upstream. When encoding the NFSACL SETACL operation, reserve just the estimated size of the ACL rather than a fixed maximum. This eliminates needless zero padding on the wire that the server ignores. Fixes: ee5dc7732bd5 ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03pNFS/flexfiles: Fix the reset of struct pgio_header when resendingTrond Myklebust
commit d620876990f02788d5a663075df007ffb91bdfad upstream. hdr->good_bytes needs to be set to the length of the request, not zero. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03pNFS: Fix a memory leak when attempted pnfs failsTrond Myklebust
commit 1ca018d28d96d07788474abf66a5f3e9594841f5 upstream. pnfs_do_write() expects the call to pnfs_write_through_mds() to free the pgio header and to release the layout segment before exiting. The problem is that nfs_pgio_data_destroy() doesn't actually do this; it only frees the memory allocated by nfs_generic_pgio(). Ditto for pnfs_do_read()... Fix in both cases is to add a call to hdr->release(hdr). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ovl: lookup whiteouts outside iterate_dir()Miklos Szeredi
commit cdb672795876d7bc1870aed9a2d7cb59f43d1d96 upstream. If jffs2 can deadlock on overlayfs readdir because it takes the same lock on ->iterate() as in ->lookup(). Fix by moving whiteout checking outside iterate_dir(). Optimized by collecting potential whiteouts (DT_CHR) in a temporary list and if non-empty iterating throug these and checking for a 0/0 chardev. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Fixes: 49c21e1cacd7 ("ovl: check whiteout while reading directory") Reported-by: Roman Yeryomin <leroi.lists@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03fuse: initialize fc->release before calling itMiklos Szeredi
commit 0ad0b3255a08020eaf50e34ef0d6df5bdf5e09ed upstream. fc->release is called from fuse_conn_put() which was used in the error cleanup before fc->release was initialized. [Jeremiah Mahler <jmmahler@gmail.com>: assign fc->release after calling fuse_conn_init(fc) instead of before.] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Fixes: a325f9b92273 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: fix file corruption after cloning inline extentsFilipe Manana
commit ed958762644b404654a6f5d23e869f496fe127c6 upstream. Using the clone ioctl (or extent_same ioctl, which calls the same extent cloning function as well) we end up allowing copy an inline extent from the source file into a non-zero offset of the destination file. This is something not expected and that the btrfs code is not prepared to deal with - all inline extents must be at a file offset equals to 0. For example, the following excerpt of a test case for fstests triggers a crash/BUG_ON() on a write operation after an inline extent is cloned into a non-zero offset: _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test files. File foo has the same 2K of data at offset 4K # as file bar has at its offset 0. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ -c "pwrite -S 0xbb 4k 2K" \ -c "pwrite -S 0xcc 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # File bar consists of a single inline extent (2K size). $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ $SCRATCH_MNT/bar | _filter_xfs_io # Now call the clone ioctl to clone the extent of file bar into file # foo at its offset 4K. This made file foo have an inline extent at # offset 4K, something which the btrfs code can not deal with in future # IO operations because all inline extents are supposed to start at an # offset of 0, resulting in all sorts of chaos. # So here we validate that clone ioctl returns an EOPNOTSUPP, which is # what it returns for other cases dealing with inlined extents. $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Because of the inline extent at offset 4K, the following write made # the kernel crash with a BUG_ON(). $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io status=0 exit The stack trace of the BUG_ON() triggered by the last write is: [152154.035903] ------------[ cut here ]------------ [152154.036424] kernel BUG at mm/page-writeback.c:2286! [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 [152154.036424] RIP: 0010:[<ffffffff8111a9d5>] [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 [152154.036424] Stack: [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 [152154.036424] Call Trace: [152154.036424] [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] [152154.036424] [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs] [152154.036424] [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] [152154.036424] [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5 [152154.036424] [<ffffffff81165f89>] vfs_write+0xa0/0xe4 [152154.036424] [<ffffffff81166855>] SyS_pwrite64+0x64/0x82 [152154.036424] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ [152154.036424] RIP [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP <ffff880429effc68> [152154.242621] ---[ end trace e3d3376b23a57041 ]--- Fix this by returning the error EOPNOTSUPP if an attempt to copy an inline extent into a non-zero offset happens, just like what is done for other scenarios that would require copying/splitting inline extents, which were introduced by the following commits: 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: fix list transaction->pending_ordered corruptionFilipe Manana
commit d3efe08400317888f559bbedf0e42cd31575d0ef upstream. When we call btrfs_commit_transaction(), we splice the list "ordered" of our transaction handle into the transaction's "pending_ordered" list, but we don't re-initialize the "ordered" list of our transaction handle, this means it still points to the same elements it used to before the splice. Then we check if the current transaction's state is >= TRANS_STATE_COMMIT_START and if it is we end up calling btrfs_end_transaction() which simply splices again the "ordered" list of our handle into the transaction's "pending_ordered" list, leaving multiple pointers to the same ordered extents which results in list corruption when we are iterating, removing and freeing ordered extents at btrfs_wait_pending_ordered(), resulting in access to dangling pointers / use-after-free issues. Similarly, btrfs_end_transaction() can end up in some cases calling btrfs_commit_transaction(), and both did a list splice of the transaction handle's "ordered" list into the transaction's "pending_ordered" without re-initializing the handle's "ordered" list, resulting in exactly the same problem. This produces the following warning on a kernel with linked list debugging enabled: [109749.265416] ------------[ cut here ]------------ [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d (...) [109749.287505] Call Trace: [109749.288135] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [109749.298080] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [109749.331605] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [109749.334849] [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98 [109749.337093] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [109749.337847] [<ffffffff81260642>] __list_del_entry+0x5a/0x98 [109749.338678] [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs] [109749.340145] [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs] [109749.348313] [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs] [109749.349745] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [109749.350819] [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs] [109749.351976] [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e [109749.360341] [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e [109749.368828] [<ffffffff8118ee1d>] do_fsync+0x34/0x4e [109749.369790] [<ffffffff8118f045>] SyS_fsync+0x10/0x14 [109749.370925] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [109749.382274] ---[ end trace 48e0d07f7c03d95a ]--- On a non-debug kernel this leads to invalid memory accesses, causing a crash. Fix this by using list_splice_init() instead of list_splice() in btrfs_commit_transaction() and btrfs_end_transaction(). Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: fix memory leak in the extent_same ioctlFilipe Manana
commit 497b4050e0eacd4c746dd396d14916b1e669849d upstream. We were allocating memory with memdup_user() but we were never releasing that memory. This affected pretty much every call to the ioctl, whether it deduplicated extents or not. This issue was reported on IRC by Julian Taylor and on the mailing list by Marcel Ritter, credit goes to them for finding the issue. Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: fix fsync data loss after append writeFilipe Manana
commit e4545de5b035c7debb73d260c78377dbb69cbfb5 upstream. If we do an append write to a file (which increases its inode's i_size) that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, and the previous transaction added a new hard link to the file, which sets the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync the file, the inode's new i_size isn't logged. This has the consequence that after the fsync log is replayed, the file size remains what it was before the append write operation, which means users/applications will not be able to read the data that was successsfully fsync'ed before. This happens because neither the inode item nor the delayed inode get their i_size updated when the append write is made - doing so would require starting a transaction in the buffered write path, something that we do not do intentionally for performance reasons. Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is set the inode is logged with its current i_size (log the in-memory inode into the log tree). This issue is not a recent regression and is easy to reproduce with the following test case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! _cleanup() { _cleanup_flakey rm -f $tmp.* } trap "_cleanup; exit \$status" 0 1 2 3 15 # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs generic _supported_os Linux _need_to_be_root _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV _crash_and_mount() { # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount. This makes the fs replay its fsync log. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey } rm -f $seqres.full _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create the test file with some initial data and then fsync it. # The fsync here is only needed to trigger the issue in btrfs, as it causes the # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io sync # Add a hard link to our file. # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode, # which is a necessary condition to trigger the issue. ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Sync the filesystem to force a commit of the current btrfs transaction, this # is a necessary condition to trigger the bug on btrfs. sync # Now append more data to our file, increasing its size, and fsync the file. # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the # write path did not update the inode item in the btree nor the delayed inode # item (in memory struture) in the current transaction (created by the fsync # handler), the fsync did not record the inode's new i_size in the fsync # log/journal. This made the data unavailable after the fsync log/journal is # replayed. $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io echo "File content after fsync and before crash:" od -t x1 $SCRATCH_MNT/foo _crash_and_mount echo "File content after crash and log replay:" od -t x1 $SCRATCH_MNT/foo status=0 exit The expected file output before and after the crash/power failure expects the appended data to be available, which is: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0200000 Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: fix race between caching kthread and returning inode to inode cacheFilipe Manana
commit ae9d8f17118551bedd797406a6768b87c2146234 upstream. While the inode cache caching kthread is calling btrfs_unpin_free_ino(), we could have a concurrent call to btrfs_return_ino() that adds a new entry to the root's free space cache of pinned inodes. This concurrent call does not acquire the fs_info->commit_root_sem before adding a new entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem because the caching kthread calls btrfs_unpin_free_ino() after setting the caching state to BTRFS_CACHE_FINISHED and therefore races with the task calling btrfs_return_ino(), which is adding a new entry, while the former (caching kthread) is navigating the cache's rbtree, removing and freeing nodes from the cache's rbtree without acquiring the spinlock that protects the rbtree. This race resulted in memory corruption due to double free of struct btrfs_free_space objects because both tasks can end up doing freeing the same objects. Note that adding a new entry can result in merging it with other entries in the cache, in which case those entries are freed. This is particularly important as btrfs_free_space structures are also used for the block group free space caches. This memory corruption can be detected by a debugging kernel, which reports it with the following trace: [132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected [132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1 [132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [132408.505075] ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce [132408.505075] ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68 [132408.505075] ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f [132408.505075] Call Trace: [132408.505075] [<ffffffff8145eec7>] dump_stack+0x4f/0x7b [132408.505075] [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2 [132408.505075] [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36 [132408.505075] [<ffffffff81155733>] __cache_free+0xe2/0x4b6 [132408.505075] [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs] [132408.505075] [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs] [132408.505075] [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28 [132408.505075] [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf [132408.505075] [<ffffffff811563a1>] ? kfree+0xb6/0x14e [132408.505075] [<ffffffff811563d0>] kfree+0xe5/0x14e [132408.505075] [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs] [132408.505075] [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs] [132408.505075] [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs] [132408.505075] [<ffffffff8106698f>] kthread+0xef/0xf7 [132408.505075] [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28 [132408.505075] [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad [132408.505075] [<ffffffff814653d2>] ret_from_fork+0x42/0x70 [132408.505075] [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad [132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b. [132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320 [132409.503355] ------------[ cut here ]------------ [132409.504241] kernel BUG at mm/slab.c:2571! Therefore fix this by having btrfs_unpin_free_ino() acquire the lock that protects the rbtree while doing the searches and removing entries. Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: use kmem_cache_free when freeing entry in inode cacheFilipe Manana
commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa upstream. The free space entries are allocated using kmem_cache_zalloc(), through __btrfs_add_free_space(), therefore we should use kmem_cache_free() and not kfree() to avoid any confusion and any potential problem. Looking at the kfree() definition at mm/slab.c it has the following comment: /* * (...) * * Don't free memory not originally allocated by kmalloc() * or you will run into trouble. */ So better be safe and use kmem_cache_free(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: don't invalidate root dentry when subvolume deletion failsOmar Sandoval
commit 64ad6c488975d7516230cf7849190a991fd615ae upstream. Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03xfs: don't truncate attribute extents if no extents existBrian Foster
commit f66bf042693b620133d39af8d2f13615f03eadfc upstream. The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that attribute blocks exist to invalidate. It is possible to have an attribute fork without extents, however. Consider the case where the attribute fork is created towards the beginning of xfs_attr_set() but some part of the subsequent attribute set fails. If an inode in such a state hits xfs_attr_inactive(), it eventually calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a filesystem corruption warning, returns an error that bubbles back up to xfs_attr_inactive(), and leads to destruction of the in-core attribute fork without an on-disk reset. If the inode happens to make it back through xfs_inactive() in this state (e.g., via a concurrent bulkstat that cycles the inode from the reclaim state and releases it), i_afp might not exist when xfs_bmapi_read() is called and causes a NULL dereference panic. A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB) reproduces these problems. The behavior is a regression caused by: 6dfe5a0 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind ... which removed logic that avoided the attribute extent truncate when no extents exist. Restore this logic to ensure the attribute fork is destroyed and reset correctly if it exists without any allocated extents. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03xfs: fix remote symlinks on V5/CRC filesystemsEric Sandeen
commit 2ac56d3d4bd625450a54d4c3f9292d58f6b88232 upstream. If we create a CRC filesystem, mount it, and create a symlink with a path long enough that it can't live in the inode, we get a very strange result upon remount: # ls -l mnt total 4 lrwxrwxrwx. 1 root root 929 Jun 15 16:58 link -> XSLM XSLM is the V5 symlink block header magic (which happens to be followed by a NUL, so the string looks terminated). xfs_readlink_bmap() advanced cur_chunk by the size of the header for CRC filesystems, but never actually used that pointer; it kept reading from bp->b_addr, which is the start of the block, rather than the start of the symlink data after the header. Looks like this problem goes back to v3.10. Fixing this gets us reading the proper link target, again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03jbd2: fix ocfs2 corrupt when updating journal superblock failsJoseph Qi
commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a upstream. If updating journal superblock fails after journal data has been flushed, the error is omitted and this will mislead the caller as a normal case. In ocfs2, the checkpoint will be treated successfully and the other node can get the lock to update. Since the sb_start is still pointing to the old log block, it will rewrite the journal data during journal recovery by the other node. Thus the new updates will be overwritten and ocfs2 corrupts. So in above case we have to return the error, and ocfs2_commit_cache will take care of the error and prevent the other node to do update first. And only after recovering journal it can do the new updates. The issue discussion mail can be found at: https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html http://comments.gmane.org/gmane.comp.file-systems.ext4/48841 [ Fixed bug in patch which allowed a non-negative error return from jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this was causing xfstests ext4/306 to fail. -- Ted ] Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()Dmitry Monakhov
commit b4f1afcd068f6e533230dfed00782cd8a907f96b upstream. jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start() So allocations should be done with GFP_NOFS [Full stack trace snipped from 3.10-rh7] [<ffffffff815c4bd4>] dump_stack+0x19/0x1b [<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80 [<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20 [<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17 [<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210 [<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20 [<ffffffff81147939>] mempool_alloc+0x69/0x170 [<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20 [<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150 [<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0 [<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120 [<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL [<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2] [<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2] [<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2] [<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30 [<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210 [<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2] [<ffffffff811532ce>] ? lru_cache_add+0xe/0x10 [<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4] [<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270 [<ffffffff81146918>] __generic_file_write_iter+0x178/0x390 [<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0 [<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0 [<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4] [<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0 [<ffffffff811b93f0>] do_sync_write+0x90/0xe0 [<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0 [<ffffffff811ba5b8>] SyS_write+0x58/0xb0 [<ffffffff815d4799>] system_call_fastpath+0x16/0x1b Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko
commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream. ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: correctly migrate a file with a hole at the beginningEryu Guan
commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream. Currently ext4_ind_migrate() doesn't correctly handle a file which contains a hole at the beginning of the file. This caused the migration to be done incorrectly, and then if there is a subsequent following delayed allocation write to the "hole", this would reclaim the same data blocks again and results in fs corruption. # assmuing 4k block size ext4, with delalloc enabled # skip the first block and write to the second block xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile # converting to indirect-mapped file, which would move the data blocks # to the beginning of the file, but extent status cache still marks # that region as a hole chattr -e /mnt/ext4/testfile # delayed allocation writes to the "hole", reclaim the same data block # again, results in i_blocks corruption xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile umount /mnt/ext4 e2fsck -nf /dev/sda6 ... Inode 53, i_blocks is 16, should be 8. Fix? no ... Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: be more strict when migrating to non-extent based fileEryu Guan
commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream. Currently the check in ext4_ind_migrate() is not enough before doing the real conversion: a) delayed allocated extents could bypass the check on eh->eh_entries and eh->eh_depth This can be demonstrated by this script xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile where testfile has two extents but still be converted to non-extent based file format. b) only extent length is checked but not the offset, which would result in data lose (delalloc) or fs corruption (nodelalloc), because non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30) blocks This can be demostrated by xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile sync If delalloc is enabled, dmesg prints EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53 EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5 EXT4-fs (dm-4): This should not happen!! Data will be lost If delalloc is disabled, e2fsck -nf shows corruption Inode 53, i_size is 5497558142976, should be 4096. Fix? no Fix the two issues by a) forcing all delayed allocation blocks to be allocated before checking eh->eh_depth and eh->eh_entries b) limiting the last logical block of the extent is within direct map Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix reservation release on invalidatepage for delalloc fsLukas Czerner
commit 9705acd63b125dee8b15c705216d7186daea4625 upstream. On delalloc enabled file system on invalidatepage operation in ext4_da_page_release_reservation() we want to clear the delayed buffer and remove the extent covering the delayed buffer from the extent status tree. However currently there is a bug where on the systems with page size > block size we will always remove extents from the start of the page regardless where the actual delayed buffers are positioned in the page. This leads to the errors like this: EXT4-fs warning (device loop0): ext4_da_release_space:1225: ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data blocks This however can cause data loss on writeback time if the file system is in ENOSPC condition because we're releasing reservation for someones else delayed buffer. Fix this by only removing extents that corresponds to the part of the page we want to invalidate. This problem is reproducible by the following fio receipt (however I was only able to reproduce it with fio-2.1 or older. [global] bs=8k iodepth=1024 iodepth_batch=60 randrepeat=1 size=1m directory=/mnt/test numjobs=20 [job1] ioengine=sync bs=1k direct=1 rw=randread filename=file1:file2 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1:file2 [job3] bs=1k ioengine=posixaio rw=randwrite direct=1 filename=file1:file2 [job5] bs=1k ioengine=sync rw=randread filename=file1:file2 [job7] ioengine=libaio rw=randwrite filename=file1:file2 [job8] ioengine=posixaio rw=randwrite filename=file1:file2 [job10] ioengine=mmap rw=randwrite bs=1k filename=file1:file2 [job11] ioengine=mmap rw=randwrite direct=1 filename=file1:file2 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: avoid deadlocks in the writeback path by using sb_getblk_gfpNikolay Borisov
commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 upstream. Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible deadlocks in the page writeback path. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix fencepost error in lazytime optimizationTheodore Ts'o
commit 0f0ff9a9f3fa2ec6f427603fd521d5f3a0b076d1 upstream. Commit 8f4d8558391: "ext4: fix lazytime optimization" was not a complete fix. In the case where the inode number is a multiple of 16, and we could still end up updating an inode with dirty timestamps written to the wrong inode on disk. Oops. This can be easily reproduced by using generic/005 with a file system with metadata_csum and lazytime enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: set lazytime on remount if MS_LAZYTIME is set by mountTheodore Ts'o
commit a2fd66d069d86d793e9d39d4079b96f46d13f237 upstream. Newer versions of mount parse the lazytime feature and pass it to the mount system call via the flags field in the mount system call, removing the lazytime string from the mount options list. So we need to check for the presence of MS_LAZYTIME and set it in sb->s_flags in order for this flag to be set on a remount. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: don't retry file block mapping on bigalloc fs with non-extent fileDarrick J. Wong
commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream. ext4 isn't willing to map clusters to a non-extent file. Don't signal this with an out of space error, since the FS will retry the allocation (which didn't fail) forever. Instead, return EUCLEAN so that the operation will fail immediately all the way back to userspace. (The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: call sync_blockdev() before invalidate_bdev() in put_super()Theodore Ts'o
commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream. Normally all of the buffers will have been forced out to disk before we call invalidate_bdev(), but there will be some cases, where a file system operation was aborted due to an ext4_error(), where there may still be some dirty buffers in the buffer cache for the device. So try to force them out to memory before calling invalidate_bdev(). This fixes a warning triggered by generic/081: WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f() Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix race between truncate and __ext4_journalled_writepage()Theodore Ts'o
commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream. The commit cf108bca465d: "ext4: Invert the locking order of page_lock and transaction start" caused __ext4_journalled_writepage() to drop the page lock before the page was written back, as part of changing the locking order to jbd2_journal_start -> page_lock. However, this introduced a potential race if there was a truncate racing with the data=journalled writeback mode. Fix this by grabbing the page lock after starting the journal handle, and then checking to see if page had gotten truncated out from under us. This fixes a number of different warnings or BUG_ON's when running xfstests generic/086 in data=journalled mode, including: jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7 c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 - and - kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! ... Call Trace: [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c027d883>] ? lock_buffer+0x36/0x36 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 [<c0229139>] do_invalidatepage+0x22/0x26 [<c0229198>] truncate_inode_page+0x5b/0x85 [<c022934b>] truncate_inode_pages_range+0x156/0x38c [<c0229592>] truncate_inode_pages+0x11/0x15 [<c022962d>] truncate_pagecache+0x55/0x71 [<c02b913b>] ext4_setattr+0x4a9/0x560 [<c01ca542>] ? current_kernel_time+0x10/0x44 [<c026c4d8>] notify_change+0x1c7/0x2be [<c0256a00>] do_truncate+0x65/0x85 [<c0226f31>] ? file_ra_state_init+0x12/0x29 - and - WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 irty_metadata+0x14a/0x1ae() ... Call Trace: [<c01b879f>] ? console_unlock+0x3a1/0x3ce [<c082cbb4>] dump_stack+0x48/0x60 [<c0178b65>] warn_slowpath_common+0x89/0xa0 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae [<c0178bef>] warn_slowpath_null+0x14/0x18 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d [<c02b2f44>] write_end_fn+0x40/0x53 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a [<c02b59e7>] ext4_writepage+0x354/0x3b8 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b5a5b>] __writepage+0x10/0x2e [<c0225956>] write_cache_pages+0x22d/0x32c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b6ee8>] ext4_writepages+0x102/0x607 [<c019adfe>] ? sched_clock_local+0x10/0x10e [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 [<c01a8ad5>] ? lock_is_held+0x43/0x51 [<c0226dff>] do_writepages+0x1c/0x29 [<c0276bed>] __writeback_single_inode+0xc3/0x545 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d ... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21mnt: Modify fs_fully_visible to deal with locked ro nodev and atimeEric W. Biederman
commit 8c6cf9cc829fcd0b179b59f7fe288941d0e31108 upstream. Ignore an existing mount if the locked readonly, nodev or atime attributes are less permissive than the desired attributes of the new mount. On success ensure the new mount locks all of the same readonly, nodev and atime attributes as the old mount. The nosuid and noexec attributes are not checked here as this change is destined for stable and enforcing those attributes causes a regression in lxc and libvirt-lxc where those applications will not start and there are no known executables on sysfs or proc and no known way to create exectuables without code modifications Fixes: e51db73532955 ("userns: Better restrictions on when proc and sysfs can be mounted") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21mnt: Refactor the logic for mounting sysfs and proc in a user namespaceEric W. Biederman
commit 1b852bceb0d111e510d1a15826ecc4a19358d512 upstream. Fresh mounts of proc and sysfs are a very special case that works very much like a bind mount. Unfortunately the current structure can not preserve the MNT_LOCK... mount flags. Therefore refactor the logic into a form that can be modified to preserve those lock bits. Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount of the filesystem be fully visible in the current mount namespace, before the filesystem may be mounted. Move the logic for calling fs_fully_visible from proc and sysfs into fs/namespace.c where it has greater access to mount namespace state. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21mnt: Update fs_fully_visible to test for permanently empty directoriesEric W. Biederman
commit 7236c85e1be51a9e25ba0f6e087a66ca89605a49 upstream. fs_fully_visible attempts to make fresh mounts of proc and sysfs give the mounter no more access to proc and sysfs than if they could have by creating a bind mount. One aspect of proc and sysfs that makes this particularly tricky is that there are other filesystems that typically mount on top of proc and sysfs. As those filesystems are mounted on empty directories in practice it is safe to ignore them. However testing to ensure filesystems are mounted on empty directories has not been something the in kernel data structures have supported so the current test for an empty directory which checks to see if nlink <= 2 is a bit lacking. proc and sysfs have recently been modified to use the new empty_dir infrastructure to create all of their dedicated mount points. Instead of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a directory is empty, test for is_empty_dir_inode(inode). That small change guaranteess mounts found on proc and sysfs really are safe to ignore, because the directories are not only empty but nothing can ever be added to them. This guarantees there is nothing to worry about when mounting proc and sysfs. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21sysfs: Create mountpoints with sysfs_create_mount_pointEric W. Biederman
commit f9bb48825a6b5d02f4cabcc78967c75db903dcdc upstream. This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21sysfs: Add support for permanently empty directories to serve as mount points.Eric W. Biederman
commit 87d2846fcf88113fae2341da1ca9a71f0d916f2c upstream. Add two functions sysfs_create_mount_point and sysfs_remove_mount_point that hang a permanently empty directory off of a kobject or remove a permanently emptpy directory hanging from a kobject. Export these new functions so modular filesystems can use them. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21kernfs: Add support for always empty directories.Eric W. Biederman
commit ea015218f2f7ace2dad9cedd21ed95bdba2886d7 upstream. Add a new function kernfs_create_empty_dir that can be used to create directory that can not be modified. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21proc: Allow creating permanently empty directories that serve as mount pointsEric W. Biederman
commit eb6d38d5427b3ad42f5268da0f1dd31bb0af1264 upstream. Add a new function proc_create_mount_point that when used to creates a directory that can not be added to. Add a new function is_empty_pde to test if a function is a mount point. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21sysctl: Allow creating permanently empty directories that serve as mountpoints.Eric W. Biederman
commit f9bd6733d3f11e24f3949becf277507d422ee1eb upstream. Add a magic sysctl table sysctl_mount_point that when used to create a directory forces that directory to be permanently empty. Update the code to use make_empty_dir_inode when accessing permanently empty directories. Update the code to not allow adding to permanently empty directories. Update /proc/sys/fs/binfmt_misc to be a permanently empty directory. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21fs: Add helper functions for permanently empty directories.Eric W. Biederman
commit fbabfd0f4ee2e8847bf56edf481249ad1bb8c44d upstream. To ensure it is safe to mount proc and sysfs I need to check if filesystems that are mounted on top of them are mounted on truly empty directories. Given that some directories can gain entries over time, knowing that a directory is empty right now is insufficient. Therefore add supporting infrastructure for permantently empty directories that proc and sysfs can use when they create mount points for filesystems and fs_fully_visible can use to test for permanently empty directories to ensure that nothing will be gained by mounting a fresh copy of proc or sysfs. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10fs/ufs: restore s_lock mutex_init()Fabian Frederick
commit e4f95517f18271b1da36cfc5d700e46844396d6e upstream. Add last missing line in commit "cdd9eefdf905" ("fs/ufs: restore s_lock mutex") Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10ufs: Fix possible deadlock when looking up directoriesJan Kara
commit 514d748f69c97a51a2645eb198ac5c6218f22ff9 upstream. Commit e4502c63f56aeca88 (ufs: deal with nfsd/iget races) made ufs create inodes with I_NEW flag set. However ufs_mkdir() never cleared this flag. Thus if someone ever tried to lookup the directory by inode number, he would deadlock waiting for I_NEW to be cleared. Luckily this mostly happens only if the filesystem is exported over NFS since otherwise we have the inode attached to dentry and don't look it up by inode number. In rare cases dentry can get freed without inode being freed and then we'd hit the deadlock even without NFS export. Fix the problem by clearing I_NEW before instantiating new directory inode. Fixes: e4502c63f56aeca887ced37f24e0def1ef11cec8 Reported-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10ufs: Fix warning from unlock_new_inode()Jan Kara
commit 12ecbb4b1d765a5076920999298d9625439dbe58 upstream. Commit e4502c63f56aeca88 (ufs: deal with nfsd/iget races) introduced unlock_new_inode() call into ufs_add_nondir(). However that function gets called also from ufs_link() which hands it already initialized inode and thus unlock_new_inode() complains. The problem is harmless but annoying. Fix the problem by opencoding necessary stuff in ufs_link() Fixes: e4502c63f56aeca887ced37f24e0def1ef11cec8 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10vfs: Ignore unlocked mounts in fs_fully_visibleEric W. Biederman
commit ceeb0e5d39fcdf4dca2c997bf225c7fc49200b37 upstream. Limit the mounts fs_fully_visible considers to locked mounts. Unlocked can always be unmounted so considering them adds hassle but no security benefit. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10vfs: Remove incorrect debugging WARN in prepend_pathEric W. Biederman
commit 93e3bce6287e1fb3e60d3324ed08555b5bbafa89 upstream. The warning message in prepend_path is unclear and outdated. It was added as a warning that the mechanism for generating names of pseudo files had been removed from prepend_path and d_dname should be used instead. Unfortunately the warning reads like a general warning, making it unclear what to do with it. Remove the warning. The transition it was added to warn about is long over, and I added code several years ago which in rare cases causes the warning to fire on legitimate code, and the warning is now firing and scaring people for no good reason. Reported-by: Ivan Delalande <colona@arista.com> Reported-by: Omar Sandoval <osandov@osandov.com> Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10fs/ufs: restore s_lock mutexFabian Frederick
commit cdd9eefdf905e92e7fc6cc393314efe68dc6ff66 upstream. Commit 0244756edc4b98c ("ufs: sb mutex merge + mutex_destroy") generated deadlocks in read/write mode on mkdir. This patch partially reverts it keeping fixes by Andrew Morton and mutex_destroy() [AV: fixed a missing bit in ufs_remount()] Signed-off-by: Fabian Frederick <fabf@skynet.be> Reported-by: Ian Campbell <ian.campbell@citrix.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Roger Pau Monne <roger.pau@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10fs/ufs: revert "ufs: fix deadlocks introduced by sb mutex merge"Fabian Frederick
commit 13b987ea275840d74d9df9a44326632fab1894da upstream. This reverts commit 9ef7db7f38d0 ("ufs: fix deadlocks introduced by sb mutex merge") That patch tried to solve commit 0244756edc4b98c ("ufs: sb mutex merge + mutex_destroy") which is itself partially reverted due to multiple deadlocks. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Jan Kara <jack@suse.cz> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Roger Pau Monne <roger.pau@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10fs: Fix S_NOSEC handlingJan Kara
commit 2426f3910069ed47c0cc58559a6d088af7920201 upstream. file_remove_suid() could mistakenly set S_NOSEC inode bit when root was modifying the file. As a result following writes to the file by ordinary user would avoid clearing suid or sgid bits. Fix the bug by checking actual mode bits before setting S_NOSEC. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-02vfs: read file_handle only once in handle_to_pathSasha Levin
We used to read file_handle twice. Once to get the amount of extra bytes, and once to fetch the entire structure. This may be problematic since we do size verifications only after the first read, so if the number of extra bytes changes in userspace between the first and second calls, we'll have an incoherent view of file_handle. Instead, read the constant size once, and copy that over to the final structure without having to re-read it again. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-31Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Off-by-one in d_walk()/__dentry_kill() race fix. It's very hard to hit; possible in the same conditions as the original bug, except that you need the skipped branch to contain all the remaining evictables, so that the d_walk()-calling loop in d_invalidate() decides there's nothing more to do and doesn't go for another pass - otherwise that next pass will sweep the sucker. So it's not too urgent, but seeing that the fix is obvious and the original commit has spread into all -stable branches..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: d_walk() might skip too much
2015-05-29Merge tag 'xfs-for-linus-4.1-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This is a little larger than I'd like late in the release cycle, but all the fixes are for regressions introduced in the 4.1-rc1 merge, or are needed back in -stable kernels fairly quickly as they are filesystem corruption or userspace visible correctness issues. Changes in this update: - regression fix for new rename whiteout code - regression fixes for new superblock generic per-cpu counter code - fix for incorrect error return sign introduced in 3.17 - metadata corruption fixes that need to go back to -stable kernels" * tag 'xfs-for-linus-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: fix broken i_nlink accounting for whiteout tmpfile inode xfs: xfs_iozero can return positive errno xfs: xfs_attr_inactive leaves inconsistent attr fork state behind xfs: extent size hints can round up extents past MAXEXTLEN xfs: inode and free block counters need to use __percpu_counter_compare percpu_counter: batch size aware __percpu_counter_compare() xfs: use percpu_counter_read_positive for mp->m_icount
2015-05-28d_walk() might skip too muchAl Viro
when we find that a child has died while we'd been trying to ascend, we should go into the first live sibling itself, rather than its sibling. Off-by-one in question had been introduced in "deal with deadlock in d_walk()" and the fix needs to be backported to all branches this one has been backported to. Cc: stable@vger.kernel.org # 3.2 and later Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>