summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-12-14Btrfs: fix regression running delayed references when using qgroupsFilipe Manana
commit b06c4bf5c874a57254b197f53ddf588e7a24a2bf upstream. In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-12-14ceph: fix message length computationArnd Bergmann
commit 777d738a5e58ba3b6f3932ab1543ce93703f4873 upstream. create_request_message() computes the maximum length of a message, but uses the wrong type for the time stamp: sizeof(struct timespec) may be 8 or 16 depending on the architecture, while sizeof(struct ceph_timespec) is always 8, and that is what gets put into the message. Found while auditing the uses of timespec for y2038 problems. Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ocfs2: fix umask ignored issueJunxiao Bi
commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640 upstream. New created file's mode is not masked with umask, and this makes umask not work for ocfs2 volume. Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14nfs: if we have no valid attrs, then don't declare the attribute cache validJeff Layton
commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 upstream. If we pass in an empty nfs_fattr struct to nfs_update_inode, it will (correctly) not update any of the attributes, but it then clears the NFS_INO_INVALID_ATTR flag, which indicates that the attributes are up to date. Don't clear the flag if the fattr struct has no valid attrs to apply. Reviewed-by: Steve French <steve.french@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14nfs4: start callback_ident at idr 1Benjamin Coddington
commit c68a027c05709330fe5b2f50c50d5fa02124b5d8 upstream. If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing it when the nfs_client is freed. A decoding or server bug can then find and try to put that first nfs_client which would lead to a crash. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: d6870312659d ("nfs4client: convert to idr_alloc()") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14debugfs: fix refcount imbalance in start_creatingDaniel Borkmann
commit 0ee9608c89e81a1ccee52ecb58a7ff040e2522d9 upstream. In debugfs' start_creating(), we pin the file system to safely access its root. When we failed to create a file, we unpin the file system via failed_creating() to release the mount count and eventually the reference of the vfsmount. However, when we run into an error during lookup_one_len() when still in start_creating(), we only release the parent's mutex but not so the reference on the mount. Looks like it was done in the past, but after splitting portions of __create_file() into start_creating() and end_creating() via 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off"), this seemed missed. Noticed during code review. Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14nfsd: eliminate sending duplicate and repeated delegationsAndrew Elble
commit 34ed9872e745fa56f10e9bef2cf3d2336c6c8816 upstream. We've observed the nfsd server in a state where there are multiple delegations on the same nfs4_file for the same client. The nfs client does attempt to DELEGRETURN these when they are presented to it - but apparently under some (unknown) circumstances the client does not manage to return all of them. This leads to the eventual attempt to CB_RECALL more than one delegation with the same nfs filehandle to the same client. The first recall will succeed, but the next recall will fail with NFS4ERR_BADHANDLE. This leads to the server having delegations on cl_revoked that the client has no way to FREE or DELEGRETURN, with resulting inability to recover. The state manager on the server will continually assert SEQ4_STATUS_RECALLABLE_STATE_REVOKED, and the state manager on the client will be looping unable to satisfy the server. List discussion also reports a race between OPEN and DELEGRETURN that will be avoided by only sending the delegation once to the client. This is also logically in accordance with RFC5561 9.1.1 and 10.2. So, let's: 1.) Not hand out duplicate delegations. 2.) Only send them to the client once. RFC 5561: 9.1.1: "Delegations and layouts, on the other hand, are not associated with a specific owner but are associated with the client as a whole (identified by a client ID)." 10.2: "...the stateid for a delegation is associated with a client ID and may be used on behalf of all the open-owners for the given client. A delegation is made to the client as a whole and not to any specific process or thread of control within it." Reported-by: Eric Meddaugh <etmsys@rit.edu> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14nfsd: serialize state seqid morphing operationsJeff Layton
commit 35a92fe8770ce54c5eb275cd76128645bea2d200 upstream. Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were running in parallel. The server would receive the OPEN_DOWNGRADE first and check its seqid, but then an OPEN would race in and bump it. The OPEN_DOWNGRADE would then complete and bump the seqid again. The result was that the OPEN_DOWNGRADE would be applied after the OPEN, even though it should have been rejected since the seqid changed. The only recourse we have here I think is to serialize operations that bump the seqid in a stateid, particularly when we're given a seqid in the call. To address this, we add a new rw_semaphore to the nfs4_ol_stateid struct. We do a down_write prior to checking the seqid after looking up the stateid to ensure that nothing else is going to bump it while we're operating on it. In the case of OPEN, we do a down_read, as the call doesn't contain a seqid. Those can run in parallel -- we just need to serialize them when there is a concurrent OPEN_DOWNGRADE or CLOSE. LOCK and LOCKU however always take the write lock as there is no opportunity for parallelizing those. Reported-and-Tested-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ext4, jbd2: ensure entering into panic after recording an error in superblockDaeho Jeong
commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream. If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the journaling will be aborted first and the error number will be recorded into JBD2 superblock and, finally, the system will enter into the panic state in "errors=panic" option. But, in the rare case, this sequence is little twisted like the below figure and it will happen that the system enters into panic state, which means the system reset in mobile environment, before completion of recording an error in the journal superblock. In this case, e2fsck cannot recognize that the filesystem failure occurred in the previous run and the corruption wouldn't be fixed. Task A Task B ext4_handle_error() -> jbd2_journal_abort() -> __journal_abort_soft() -> __jbd2_journal_abort_hard() | -> journal->j_flags |= JBD2_ABORT; | | __ext4_abort() | -> jbd2_journal_abort() | | -> __journal_abort_soft() | | -> if (journal->j_flags & JBD2_ABORT) | | return; | -> panic() | -> jbd2_journal_update_sb_errno() Tested-by: Hobin Woo <hobin.woo@samsung.com> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ext4: fix potential use after free in __ext4_journal_stopLukas Czerner
commit 6934da9238da947628be83635e365df41064b09b upstream. There is a use-after-free possibility in __ext4_journal_stop() in the case that we free the handle in the first jbd2_journal_stop() because we're referencing handle->h_err afterwards. This was introduced in 9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by storing the handle->h_err value beforehand and avoid referencing potentially freed handle. Fixes: 9705acd63b125dee8b15c705216d7186daea4625 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ext4 crypto: replace some BUG_ON()'s with error checksTheodore Ts'o
commit 687c3c36e754a999a8263745b27965128db4fee5 upstream. Buggy (or hostile) userspace should not be able to cause the kernel to crash. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ext4 crypto: fix memory leak in ext4_bio_write_page()Theodore Ts'o
commit 937d7b84dca58f2565715f2c8e52f14c3d65fb22 upstream. There are times when ext4_bio_write_page() is called even though we don't actually need to do any I/O. This happens when ext4_writepage() gets called by the jbd2 commit path when an inode needs to force its pages written out in order to provide data=ordered guarantees --- and a page is backed by an unwritten (e.g., uninitialized) block on disk, or if delayed allocation means the page's backing store hasn't been allocated yet. In that case, we need to skip the call to ext4_encrypt_page(), since in addition to wasting CPU, it leads to a bounce page and an ext4 crypto context getting leaked. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14btrfs: fix signed overflows in btrfs_sync_fileDavid Sterba
commit 9dcbeed4d7e11e1dcf5e55475de3754f0855d1c2 upstream. The calculation of range length in btrfs_sync_file leads to signed overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin. https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 The fsync call passes 0 and LLONG_MAX, the range length does not fit to loff_t and overflows, but the value is converted to u64 so it silently works as expected. The minimal fix is a typecast to u64, switching functions to take (start, end) instead of (start, len) would be more intrusive. Coccinelle script found that there's one more opencoded calculation of the length. <smpl> @@ loff_t start, end; @@ * end - start </smpl> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14Btrfs: fix race when listing an inode's xattrsFilipe Manana
commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d upstream. When listing a inode's xattrs we have a time window where we race against a concurrent operation for adding a new hard link for our inode that makes us not return any xattr to user space. In order for this to happen, the first xattr of our inode needs to be at slot 0 of a leaf and the previous leaf must still have room for an inode ref (or extref) item, and this can happen because an inode's listxattrs callback does not lock the inode's i_mutex (nor does the VFS does it for us), but adding a hard link to an inode makes the VFS lock the inode's i_mutex before calling the inode's link callback. If we have the following leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ] slot N - 2 slot N - 1 slot 0 The race illustrated by the following sequence diagram is possible: CPU 1 CPU 2 btrfs_listxattr() searches for key (257 XATTR_ITEM 0) gets path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() releases the path adds key (257 INODE_REF 666) to the end of leaf X (slot N), and leaf X now has N + 1 items searches for the key (257 INODE_REF 256), with path->keep_locks == 1, because that is the last key it saw in leaf X before releasing the path ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in leaf X, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 666) btrfs_listxattr's loop iteration sees that the type of the key pointed by the path is different from the type BTRFS_XATTR_ITEM_KEY and so it breaks the loop and stops looking for more xattr items --> the application doesn't get any xattr listed for our inode So fix this by breaking the loop only if the key's type is greater than BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana
commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d upstream. If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14Btrfs: fix race leading to incorrect item deletion when dropping extentsFilipe Manana
commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c upstream. While running a stress test I got the following warning triggered: [191627.672810] ------------[ cut here ]------------ [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]() (...) [191627.701485] Call Trace: [191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c [191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0 [191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18 [191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs] [191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs] [191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs] [191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs] [191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs] [191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae [191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2 [191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7 [191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28 [191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70 [191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.744206] ---[ end trace bbfddacb7aaada8d ]--- $ cat -n fs/btrfs/file.c 691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, (...) 758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 759 if (key.objectid > ino || 760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) 761 break; 762 763 fi = btrfs_item_ptr(leaf, path->slots[0], 764 struct btrfs_file_extent_item); 765 extent_type = btrfs_file_extent_type(leaf, fi); 766 767 if (extent_type == BTRFS_FILE_EXTENT_REG || 768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) 774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) 778 } else { 779 WARN_ON(1); 780 extent_end = search_start; 781 } (...) This happened because the item we were processing did not match a file extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this case we cast the item to a struct btrfs_file_extent_item pointer and then find a type field value that does not match any of the expected values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens due to a tiny time window where a race can happen as exemplified below. For example, consider the following scenario where we're using the NO_HOLES feature and we have the following two neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather than explicit because NO_HOLES is enabled). Now if our inode has an ordered extent for the range [4K, 8K[ that is finishing, the following can happen: CPU 1 CPU 2 btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() Searches for the key (257 EXTENT_DATA 4096) through btrfs_lookup_file_extent() Key not found and we get a path where path->nodes[0] == leaf X and path->slots[0] == N Because path->slots[0] is >= btrfs_header_nritems(leaf X), we call btrfs_next_leaf() btrfs_next_leaf() releases the path inserts key (257 INODE_REF 4096) at the end of leaf X, leaf X now has N + 1 keys, and the new key is at slot N btrfs_next_leaf() searches for key (257 INODE_REF 256), with path->keep_locks set to 1, because it was the last key it saw in leaf X finds it in leaf X again and notices it's no longer the last key of the leaf, so it returns 0 with path->nodes[0] == leaf X and path->slots[0] == N (which is now < btrfs_header_nritems(leaf X)), pointing to the new key (257 INODE_REF 4096) __btrfs_drop_extents() casts the item at path->nodes[0], slot path->slots[0], to a struct btrfs_file_extent_item - it does not skip keys for the target inode with a type less than BTRFS_EXTENT_DATA_KEY (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY) sees a bogus value for the type field triggering the WARN_ON in the trace shown above, and sets extent_end = search_start (4096) does the if-then-else logic to fixup 0 length extent items created by a past bug from hole punching: if (extent_end == key.offset && extent_end >= search_start) goto delete_extent_item; that evaluates to true and it ends up deleting the key pointed to by path->slots[0], (257 INODE_REF 4096), from leaf X The same could happen for example for a xattr that ends up having a key with an offset value that matches search_start (very unlikely but not impossible). So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are skipped, never casted to struct btrfs_file_extent_item and never deleted by accident. Also protect against the unexpected case of getting a key for a lower inode number by skipping that key and issuing a warning. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14Btrfs: fix regression when running delayed referencesFilipe Manana
commit 2c3cf7d5f6105bb957df125dfce61d4483b8742d upstream. In the kernel 4.2 merge window we had a refactoring/rework of the delayed references implementation in order to fix certain problems with qgroups. However that rework introduced one more regression that leads to the following trace when running delayed references for metadata: [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2 [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000 [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293 [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000 [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000 [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8 [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000 [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000 [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0 [35908.065201] Stack: [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000 [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8 [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000 [35908.065201] Call Trace: [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs] [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs] [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs] [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs] [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs] [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs] [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs] [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs] [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs] [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31 [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP <ffff88010c4cbb08> [35908.310885] ---[ end trace fe4299baf0666457 ]--- This happens because the new delayed references code no longer merges delayed references that have different sequence values. The following steps are an example sequence leading to this issue: 1) Transaction N starts, fs_info->tree_mod_seq has value 0; 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for bytenr A is created, with a value of 1 and a seq value of 0; 3) fs_info->tree_mod_seq is incremented to 1; 4) Extent buffer A is deleted through btrfs_del_items(), which calls btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The later returns the metadata extent associated to extent buffer A to the free space cache (the range is not pinned), because the extent buffer was created in the current transaction (N) and writeback never happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set in the extent buffer). This creates the delayed reference Ref2 for bytenr A, with a value of -1 and a seq value of 1; 5) Delayed reference Ref2 is not merged with Ref1 when we create it, because they have different sequence numbers (decided at add_delayed_ref_tail_merge()); 6) fs_info->tree_mod_seq is incremented to 2; 7) Some task attempts to allocate a new extent buffer (done at extent-tree.c:find_free_extent()), but due to heavy fragmentation and running low on metadata space the clustered allocation fails and we fall back to unclustered allocation, which finds the extent at offset A, so a new extent buffer at offset A is allocated. This creates delayed reference Ref3 for bytenr A, with a value of 1 and a seq value of 2; 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they all have different seq values; 9) We start running the delayed references (__btrfs_run_delayed_refs()); 10) The delayed Ref1 is the first one being applied, which ends up creating an inline extent backref in the extent tree; 10) Next the delayed reference Ref3 is selected for execution, and not Ref2, because select_delayed_ref() always gives a preference for positive references (that have an action of BTRFS_ADD_DELAYED_REF); 11) When running Ref3 we encounter alreay the inline extent backref in the extent tree at insert_inline_extent_backref(), which makes us hit the following BUG_ON: BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID); This is always true because owner corresponds to the level of the extent buffer/btree node in the btree. For the scenario described above we hit the BUG_ON because we never merge references that have different seq values. We used to do the merging before the 4.2 kernel, more specifically, before the commmits: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.") This issue became more exposed after the following change that was added to 4.2 as well: cffc3374e567 ("Btrfs: fix order by which delayed references are run") Which in turn fixed another regression by the two commits previously mentioned. So fix this by bringing back the delayed reference merge code, with the proper adaptations so that it operates against the new data structure (linked list vs old red black tree implementation). This issue was hit running fstest btrfs/063 in a loop. Several people have reported this issue in the mailing list when running on kernels 4.2+. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14Btrfs: fix truncation of compressed and inlined extentsFilipe Manana
commit 0305cd5f7fca85dae392b9ba85b116896eb7c1c7 upstream. When truncating a file to a smaller size which consists of an inline extent that is compressed, we did not discard (or made unusable) the data between the new file size and the old file size, wasting metadata space and allowing for the truncated data to be leaked and the data corruption/loss mentioned below. We were also not correctly decrementing the number of bytes used by the inode, we were setting it to zero, giving a wrong report for callers of the stat(2) syscall. The fsck tool also reported an error about a mismatch between the nbytes of the file versus the real space used by the file. Now because we weren't discarding the truncated region of the file, it was possible for a caller of the clone ioctl to actually read the data that was truncated, allowing for a security breach without requiring root access to the system, using only standard filesystem operations. The scenario is the following: 1) User A creates a file which consists of an inline and compressed extent with a size of 2000 bytes - the file is not accessible to any other users (no read, write or execution permission for anyone else); 2) The user truncates the file to a size of 1000 bytes; 3) User A makes the file world readable; 4) User B creates a file consisting of an inline extent of 2000 bytes; 5) User B issues a clone operation from user A's file into its own file (using a length argument of 0, clone the whole range); 6) User B now gets to see the 1000 bytes that user A truncated from its file before it made its file world readbale. User B also lost the bytes in the range [1000, 2000[ bytes from its own file, but that might be ok if his/her intention was reading stale data from user A that was never supposed to be public. Note that this contrasts with the case where we truncate a file from 2000 bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In this case reading any byte from the range [1000, 2000[ will return a value of 0x00, instead of the original data. This problem exists since the clone ioctl was added and happens both with and without my recent data loss and file corruption fixes for the clone ioctl (patch "Btrfs: fix file corruption and data loss after cloning inline extents"). So fix this by truncating the compressed inline extents as we do for the non-compressed case, which involves decompressing, if the data isn't already in the page cache, compressing the truncated version of the extent, writing the compressed content into the inline extent and then truncate it. The following test case for fstests reproduces the problem. In order for the test to pass both this fix and my previous fix for the clone ioctl that forbids cloning a smaller inline extent into a larger one, which is titled "Btrfs: fix file corruption and data loss after cloning inline extents", are needed. Without that other fix the test fails in a different way that does not leak the truncated data, instead part of destination file gets replaced with zeroes (because the destination file has a larger inline extent than the source). seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our test files. File foo is going to be the source of a clone operation # and consists of a single inline extent with an uncompressed size of 512 bytes, # while file bar consists of a single inline extent with an uncompressed size of # 256 bytes. For our test's purpose, it's important that file bar has an inline # extent with a size smaller than foo's inline extent. $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \ -c "pwrite -S 0x2a 128 384" \ $SCRATCH_MNT/foo | _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io # Now durably persist all metadata and data. We do this to make sure that we get # on disk an inline extent with a size of 512 bytes for file foo. sync # Now truncate our file foo to a smaller size. Because it consists of a # compressed and inline extent, btrfs did not shrink the inline extent to the # new size (if the extent was not compressed, btrfs would shrink it to 128 # bytes), it only updates the inode's i_size to 128 bytes. $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo # Now clone foo's inline extent into bar. # This clone operation should fail with errno EOPNOTSUPP because the source # file consists only of an inline extent and the file's size is smaller than # the inline extent of the destination (128 bytes < 256 bytes). However the # clone ioctl was not prepared to deal with a file that has a size smaller # than the size of its inline extent (something that happens only for compressed # inline extents), resulting in copying the full inline extent from the source # file into the destination file. # # Note that btrfs' clone operation for inline extents consists of removing the # inline extent from the destination inode and copy the inline extent from the # source inode into the destination inode, meaning that if the destination # inode's inline extent is larger (N bytes) than the source inode's inline # extent (M bytes), some bytes (N - M bytes) will be lost from the destination # file. Btrfs could copy the source inline extent's data into the destination's # inline extent so that we would not lose any data, but that's currently not # done due to the complexity that would be needed to deal with such cases # (specially when one or both extents are compressed), returning EOPNOTSUPP, as # it's normally not a very common case to clone very small files (only case # where we get inline extents) and copying inline extents does not save any # space (unlike for normal, non-inlined extents). $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now because the above clone operation used to succeed, and due to foo's inline # extent not being shinked by the truncate operation, our file bar got the whole # inline extent copied from foo, making us lose the last 128 bytes from bar # which got replaced by the bytes in range [128, 256[ from foo before foo was # truncated - in other words, data loss from bar and being able to read old and # stale data from foo that should not be possible to read anymore through normal # filesystem operations. Contrast with the case where we truncate a file from a # size N to a smaller size M, truncate it back to size N and then read the range # [M, N[, we should always get the value 0x00 for all the bytes in that range. # We expected the clone operation to fail with errno EOPNOTSUPP and therefore # not modify our file's bar data/metadata. So its content should be 256 bytes # long with all bytes having the value 0xbb. # # Without the btrfs bug fix, the clone operation succeeded and resulted in # leaking truncated data from foo, the bytes that belonged to its range # [128, 256[, and losing data from bar in that same range. So reading the # file gave us the following content: # # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 # * # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a # * # 0000400 echo "File bar's content after the clone operation:" od -t x1 $SCRATCH_MNT/bar # Also because the foo's inline extent was not shrunk by the truncate # operation, btrfs' fsck, which is run by the fstests framework everytime a # test completes, failed reporting the following error: # # root 5 inode 257 errors 400, nbytes wrong status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14Btrfs: fix file corruption and data loss after cloning inline extentsFilipe Manana
commit 8039d87d9e473aeb740d4fdbd59b9d2f89b2ced9 upstream. Currently the clone ioctl allows to clone an inline extent from one file to another that already has other (non-inlined) extents. This is a problem because btrfs is not designed to deal with files having inline and regular extents, if a file has an inline extent then it must be the only extent in the file and must start at file offset 0. Having a file with an inline extent followed by regular extents results in EIO errors when doing reads or writes against the first 4K of the file. Also, the clone ioctl allows one to lose data if the source file consists of a single inline extent, with a size of N bytes, and the destination file consists of a single inline extent with a size of M bytes, where we have M > N. In this case the clone operation removes the inline extent from the destination file and then copies the inline extent from the source file into the destination file - we lose the M - N bytes from the destination file, a read operation will get the value 0x00 for any bytes in the the range [N, M] (the destination inode's i_size remained as M, that's why we can read past N bytes). So fix this by not allowing such destructive operations to happen and return errno EOPNOTSUPP to user space. Currently the fstest btrfs/035 tests the data loss case but it totally ignores this - i.e. expects the operation to succeed and does not check the we got data loss. The following test case for fstests exercises all these cases that result in file corruption and data loss: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" rm -f $seqres.full test_cloning_inline_extents() { local mkfs_opts=$1 local mount_opts=$2 _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 _scratch_mount $mount_opts # File bar, the source for all the following clone operations, consists # of a single inline extent (50 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ | _filter_xfs_io # Test cloning into a file with an extent (non-inlined) where the # destination offset overlaps that extent. It should not be possible to # clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo data after clone operation:" # All bytes should have the value 0xaa (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io # Test cloning the inline extent against a file which has a hole in its # first 4K followed by a non-inlined extent. It should not be possible # as well to clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo2 data after clone operation:" # All bytes should have the value 0x00 (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo2 $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io # Test cloning the inline extent against a file which has a size of zero # but has a prealloc extent. It should not be possible as well to clone # the inline extent from file bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "First 50 bytes of foo3 after clone operation:" # Should not be able to read any bytes, file has 0 bytes i_size (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo3 $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size not greater than the size of # bar's inline extent (40 < 50). # It should be possible to do the extent cloning from bar to this file. $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4 # Doing IO against any range in the first 4K of the file should work. echo "File foo4 data after clone operation:" # Must match file bar's content. od -t x1 $SCRATCH_MNT/foo4 $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size greater than the size of bar's # inline extent (60 > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5 # Reading the file should not fail. echo "File foo5 data after clone operation:" # Must have a size of 60 bytes, with all bytes having a value of 0x03 # (the clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo5 # Test cloning the inline extent against a file which has no extents but # has a size greater than bar's inline extent (16K > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6 # Reading the file should not fail. echo "File foo6 data after clone operation:" # Must have a size of 16K, with all bytes having a value of 0x00 (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo6 # Test cloning the inline extent against a file which has no extents but # has a size not greater than bar's inline extent (30 < 50). # It should be possible to clone the inline extent from file bar into # this file. $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7 # Reading the file should not fail. echo "File foo7 data after clone operation:" # Must have a size of 50 bytes, with all bytes having a value of 0xbb. od -t x1 $SCRATCH_MNT/foo7 # Test cloning the inline extent against a file which has a size not # greater than the size of bar's inline extent (20 < 50) but has # a prealloc extent that goes beyond the file's size. It should not be # possible to clone the inline extent from bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" \ -c "pwrite -S 0x88 0 20" \ $SCRATCH_MNT/foo8 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8 echo "File foo8 data after clone operation:" # Must have a size of 20 bytes, with all bytes having a value of 0x88 # (the clone operation did not modify our file). od -t x1 $SCRATCH_MNT/foo8 _scratch_unmount } echo -e "\nTesting without compression and without the no-holes feature...\n" test_cloning_inline_extents echo -e "\nTesting with compression and without the no-holes feature...\n" test_cloning_inline_extents "" "-o compress" echo -e "\nTesting without compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "" echo -e "\nTesting with compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "-o compress" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14btrfs: fix resending received snapshot with parentRobin Ruede
commit b96b1db039ebc584d03a9933b279e0d3e704c528 upstream. This fixes a regression introduced by 37b8d27d between v4.1 and v4.2. When a snapshot is received, its received_uuid is set to the original uuid of the subvolume. When that snapshot is then resent to a third filesystem, it's received_uuid is set to the second uuid instead of the original one. The same was true for the parent_uuid. This behaviour was partially changed in 37b8d27d, but in that patch only the parent_uuid was taken from the real original, not the uuid itself, causing the search for the parent to fail in the case below. This happens for example when trying to send a series of linked snapshots (e.g. created by snapper) from the backup file system back to the original one. The following commands reproduce the issue in v4.2.1 (no error in 4.1.6) # setup three test file systems for i in 1 2 3; do truncate -s 50M fs$i mkfs.btrfs fs$i mkdir $i mount fs$i $i done echo "content" > 1/testfile btrfs su snapshot -r 1/ 1/snap1 echo "changed content" > 1/testfile btrfs su snapshot -r 1/ 1/snap2 # works fine: btrfs send 1/snap1 | btrfs receive 2/ btrfs send -p 1/snap1 1/snap2 | btrfs receive 2/ # ERROR: could not find parent subvolume btrfs send 2/snap1 | btrfs receive 3/ btrfs send -p 2/snap1 2/snap2 | btrfs receive 3/ Signed-off-by: Robin Ruede <rruede+git@gmail.com> Fixes: 37b8d27de5d0 ("Btrfs: use received_uuid of parent during send") Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09fs/proc, core/debug: Don't expose absolute kernel addresses via wchanIngo Molnar
commit b2f73922d119686323f14fbbe46587f863852328 upstream. So the /proc/PID/stat 'wchan' field (the 30th field, which contains the absolute kernel address of the kernel function a task is blocked in) leaks absolute kernel addresses to unprivileged user-space: seq_put_decimal_ull(m, ' ', wchan); The absolute address might also leak via /proc/PID/wchan as well, if KALLSYMS is turned off or if the symbol lookup fails for some reason: static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { unsigned long wchan; char symname[KSYM_NAME_LEN]; wchan = get_wchan(task); if (lookup_symbol_name(wchan, symname) < 0) { if (!ptrace_may_access(task, PTRACE_MODE_READ)) return 0; seq_printf(m, "%lu", wchan); } else { seq_printf(m, "%s", symname); } return 0; } This isn't ideal, because for example it trivially leaks the KASLR offset to any local attacker: fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35) ffffffff8123b380 Most real-life uses of wchan are symbolic: ps -eo pid:10,tid:10,wchan:30,comm and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat: triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1 open("/proc/30833/wchan", O_RDONLY) = 6 There's one compatibility quirk here: procps relies on whether the absolute value is non-zero - and we can provide that functionality by outputing "0" or "1" depending on whether the task is blocked (whether there's a wchan address). These days there appears to be very little legitimate reason user-space would be interested in the absolute address. The absolute address is mostly historic: from the days when we didn't have kallsyms and user-space procps had to do the decoding itself via the System.map. So this patch sets all numeric output to "0" or "1" and keeps only symbolic output, in /proc/PID/wchan. ( The absolute sleep address can generally still be profiled via perf, by tasks with sufficient privileges. ) Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09btrfs: fix possible leak in btrfs_ioctl_balance()Christian Engelmayer
commit 0f89abf56abbd0e1c6e3cef9813e6d9f05383c1e upstream. Commit 8eb934591f8b ("btrfs: check unsupported filters in balance arguments") adds a jump to exit label out_bargs in case the argument check fails. At this point in addition to the bargs memory, the memory for struct btrfs_balance_control has already been allocated. Ownership of bctl is passed to btrfs_balance() in the good case, thus the memory is not freed due to the introduced jump. Make sure that the memory gets freed in any case as necessary. Detected by Coverity CID 1328378. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09ovl: fix open in stacked overlayMiklos Szeredi
commit 1c8a47df36d72ace8cf78eb6c228aa0f8027d3c2 upstream. If two overlayfs filesystems are stacked on top of each other, then we need recursion in ovl_d_select_inode(). I guess d_backing_inode() is supposed to do that. But currently it doesn't and that functionality is open coded in vfs_open(). This is now copied into ovl_d_select_inode() to fix this regression. Reported-by: Alban Crequy <alban.crequy@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay...") Cc: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09ovl: fix dentry reference leakDavid Howells
commit ab79efab0a0ba01a74df782eb7fa44b044dae8b5 upstream. In ovl_copy_up_locked(), newdentry is leaked if the function exits through out_cleanup as this just to out after calling ovl_cleanup() - which doesn't actually release the ref on newdentry. The out_cleanup segment should instead exit through out2 as certainly newdentry leaks - and possibly upper does also, though this isn't caught given the catch of newdentry. Without this fix, something like the following is seen: BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs] BUG: Dentry ffff880023ece640{i=0,n=bigfile} still in use (1) [unmount of tmpfs tmpfs] when unmounting the upper layer after an error occurred in copyup. An error can be induced by creating a big file in a lower layer with something like: dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000)) to create a large file (4.1G). Overlay an upper layer that is too small (on tmpfs might do) and then induce a copy up by opening it writably. Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09ovl: use O_LARGEFILE in ovl_copy_up()David Howells
commit 0480334fa60488d12ae101a02d7d9e1a3d03d7dd upstream. Open the lower file with O_LARGEFILE in ovl_copy_up(). Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for catching 32-bit userspace dealing with a file large enough that it'll be mishandled if the application isn't aware that there might be an integer overflow. Inside the kernel, there shouldn't be any problems. Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09ovl: free lower_mnt array in ovl_put_superKonstantin Khlebnikov
commit 5ffdbe8bf1e485026e1c7e4714d2841553cf0b40 upstream. This fixes memory leak after umount. Kmemleak report: unreferenced object 0xffff8800ba791010 (size 8): comm "mount", pid 2394, jiffies 4294996294 (age 53.920s) hex dump (first 8 bytes): 20 1c 13 02 00 88 ff ff ....... backtrace: [<ffffffff811f8cd4>] create_object+0x124/0x2c0 [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0 [<ffffffff811dffe6>] __kmalloc+0x106/0x340 [<ffffffffa0152bfc>] ovl_fill_super+0x55c/0x9b0 [overlay] [<ffffffff81200ac4>] mount_nodev+0x54/0xa0 [<ffffffffa0152118>] ovl_mount+0x18/0x20 [overlay] [<ffffffff81201ab3>] mount_fs+0x43/0x170 [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170 [<ffffffff812233ad>] do_mount+0x22d/0xdf0 [<ffffffff812242cb>] SyS_mount+0x7b/0xc0 [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: dd662667e6d3 ("ovl: add mutli-layer infrastructure") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-09ovl: free stack of paths in ovl_fill_superKonstantin Khlebnikov
commit 0f95502ad84874b3c05fc7cdd9d4d9d5cddf7859 upstream. This fixes small memory leak after mount. Kmemleak report: unreferenced object 0xffff88003683fe00 (size 16): comm "mount", pid 2029, jiffies 4294909563 (age 33.380s) hex dump (first 16 bytes): 20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff '......@K.6.... backtrace: [<ffffffff811f8cd4>] create_object+0x124/0x2c0 [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0 [<ffffffff811dffe6>] __kmalloc+0x106/0x340 [<ffffffffa01b7a29>] ovl_fill_super+0x389/0x9a0 [overlay] [<ffffffff81200ac4>] mount_nodev+0x54/0xa0 [<ffffffffa01b7118>] ovl_mount+0x18/0x20 [overlay] [<ffffffff81201ab3>] mount_fs+0x43/0x170 [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170 [<ffffffff812233ad>] do_mount+0x22d/0xdf0 [<ffffffff812242cb>] SyS_mount+0x7b/0xc0 [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: a78d9f0d5d5c ("ovl: support multiple lower layers") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-27nfsd/blocklayout: accept any minlengthChristoph Hellwig
commit 8c3ad9cb7343dc5f61b8cf3cdbe1016c5e7c2c8b upstream. Recent Linux clients have started to send GETLAYOUT requests with minlength less than blocksize. Servers aren't really allowed to impose this kind of restriction on layouts; see RFC 5661 section 18.43.3 for details. This has been observed to cause indefinite hangs on fsx runs on some clients. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-27btrfs: fix use after free iterating extrefsChris Mason
commit dc6c5fb3b514221f2e9d21ee626a9d95d3418dff upstream. The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-27btrfs: check unsupported filters in balance argumentsDavid Sterba
commit 8eb934591f8bf584969454a658f629cd06e59f3a upstream. We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22namei: results of d_is_negative() should be checked after dentry revalidationTrond Myklebust
commit daf3761c9fcde0f4ca64321cbed6c1c86d304193 upstream. Leandro Awa writes: "After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form: T34: ./regex.c:39:22: error: config.h: No such file or directory From our 'git bisect' testing, the following commit appears to be the possible cause of the behavior we've been seeing: commit 766c4cbfacd8" Al Viro says: "What happens is that 766c4cbfacd8 got the things subtly wrong. We used to treat d_is_negative() after lookup_fast() as "fall with ENOENT". That was wrong - checking ->d_flags outside of ->d_seq protection is unreliable and failing with hard error on what should've fallen back to non-RCU pathname resolution is a bug. Unfortunately, we'd pulled the test too far up and ran afoul of another kind of staleness. The dentry might have been absolutely stable from the RCU point of view (and we might be on UP, etc), but stale from the remote fs point of view. If ->d_revalidate() returns "it's actually stale", dentry gets thrown away and the original code wouldn't even have looked at its ->d_flags. What we need is to check ->d_flags where 766c4cbfacd8 does (prior to ->d_seq validation) but only use the result in cases where we do not discard this dentry outright" Reported-by: Leandro Awa <lawa@nvidia.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911 Fixes: 766c4cbfacd8 ("namei: d_is_negative() should be checked...") Tested-by: Leandro Awa <lawa@nvidia.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22nfs/filelayout: Fix NULL reference caused by double freeing of fh_arrayKinglong Mee
commit 3ec0c97959abff33a42db9081c22132bcff5b4f2 upstream. If filelayout_decode_layout fail, _filelayout_free_lseg will causes a double freeing of fh_array. [ 1179.279800] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1179.280198] IP: [<ffffffffa027222d>] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files] [ 1179.281010] PGD 0 [ 1179.281443] Oops: 0000 [#1] [ 1179.281831] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) xfs libcrc32c coretemp nfsd crct10dif_pclmul ppdev crc32_pclmul crc32c_intel auth_rpcgss ghash_clmulni_intel nfs_acl lockd vmw_balloon grace sunrpc parport_pc vmw_vmci parport shpchp i2c_piix4 vmwgfx drm_kms_helper ttm drm serio_raw mptspi scsi_transport_spi mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded: fscache] [ 1179.283891] CPU: 0 PID: 13336 Comm: cat Tainted: G OE 4.3.0-rc1-pnfs+ #244 [ 1179.284323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 [ 1179.285206] task: ffff8800501d48c0 ti: ffff88003e3c4000 task.ti: ffff88003e3c4000 [ 1179.285668] RIP: 0010:[<ffffffffa027222d>] [<ffffffffa027222d>] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files] [ 1179.286612] RSP: 0018:ffff88003e3c77f8 EFLAGS: 00010202 [ 1179.287092] RAX: 0000000000000000 RBX: ffff88001fe78900 RCX: 0000000000000000 [ 1179.287731] RDX: ffffea0000f40760 RSI: ffff88001fe789c8 RDI: ffff88001fe789c0 [ 1179.288383] RBP: ffff88003e3c7810 R08: ffffea0000f40760 R09: 0000000000000000 [ 1179.289170] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88001fe789c8 [ 1179.289959] R13: ffff88001fe789c0 R14: ffff88004ec05a80 R15: ffff88004f935b88 [ 1179.290791] FS: 00007f4e66bb5700(0000) GS:ffffffff81c29000(0000) knlGS:0000000000000000 [ 1179.291580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1179.292209] CR2: 0000000000000000 CR3: 00000000203f8000 CR4: 00000000001406f0 [ 1179.292731] Stack: [ 1179.293195] ffff88001fe78900 00000000000000d0 ffff88001fe78178 ffff88003e3c7868 [ 1179.293676] ffffffffa0272737 0000000000000001 0000000000000001 ffff88001fe78800 [ 1179.294151] 00000000614fffce ffffffff81727671 ffff88001fe78100 ffff88001fe78100 [ 1179.294623] Call Trace: [ 1179.295092] [<ffffffffa0272737>] filelayout_alloc_lseg+0xa7/0x2d0 [nfs_layout_nfsv41_files] [ 1179.295625] [<ffffffff81727671>] ? out_of_line_wait_on_bit+0x81/0xb0 [ 1179.296133] [<ffffffffa040407e>] pnfs_layout_process+0xae/0x320 [nfsv4] [ 1179.296632] [<ffffffffa03e0a01>] nfs4_proc_layoutget+0x2b1/0x360 [nfsv4] [ 1179.297134] [<ffffffffa0402983>] pnfs_update_layout+0x853/0xb30 [nfsv4] [ 1179.297632] [<ffffffffa039db24>] ? nfs_get_lock_context+0x74/0x170 [nfs] [ 1179.298158] [<ffffffffa0271807>] filelayout_pg_init_read+0x37/0x50 [nfs_layout_nfsv41_files] [ 1179.298834] [<ffffffffa03a72d9>] __nfs_pageio_add_request+0x119/0x460 [nfs] [ 1179.299385] [<ffffffffa03a6bd7>] ? nfs_create_request.part.9+0x37/0x2e0 [nfs] [ 1179.299872] [<ffffffffa03a7cc3>] nfs_pageio_add_request+0xa3/0x1b0 [nfs] [ 1179.300362] [<ffffffffa03a8635>] readpage_async_filler+0x85/0x260 [nfs] [ 1179.300907] [<ffffffff81180cb1>] read_cache_pages+0x91/0xd0 [ 1179.301391] [<ffffffffa03a85b0>] ? nfs_read_completion+0x220/0x220 [nfs] [ 1179.301867] [<ffffffffa03a8dc8>] nfs_readpages+0x128/0x200 [nfs] [ 1179.302330] [<ffffffff81180ef3>] __do_page_cache_readahead+0x203/0x280 [ 1179.302784] [<ffffffff81180dc8>] ? __do_page_cache_readahead+0xd8/0x280 [ 1179.303413] [<ffffffff81181116>] ondemand_readahead+0x1a6/0x2f0 [ 1179.303855] [<ffffffff81181371>] page_cache_sync_readahead+0x31/0x50 [ 1179.304286] [<ffffffff811750a6>] generic_file_read_iter+0x4a6/0x5c0 [ 1179.304711] [<ffffffffa03a0316>] ? __nfs_revalidate_mapping+0x1f6/0x240 [nfs] [ 1179.305132] [<ffffffffa039ccf2>] nfs_file_read+0x52/0xa0 [nfs] [ 1179.305540] [<ffffffff811e343c>] __vfs_read+0xcc/0x100 [ 1179.305936] [<ffffffff811e3d15>] vfs_read+0x85/0x130 [ 1179.306326] [<ffffffff811e4a98>] SyS_read+0x58/0xd0 [ 1179.306708] [<ffffffff8172caaf>] entry_SYSCALL_64_fastpath+0x12/0x76 [ 1179.307094] Code: c4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 8b 07 49 89 f4 85 c0 74 47 48 8b 06 49 89 fd <48> 8b 38 48 85 ff 74 22 31 db eb 0c 48 63 d3 48 8b 3c d0 48 85 [ 1179.308357] RIP [<ffffffffa027222d>] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files] [ 1179.309177] RSP <ffff88003e3c77f8> [ 1179.309582] CR2: 0000000000000000 Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: William Dauchy <william@gandi.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22vfs: Test for and handle paths that are unreachable from their mnt_rootEric W. Biederman
commit 397d425dc26da728396e66d392d5dcb8dac30c37 upstream. In rare cases a directory can be renamed out from under a bind mount. In those cases without special handling it becomes possible to walk up the directory tree to the root dentry of the filesystem and down from the root dentry to every other file or directory on the filesystem. Like division by zero .. from an unconnected path can not be given a useful semantic as there is no predicting at which path component the code will realize it is unconnected. We certainly can not match the current behavior as the current behavior is a security hole. Therefore when encounting .. when following an unconnected path return -ENOENT. - Add a function path_connected to verify path->dentry is reachable from path->mnt.mnt_root. AKA to validate that rename did not do something nasty to the bind mount. To avoid races path_connected must be called after following a path component to it's next path component. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22dcache: Handle escaped paths in prepend_pathEric W. Biederman
commit cde93be45a8a90d8c264c776fab63487b5038a65 upstream. A rename can result in a dentry that by walking up d_parent will never reach it's mnt_root. For lack of a better term I call this an escaped path. prepend_path is called by four different functions __d_path, d_absolute_path, d_path, and getcwd. __d_path only wants to see paths are connected to the root it passes in. So __d_path needs prepend_path to return an error. d_absolute_path similarly wants to see paths that are connected to some root. Escaped paths are not connected to any mnt_root so d_absolute_path needs prepend_path to return an error greater than 1. So escaped paths will be treated like paths on lazily unmounted mounts. getcwd needs to prepend "(unreachable)" so getcwd also needs prepend_path to return an error. d_path is the interesting hold out. d_path just wants to print something, and does not care about the weird cases. Which raises the question what should be printed? Given that <escaped_path>/<anything> should result in -ENOENT I believe it is desirable for escaped paths to be printed as empty paths. As there are not really any meaninful path components when considered from the perspective of a mount tree. So tweak prepend_path to return an empty path with an new error code of 3 when it encounters an escaped path. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22UBIFS: Kill unneeded locking in ubifs_init_securityRichard Weinberger
commit cf6f54e3f133229f02a90c04fe0ff9dd9d3264b4 upstream. Fixes the following lockdep splat: [ 1.244527] ============================================= [ 1.245193] [ INFO: possible recursive locking detected ] [ 1.245193] 4.2.0-rc1+ #37 Not tainted [ 1.245193] --------------------------------------------- [ 1.245193] cp/742 is trying to acquire lock: [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff812b3f69>] ubifs_init_security+0x29/0xb0 [ 1.245193] [ 1.245193] but task is already holding lock: [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff81198e7f>] path_openat+0x3af/0x1280 [ 1.245193] [ 1.245193] other info that might help us debug this: [ 1.245193] Possible unsafe locking scenario: [ 1.245193] [ 1.245193] CPU0 [ 1.245193] ---- [ 1.245193] lock(&sb->s_type->i_mutex_key#9); [ 1.245193] lock(&sb->s_type->i_mutex_key#9); [ 1.245193] [ 1.245193] *** DEADLOCK *** [ 1.245193] [ 1.245193] May be due to missing lock nesting notation [ 1.245193] [ 1.245193] 2 locks held by cp/742: [ 1.245193] #0: (sb_writers#5){.+.+.+}, at: [<ffffffff811ad37f>] mnt_want_write+0x1f/0x50 [ 1.245193] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff81198e7f>] path_openat+0x3af/0x1280 [ 1.245193] [ 1.245193] stack backtrace: [ 1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37 [ 1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014 [ 1.245193] ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5 [ 1.245193] ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68 [ 1.245193] 000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500 [ 1.245193] Call Trace: [ 1.245193] [<ffffffff814f6f49>] dump_stack+0x4c/0x65 [ 1.245193] [<ffffffff810b56c5>] ? console_unlock+0x1c5/0x510 [ 1.245193] [<ffffffff810a150d>] __lock_acquire+0x1a6d/0x1ea0 [ 1.245193] [<ffffffff8109fa78>] ? __lock_is_held+0x58/0x80 [ 1.245193] [<ffffffff810a1a93>] lock_acquire+0xd3/0x270 [ 1.245193] [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff814fc83b>] mutex_lock_nested+0x6b/0x3a0 [ 1.245193] [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff812b3f69>] ubifs_init_security+0x29/0xb0 [ 1.245193] [<ffffffff8128e286>] ubifs_create+0xa6/0x1f0 [ 1.245193] [<ffffffff81198e7f>] ? path_openat+0x3af/0x1280 [ 1.245193] [<ffffffff81195d15>] vfs_create+0x95/0xc0 [ 1.245193] [<ffffffff8119929c>] path_openat+0x7cc/0x1280 [ 1.245193] [<ffffffff8109ffe3>] ? __lock_acquire+0x543/0x1ea0 [ 1.245193] [<ffffffff81088f20>] ? sched_clock_cpu+0x90/0xc0 [ 1.245193] [<ffffffff81088c00>] ? calc_global_load_tick+0x60/0x90 [ 1.245193] [<ffffffff81088f20>] ? sched_clock_cpu+0x90/0xc0 [ 1.245193] [<ffffffff811a9cef>] ? __alloc_fd+0xaf/0x180 [ 1.245193] [<ffffffff8119ac55>] do_filp_open+0x75/0xd0 [ 1.245193] [<ffffffff814ffd86>] ? _raw_spin_unlock+0x26/0x40 [ 1.245193] [<ffffffff811a9cef>] ? __alloc_fd+0xaf/0x180 [ 1.245193] [<ffffffff81189bd9>] do_sys_open+0x129/0x200 [ 1.245193] [<ffffffff81189cc9>] SyS_open+0x19/0x20 [ 1.245193] [<ffffffff81500717>] entry_SYSCALL_64_fastpath+0x12/0x6f While the lockdep splat is a false positive, becuase path_openat holds i_mutex of the parent directory and ubifs_init_security() tries to acquire i_mutex of a new inode, it reveals that taking i_mutex in ubifs_init_security() is in vain because it is only being called in the inode allocation path and therefore nobody else can see the inode yet. Reported-and-tested-by: Boris Brezillon <boris.brezillon@free-electrons.com> Reviewed-and-tested-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: dedekind1@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22cifs: use server timestamp for ntlmv2 authenticationPeter Seiderer
commit 98ce94c8df762d413b3ecb849e2b966b21606d04 upstream. Linux cifs mount with ntlmssp against an Mac OS X (Yosemite 10.10.5) share fails in case the clocks differ more than +/-2h: digest-service: digest-request: od failed with 2 proto=ntlmv2 digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2 Fix this by (re-)using the given server timestamp for the ntlmv2 authentication (as Windows 7 does). A related problem was also reported earlier by Namjae Jaen (see below): Windows machine has extended security feature which refuse to allow authentication when there is time difference between server time and client time when ntlmv2 negotiation is used. This problem is prevalent in embedded enviornment where system time is set to default 1970. Modern servers send the server timestamp in the TargetInfo Av_Pair structure in the challenge message [see MS-NLMP 2.2.2.1] In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must use the server provided timestamp if present OR current time if it is not Reported-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Peter Seiderer <ps.report@gmx.net> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22Do not fall back to SMBWriteX in set_file_size error casesSteve French
commit 646200a041203f440fb6fcf9cacd9efeda9de74c upstream. The error paths in set_file_size for cifs and smb3 are incorrect. In the unlikely event that a server did not support set file info of the file size, the code incorrectly falls back to trying SMBWriteX (note that only the original core SMB Write, used for example by DOS, can set the file size this way - this actually does not work for the more recent SMBWriteX). The idea was since the old DOS SMB Write could set the file size if you write zero bytes at that offset then use that if server rejects the normal set file info call. Fortunately the SMBWriteX will never be sent on the wire (except when file size is zero) since the length and offset fields were reversed in the two places in this function that call SMBWriteX causing the fall back path to return an error. It is also important to never call an SMB request from an SMB2/sMB3 session (which theoretically would be possible, and can cause a brief session drop, although the client recovers) so this should be fixed. In practice this path does not happen with modern servers but the error fall back to SMBWriteX is clearly wrong. Removing the calls to SMBWriteX in the error paths in cifs_set_file_size Pointed out by PaX/grsecurity team Signed-off-by: Steve French <steve.french@primarydata.com> Reported-by: PaX Team <pageexec@freemail.hu> CC: Emese Revfy <re.emese@gmail.com> CC: Brad Spengler <spender@grsecurity.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22disabling oplocks/leases via module parm enable_oplocks broken for SMB3Steve French
commit e0ddde9d44e37fbc21ce893553094ecf1a633ab5 upstream. leases (oplocks) were always requested for SMB2/SMB3 even when oplocks disabled in the cifs.ko module. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Chandrika Srinivasan <chandrika.srinivasan@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22Fix sec=krb5 on smb3 mountsSteve French
commit ceb1b0b9b4d1089e9f2731a314689ae17784c861 upstream. Kerberos, which is very important for security, was only enabled for CIFS not SMB2/SMB3 mounts (e.g. vers=3.0) Patch based on the information detailed in http://thread.gmane.org/gmane.linux.kernel.cifs/10081/focus=10307 to enable Kerberized SMB2/SMB3 a) SMB2_negotiate: enable/use decode_negTokenInit in SMB2_negotiate b) SMB2_sess_setup: handle Kerberos sectype and replicate Kerberos SMB1 processing done in sess_auth_kerberos Signed-off-by: Noel Power <noel.power@suse.com> Signed-off-by: Jim McDonough <jmcd@samba.org> Signed-off-by: Steve French <steve.french@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22NFS: Fix a write performance regressionTrond Myklebust
commit 8fa4592a14ebb3c22a21d846d1e4f65dab7d1a7c upstream. If all other conditions in nfs_can_extend_write() are met, and there are no locks, then we should be able to assume close-to-open semantics and the ability to extend our write to cover the whole page. With this patch, the xfstests generic/074 test completes in 242s instead of >1400s on my test rig. Fixes: bd61e0a9c852 ("locks: convert posix locks to file_lock_context") Cc: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22nfs: fix pg_test page count calculationPeng Tao
commit 048883e0b934d9a5103d40e209cb14b7f33d2933 upstream. We really want sizeof(struct page *) instead. Otherwise we limit maximum IO size to 64 pages rather than 512 pages on a 64bit system. Fixes 2e11f829(nfs: cap request size to fit a kmalloced page array). Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Fixes: 2e11f8296d22 ("nfs: cap request size to fit a kmalloced page array") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22NFSv4: Recovery of recalled read delegations is brokenTrond Myklebust
commit 24311f884189d42d40354a6f38ca218eb9aeb811 upstream. When a read delegation is being recalled, and we're reclaiming the cached opens, we need to make sure that we only reclaim read-only modes. A previous attempt to do this, relied on retrieving the delegation type from the nfs4_opendata structure. Unfortunately, as Kinglong pointed out, this field can only be set when performing reboot recovery. Furthermore, if we call nfs4_open_recover(), then we end up clobbering the state->flags for all modes that we're not recovering... The fix is to have the delegation recall code pass this information to the recovery call, and then refactor the recovery code so that nfs4_open_delegation_recall() does not need to call nfs4_open_recover(). Reported-by: Kinglong Mee <kinglongmee@gmail.com> Fixes: 39f897fdbd46 ("NFSv4: When returning a delegation, don't...") Tested-by: Kinglong Mee <kinglongmee@gmail.com> Cc: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22NFS: Do cleanup before resetting pageio read/write to mdsKinglong Mee
commit 6f29b9bba7b08c6b1d6f2cc4cf750b342fc1946c upstream. There is a reference leak of layout segment after resetting pageio read/write to mds. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22nfs: fix v4.2 SEEK on files over 2 gigsJ. Bruce Fields
commit 306a5549355966e480e0dcacdc6b9321d153e0c0 upstream. We're incorrectly assigning a loff_t return to an int. If SEEK_HOLE or SEEK_DATA returns an offset over 2^31 then the application will see a weird lseek() result (usually -EIO). Fixes: bdcc2cd14e4e "NFSv4.2: handle NFS-specific llseek errors" Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22Btrfs: update fix for read corruption of compressed and shared extentsFilipe Manana
commit 808f80b46790f27e145c72112189d6a3be2bc884 upstream. My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22Btrfs: fix read corruption of compressed and shared extentsFilipe Manana
commit 005efedf2c7d0a270ffbe28d8997b03844f3e3e7 upstream. If a file has a range pointing to a compressed extent, followed by another range that points to the same compressed extent and a read operation attempts to read both ranges (either completely or part of them), the pages that correspond to the second range are incorrectly filled with zeroes. Consider the following example: File layout [0 - 8K] [8K - 24K] | | | | points to extent X, points to extent X, offset 4K, length of 8K offset 0, length 16K [extent X, compressed length = 4K uncompressed length = 16K] If a readpages() call spans the 2 ranges, a single bio to read the extent is submitted - extent_io.c:submit_extent_page() would only create a new bio to cover the second range pointing to the extent if the extent it points to had a different logical address than the extent associated with the first range. This has a consequence of the compressed read end io handler (compression.c:end_compressed_bio_read()) finish once the extent is decompressed into the pages covering the first range, leaving the remaining pages (belonging to the second range) filled with zeroes (done by compression.c:btrfs_clear_biovec_end()). So fix this by submitting the current bio whenever we find a range pointing to a compressed extent that was preceded by a range with a different extent map. This is the simplest solution for this corner case. Making the end io callback populate both ranges (or more, if we have multiple pointing to the same extent) is a much more complex solution since each bio is tightly coupled with a single extent map and the extent maps associated to the ranges pointing to the shared extent can have different offsets and lengths. The following test case for fstests triggers the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create a test file with a single extent that is compressed (the # data we write into it is highly compressible no matter which # compression algorithm is used, zlib or lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \ -c "pwrite -S 0xbb 4K 8K" \ -c "pwrite -S 0xcc 12K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone our extent into an adjacent offset. $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # Same as before but for this file we clone the extent into a lower # file offset. $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \ -c "pwrite -S 0xbb 12K 8K" \ -c "pwrite -S 0xcc 20K 4K" \ $SCRATCH_MNT/bar | _filter_xfs_io $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/bar echo "File digests before unmounting filesystem:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch # Evicting the inode or clearing the page cache before reading # again the file would also trigger the bug - reads were returning # all bytes in the range corresponding to the second reference to # the extent with a value of 0, but the correct data was persisted # (it was a bug exclusively in the read path). The issue happened # only if the same readpages() call targeted pages belonging to the # first and second ranges that point to the same compressed extent. _scratch_remount echo "File digests after mounting filesystem again:" # Must match the same digests we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22btrfs: skip waiting on ordered range for special filesJeff Mahoney
commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc upstream. In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22ocfs2/dlm: fix deadlock when dispatch assert masterJoseph Qi
commit 012572d4fc2e4ddd5c8ec8614d51414ec6cae02a upstream. The order of the following three spinlocks should be: dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock But dlm_dispatch_assert_master() is called while holding dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls dlm_grab() which will take dlm_domain_lock. Once another thread (for example, dlm_query_join_handler) has already taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock happens. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: "Junxiao Bi" <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22blockdev: don't set S_DAX for misaligned partitionsJeff Moyer
commit f0b2e563bc419df7c1b3d2f494574c25125f6aed upstream. The dax code doesn't currently support misaligned partitions, so disable O_DIRECT via dax until such time as that support materializes. Suggested-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-22dax: fix O_DIRECT I/O to the last block of a blockdevJeff Moyer
commit e94f5a2285fc94202a9efb2c687481f29b64132c upstream. commit bbab37ddc20b (block: Add support for DAX reads/writes to block devices) caused a regression in mkfs.xfs. That utility sets the block size of the device to the logical block size using the BLKBSZSET ioctl, and then issues a single sector read from the last sector of the device. This results in the dax_io code trying to do a page-sized read from 512 bytes from the end of the device. The result is -ERANGE being returned to userspace. The fix is to align the block to the page size before calling get_block. Thanks to willy for simplifying my original patch. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>