summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2014-09-05Btrfs: fix task hang under heavy compressed writeLiu Bo
commit 9e0af23764344f7f1b68e4eefbe7dc865018b63d upstream. This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix filemap_flush call in btrfs_file_releaseChris Mason
commit f6dc45c7a93a011dff6eb9b2ffda59c390c7705a upstream. We should only be flushing on close if the file was flagged as needing it during truncate. I broke this with my ordered data vs transaction commit deadlock fix. Thanks to Miao Xie for catching this. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Miao Xie <miaox@cn.fujitsu.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix crash on endio of reading corrupted blockLiu Bo
commit 38c1c2e44bacb37efd68b90b3f70386a8ee370ee upstream. The crash is ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:2124! [...] Workqueue: btrfs-endio normal_work_helper [btrfs] RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs] This is in fact a regression. It is because we forgot to increase @offset properly in reading corrupted block, so that the @offset remains, and this leads to checksum errors while reading left blocks queued up in the same bio, and then ends up with hiting the above BUG_ON. Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05btrfs: disable strict file flushes for renames and truncatesChris Mason
commit 8d875f95da43c6a8f18f77869f2ef26e9594fecc upstream. Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix compressed write corruption on enospcLiu Bo
commit ce62003f690dff38d3164a632ec69efa15c32cbf upstream. When failing to allocate space for the whole compressed extent, we'll fallback to uncompressed IO, but we've forgotten to redirty the pages which belong to this compressed extent, and these 'clean' pages will simply skip 'submit' part and go to endio directly, at last we got data corruption as we write nothing. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: read lock extent buffer while walking backrefsFilipe Manana
commit 6f7ff6d7832c6be13e8c95598884dbc40ad69fb7 upstream. Before processing the extent buffer, acquire a read lock on it, so that we're safe against concurrent updates on the extent buffer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix csum tree corruption, duplicate and outdated checksumsFilipe Manana
commit 27b9a8122ff71a8cadfbffb9c4f0694300464f3b upstream. Under rare circumstances we can end up leaving 2 versions of a checksum for the same file extent range. The reason for this is that after calling btrfs_next_leaf we process slot 0 of the leaf it returns, instead of processing the slot set in path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after btrfs_next_leaf() releases the path and before it searches for the next leaf, another task might cause a split of the next leaf, which migrates some of its keys to the leaf we were processing before calling btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the same leaf but with path->slots[0] having a slot number corresponding to the first new key it got, that is, a slot number that didn't exist before calling btrfs_next_leaf(), as the leaf now has more keys than it had before. So we must really process the returned leaf starting at path->slots[0] always, as it isn't always 0, and the key at slot 0 can have an offset much lower than our search offset/bytenr. For example, consider the following scenario, where we have: sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568 four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472 Leaf N: slot = 0 slot = btrfs_header_nritems() - 1 |-------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] | |-------------------------------------------------------------------| Leaf N + 1: slot = 0 slot = btrfs_header_nritems() - 1 |--------------------------------------------------------------------| | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 | |--------------------------------------------------------------------| Because we are at the last slot of leaf N, we call btrfs_next_leaf() to find the next highest key, which releases the current path and then searches for that next key. However after releasing the path and before finding that next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore btrfs_next_leaf() will returns us a path again with leaf N but with the slot pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N is then: slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1 |----------------------------------------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] | |----------------------------------------------------------------------------------------------------| And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump into the "insert:" label, which will set tmp to: tmp = min((sums->len - total_bytes) >> blocksize_bits, (next_offset - file_key.offset) >> blocksize_bits) = min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) = min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4 and ins_size = csum_size * tmp = 4 * 4 = 16 bytes. In other words, we insert a new csum item in the tree with key (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong, because the item with key (CSUM CSUM 40161280) (the one that was moved from leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288 bytes of our data and won't get those old checksums removed. So this leaves us 2 different checksums for 3 4kb blocks of data in the tree, and breaks the logical rule: Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover An obvious bad effect of this is that a subsequent csum tree lookup to get the checksum of any of the blocks with logical offset of 40161280, 40165376 or 40169472 (the last 3 4kb blocks of file data), will get the old checksums. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: Fix memory corruption by ulist_add_merge() on 32bit archTakashi Iwai
commit 4eb1f66dce6c4dc28dd90a7ffbe6b2b1cb08aa4e upstream. We've got bug reports that btrfs crashes when quota is enabled on 32bit kernel, typically with the Oops like below: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs] *pde = 00000000 Oops: 0000 [#1] SMP CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs] task: f1478130 ti: f147c000 task.ti: f147c000 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0 EIP is at find_parent_nodes+0x360/0x1380 [btrfs] EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690 Stack: 00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050 00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000 00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000 Call Trace: [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs] [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs] [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs] [<c025e38b>] process_one_work+0x11b/0x390 [<c025eea1>] worker_thread+0x101/0x340 [<c026432b>] kthread+0x9b/0xb0 [<c0712a71>] ret_from_kernel_thread+0x21/0x30 [<c0264290>] kthread_create_on_node+0x110/0x110 This indicates a NULL corruption in prefs_delayed list. The further investigation and bisection pointed that the call of ulist_add_merge() results in the corruption. ulist_add_merge() takes u64 as aux and writes a 64bit value into old_aux. The callers of this function in backref.c, however, pass a pointer of a pointer to old_aux. That is, the function overwrites 64bit value on 32bit pointer. This caused a NULL in the adjacent variable, in this case, prefs_delayed. Here is a quick attempt to band-aid over this: a new function, ulist_add_merge_ptr() is introduced to pass/store properly a pointer value instead of u64. There are still ugly void ** cast remaining in the callers because void ** cannot be taken implicitly. But, it's safer than explicit cast to u64, anyway. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046 Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have two more fixes in my for-linus branch. I was hoping to also include a fix for a btrfs deadlock with compression enabled, but we're still nailing that one down" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: test for valid bdev before kobj removal in btrfs_rm_device Btrfs: fix abnormal long waiting in fsync
2014-07-19btrfs: test for valid bdev before kobj removal in btrfs_rm_deviceEric Sandeen
commit 99994cd btrfs: dev delete should remove sysfs entry added a btrfs_kobj_rm_device, which dereferences device->bdev... right after we check whether device->bdev might be NULL. I don't honestly know if it's possible to have a NULL device->bdev here, but assuming that it is (given the test), we need to move the kobject removal to be under that test. (Coverity spotted this) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19Btrfs: fix abnormal long waiting in fsyncLiu Bo
xfstests generic/127 detected this problem. With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush data within the passed range. This is the cause of the above problem, -- btrfs's fsync has a stage called 'sync log' which will wait for all the ordered extents it've recorded to finish. In xfstests/generic/127, with mixed operations such as truncate, fallocate, punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will mmap, and then msync. And I find that msync will wait for quite a long time (about 20s in my case), thanks to ftrace, it turns out that the previous fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants, there can be some ordered extents created but not getting corresponding pages flushed, then they're left in memory until we fsync which runs into the stage 'sync log', and fsync will just wait for the system writeback thread to flush those pages and get ordered extents finished, so the latency is inevitable. This adds a flush similar to btrfs_start_ordered_extent() in btrfs_wait_logged_extents() to fix that. Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've queued up a few fixes in my for-linus branch" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix crash when starting transaction Btrfs: fix btrfs_print_leaf for skinny metadata Btrfs: fix race of using total_bytes_pinned btrfs: use E2BIG instead of EIO if compression does not help btrfs: remove stale comment from btrfs_flush_all_pending_stuffs Btrfs: fix use-after-free when cloning a trailing file hole btrfs: fix null pointer dereference in btrfs_show_devname when name is null btrfs: fix null pointer dereference in clone_fs_devices when name is null btrfs: fix nossd and ssd_spread mount option regression Btrfs: fix race between balance recovery and root deletion Btrfs: atomically set inode->i_flags in btrfs_update_iflags btrfs: only unlock block in verify_parent_transid if we locked it Btrfs: assert send doesn't attempt to start transactions btrfs compression: reuse recently used workspace Btrfs: fix crash when mounting raid5 btrfs with missing disks btrfs: create sprout should rename fsid on the sysfs as well btrfs: dev replace should replace the sysfs entry btrfs: dev add should add its sysfs entry btrfs: dev delete should remove sysfs entry btrfs: rename add_device_membership to btrfs_kobj_add_device
2014-07-03Btrfs: fix crash when starting transactionFilipe Manana
Often when starting a transaction we commit the currently running transaction, which can end up writing block group caches when the current process has its journal_info set to NULL (and not to a transaction). This makes our assertion at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting in a crash/hang. Therefore fix it by setting journal_info. Two different traces of this issue follow below. 1) [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670 [51502.242213] ------------[ cut here ]------------ [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964! [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC (...) [51502.244010] Call Trace: [51502.244010] [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs] [51502.244010] [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs] [51502.244010] [<ffffffffa0357a6a>] commit_cowonly_roots+0x164/0x226 [btrfs] [51502.244010] [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs] [51502.244010] [<ffffffff8168ec7b>] ? _raw_spin_unlock+0x2b/0x40 [51502.244010] [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs] [51502.244010] [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs] [51502.244010] [<ffffffffa02d73e1>] __unlink_start_trans+0x31/0xe0 [btrfs] [51502.244010] [<ffffffffa02dea67>] btrfs_unlink+0x37/0xc0 [btrfs] [51502.244010] [<ffffffff811bb054>] ? do_unlinkat+0x114/0x2a0 [51502.244010] [<ffffffff811baebc>] vfs_unlink+0xcc/0x150 [51502.244010] [<ffffffff811bb1a0>] do_unlinkat+0x260/0x2a0 [51502.244010] [<ffffffff811a9ef4>] ? filp_close+0x64/0x90 [51502.244010] [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0 [51502.244010] [<ffffffff81349cab>] ? trace_hardirqs_on_thunk+0x3a/0x3f [51502.244010] [<ffffffff811be9eb>] SyS_unlinkat+0x1b/0x40 [51502.244010] [<ffffffff81698452>] system_call_fastpath+0x16/0x1b [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5 [51502.244010] RIP [<ffffffffa03575da>] assfail.constprop.88+0x1e/0x20 [btrfs] 2) [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670 [25405.097488] ------------[ cut here ]------------ [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964! [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC (...) [25405.100008] Call Trace: [25405.100008] [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs] [25405.100008] [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs] [25405.100008] [<ffffffffa035755a>] commit_cowonly_roots+0x164/0x226 [btrfs] [25405.100008] [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs] [25405.100008] [<ffffffff8109c170>] ? bit_waitqueue+0xc0/0xc0 [25405.100008] [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs] [25405.100008] [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs] [25405.100008] [<ffffffffa02e3407>] btrfs_create+0x47/0x210 [btrfs] [25405.100008] [<ffffffffa02d74cc>] ? btrfs_permission+0x3c/0x80 [btrfs] [25405.100008] [<ffffffff811bc63b>] vfs_create+0x9b/0x130 [25405.100008] [<ffffffff811bcf19>] do_last+0x849/0xe20 [25405.100008] [<ffffffff811b9409>] ? link_path_walk+0x79/0x820 [25405.100008] [<ffffffff811bd5b5>] path_openat+0xc5/0x690 [25405.100008] [<ffffffff810ab07d>] ? trace_hardirqs_on+0xd/0x10 [25405.100008] [<ffffffff811cdcd2>] ? __alloc_fd+0x32/0x1d0 [25405.100008] [<ffffffff811be2a3>] do_filp_open+0x43/0xa0 [25405.100008] [<ffffffff811cddf1>] ? __alloc_fd+0x151/0x1d0 [25405.100008] [<ffffffff811abcfc>] do_sys_open+0x13c/0x230 [25405.100008] [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0 [25405.100008] [<ffffffff811abe12>] SyS_open+0x22/0x30 [25405.100008] [<ffffffff81698452>] system_call_fastpath+0x16/0x1b [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5 [25405.100008] RIP [<ffffffffa03570ca>] assfail.constprop.88+0x1e/0x20 [btrfs] Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03Btrfs: fix btrfs_print_leaf for skinny metadataJosef Bacik
We wouldn't actuall print the extent information if we had a skinny metadata item, this fixes that. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03Btrfs: fix race of using total_bytes_pinnedLiu Bo
This percpu counter @total_bytes_pinned is introduced to skip unnecessary operations of 'commit transaction', it accounts for those space we may free but are stuck in delayed refs. And we zero out @space_info->total_bytes_pinned every transaction period so we have a better idea of how much space we'll actually free up by committing this transaction. However, we do the 'zero out' part a little earlier, before we actually unpin space, so we end up returning ENOSPC when we actually have free space that's just unpinned from committing transaction. xfstests/generic/074 complained then. This fixes it by actually accounting the percpu pinned number when 'unpin', and since it's protected by space_info->lock, the race is gone now. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03btrfs: use E2BIG instead of EIO if compression does not helpDavid Sterba
Return codes got updated in 60e1975acb48fc3d74a3422b21dde74c977ac3d5 (btrfs: return errno instead of -1 from compression) lzo wrapper returns E2BIG in this case, do the same for zlib. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03btrfs: remove stale comment from btrfs_flush_all_pending_stuffsDavid Sterba
Commit fcebe4562dec83b3f8d3088d77584727b09130b2 (Btrfs: rework qgroup accounting) removed the qgroup accounting after delayed refs. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03Btrfs: fix use-after-free when cloning a trailing file holeFilipe Manana
The transaction handle was being used after being freed. Cc: Chris Mason <clm@fb.com> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03btrfs: fix null pointer dereference in btrfs_show_devname when name is nullAnand Jain
dev->name is null but missing flag is not set. Strictly speaking the missing flag should have been set, but there are more places where code just checks if name is null. For now this patch does the same. stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000064 IP: [<ffffffffa0228908>] btrfs_show_devname+0x58/0xf0 [btrfs] [<ffffffff81198879>] show_vfsmnt+0x39/0x130 [<ffffffff81178056>] m_show+0x16/0x20 [<ffffffff8117d706>] seq_read+0x296/0x390 [<ffffffff8115aa7d>] vfs_read+0x9d/0x160 [<ffffffff8115b549>] SyS_read+0x49/0x90 [<ffffffff817abe52>] system_call_fastpath+0x16/0x1b reproducer: mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2 btrfstune -S 1 /dev/sdg1 modprobe -r btrfs && modprobe btrfs mount -o degraded /dev/sdg1 /btrfs btrfs dev add /dev/sdg3 /btrfs Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03btrfs: fix null pointer dereference in clone_fs_devices when name is nullAnand Jain
when one of the device path is missing btrfs_device name is null. So this patch will check for that. stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff812e18c0>] strlen+0x0/0x30 [<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs] [<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs] [<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0 [<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs] [<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0 [<ffffffff81192a61>] ? __blkdev_put+0x171/0x180 [<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590 [<ffffffff81193426>] ? blkdev_put+0x106/0x110 [<ffffffff81179175>] ? mntput+0x35/0x40 [<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0 [<ffffffff8115c72e>] ? ____fput+0xe/0x10 [<ffffffff81068033>] ? task_work_run+0xb3/0xd0 [<ffffffff8116d547>] SyS_ioctl+0x57/0x90 [<ffffffff817a793e>] ? do_page_fault+0xe/0x10 [<ffffffff817abe52>] system_call_fastpath+0x16/0x1b reproducer: mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2 btrfstune -S 1 /dev/sdg1 modprobe -r btrfs && modprobe btrfs mount -o degraded /dev/sdg1 /btrfs btrfs dev add /dev/sdg3 /btrfs Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03btrfs: fix nossd and ssd_spread mount option regressionEric Sandeen
The commit 0780253 btrfs: Cleanup the btrfs_parse_options for remount. broke ssd options quite badly; it stopped making ssd_spread imply ssd, and it made "nossd" unsettable. Put things back at least as well as they were before (though ssd mount option handling is still pretty odd: # mount -o "nossd,ssd_spread" works?) Reported-by: Roman Mamedov <rm@romanrm.net> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03Btrfs: fix race between balance recovery and root deletionWang Shilong
Balance recovery is called when RW mounting or remounting from RO to RW, it is called to finish roots merging. When doing balance recovery, relocation root's corresponding fs root(whose root refs is 0) might be destroyed by cleaner thread, this will make btrfs fail to mount. Fix this problem by holding @cleaner_mutex when doing balance recovery. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03Btrfs: atomically set inode->i_flags in btrfs_update_iflagsFilipe Manana
This change is based on the corresponding recent change for ext4: ext4: atomically set inode->i_flags in ext4_set_inode_flags() That has the following commit message that applies to btrfs as well: "Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time." Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE and BTRFS_INODE_APPEND, respectively. Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: only unlock block in verify_parent_transid if we locked itJosef Bacik
This is a regression from my patch a26e8c9f75b0bfd8cccc9e8f110737b136eb5994, we need to only unlock the block if we were the one who locked it. Otherwise this will trip BUG_ON()'s in locking.c Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28Btrfs: assert send doesn't attempt to start transactionsFilipe Manana
When starting a transaction just assert that current->journal_info doesn't contain a send transaction stub, since send isn't supposed to start transactions and when it finishes (either successfully or not) it's supposed to set current->journal_info to NULL. This is motivated by the change titled: Btrfs: fix crash when starting transaction Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs compression: reuse recently used workspaceSergey Senozhatsky
Add compression `workspace' in free_workspace() to `idle_workspace' list head, instead of tail. So we have better chances to reuse most recently used `workspace'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28Btrfs: fix crash when mounting raid5 btrfs with missing disksLiu Bo
The reproducer is $ mkfs.btrfs D1 D2 D3 -mraid5 $ mkfs.ext4 D2 && mkfs.ext4 D3 $ mount D1 /btrfs -odegraded ------------------- [ 87.672992] ------------[ cut here ]------------ [ 87.673845] kernel BUG at fs/btrfs/raid56.c:1828! ... [ 87.673845] RIP: 0010:[<ffffffff813efc7e>] [<ffffffff813efc7e>] __raid_recover_end_io+0x4ae/0x4d0 ... [ 87.673845] Call Trace: [ 87.673845] [<ffffffff8116bbc6>] ? mempool_free+0x36/0xa0 [ 87.673845] [<ffffffff813f0255>] raid_recover_end_io+0x75/0xa0 [ 87.673845] [<ffffffff81447c5b>] bio_endio+0x5b/0xa0 [ 87.673845] [<ffffffff81447cb2>] bio_endio_nodec+0x12/0x20 [ 87.673845] [<ffffffff81374621>] end_workqueue_fn+0x41/0x50 [ 87.673845] [<ffffffff813ad2aa>] normal_work_helper+0xca/0x2c0 [ 87.673845] [<ffffffff8108ba2b>] process_one_work+0x1eb/0x530 [ 87.673845] [<ffffffff8108b9c9>] ? process_one_work+0x189/0x530 [ 87.673845] [<ffffffff8108c15b>] worker_thread+0x11b/0x4f0 [ 87.673845] [<ffffffff8108c040>] ? rescuer_thread+0x290/0x290 [ 87.673845] [<ffffffff810939c4>] kthread+0xe4/0x100 [ 87.673845] [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220 [ 87.673845] [<ffffffff817e7c7c>] ret_from_fork+0x7c/0xb0 [ 87.673845] [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220 ------------------- It's because that we miscalculate @rbio->bbio->error so that it doesn't reach maximum of tolerable errors while it should have. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-by: Satoru Takeuchi<takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: create sprout should rename fsid on the sysfs as wellAnand Jain
Creating sprout will change the fsid of the mounted root. do the same on the sysfs as well. reproducer: mount /dev/sdb /btrfs (seed disk) btrfs dev add /dev/sdc /btrfs mount -o rw,remount /btrfs btrfs dev del /dev/sdb /btrfs mount /dev/sdb /btrfs Error: kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: dev replace should replace the sysfs entryAnand Jain
when we replace the device its corresponding sysfs entry has to be replaced as well Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: dev add should add its sysfs entryAnand Jain
we would need the device links to be created, when device is added. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: dev delete should remove sysfs entryAnand Jain
when we delete the device from the mounted btrfs, we would need its corresponding sysfs enty to be removed as well. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28btrfs: rename add_device_membership to btrfs_kobj_add_deviceAnand Jain
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This fixes some lockups in btrfs reported with rc1. It probably has some performance impact because it is backing off our spinning locks more often and switching to a blocking lock. I'll be able to nail that down next week, but for now I want to get the lockups taken care of. Otherwise some more stack reduction and assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix wrong error handle when the device is missing or is not writeable Btrfs: fix deadlock when mounting a degraded fs Btrfs: use bio_endio_nodec instead of open code Btrfs: fix NULL pointer crash when running balance and scrub concurrently btrfs: Skip scrubbing removed chunks to avoid -ENOENT. Btrfs: fix broken free space cache after the system crashed Btrfs: make free space cache write out functions more readable Btrfs: remove unused wait queue in struct extent_buffer Btrfs: fix deadlocks with trylock on tree nodes
2014-06-19Btrfs: fix wrong error handle when the device is missing or is not writeableMiao Xie
The original bio might be submitted, so we shoud increase bi_remaining to account for it when we deal with the error that the device is missing or is not writeable, or we would skip the endio handle. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: fix deadlock when mounting a degraded fsMiao Xie
The deadlock happened when we mount degraded filesystem, the reproduced steps are following: # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1> # echo 1 > /sys/block/`basename <dev0>`/device/delete # mount -o degraded <dev1> <mnt> The reason was that the counter -- bi_remaining was wrong. If the missing or unwriteable device was the last device in the mapping array, we would not submit the original bio, so we shouldn't increase bi_remaining of it in btrfs_end_bio(), or we would skip the final endio handle. Fix this problem by adding a flag into btrfs bio structure. If we submit the original bio, we will set the flag, and we increase bi_remaining counter, or we don't. Though there is another way to fix it -- decrease bi_remaining counter of the original bio when we make sure the original bio is not submitted, this method need add more check and is easy to make mistake. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: use bio_endio_nodec instead of open codeMiao Xie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: fix NULL pointer crash when running balance and scrub concurrentlyWang Shilong
While running balance, scrub, fsstress concurrently we hit the following kernel crash: [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132 [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs] [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0 [56561.524325] Oops: 0000 [#1] SMP [....] [56561.527237] Call Trace: [56561.527309] [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs] [56561.527392] [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0 [56561.527476] [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs] [56561.527561] [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs] [56561.527639] [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0 [56561.527712] [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60 [56561.527788] [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0 [56561.527870] [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0 [56561.527941] [<ffffffff815707f7>] tracesys+0xdd/0xe2 [...] [56561.528304] RIP [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs] [56561.528395] RSP <ffff88004c0f5be8> [56561.528454] CR2: 0000000000000078 This is because in btrfs_relocate_chunk(), we will free @bdev directly while scrub may still hold extent mapping, and may access freed memory. Fix this problem by wrapping freeing @bdev work into free_extent_map() which is based on reference count. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19btrfs: Skip scrubbing removed chunks to avoid -ENOENT.Qu Wenruo
When run scrub with balance, sometimes -ENOENT will be returned, since in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but btrfs_lookup_block_group() will search block group in *MEMORY*, so if a chunk is removed but not committed, -ENOENT will be returned. However, there is no need to stop scrubbing since other chunks may be scrubbed without problem. So this patch changes the behavior to skip removed chunks and continue to scrub the rest. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: fix broken free space cache after the system crashedMiao Xie
When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group xxxx has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group xxx It is because we didn't update the metadata of the allocated space (in extent tree) until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time (commit transaction), those spaces were lost. In fact, only the free space that is used to store the file data had this problem, the others didn't because the metadata of them is updated in the same transaction context. There are many methods which can fix the above problem - track the allocated space, and write it out when we write out the free space cache - account the size of the allocated space that is used to store the file data, if the size is not zero, don't write out the free space cache. The first one is complex and may make the performance drop down. This patch chose the second method, we use a per-block-group variant to account the size of that allocated space. Besides that, we also introduce a per-block-group read-write semaphore to avoid the race between the allocation and the free space cache write out. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: make free space cache write out functions more readableMiao Xie
This patch makes the free space cache write out functions more readable, and beisdes that, it also reduces the stack space that the function -- __btrfs_write_out_cache uses from 194bytes to 144bytes. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: remove unused wait queue in struct extent_bufferFilipe Manana
The lock_wq wait queue is not used anywhere, therefore just remove it. On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320 bytes down to 296 bytes, which means a 4Kb page can now be used for 13 extent buffers instead of 12. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19Btrfs: fix deadlocks with trylock on tree nodesChris Mason
The Btrfs tree trylock function is poorly named. It always takes the spinlock and backs off if the blocking lock is held. This can lead to surprising lockups because people expect it to really be a trylock. This commit makes it a pure trylock, both for the spinlock and the blocking lock. It also reworks the nested lock handling slightly to avoid taking the read lock while a spinning write lock might be held. Signed-off-by: Chris Mason <clm@fb.com>
2014-06-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull more btrfs updates from Chris Mason: "This has a few fixes since our last pull and a new ioctl for doing btree searches from userland. It's very similar to the existing ioctl, but lets us return larger items back down to the app" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix error handling in create_pending_snapshot btrfs: fix use of uninit "ret" in end_extent_writepage() btrfs: free ulist in qgroup_shared_accounting() error path Btrfs: fix qgroups sanity test crash or hang btrfs: prevent RCU warning when dereferencing radix tree slot Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting btrfs: new ioctl TREE_SEARCH_V2 btrfs: tree_search, search_ioctl: direct copy to userspace btrfs: new function read_extent_buffer_to_user btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer btrfs: tree_search, search_ioctl: accept varying buffer btrfs: tree_search: eliminate redundant nr_items check
2014-06-13btrfs: fix error handling in create_pending_snapshotEric Sandeen
fcebe456 cut and pasted some code to a later point in create_pending_snapshot(), but didn't switch to the appropriate error handling for this stage of the function. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13btrfs: fix use of uninit "ret" in end_extent_writepage()Eric Sandeen
If this condition in end_extent_writepage() is false: if (tree->ops && tree->ops->writepage_end_io_hook) we will then test an uninitialized "ret" at: ret = ret < 0 ? ret : -EIO; The test for ret is for the case where ->writepage_end_io_hook failed, and we'd choose that ret as the error; but if there is no ->writepage_end_io_hook, nothing sets ret. Initializing ret to 0 should be sufficient; if writepage_end_io_hook wasn't set, (!uptodate) means non-zero err was passed in, so we choose -EIO in that case. Signed-of-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13btrfs: free ulist in qgroup_shared_accounting() error pathEric Sandeen
If tmp = ulist_alloc(GFP_NOFS) fails, we return without freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS) and cause a memory leak. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13Btrfs: fix qgroups sanity test crash or hangFilipe Manana
Often when running the qgroups sanity test, a crash or a hang happened. This is because the extent buffer the test uses for the root node doesn't have an header level explicitly set, making it have a random level value. This is a problem when it's not zero for the btrfs_search_slot() calls the test ends up doing, resulting in crashes or hangs such as the following: [ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on (...) [ 6454.127760] BTRFS: selftest: Running qgroup tests [ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup [ 6454.127966] BTRFS: selftest: Qgroup basic add [ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383] [ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs] [ 6480.152005] irq event stamp: 188448 [ 6480.152005] hardirqs last enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30 [ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80 [ 6480.152005] softirqs last enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450 [ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0 [ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4 [ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000 [ 6480.152005] RIP: 0010:[<ffffffff81349a63>] [<ffffffff81349a63>] __write_lock_failed+0x13/0x20 [ 6480.152005] RSP: 0018:ffff8800d0d038e8 EFLAGS: 00000287 [ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852 [ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8 [ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb [ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858 [ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0 [ 6480.152005] FS: 00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000 [ 6480.152005] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0 [ 6480.152005] Stack: [ 6480.152005] ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8 [ 6480.152005] ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007 [ 6480.152005] ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16 [ 6480.152005] Call Trace: [ 6480.152005] [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0 [ 6480.152005] [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80 [ 6480.152005] [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs] [ 6480.152005] [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs] [ 6480.152005] [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs] [ 6480.152005] [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs] [ 6480.152005] [<ffffffff811a727f>] ? create_object+0x23f/0x300 [ 6480.152005] [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs] [ 6480.152005] [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs] [ 6480.152005] [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs] [ 6480.152005] [<ffffffff8108cad6>] ? local_clock+0x16/0x30 [ 6480.152005] [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs] [ 6480.152005] [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs] [ 6480.152005] [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs] [ 6480.152005] [<ffffffff81000352>] do_one_initcall+0x102/0x150 [ 6480.152005] [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50 [ 6480.152005] [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74 [ 6480.152005] [<ffffffff810d91cc>] load_module+0x1cdc/0x2630 (...) Therefore initialize the extent buffer as an empty leaf (level 0). Issue easy to reproduce when btrfs is built as a module via: $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13btrfs: prevent RCU warning when dereferencing radix tree slotSasha Levin
Mark the dereference as protected by lock. Not doing so triggers an RCU warning since the radix tree assumed that RCU is in use. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13Btrfs: fix unfinished readahead thread for raid5/6 degraded mountingWang Shilong
Steps to reproduce: # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5 # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device # mount /dev/sdb /mnt -o degraded # btrfs scrub start -BRd /mnt This is because readahead would skip missing device, this is not true for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block mapping. If expected data locates in missing device, readahead thread would not call __readahead_hook() which makes event @rc->elems=0 wait forever. Fix this problem by checking return value of btrfs_map_block(),we can only skip missing device safely if there are several mirrors. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13btrfs: new ioctl TREE_SEARCH_V2Gerhard Heift
This new ioctl call allows the user to supply a buffer of varying size in which a tree search can store its results. This is much more flexible if you want to receive items which are larger than the current fixed buffer of 3992 bytes or if you want to fetch more items at once. Items larger than this buffer are for example some of the type EXTENT_CSUM. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>