Age | Commit message (Collapse) | Author |
|
commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream.
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).
The following example, turned into a test for fstests, reproduces the
issue:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure
In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.
Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).
CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ff3d27a048d926b3920ccdb75d98788c567cae0d ]
Under the following case, qgroup rescan can double account cowed tree
blocks:
In this case, extent tree only has one tree block.
-
| transid=5 last committed=4
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 4).
| Scan it, set qgroup_rescan_progress to the last
| EXTENT/META_ITEM + 1
| now qgroup_rescan_progress = A + 1.
|
| fs tree get CoWed, new tree block is at A + 16K
| transid 5 get committed
-
| transid=6 last committed=5
| btrfs_qgroup_rescan_worker()
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 5).
| scan it using qgroup_rescan_progress (A + 1).
| found new tree block beyong A, and it's fs tree block,
| account it to increase qgroup numbers.
-
In above case, tree block A, and tree block A + 16K get accounted twice,
while qgroup rescan should stop when it already reach the last leaf,
other than continue using its qgroup_rescan_progress.
Such case could happen by just looping btrfs/017 and with some
possibility it can hit such double qgroup accounting problem.
Fix it by checking the path to determine if we should finish qgroup
rescan, other than relying on next loop to exit.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3d3a2e610ea5e7c6d4f9481ecce5d8e2d8317843 ]
Currently the code assumes that there's an implied barrier by the
sequence of code preceding the wakeup, namely the mutex unlock.
As Nikolay pointed out:
I think this is wrong (not your code) but the original assumption that
the RELEASE semantics provided by mutex_unlock is sufficient.
According to memory-barriers.txt:
Section 'LOCK ACQUISITION FUNCTIONS' states:
(2) RELEASE operation implication:
Memory operations issued before the RELEASE will be completed before the
RELEASE operation has completed.
Memory operations issued after the RELEASE *may* be completed before the
RELEASE operation has completed.
(I've bolded the may portion)
The example given there:
As an example, consider the following:
*A = a;
*B = b;
ACQUIRE
*C = c;
*D = d;
RELEASE
*E = e;
*F = f;
The following sequence of events is acceptable:
ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
So if we assume that *C is modifying the flag which the waitqueue is checking,
and *E is the actual wakeup, then those accesses can be re-ordered...
IMHO this code should be considered broken...
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5811375325420052fcadd944792a416a43072b7f upstream.
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.
In case of free space inode, this ends up with a warning in
cow_file_range().
The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.
With this, run_delalloc_nocow() bails out when errors occur at the two
places.
cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c5b4a50b74018b3677098151ec5f4fce07d5e6a0 upstream.
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.
Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ac0b4145d662a3b9e34085dea460fb06ede9b69b upstream.
[BUG]
Btrfs can create compressed extent without checksum (even though it
shouldn't), and if we then try to replace device containing such extent,
the result device will contain all the uncompressed data instead of the
compressed one.
Test case already submitted to fstests:
https://patchwork.kernel.org/patch/10442353/
[CAUSE]
When handling compressed extent without checksum, device replace will
goe into copy_nocow_pages() function.
In that function, btrfs will get all inodes referring to this data
extents and then use find_or_create_page() to get pages direct from that
inode.
The problem here is, pages directly from inode are always uncompressed.
And for compressed data extent, they mismatch with on-disk data.
Thus this leads to corrupted compressed data extent written to replace
device.
[FIX]
In this attempt, we could just remove the "optimization" branch, and let
unified scrub_pages() to handle it.
Although scrub_pages() won't bother reusing page cache, it will be a
little slower, but it does the correct csum checking and won't cause
such data corruption caused by "optimization".
Note about the fix: this is the minimal fix that can be backported to
older stable trees without conflicts. The whole callchain from
copy_nocow_pages() can be deleted, and will be in followup patches.
Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk")
CC: stable@vger.kernel.org # 4.4+
Reported-by: James Harvey <jamespharvey20@gmail.com>
Reviewed-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ remove code removal, add note why ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fd4e994bd1f9dc9628e168a7f619bf69f6984635 upstream.
If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program on 4.7+:
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/btrfs.h>
#include <linux/btrfs_tree.h>
int main(int argc, char **argv)
{
struct btrfs_ioctl_vol_args_v2 vol_args = {
.flags = UINT64_MAX,
};
int ret;
int fd;
if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}
fd = open(argv[1], O_WRONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
if (ret == -1)
perror("ioctl");
close(fd);
return EXIT_SUCCESS;
}
When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
filesystem to be remounted read-only as the writer count will stay
lifted.
Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5c40d598f5408bd0ca22dfffa82f03cd9433f23 upstream.
In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.
The race window is only a few instructions in size, between the if and
the locks which is:
3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835 return -EISDIR;
where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0). The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.
Not impossible but still needs a lot of bad luck to hit unintentionally.
Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8810f7517a3bc4ca2d41d022446d3f5fd6b77c09 ]
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 186a6519dc94964a4c5c68fca482f20f71551f26.
This commit used an incorrect log message.
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e2731e55884f2138a252b0a3d7b24d57e49c3c59 upstream.
btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]
While running btrfs/011, I hit the following lockdep splat.
This is the important bit:
pcpu_alloc+0x1ac/0x5e0
__percpu_counter_init+0x4e/0xb0
btrfs_init_fs_root+0x99/0x1c0 [btrfs]
btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
resolve_indirect_refs+0x130/0x830 [btrfs]
find_parent_nodes+0x69e/0xff0 [btrfs]
btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
btrfs_find_all_roots+0x50/0x70 [btrfs]
btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
The percpu_counter_init call in btrfs_alloc_subvolume_writers
uses GFP_KERNEL, which we can't do during transaction commit.
This switches it to GFP_NOFS.
========================================================
WARNING: possible irq lock inversion dependency detected
4.12.14-kvmsmall #8 Tainted: G W
--------------------------------------------------------
kswapd0/50 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(pcpu_alloc_mutex){+.+.+.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pcpu_alloc_mutex);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/50:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0
#1: (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50
the shortest dependencies between 2nd lock and 1st lock:
-> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
HARDIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
kmem_cache_init_late+0x42/0x75
start_kernel+0x343/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
SOFTIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
kmem_cache_init_late+0x42/0x75
start_kernel+0x343/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
RECLAIM_FS-ON-W at:
__kmalloc+0x47/0x310
pcpu_extend_area_map+0x2b/0xc0
pcpu_alloc+0x3ec/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
__kmem_cache_create+0x1bf/0x390
create_cache+0xba/0x1b0
kmem_cache_create+0x1f8/0x2b0
ksm_init+0x6f/0x19d
do_one_initcall+0x50/0x1b0
kernel_init_freeable+0x201/0x289
kernel_init+0xa/0x100
ret_from_fork+0x3a/0x50
INITIAL USE at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
setup_cpu_cache+0x2f/0x1f0
__kmem_cache_create+0x1bf/0x390
create_boot_cache+0x8b/0xb1
kmem_cache_init+0xa1/0x19e
start_kernel+0x270/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
}
... key at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0
... acquired at:
pcpu_alloc+0x1ac/0x5e0
__percpu_counter_init+0x4e/0xb0
btrfs_init_fs_root+0x99/0x1c0 [btrfs]
btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
resolve_indirect_refs+0x130/0x830 [btrfs]
find_parent_nodes+0x69e/0xff0 [btrfs]
btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
btrfs_find_all_roots+0x50/0x70 [btrfs]
btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
transaction_kthread+0x176/0x1b0 [btrfs]
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
-> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
HARDIRQ-ON-W at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
HARDIRQ-ON-R at:
down_read+0x35/0x90
caching_thread+0x57/0x560 [btrfs]
normal_work_helper+0x1c0/0x5e0 [btrfs]
process_one_work+0x1e0/0x5c0
worker_thread+0x44/0x390
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
SOFTIRQ-ON-W at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-R at:
down_read+0x35/0x90
caching_thread+0x57/0x560 [btrfs]
normal_work_helper+0x1c0/0x5e0 [btrfs]
process_one_work+0x1e0/0x5c0
worker_thread+0x44/0x390
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
INITIAL USE at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
... acquired at:
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
btrfs_create_tree+0xbb/0x2a0 [btrfs]
btrfs_create_uuid_tree+0x37/0x140 [btrfs]
open_ctree+0x23c0/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> (&found->groups_sem){++++..} ops: 2134587 {
HARDIRQ-ON-W at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
HARDIRQ-ON-R at:
down_read+0x35/0x90
btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
open_ctree+0x207b/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-W at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-R at:
down_read+0x35/0x90
btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
open_ctree+0x207b/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
INITIAL USE at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
... acquired at:
find_free_extent+0xcb4/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
__btrfs_cow_block+0x110/0x5b0 [btrfs]
btrfs_cow_block+0xd7/0x290 [btrfs]
btrfs_search_slot+0x1f6/0x960 [btrfs]
btrfs_lookup_inode+0x2a/0x90 [btrfs]
__btrfs_update_delayed_inode+0x65/0x210 [btrfs]
btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
evict+0xc4/0x190
__dentry_kill+0xbf/0x170
dput+0x2ae/0x2f0
SyS_rename+0x2a6/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
HARDIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
IN-RECLAIM_FS-W at:
__mutex_lock+0x4e/0x8c0
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
INITIAL USE at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
... acquired at:
__lock_acquire+0x264/0x11c0
lock_acquire+0xbd/0x1e0
__mutex_lock+0x4e/0x8c0
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
stack backtrace:
CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x78/0xb7
print_irq_inversion_bug.part.38+0x19f/0x1aa
check_usage_forwards+0x102/0x120
? ret_from_fork+0x3a/0x50
? check_usage_backwards+0x110/0x110
mark_lock+0x16c/0x270
__lock_acquire+0x264/0x11c0
? pagevec_lookup_entries+0x1a/0x30
? truncate_inode_pages_range+0x2b3/0x7f0
lock_acquire+0xbd/0x1e0
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
__mutex_lock+0x4e/0x8c0
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
? mem_cgroup_shrink_node+0x2c0/0x2c0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x3a/0x50
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]
When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.
Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).
Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]
The extent tree of the test fs is like the following:
BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
item 0 key (4096 168 4096) itemoff 3944 itemsize 51
extent refs 1 gen 1 flags 2
tree block key (68719476736 0 0) level 1
^^^^^^^
ref#0: tree block backref root 5
And it's using an empty tree for fs tree, so there is no way that its
level can be 1.
For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:
item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
refs 1 gen 4 flags TREE_BLOCK
tree block key (256 INODE_ITEM 0) level 0
^^^^^^^
tree block backref root 5
Fix the level to 0, so it won't break later tree level checker.
Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]
do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.
The fix is to simply add the missing cond_resched.
Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.
Trivial reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdc /mnt/sdc
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
$ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
$ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
$ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
| btrfs receive -vv /mnt/sdd
Before this change the output of the second receive command is:
receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
utimes
write foobar, offset 8192, len 8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
After this change it is:
receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
utimes
update_extent foobar: offset=8192, len=8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6f794e3c5c8f8fdd3b5bb20d9ded894e685b5bbe ]
It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.
[1]
commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3
btrfs: Enhance super validation check
Fixes: 319e4d0661e5323 ("btrfs: Enhance super validation check")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 762221f095e3932669093466aaf4b85ed9ad2ac1 ]
The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.
We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.
[1]: https://patchwork.kernel.org/patch/10091755/
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9ea2c7c9da13c9073e371c046cbbc45481ecb459 ]
When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the
path being passed hasn't been initialised, if it has already been used to
query a btree then it could potentially have populated the nodes/slots arrays.
Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes-coverity-id: 711515
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 343e4fc1c60971b0734de26dbbd475d433950982 ]
Setting plug can merge adjacent IOs before dispatching IOs to the disk
driver.
Without plug, it'd not be a problem for single disk usecases, but for
multiple disks using raid profile, a large IO can be split to several
IOs of stripe length, and plug can be helpful to bring them together
for each disk so that we can save several disk access.
Moreover, fsync issues synchronous writes, so plug can really take
effect.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.
Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.
Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 02a3307aa9c20b4f6626255b028f07f6cfa16feb upstream.
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.
btrfs_search_slot()
read_block_for_search() # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block() # read child
Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block. It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.
A simple PoC is included in btrfs/124,
0. A two-disk raid1 btrfs,
1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
2. Mount this filesystem and put it in use, after a while, device tree's
root got COW but block A hasn't been allocated/overwritten yet.
3. Umount it and reload the btrfs module to remove both disks from the
global @fs_devices list.
4. mount -odegraded dev1 and write some data, so now block A is allocated
to be a leaf in checksum tree. Note that only dev1 has the latest
metadata of this filesystem.
5. Umount it and mount it again normally (with both disks), since raid1
can pick up one disk by the writer task's pid, if btrfs_search_slot()
needs to read block A, dev2 which does NOT have the latest metadata
might be read for block A, then we got a stale block A.
6. As parent transid is not checked, block A is marked as uptodate and
put into the extent buffer cache, so the future search won't bother
to read disk again, which means it'll make changes on this stale
one and make it dirty and flush it onto disk.
To avoid the problem, parent transid needs to be passed to
read_tree_block().
In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.
This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.
Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 02ee654d3a04563c67bfe658a05384548b9bb105 upstream.
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
only, which isn't called during the remount. So when resuming from
the paused balance we hit the bug:
kernel: kernel BUG at fs/btrfs/volumes.c:3890!
::
kernel: balance_kthread+0x51/0x60 [btrfs]
kernel: kthread+0x111/0x130
::
kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
Reproducer:
On a mounted filesystem:
btrfs balance start --full-balance /btrfs
btrfs balance pause /btrfs
mount -o remount,ro /dev/sdb /btrfs
mount -o remount,rw /dev/sdb /btrfs
To fix this set the BTRFS_BALANCE_RESUME flag in
btrfs_resume_balance_async().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9a8fca62aacc1599fea8e813d01e1955513e4fad upstream.
If a file has xattrs, we fsync it, to ensure we clear the flags
BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
inode, the current transaction commits and then we fsync it (without
either of those bits being set in its inode), we end up not logging
all its xattrs. This results in deleting all xattrs when replying the
log after a power failure.
Trivial reproducer
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ setfattr -n user.xa -v qwerty /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
$ sync
$ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
$ mount /dev/sdb /mnt
$ getfattr --absolute-names --dump /mnt/foobar
<empty output>
$
So fix this by making sure all xattrs are logged if we log a file's inode
item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
BTRFS_INODE_COPY_EVERYTHING were set in the inode.
Fixes: 36283bf777d9 ("Btrfs: fix fsync xattr loss in the fast fsync path")
Cc: <stable@vger.kernel.org> # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit bff5baf8aa37a97293725a16c03f49872249c07e ]
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.
Detected by CoverityScan, CID#1414312 ("Logically dead code")
Fixes: 5dca6eea91653e ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0dde10bed2c44a4024eb446cc72fe4e0cb97ec06 upstream.
There is no need for the extra pair of parentheses, remove it. This
fixes the following warning when building with clang:
fs/btrfs/tree-log.c:3694:10: warning: equality comparison with extraneous
parentheses [-Wparentheses-equality]
if ((i == (nr - 1)))
~~^~~~~~~~~~~
Also remove the unnecessary parentheses around the substraction.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit be2d253cc98244765323a7c94cc1ac5cd5a17072 ]
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e1cbfd7bf6dabdac561c75d08357571f44040a45 ]
Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:
$ mkfs.btrfs -f /dev/sdb
$ mount -o compress /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
$ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7bf6
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.
So teach the incremental send code to deal with this particular case.
The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1c81ba237bcecad9bc885a1ddcf02d725ea38482 ]
When using compression, if we fail to insert an inline extent we
incorrectly end up attempting to free the reserved data space twice,
once through extent_clear_unlock_delalloc(), because we pass it the
flag EXTENT_DO_ACCOUNTING, and once through a direct call to
btrfs_free_reserved_data_space_noquota(). This results in a trace
like the following:
[ 834.576240] ------------[ cut here ]------------
[ 834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[ 834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[ 834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[ 834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
[ 834.596103] Call Trace:
[ 834.596103] dump_stack+0x67/0x90
[ 834.596103] __warn+0xc2/0xdd
[ 834.596103] warn_slowpath_null+0x1d/0x1f
[ 834.596103] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[ 834.596103] compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
[ 834.596103] ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
[ 834.596103] async_cow_start+0x32/0x4d [btrfs]
[ 834.596103] btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
[ 834.596103] btrfs_delalloc_helper+0xe/0x10 [btrfs]
[ 834.596103] process_one_work+0x273/0x4e4
[ 834.596103] worker_thread+0x1eb/0x2ca
[ 834.596103] ? rescuer_thread+0x2b6/0x2b6
[ 834.596103] kthread+0x100/0x108
[ 834.596103] ? __list_del_entry+0x22/0x22
[ 834.596103] ret_from_fork+0x2e/0x40
[ 834.611656] ---[ end trace 719902fe6bdef08f ]---
So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
if an error happened.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 14506127979a5a3d0c5d9b4cc76ce9d4ec23b717 ]
If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs
balance -dconvert=raid1", the meta.target field will be uninitialized.
That's otherwise ok, as it's unused except for this warning.
Thus, let's use the existing set of raid levels for the comparison.
As a side effect, non-convert balances will now nag about data>metadata.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fd649f10c3d21ee9d7542c609f29978bdf73ab94 upstream.
Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also free
the btrfs_fs_devices struct itself. Currently this leads to a use after
free since list_for_each_entry will try to perform a check on the
already freed memory to see if it has to terminate the loop.
The fix is to use 'break' when we know we are freeing the current
fs_devs.
Fixes: 4fde46f0cc71 ("Btrfs: free the stale device")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 92e222df7b8f05c565009c7383321b593eca488b upstream.
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.
The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.
Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.
The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.
Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.
Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.
Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.
The real problem was already introduced with the rest of the code in
73c5de0051.
The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d7d824966530acfe32b94d1ed672e6fe1638cd68 upstream.
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.
Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ]
The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.
Implications:
Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.
btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.
Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.
Before this patch the buffer was one byte larger, but then the -1 was
not added.
Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added implications ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 900c9981680067573671ecc5cbfa7c5770be3a40 upstream.
The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots. However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.
cc: <stable@vger.kernel.org> v4.4-rc6+
Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e8f1bc1493855e32b7a2a019decc3c353d94daf6 upstream.
This regression is introduced in
commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction").
There are two problems,
a) it is ->destroy_inode() that does the final free on inode, not
->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.
This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().
Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction")
Cc: <stable@vger.kernel.org> # v4.7-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 55237a5f2431a72435e3ed39e4306e973c0446b7 upstream.
It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.
cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1846430c24d66e85cc58286b3319c82cd54debb2 upstream.
In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,
umount
-> close_ctree
-> stop workers
-> iput(btree_inode)
-> iput_final
-> write_inode_now
-> ...
-> queue job on stop'd workers
cc: <stable@vger.kernel.org> v3.12+
Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e89166990f11c3f21e1649d760dd35f9e410321c upstream.
@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.
cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream.
This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b77000ed558daa3bef0899d29bf171b8c9b5e6a8 ]
If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add label and use existing unlocking code ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 457ae7268b29c33dee1c0feb143a15f6029d177b ]
This isn't super serious because you need CAP_ADMIN to run this code.
I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks... There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size. The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():
alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
Fixes: f5ecec3ce21f ("btrfs: send: silence an integer overflow warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9ca2e97fa3c3216200afe35a3b111ec51cc796d2 ]
If 'btrfs_alloc_path()' fails, we must free the resources already
allocated, as done in the other error handling paths in this function.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e1699d2d7bf6e6cce3e1baff19f9dd4595a58664 ]
This is a story about 4 distinct (and very old) btrfs bugs.
Commit c8b978188c ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).
Commit 93c82d5750 ("Btrfs: zero page past end of inline file items")
fixed bug #1: uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read. The fix
was to add a memset in btrfs_get_extent to zero out the hole.
Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption")
fixed bug #2: compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases. This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.
There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record. This memset doesn't cover the region between the end of the
decompressed data and the end of the page. It has also moved around a
few times over the years, so there's no single patch to refer to.
This patch fixes bug #3: compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same: zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.
The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code. In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.
A fast reproducer for bug #0 is presented below. A few offending extents
are also created in the wild during large rsync transfers with the -S
flag. A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well. Once an offending
file is created, it can present different content to userspace each
time it is read.
Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y
kernel back to v3.5 has this behavior. There are fossil records of this
bug's effects in commits all the way back to v2.6.32. I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.
It is not clear whether bug #0 is worth fixing. A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.
Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak. So enough about
bug #0, let's get back to bug #3 (this patch).
An example of on-disk structure leading to data corruption found in
the wild:
item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
inode generation 50 transid 50 size 47424 nbytes 49141
block group 0 mode 100644 links 1 uid 0 gid 0
rdev 0 flags 0x0(none)
item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
inode ref index 3 namelen 10 name: DB_File.so
item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
inline extent data size 1341 ram 4085 compress(zlib)
item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
extent data disk byte 5367308288 nr 20480
extent data offset 0 nr 45056 ram 45056
extent compression(zlib)
Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096. The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.
Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:
Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:
# touch foo
# chattr +c foo
# xfs_io -f -c "pwrite -W 0 1000" foo
# xfs_io -f -c "falloc 4 8188" foo
# od -x foo
# echo 3 >/proc/sys/vm/drop_caches
# od -x foo
This produce the following on my box:
Correct output: file contains 1000 data bytes followed
by zeros:
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
0001760 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000
Actual output: the data after the first 1000 bytes
will be different each run:
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
(...)
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e19182c0fff451e3744c1107d98f072e7ca377a0 upstream.
If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.
Fixes: 79787eaab4612 ("btrfs: replace many BUG_ONs with proper error handling")
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8e138e0d92c6c9d3d481674fb14e3439b495be37 upstream.
We discovered a box that had double allocations, and suspected the space
cache may be to blame. While auditing the write out path I noticed that
if we've already setup the space cache we will just carry on. This
means that any error we hit after cache_save_setup before we go to
actually write the cache out we won't reset the inode generation, so
whatever was already written will be considered correct, except it'll be
stale. Fix this by _always_ resetting the generation on the block group
inode, this way we only ever have valid or invalid cache.
With this patch I was no longer able to reproduce cache corruption with
dm-log-writes and my bpf error injection tool.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 73ba39ab9307340dc98ec3622891314bbc09cc2e ]
In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret
on errors. However, it directly returns 0. It may be better to return
ret. This patch also removes the warning, because the caller already
prints a warning.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
[ edited subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4dd9920d991745c4a16f53a8f615f706fbe4b3f7 ]
Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.
Consider the following example scenario.
Parent snapshot:
. (ino 256, gen 8)
|---- a1/ (ino 257, gen 9)
|---- a2/ (ino 258, gen 9)
Send snapshot:
. (ino 256, gen 3)
|---- a2/ (ino 257, gen 7)
In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):
rmdir a1
mkfile o257-7-0
rename o257-7-0 -> a2
ERROR: rename o257-7-0 -> a2 failed: Is a directory
What happens when computing the incremental send stream is:
1) An operation to remove the directory with inode number 257 and
generation 9 is issued.
2) An operation to create the inode with number 257 and generation 7 is
issued. This creates the inode with an orphanized name of "o257-7-0".
3) An operation rename the new inode 257 to its final name, "a2", is
issued. This is incorrect because inode 258, which has the same name
and it's a child of the same parent (root inode 256), was not yet
processed and therefore no rmdir operation for it was yet issued.
The rename operation is issued because we fail to detect that the
name of the new inode 257 collides with inode 258, because their
parent, a subvolume/snapshot root (inode 256) has a different
generation in both snapshots.
So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.
We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/a1
$ mkdir /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inode from the second snapshot has
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt /tmp/1.snap
# Receive of snapshot snap2 used to fail.
$ btrfs receive /mnt /tmp/2.snap
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|