linux-toradex.git/fs/btrfs/ioctl.c, branch v4.0.6

Btrfs: fix uninit variable in clone ioctl

2015-06-23T00:03:37+00:00

commit de249e66a73d696666281cd812087979c6fae552 upstream.

Commit 0d97a64e0 creates a new variable but doesn't always set it up.
This puts it back to the original method (key.offset + 1) for the cases
not covered by Filipe's new logic.

Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix range cloning when same inode used as source and destination

2015-06-23T00:03:36+00:00

commit df858e76723ace61342b118aa4302bd09de4e386 upstream.

While searching for extents to clone we might find one where we only use
a part of it coming from its tail. If our destination inode is the same
the source inode, we end up removing the tail part of the extent item and
insert after a new one that point to the same extent with an adjusted
key file offset and data offset. After this we search for the next extent
item in the fs/subvol tree with a key that has an offset incremented by
one. But this second search leaves us at the new extent item we inserted
previously, and since that extent item has a non-zero data offset, it
it can make us call btrfs_drop_extents with an empty range (start == end)
which causes the following warning:

[23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]()
(...)
[23978.557266] Call Trace:
[23978.557978]  [] dump_stack+0x4c/0x65
[23978.559191]  [] warn_slowpath_common+0xa1/0xbb
[23978.560699]  [] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs]
[23978.562389]  [] warn_slowpath_null+0x1a/0x1c
[23978.563613]  [] btrfs_drop_extent_cache+0x43/0x385 [btrfs]
[23978.565103]  [] ? time_hardirqs_off+0x15/0x28
[23978.566294]  [] ? trace_hardirqs_off+0xd/0xf
[23978.567438]  [] __btrfs_drop_extents+0x6b/0x9e1 [btrfs]
[23978.568702]  [] ? trace_hardirqs_on+0xd/0xf
[23978.569763]  [] ? ____cache_alloc+0x69/0x2eb
[23978.570817]  [] ? virt_to_head_page+0x9/0x36
[23978.571872]  [] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
[23978.573466]  [] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[23978.574962]  [] btrfs_drop_extents+0x66/0x7f [btrfs]
[23978.576179]  [] btrfs_clone+0x516/0xaf5 [btrfs]
[23978.577311]  [] ? lock_extent_range+0x7b/0xcd [btrfs]
[23978.578520]  [] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
[23978.580282]  [] btrfs_ioctl+0xb51/0x219a [btrfs]
(...)
[23978.591887] ---[ end trace 988ec2a653d03ed3 ]---

Then we attempt to insert a new extent item with a key that already
exists, which makes btrfs_insert_empty_item return -EEXIST resulting in
abortion of the current transaction:

[23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
(...)
[23978.622589] Call Trace:
[23978.623181]  [] dump_stack+0x4c/0x65
[23978.624359]  [] warn_slowpath_common+0xa1/0xbb
[23978.625573]  [] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[23978.626971]  [] warn_slowpath_fmt+0x46/0x48
[23978.628003]  [] ? vprintk_default+0x1d/0x1f
[23978.629138]  [] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[23978.630528]  [] btrfs_clone+0x7fc/0xaf5 [btrfs]
[23978.631635]  [] ? lock_extent_range+0x7b/0xcd [btrfs]
[23978.632886]  [] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
[23978.634119]  [] btrfs_ioctl+0xb51/0x219a [btrfs]
(...)
[23978.647714] ---[ end trace 988ec2a653d03ed4 ]---

This is wrong because we should not process the extent item that we just
inserted previously, and instead process the extent item that follows it
in the tree

For example for the test case I wrote for fstests:

   bs=$((64 * 1024))
   mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc
   mount /dev/sdc /mnt

   xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo

   $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo
   $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo

The second clone call fails with -EEXIST, because when we process the
first extent item (offset 262144), we drop part of it (counting from the
end) and then insert a new extent item with a key greater then the key we
found. The next time we search the tree we search for a key with offset
262144 + 1, which leaves us at the new extent item we have just inserted
but we think it refers to an extent that we need to clone.

Fix this by ensuring the next search key uses an offset corresponding to
the offset of the key we found previously plus the data length of the
corresponding extent item. This ensures we skip new extent items that we
inserted and works for the case of implicit holes too (NO_HOLES feature).

A test case for fstests follows soon.

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

btrfs: unlock i_mutex after attempting to delete subvolume during send

2015-05-13T12:14:27+00:00

commit 909e26dce3f7600f5e293ac0522c28790a0c8c9c upstream.

Whenever the check for a send in progress introduced in commit
521e0546c970 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:

[  +0.000059] ================================================
[  +0.000028] [ BUG: lock held when returning to user space! ]
[  +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[  +0.000026] ------------------------------------------------
[  +0.000029] btrfs/211 is leaving the kernel with locks still held!
[  +0.000029] 1 lock held by btrfs/211:
[  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [] btrfs_ioctl_snap_destroy+0x2df/0x7a0

Make sure we unlock it in the error path.

Reviewed-by: Filipe Manana 
Reviewed-by: David Sterba 
Signed-off-by: Omar Sandoval 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix inode eviction infinite loop after extent_same ioctl

2015-05-06T20:03:38+00:00

commit 113e8283869b9855c8b999796aadd506bbac155f upstream.

If we pass a length of 0 to the extent_same ioctl, we end up locking an
extent range with a start offset greater then its end offset (if the
destination file's offset is greater than zero). This results in a warning
from extent_io.c:insert_state through the following call chain:

  btrfs_extent_same()
    btrfs_double_lock()
      lock_extent_range()
        lock_extent(inode->io_tree, offset, offset + len - 1)
          lock_extent_bits()
            __set_extent_bit()
              insert_state()
                --> WARN_ON(end < start)

This leads to an infinite loop when evicting the inode. This is the same
problem that my previous patch titled
"Btrfs: fix inode eviction infinite loop after cloning into it" addressed
but for the extent_same ioctl instead of the clone ioctl.

Signed-off-by: Filipe Manana 
Reviewed-by: Omar Sandoval 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix inode eviction infinite loop after cloning into it

2015-05-06T20:03:37+00:00

commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream.

If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:

[ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3914.620886] BTRFS: end < start 4095 4096
(...)
[ 3914.638093] Call Trace:
[ 3914.638636]  [] dump_stack+0x4c/0x65
[ 3914.639620]  [] warn_slowpath_common+0xa1/0xbb
[ 3914.640789]  [] ? insert_state+0x4b/0x10b [btrfs]
[ 3914.642041]  [] warn_slowpath_fmt+0x46/0x48
[ 3914.643236]  [] insert_state+0x4b/0x10b [btrfs]
[ 3914.644441]  [] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3914.645711]  [] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3914.646914]  [] ? _raw_spin_unlock+0x28/0x33
[ 3914.648058]  [] ? test_range_bit+0xcc/0xde [btrfs]
[ 3914.650105]  [] lock_extent+0x13/0x15 [btrfs]
[ 3914.651361]  [] lock_extent_range+0x3d/0xcd [btrfs]
[ 3914.652761]  [] btrfs_ioctl_clone+0x278/0x388 [btrfs]
[ 3914.654128]  [] ? might_fault+0x58/0xb5
[ 3914.655320]  [] btrfs_ioctl+0xb51/0x2195 [btrfs]
(...)
[ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---

This later makes the inode eviction handler enter an infinite loop that
keeps dumping the following warning over and over:

[ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3915.119913] BTRFS: end < start 4095 4096
(...)
[ 3915.137394] Call Trace:
[ 3915.137913]  [] dump_stack+0x4c/0x65
[ 3915.139154]  [] warn_slowpath_common+0xa1/0xbb
[ 3915.140316]  [] ? insert_state+0x4b/0x10b [btrfs]
[ 3915.141505]  [] warn_slowpath_fmt+0x46/0x48
[ 3915.142709]  [] insert_state+0x4b/0x10b [btrfs]
[ 3915.143849]  [] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3915.145120]  [] ? btrfs_kill_super+0x17/0x23 [btrfs]
[ 3915.146352]  [] ? deactivate_locked_super+0x3b/0x50
[ 3915.147565]  [] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3915.148785]  [] ? _raw_write_unlock+0x28/0x33
[ 3915.149931]  [] btrfs_evict_inode+0x196/0x482 [btrfs]
[ 3915.151154]  [] evict+0xa0/0x148
[ 3915.152094]  [] dispose_list+0x39/0x43
[ 3915.153081]  [] evict_inodes+0xdc/0xeb
[ 3915.154062]  [] generic_shutdown_super+0x49/0xef
[ 3915.155193]  [] kill_anon_super+0x13/0x1e
[ 3915.156274]  [] btrfs_kill_super+0x17/0x23 [btrfs]
(...)
[ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---

So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).

This is trivial to reproduce. For example, the steps for the test I just
made for fstests:

  mkfs.btrfs -f SCRATCH_DEV
  mount SCRATCH_DEV $SCRATCH_MNT

  touch $SCRATCH_MNT/foo
  touch $SCRATCH_MNT/bar

  $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
  umount $SCRATCH_MNT

A test case for fstests follows soon.

Signed-off-by: Filipe Manana 
Reviewed-by: Omar Sandoval 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)

2015-02-22T16:38:41+00:00

Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells 
Signed-off-by: Al Viro

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs

2014-12-12T19:15:23+00:00

Pull btrfs update from Chris Mason:
 "From a feature point of view, most of the code here comes from Miao
  Xie and others at Fujitsu to implement scrubbing and replacing devices
  on raid56.  This has been in development for a while, and it's a big
  improvement.

  Filipe and Josef have a great assortment of fixes, many of which solve
  problems corruptions either after a crash or in error conditions.  I
  still have a round two from Filipe for next week that solves
  corruptions with discard and block group removal"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: make get_caching_control unconditionally return the ctl
  Btrfs: fix unprotected deletion from pending_chunks list
  Btrfs: fix fs mapping extent map leak
  Btrfs: fix memory leak after block remove + trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block groups
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs, replace: enable dev-replace for raid56
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix crash caused by block group removal
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  ...

Btrfs: fix snapshot inconsistency after a file write followed by truncate

2014-11-25T15:41:23+00:00

If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason

Btrfs: ensure send always works on roots without orphans

2014-11-25T15:41:23+00:00

Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason

btrfs: get rid of f_dentry use

2014-11-19T18:01:21+00:00

Signed-off-by: Al Viro