linux-toradex.git/fs, branch v3.10.78

ext4: fix data corruption caused by unwritten and delayed extents

2015-05-13T12:15:42+00:00

commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream.

Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.

The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.

At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.

When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.

For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.

This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)

Signed-off-by: Lukas Czerner 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

RCU pathwalk breakage when running into a symlink overmounting something

2015-05-06T19:56:27+00:00

commit 3cab989afd8d8d1bc3d99fef0e7ed87c31e7b647 upstream.

Calling unlazy_walk() in walk_component() and do_last() when we find
a symlink that needs to be followed doesn't acquire a reference to vfsmount.
That's fine when the symlink is on the same vfsmount as the parent directory
(which is almost always the case), but it's not always true - one _can_
manage to bind a symlink on top of something.  And in such cases we end up
with excessive mntput().

Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

ext4: make fsync to sync parent dir in no-journal for real this time

2015-05-06T19:56:25+00:00

commit e12fb97222fc41e8442896934f76d39ef99b590a upstream.

Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a
support for for syncing parent directory of newly created inodes to
make sure that the inode is not lost after a power failure in
no-journal mode.

However this does not work in majority of cases, namely:
 - if the directory has inline data
 - if the directory is already indexed
 - if the directory already has at least one block and:
	- the new entry fits into it
	- or we've successfully converted it to indexed

So in those cases we might lose the inode entirely even after fsync in
the no-journal mode. This also includes ext2 default mode obviously.

I've noticed this while running xfstest generic/321 and even though the
test should fail (we need to run fsck after a crash in no-journal mode)
I could not find a newly created entries even when if it was fsynced
before.

Fix this by adjusting the ext4_add_entry() successful exit paths to set
the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the
parent directory as well.

Signed-off-by: Lukas Czerner 
Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Cc: Frank Mayhar 
Signed-off-by: Greg Kroah-Hartman

fs/binfmt_elf.c: fix bug in loading of PIE binaries

2015-05-06T19:56:24+00:00

commit a87938b2e246b81b4fb713edb371a9fa3c5c3c86 upstream.

With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down
address allocation strategy, load_elf_binary() will attempt to map a PIE
binary into an address range immediately below mm->mmap_base.

Unfortunately, load_elf_ binary() does not take account of the need to
allocate sufficient space for the entire binary which means that, while
the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent
PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are
that is supposed to be the "gap" between the stack and the binary.

Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this
means that binaries with large data segments > 128MB can end up mapping
part of their data segment over their stack resulting in corruption of the
stack (and the data segment once the binary starts to run).

Any PIE binary with a data segment > 128MB is vulnerable to this although
address randomization means that the actual gap between the stack and the
end of the binary is normally greater than 128MB.  The larger the data
segment of the binary the higher the probability of failure.

Fix this by calculating the total size of the binary in the same way as
load_elf_interp().

Signed-off-by: Michael Davidson 
Cc: Alexander Viro 
Cc: Jiri Kosina 
Cc: Kees Cook 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix inode eviction infinite loop after cloning into it

2015-05-06T19:56:21+00:00

commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream.

If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:

[ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3914.620886] BTRFS: end < start 4095 4096
(...)
[ 3914.638093] Call Trace:
[ 3914.638636]  [] dump_stack+0x4c/0x65
[ 3914.639620]  [] warn_slowpath_common+0xa1/0xbb
[ 3914.640789]  [] ? insert_state+0x4b/0x10b [btrfs]
[ 3914.642041]  [] warn_slowpath_fmt+0x46/0x48
[ 3914.643236]  [] insert_state+0x4b/0x10b [btrfs]
[ 3914.644441]  [] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3914.645711]  [] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3914.646914]  [] ? _raw_spin_unlock+0x28/0x33
[ 3914.648058]  [] ? test_range_bit+0xcc/0xde [btrfs]
[ 3914.650105]  [] lock_extent+0x13/0x15 [btrfs]
[ 3914.651361]  [] lock_extent_range+0x3d/0xcd [btrfs]
[ 3914.652761]  [] btrfs_ioctl_clone+0x278/0x388 [btrfs]
[ 3914.654128]  [] ? might_fault+0x58/0xb5
[ 3914.655320]  [] btrfs_ioctl+0xb51/0x2195 [btrfs]
(...)
[ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---

This later makes the inode eviction handler enter an infinite loop that
keeps dumping the following warning over and over:

[ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3915.119913] BTRFS: end < start 4095 4096
(...)
[ 3915.137394] Call Trace:
[ 3915.137913]  [] dump_stack+0x4c/0x65
[ 3915.139154]  [] warn_slowpath_common+0xa1/0xbb
[ 3915.140316]  [] ? insert_state+0x4b/0x10b [btrfs]
[ 3915.141505]  [] warn_slowpath_fmt+0x46/0x48
[ 3915.142709]  [] insert_state+0x4b/0x10b [btrfs]
[ 3915.143849]  [] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3915.145120]  [] ? btrfs_kill_super+0x17/0x23 [btrfs]
[ 3915.146352]  [] ? deactivate_locked_super+0x3b/0x50
[ 3915.147565]  [] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3915.148785]  [] ? _raw_write_unlock+0x28/0x33
[ 3915.149931]  [] btrfs_evict_inode+0x196/0x482 [btrfs]
[ 3915.151154]  [] evict+0xa0/0x148
[ 3915.152094]  [] dispose_list+0x39/0x43
[ 3915.153081]  [] evict_inodes+0xdc/0xeb
[ 3915.154062]  [] generic_shutdown_super+0x49/0xef
[ 3915.155193]  [] kill_anon_super+0x13/0x1e
[ 3915.156274]  [] btrfs_kill_super+0x17/0x23 [btrfs]
(...)
[ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---

So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).

This is trivial to reproduce. For example, the steps for the test I just
made for fstests:

  mkfs.btrfs -f SCRATCH_DEV
  mount SCRATCH_DEV $SCRATCH_MNT

  touch $SCRATCH_MNT/foo
  touch $SCRATCH_MNT/bar

  $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
  umount $SCRATCH_MNT

A test case for fstests follows soon.

Signed-off-by: Filipe Manana 
Reviewed-by: Omar Sandoval 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix log tree corruption when fs mounted with -o discard

2015-05-06T19:56:21+00:00

commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream.

While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.

Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).

Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().

Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log)
Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

dcache: Fix locking bugs in backported "deal with deadlock in d_walk()"

2015-04-29T08:34:02+00:00

commit 20defcec264ceab2630356fb9d397f3d237b5e6d upstream in 3.2-stable

Steven Rostedt reported:
> Porting -rt to the latest 3.2 stable tree I triggered this bug:
>
> =====================================
> [ BUG: bad unlock balance detected! ]
> -------------------------------------
> rm/1638 is trying to release lock (rcu_read_lock) at:
> [] rcu_read_unlock+0x0/0x23
> but there are no more locks to release!
>
> other info that might help us debug this:
> 2 locks held by rm/1638:
>  #0:  (&sb->s_type->i_mutex_key#9/1){+.+.+.}, at: [] do_rmdir+0x5f/0xd2
>  #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] vfs_rmdir+0x49/0xac
>
> stack backtrace:
> Pid: 1638, comm: rm Not tainted 3.2.66-test-rt96+ #2
> Call Trace:
>  [] ? printk+0x1d/0x1f
>  [] print_unlock_inbalance_bug+0xc3/0xcd
>  [] lock_release_non_nested+0x98/0x1ec
>  [] ? trace_hardirqs_off_caller+0x18/0x90
>  [] ? local_clock+0x2d/0x50
>  [] ? d_hash+0x2f/0x2f
>  [] ? d_hash+0x2f/0x2f
>  [] lock_release+0x192/0x1ad
>  [] rcu_read_unlock+0x17/0x23
>  [] shrink_dcache_parent+0x227/0x270
>  [] vfs_rmdir+0x68/0xac
>  [] do_rmdir+0x98/0xd2
>  [] ? fput+0x1a3/0x1ab
>  [] ? sysenter_exit+0xf/0x1a
>  [] ? trace_hardirqs_on_caller+0x118/0x149
>  [] sys_unlinkat+0x2b/0x35
>  [] sysenter_do_call+0x12/0x12
>
>
>
>
> There's a path to calling rcu_read_unlock() without calling
> rcu_read_lock() in have_submounts().
>
> 	goto positive;
>
> positive:
> 	if (!locked && read_seqretry(&rename_lock, seq))
> 		goto rename_retry;
>
> rename_retry:
> 	rcu_read_unlock();
>
> in the above path, rcu_read_lock() is never done before calling
> rcu_read_unlock();

I reviewed locking contexts in all three functions that I changed when
backporting "deal with deadlock in d_walk()".  It's actually worse
than this:

- We don't hold this_parent->d_lock at the 'positive' label in
  have_submounts(), but it is unlocked after 'rename_retry'.
- There is an rcu_read_unlock() after the 'out' label in
  select_parent(), but it's not held at the 'goto out'.

Fix all three lock imbalances.

Reported-by: Steven Rostedt 
Signed-off-by: Ben Hutchings 
Tested-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

deal with deadlock in d_walk()

2015-04-29T08:34:00+00:00

commit ca5358ef75fc69fee5322a38a340f5739d997c10 upstream.

... by not hitting rename_retry for reasons other than rename having
happened.  In other words, do _not_ restart when finding that
between unlocking the child and locking the parent the former got
into __dentry_kill().  Skip the killed siblings instead...

Signed-off-by: Al Viro 
Cc: Ben Hutchings 
[hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2:
 - As we only have try_to_ascend() and not d_walk(), apply this
   change to all callers of try_to_ascend()
 - Adjust context to make __dentry_kill() apply to d_kill()]
Signed-off-by: hujianyang 
Signed-off-by: Greg Kroah-Hartman

move d_rcu from overlapping d_child to overlapping d_alias

2015-04-29T08:34:00+00:00

commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream.

Signed-off-by: Al Viro 
Cc: Ben Hutchings 
[hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2:
 - Apply name changes in all the different places we use d_alias and d_child
 - Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()]
Signed-off-by: hujianyang 
Signed-off-by: Greg Kroah-Hartman

splice: Apply generic position and size checks to each write

2015-04-29T08:33:57+00:00

commit 894c6350eaad7e613ae267504014a456e00a3e2a from the 3.2-stable branch.

We need to check the position and size of file writes against various
limits, using generic_write_check().  This was not being done for
the splice write path.  It was fixed upstream by commit 8d0207652cbe
("->splice_write() via ->write_iter()") but we can't apply that.

CVE-2014-7822

Signed-off-by: Ben Hutchings 
[Ben fixed it in 3.2 stable, i ported it to 3.10 stable]
Signed-off-by: Zhang Zhen 
Signed-off-by: Greg Kroah-Hartman