linux-toradex.git/fs/ext4, branch v3.14.57

ext4: replace open coded nofail allocation in ext4_free_blocks()

2015-08-03T16:29:53+00:00

commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream.

ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.

Signed-off-by: Michal Hocko 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: correctly migrate a file with a hole at the beginning

2015-08-03T16:29:53+00:00

commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream.

Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file.  This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.

  # assmuing 4k block size ext4, with delalloc enabled
  # skip the first block and write to the second block
  xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

  # converting to indirect-mapped file, which would move the data blocks
  # to the beginning of the file, but extent status cache still marks
  # that region as a hole
  chattr -e /mnt/ext4/testfile

  # delayed allocation writes to the "hole", reclaim the same data block
  # again, results in i_blocks corruption
  xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
  umount /mnt/ext4
  e2fsck -nf /dev/sda6
  ...
  Inode 53, i_blocks is 16, should be 8.  Fix? no
  ...

Signed-off-by: Eryu Guan 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: be more strict when migrating to non-extent based file

2015-08-03T16:29:53+00:00

commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream.

Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:

a) delayed allocated extents could bypass the check on eh->eh_entries
   and eh->eh_depth

This can be demonstrated by this script

  xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile

where testfile has two extents but still be converted to non-extent
based file format.

b) only extent length is checked but not the offset, which would result
   in data lose (delalloc) or fs corruption (nodelalloc), because
   non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
   blocks

This can be demostrated by

  xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile
  sync

If delalloc is enabled, dmesg prints
  EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
  EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
  EXT4-fs (dm-4): This should not happen!! Data will be lost

If delalloc is disabled, e2fsck -nf shows corruption
  Inode 53, i_size is 5497558142976, should be 4096.  Fix? no

Fix the two issues by

a) forcing all delayed allocation blocks to be allocated before checking
   eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map

Signed-off-by: Eryu Guan 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: fix reservation release on invalidatepage for delalloc fs

2015-08-03T16:29:53+00:00

commit 9705acd63b125dee8b15c705216d7186daea4625 upstream.

On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.

However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:

EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks

This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.

Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.

This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.

[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2

Signed-off-by: Lukas Czerner 
Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

ext4: don't retry file block mapping on bigalloc fs with non-extent file

2015-08-03T16:29:53+00:00

commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream.

ext4 isn't willing to map clusters to a non-extent file.  Don't signal
this with an out of space error, since the FS will retry the
allocation (which didn't fail) forever.  Instead, return EUCLEAN so
that the operation will fail immediately all the way back to userspace.

(The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)

Signed-off-by: Darrick J. Wong 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: call sync_blockdev() before invalidate_bdev() in put_super()

2015-08-03T16:29:53+00:00

commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream.

Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device.  So
try to force them out to memory before calling invalidate_bdev().

This fixes a warning triggered by generic/081:

WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: fix race between truncate and __ext4_journalled_writepage()

2015-08-03T16:29:52+00:00

commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream.

The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock.  However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0

	      	      	  - and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
Call Trace:
 [] ? __ext4_journalled_invalidatepage+0x117/0x117
 [] __ext4_journalled_invalidatepage+0x10f/0x117
 [] ? __ext4_journalled_invalidatepage+0x117/0x117
 [] ? lock_buffer+0x36/0x36
 [] ext4_journalled_invalidatepage+0xd/0x22
 [] do_invalidatepage+0x22/0x26
 [] truncate_inode_page+0x5b/0x85
 [] truncate_inode_pages_range+0x156/0x38c
 [] truncate_inode_pages+0x11/0x15
 [] truncate_pagecache+0x55/0x71
 [] ext4_setattr+0x4a9/0x560
 [] ? current_kernel_time+0x10/0x44
 [] notify_change+0x1c7/0x2be
 [] do_truncate+0x65/0x85
 [] ? file_ra_state_init+0x12/0x29

	      	      	  - and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
    ...
Call Trace:
 [] ? console_unlock+0x3a1/0x3ce
 [] dump_stack+0x48/0x60
 [] warn_slowpath_common+0x89/0xa0
 [] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
 [] warn_slowpath_null+0x14/0x18
 [] jbd2_journal_dirty_metadata+0x14a/0x1ae
 [] __ext4_handle_dirty_metadata+0xd4/0x19d
 [] write_end_fn+0x40/0x53
 [] ext4_walk_page_buffers+0x4e/0x6a
 [] ext4_writepage+0x354/0x3b8
 [] ? mpage_release_unused_pages+0xd4/0xd4
 [] ? wait_on_buffer+0x2c/0x2c
 [] ? ext4_writepage+0x3b8/0x3b8
 [] __writepage+0x10/0x2e
 [] write_cache_pages+0x22d/0x32c
 [] ? ext4_writepage+0x3b8/0x3b8
 [] ext4_writepages+0x102/0x607
 [] ? sched_clock_local+0x10/0x10e
 [] ? __lock_is_held+0x2e/0x44
 [] ? lock_is_held+0x43/0x51
 [] do_writepages+0x1c/0x29
 [] __writeback_single_inode+0xc3/0x545
 [] writeback_sb_inodes+0x21f/0x36d
    ...

Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: check for zero length extent explicitly

2015-06-06T15:19:36+00:00

commit 2f974865ffdfe7b9f46a9940836c8b167342563d upstream.

The following commit introduced a bug when checking for zero length extent

5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()

Zero length extent could pass the check if lblock is zero.

Adding the explicit check for zero length back.

Signed-off-by: Eryu Guan 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: fix NULL pointer dereference when journal restart fails

2015-06-06T15:19:35+00:00

commit 9d506594069355d1fb2de3f9104667312ff08ed3 upstream.

Currently when journal restart fails, we'll have the h_transaction of
the handle set to NULL to indicate that the handle has been effectively
aborted. We handle this situation quietly in the jbd2_journal_stop() and just
free the handle and exit because everything else has been done before we
attempted (and failed) to restart the journal.

Unfortunately there are a number of problems with that approach
introduced with commit

41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
fails"

First of all in ext4 jbd2_journal_stop() will be called through
__ext4_journal_stop() where we would try to get a hold of the superblock
by dereferencing h_transaction which in this case would lead to NULL
pointer dereference and crash.

In addition we're going to free the handle regardless of the refcount
which is bad as well, because others up the call chain will still
reference the handle so we might potentially reference already freed
memory.

Moreover it's expected that we'll get aborted handle as well as detached
handle in some of the journalling function as the error propagates up
the stack, so it's unnecessary to call WARN_ON every time we get
detached handle.

And finally we might leak some memory by forgetting to free reserved
handle in jbd2_journal_stop() in the case where handle was detached from
the transaction (h_transaction is NULL).

Fix the NULL pointer dereference in __ext4_journal_stop() by just
calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
the potential memory leak in jbd2_journal_stop() and use proper
handle refcounting before we attempt to free it to avoid use-after-free
issues.

And finally remove all WARN_ON(!transaction) from the code so that we do
not get random traces when something goes wrong because when journal
restart fails we will get to some of those functions.

Signed-off-by: Lukas Czerner 
Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

ext4: fix data corruption caused by unwritten and delayed extents

2015-05-13T12:16:57+00:00

commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream.

Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.

The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.

At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.

When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.

For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.

This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)

Signed-off-by: Lukas Czerner 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman