linux-toradex.git/fs, branch v3.5.3

ext4: fix kernel BUG on large-scale rm -rf commands

2012-08-26T02:31:42+00:00

commit 89a4e48f8479f8145eca9698f39fe188c982212f upstream.

Commit 968dee7722: "ext4: fix hole punch failure when depth is greater
than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
crashes when users ran run "rm -rf" on large directory hierarchy on
ext4 filesystems on RAID devices:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

    Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
    Call Trace:
     [] ? __ext4_handle_dirty_metadata+0x83/0x110
     [] ext4_ext_truncate+0x193/0x1d0
     [] ? ext4_mark_inode_dirty+0x7f/0x1f0
     [] ext4_truncate+0xf5/0x100
     [] ext4_evict_inode+0x461/0x490
     [] evict+0xa2/0x1a0
     [] iput+0x103/0x1f0
     [] do_unlinkat+0x154/0x1c0
     [] ? sys_newfstatat+0x2a/0x40
     [] sys_unlinkat+0x1b/0x50
     [] system_call_fastpath+0x16/0x1b
    Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00

    RIP  [] ext4_ext_remove_space+0xa34/0xdf0

This could be reproduced as follows:

The problem in commit 968dee7722 was that caused the variable 'i' to
be left uninitialized if the truncate required more space than was
available in the journal.  This resulted in the function
ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
ext4_ext_remove_space() to restart the truncate operation after
starting a new jbd2 handle.

Reported-by: Maciej Żenczykowski 
Reported-by: Marti Raudsepp 
Tested-by: Fengguang Wu 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: fix long mount times on very big file systems

2012-08-26T02:31:42+00:00

commit 0548bbb85337e532ca2ed697c3e9b227ff2ed4b4 upstream.

Commit 8aeb00ff85a: "ext4: fix overhead calculation used by
ext4_statfs()" introduced a O(n**2) calculation which makes very large
file systems take forever to mount.  Fix this with an optimization for
non-bigalloc file systems.  (For bigalloc file systems the overhead
needs to be set in the the superblock.)

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: don't call ext4_error while block group is locked

2012-08-26T02:31:42+00:00

commit 7a4c5de27efa4c2ecca87af0a3deea63446367e2 upstream.

While in ext4_validate_block_bitmap(), if an block allocation bitmap
is found to be invalid, we call ext4_error() while the block group is
still locked.  This causes ext4_commit_super() to call a function
which might sleep while in an atomic context.

There's no need to keep the block group locked at this point, so hoist
the ext4_error() call up to ext4_validate_block_bitmap() and release
the block group spinlock before calling ext4_error().

The reported stack trace can be found at:

	http://article.gmane.org/gmane.comp.file-systems.ext4/33731

Reported-by: Dave Jones 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: avoid kmemcheck complaint from reading uninitialized memory

2012-08-26T02:31:42+00:00

commit 7e731bc9a12339f344cddf82166b82633d99dd86 upstream.

Commit 03179fe923 introduced a kmemcheck complaint in
ext4_da_get_block_prep() because we save and restore
ei->i_da_metadata_calc_last_lblock even though it is left
uninitialized in the case where i_da_metadata_calc_len is zero.

This doesn't hurt anything, but silencing the kmemcheck complaint
makes it easier for people to find real bugs.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631
(which is marked as a regression).

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: make sure the journal sb is written in ext4_clear_journal_err()

2012-08-26T02:31:42+00:00

commit d796c52ef0b71a988364f6109aeb63d79c5b116b upstream.

After we transfer set the EXT4_ERROR_FS bit in the file system
superblock, it's not enough to call jbd2_journal_clear_err() to clear
the error indication from journal superblock --- we need to call
jbd2_journal_update_sb_errno() as well.  Otherwise, when the root file
system is mounted read-only, the journal is replayed, and the error
indicator is transferred to the superblock --- but the s_errno field
in the jbd2 superblock is left set (since although we cleared it in
memory, we never flushed it out to disk).

This can end up confusing e2fsck.  We should make e2fsck more robust
in this case, but the kernel shouldn't be leaving things in this
confused state, either.

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

fuse: verify all ioctl retry iov elements

2012-08-26T02:31:39+00:00

commit fb6ccff667712c46b4501b920ea73a326e49626a upstream.

Commit 7572777eef78ebdee1ecb7c258c0ef94d35bad16 attempted to verify that
the total iovec from the client doesn't overflow iov_length() but it
only checked the first element.  The iovec could still overflow by
starting with a small element.  The obvious fix is to check all the
elements.

The overflow case doesn't look dangerous to the kernel as the copy is
limited by the length after the overflow.  This fix restores the
intention of returning an error instead of successfully copying less
than the iovec represented.

I found this by code inspection.  I built it but don't have a test case.
I'm cc:ing stable because the initial commit did as well.

Signed-off-by: Zach Brown 
Signed-off-by: Miklos Szeredi 
Signed-off-by: Greg Kroah-Hartman

ore: Fix out-of-bounds access in _ios_obj()

2012-08-15T14:52:46+00:00

commit 9e62bb4458ad2cf28bd701aa5fab380b846db326 upstream.

_ios_obj() is accessed by group_index not device_table index.

The oc->comps array is only a group_full of devices at a time
it is not like ore_comp_dev() which is indexed by a global
device_table index.

This did not BUG until now because exofs only uses a single
COMP for all devices. But with other FSs like PanFS this is
not true.

This bug was only in the write_path, all other users were
using it correctly

[This is a bug since 3.2 Kernel]

Signed-off-by: Boaz Harrosh 
Signed-off-by: Greg Kroah-Hartman

nilfs2: fix deadlock issue between chcp and thaw ioctls

2012-08-15T14:52:31+00:00

commit 572d8b3945a31bee7c40d21556803e4807fd9141 upstream.

An fs-thaw ioctl causes deadlock with a chcp or mkcp -s command:

 chcp            D ffff88013870f3d0     0  1325   1324 0x00000004
 ...
 Call Trace:
   nilfs_transaction_begin+0x11c/0x1a0 [nilfs2]
   wake_up_bit+0x20/0x20
   copy_from_user+0x18/0x30 [nilfs2]
   nilfs_ioctl_change_cpmode+0x7d/0xcf [nilfs2]
   nilfs_ioctl+0x252/0x61a [nilfs2]
   do_page_fault+0x311/0x34c
   get_unmapped_area+0x132/0x14e
   do_vfs_ioctl+0x44b/0x490
   __set_task_blocked+0x5a/0x61
   vm_mmap_pgoff+0x76/0x87
   __set_current_blocked+0x30/0x4a
   sys_ioctl+0x4b/0x6f
   system_call_fastpath+0x16/0x1b
 thaw            D ffff88013870d890     0  1352   1351 0x00000004
 ...
 Call Trace:
   rwsem_down_failed_common+0xdb/0x10f
   call_rwsem_down_write_failed+0x13/0x20
   down_write+0x25/0x27
   thaw_super+0x13/0x9e
   do_vfs_ioctl+0x1f5/0x490
   vm_mmap_pgoff+0x76/0x87
   sys_ioctl+0x4b/0x6f
   filp_close+0x64/0x6c
   system_call_fastpath+0x16/0x1b

where the thaw ioctl deadlocked at thaw_super() when called while chcp was
waiting at nilfs_transaction_begin() called from
nilfs_ioctl_change_cpmode().  This deadlock is 100% reproducible.

This is because nilfs_ioctl_change_cpmode() first locks sb->s_umount in
read mode and then waits for unfreezing in nilfs_transaction_begin(),
whereas thaw_super() locks sb->s_umount in write mode.  The locking of
sb->s_umount here was intended to make snapshot mounts and the downgrade
of snapshots to checkpoints exclusive.

This fixes the deadlock issue by replacing the sb->s_umount usage in
nilfs_ioctl_change_cpmode() with a dedicated mutex which protects snapshot
mounts.

Signed-off-by: Ryusuke Konishi 
Cc: Fernando Luis Vazquez Cao 
Tested-by: Ryusuke Konishi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

ext4: use s_csum_seed instead of i_csum_seed for xattr block

2012-08-09T15:23:13+00:00

commit 41eb70dde42b2360074a559a6f1fc49860a50179 upstream.

In xattr block operation, we use h_refcount to indicate whether the
xattr block is shared among many inodes. And xattr block csum uses
s_csum_seed if it is shared and i_csum_seed if it belongs to
one inode. But this has a problem. So consider the block is shared
first bewteen inode A and B, and B has some xattr update and CoW
the xattr block. When it updates the *old* xattr block(because
of the h_refcount change) and calls ext4_xattr_release_block, we
has no idea that inode A is the real owner of the *old* xattr
block and we can't use the i_csum_seed of inode A either in xattr
block csum calculation. And I don't think we have an easy way to
find inode A.

So this patch just removes the tricky i_csum_seed and we now uses
s_csum_seed every time for the xattr block csum. The corresponding
patch for the e2fsprogs will be sent in another patch.

This is spotted by xfstests 117.

Signed-off-by: Tao Ma 
Signed-off-by: "Theodore Ts'o" 
Acked-by: Darrick J. Wong 
Signed-off-by: Greg Kroah-Hartman

ext4: use proper csum calculation in ext4_rename

2012-08-09T15:23:13+00:00

commit ef58f69c3c34f6377f1e21d3533c806dbd980ad0 upstream.

In ext4_rename, when the old name is a dir, we need to
change ".." to its new parent and journal the change, so
with metadata_csum enabled, we have to re-calc the csum.

As the first block of the dir can be either a htree root
or a normal directory block and we have different csum
calculation for these 2 types, we have to choose the right
one in ext4_rename.

btw, it is found by xfstests 013.

Signed-off-by: Tao Ma 
Signed-off-by: "Theodore Ts'o" 
Acked-by: Darrick J. Wong 
Signed-off-by: Greg Kroah-Hartman