linux-toradex.git/fs, branch v3.14.4

lockd: ensure we tear down any live sockets when socket creation fails during lockd_up

2014-05-13T11:32:56+00:00

commit 679b033df48422191c4cac52b610d9980e019f9b upstream.

We had a Fedora ABRT report with a stack trace like this:

kernel BUG at net/sunrpc/svc.c:550!
invalid opcode: 0000 [#1] SMP
[...]
CPU: 2 PID: 913 Comm: rpc.nfsd Not tainted 3.13.6-200.fc20.x86_64 #1
Hardware name: Hewlett-Packard HP ProBook 4740s/1846, BIOS 68IRR Ver. F.40 01/29/2013
task: ffff880146b00000 ti: ffff88003f9b8000 task.ti: ffff88003f9b8000
RIP: 0010:[]  [] svc_destroy+0x128/0x130 [sunrpc]
RSP: 0018:ffff88003f9b9de0  EFLAGS: 00010206
RAX: ffff88003f829628 RBX: ffff88003f829600 RCX: 00000000000041ee
RDX: 0000000000000000 RSI: 0000000000000286 RDI: 0000000000000286
RBP: ffff88003f9b9de8 R08: 0000000000017360 R09: ffff88014fa97360
R10: ffffffff8114ce57 R11: ffffea00051c9c00 R12: ffff88003f829600
R13: 00000000ffffff9e R14: ffffffff81cc7cc0 R15: 0000000000000000
FS:  00007f4fde284840(0000) GS:ffff88014fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4fdf5192f8 CR3: 00000000a569a000 CR4: 00000000001407e0
Stack:
 ffff88003f792300 ffff88003f9b9e18 ffffffffa02de02a 0000000000000000
 ffffffff81cc7cc0 ffff88003f9cb000 0000000000000008 ffff88003f9b9e60
 ffffffffa033bb35 ffffffff8131c86c ffff88003f9cb000 ffff8800a5715008
Call Trace:
 [] lockd_up+0xaa/0x330 [lockd]
 [] nfsd_svc+0x1b5/0x2f0 [nfsd]
 [] ? simple_strtoull+0x2c/0x50
 [] ? write_pool_threads+0x280/0x280 [nfsd]
 [] write_threads+0x8b/0xf0 [nfsd]
 [] ? __get_free_pages+0x14/0x50
 [] ? get_zeroed_page+0x16/0x20
 [] ? simple_transaction_get+0xb1/0xd0
 [] nfsctl_transaction_write+0x48/0x80 [nfsd]
 [] vfs_write+0xb4/0x1f0
 [] ? putname+0x29/0x40
 [] SyS_write+0x49/0xa0
 [] ? __audit_syscall_exit+0x1f6/0x2a0
 [] system_call_fastpath+0x16/0x1b
Code: 31 c0 e8 82 db 37 e1 e9 2a ff ff ff 48 8b 07 8b 57 14 48 c7 c7 d5 c6 31 a0 48 8b 70 20 31 c0 e8 65 db 37 e1 e9 f4 fe ff ff 0f 0b <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55
RIP  [] svc_destroy+0x128/0x130 [sunrpc]
 RSP 

Evidently, we created some lockd sockets and then failed to create
others. make_socks then returned an error and we tried to tear down the
svc, but svc->sv_permsocks was not empty so we ended up tripping over
the BUG() in svc_destroy().

Fix this by ensuring that we tear down any live sockets we created when
socket creation is going to return an error.

Fixes: 786185b5f8abefa (SUNRPC: move per-net operations from...)
Reported-by: Raphos 
Signed-off-by: Jeff Layton 
Reviewed-by: Stanislav Kinsbursky 
Signed-off-by: J. Bruce Fields 
Signed-off-by: Greg Kroah-Hartman

aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration

2014-05-13T11:32:56+00:00

commit fa8a53c39f3fdde98c9eace6a9b412143f0f6ed6 upstream.

As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
exist in the aio ring page migration support.

As a result, for example, we have the following problem:

            thread 1                      |              thread 2
                                          |
aio_migratepage()                         |
 |-> take ctx->completion_lock            |
 |-> migrate_page_copy(new, old)          |
 |   *NOW*, ctx->ring_pages[idx] == old   |
                                          |
                                          |    *NOW*, ctx->ring_pages[idx] == old
                                          |    aio_read_events_ring()
                                          |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                          |     |-> ring->head = head;          *HERE, write to the old ring page*
                                          |     |-> kunmap_atomic(ring);
                                          |
 |-> ctx->ring_pages[idx] = new           |
 |   *BUT NOW*, the content of            |
 |    ring_pages[idx] is old.             |
 |-> release ctx->completion_lock         |

As above, the new ring page will not be updated.

Fix this issue, as well as prevent races in aio_ring_setup() by holding
the ring_lock mutex during kioctx setup and page migration.  This avoids
the overhead of taking another spinlock in aio_read_events_ring() as Tang's
and Gu's original fix did, pushing the overhead into the migration code.

Note that to handle the nesting of ring_lock inside of mmap_sem, the
migratepage operation uses mutex_trylock().  Page migration is not a 100%
critical operation in this case, so the ocassional failure can be
tolerated.  This issue was reported by Sasha Levin.

Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
Instead, make page migration fully serialised by mapping->private_lock, and
have aio_free_ring() simply disconnect the kioctx from the mapping by calling
put_aio_ring_file() before touching ctx->ring_pages[].  This simplifies the
error handling logic in aio_migratepage(), and should improve robustness.

v4: always do mutex_unlock() in cases when kioctx setup fails.

Reported-by: Yasuaki Ishimatsu 
Reported-by: Sasha Levin 
Signed-off-by: Benjamin LaHaise 
Cc: Tang Chen 
Cc: Gu Zheng 
Signed-off-by: Greg Kroah-Hartman

locks: allow __break_lease to sleep even when break_time is 0

2014-05-13T11:32:53+00:00

commit 4991a628a789dc5954e98e79476d9808812292ec upstream.

A fl->fl_break_time of 0 has a special meaning to the lease break code
that basically means "never break the lease". knfsd uses this to ensure
that leases don't disappear out from under it.

Unfortunately, the code in __break_lease can end up passing this value
to wait_event_interruptible as a timeout, which prevents it from going
to sleep at all. This causes __break_lease to spin in a tight loop and
causes soft lockups.

Fix this by ensuring that we pass a minimum value of 1 as a timeout
instead.

Cc: J. Bruce Fields 
Reported-by: Terry Barnaby 
Signed-off-by: Jeff Layton 
Signed-off-by: J. Bruce Fields 
Signed-off-by: Greg Kroah-Hartman

ext4: use i_size_read in ext4_unaligned_aio()

2014-05-06T14:59:37+00:00

commit 6e6358fc3c3c862bfe9a5bc029d3f8ce43dc9765 upstream.

We haven't taken i_mutex yet, so we need to use i_size_read().

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent()

2014-05-06T14:59:37+00:00

commit 622cad1325e404598fe3b148c3fa640dbaabc235 upstream.

The function ext4_update_i_disksize() is used in only one place, in
the function mpage_map_and_submit_extent().  Move its code to simplify
the code paths, and also move the call to ext4_mark_inode_dirty() into
the i_data_sem's critical region, to be consistent with all of the
other places where we update i_disksize.  That way, we also keep the
raw_inode's i_disksize protected, to avoid the following race:

      CPU #1                                 CPU #2

   down_write(&i_data_sem)
   Modify i_disk_size
   up_write(&i_data_sem)
                                        down_write(&i_data_sem)
                                        Modify i_disk_size
                                        Copy i_disk_size to on-disk inode
                                        up_write(&i_data_sem)
   Copy i_disk_size to on-disk inode

Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

ext4: fix jbd2 warning under heavy xattr load

2014-05-06T14:59:37+00:00

commit ec4cb1aa2b7bae18dd8164f2e9c7c51abcf61280 upstream.

When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x70/0xa0
 [] warn_slowpath_null+0x1a/0x20
 [] jbd2_journal_dirty_metadata+0x1ba/0x260
 [] __ext4_handle_dirty_metadata+0xa3/0x140
 [] ext4_xattr_release_block+0x103/0x1f0
 [] ext4_xattr_block_set+0x1e0/0x910
 [] ext4_xattr_set_handle+0x38b/0x4a0
 [] ? trace_hardirqs_on+0xd/0x10
 [] ext4_xattr_set+0xc2/0x140
 [] ext4_xattr_user_set+0x47/0x50
 [] generic_setxattr+0x6e/0x90
 [] __vfs_setxattr_noperm+0x7b/0x1c0
 [] vfs_setxattr+0xc4/0xd0
 [] setxattr+0x13e/0x1e0
 [] ? __sb_start_write+0xe7/0x1b0
 [] ? mnt_want_write_file+0x28/0x60
 [] ? fget_light+0x3c/0x130
 [] ? mnt_want_write_file+0x28/0x60
 [] ? __mnt_want_write+0x58/0x70
 [] SyS_fsetxattr+0xbe/0x100
 [] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil 
Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: note the error in ext4_end_bio()

2014-05-06T14:59:37+00:00

commit 9503c67c93ed0b95ba62d12d1fd09da6245dbdd6 upstream.

ext4_end_bio() currently throws away the error that it receives.  Chances
are this is part of a spate of errors, one of which will end up getting
the error returned to userspace somehow, but we shouldn't take that risk.
Also print out the errno to aid in debug.

Signed-off-by: Matthew Wilcox 
Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKS

2014-05-06T14:59:36+00:00

commit 4adb6ab3e0fa71363a5ef229544b2d17de6600d7 upstream.

When we try to get 2^32-1 block of the file which has the extent
(ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
in ext4_ext_put_gap_in_cache().

To avoid the problem, ext4_map_blocks() needs to check the file logical block
number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
handle 2^32-1 because the maximum file logical block number is 2^32-2.

Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
So ext4_map_blocks() should also return the same errno.

Signed-off-by: Kazuya Mio 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

smarter propagate_mnt()

2014-05-06T14:59:36+00:00

commit f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 upstream.

The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

ocfs2: fix panic on kfree(xattr->name)

2014-05-06T14:59:36+00:00

commit f81c20158f8d5f7938d5eb86ecc42ecc09273ce6 upstream.

Commit 9548906b2bb7 ('xattr: Constify ->name member of "struct xattr"')
missed that ocfs2 is calling kfree(xattr->name).  As a result, kernel
panic occurs upon calling kfree(xattr->name) because xattr->name refers
static constant names.  This patch removes kfree(xattr->name) from
ocfs2_mknod() and ocfs2_symlink().

Signed-off-by: Tetsuo Handa 
Reported-by: Tariq Saeed 
Tested-by: Tariq Saeed 
Reviewed-by: Srinivas Eeda 
Cc: Joel Becker 
Cc: Mark Fasheh 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman