Age | Commit message (Collapse) | Author |
|
commit d1908f52557b3230fbd63c0429f3b4b748bf2b6d upstream.
Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
__alloc_pages_nodemask+0x436/0x4d0
alloc_pages_current+0x97/0x1b0
__page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728
pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331
grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773
iomap_write_begin+0x50/0xd0 fs/iomap.c:118
iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190
? iomap_write_end+0x80/0x80 fs/iomap.c:150
iomap_apply+0xb3/0x130 fs/iomap.c:79
iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
? iomap_write_end+0x80/0x80
xfs_file_buffered_aio_write+0x132/0x390 [xfs]
? remove_wait_queue+0x59/0x60
xfs_file_write_iter+0x90/0x130 [xfs]
__vfs_write+0xe5/0x140
vfs_write+0xc7/0x1f0
? syscall_trace_enter+0x1d0/0x380
SyS_write+0x58/0xc0
do_syscall_64+0x6c/0x200
entry_SYSCALL64_slow_path+0x25/0x25
the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other
callers of iomap_apply loop to complete the full request. We need to
check for fatal signals and back off with a short write instead.
As the iomap_apply delegates all the work down to the actor we have to
hook into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.
Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 81ddd8c0c5e1cb41184d66567140cb48c53eb3d1 upstream.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":
BUG: spinlock bad magic on CPU#0, ls/486
lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
Call Trace:
[<ffffffff81327533>] dump_stack+0x65/0x92
[<ffffffff810baf75>] spin_dump+0x85/0xe0
[<ffffffff810baff6>] spin_bug+0x26/0x30
[<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
[<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
[<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
[<ffffffff81181cfd>] __fput+0x5d/0x160
[<ffffffff81181e3e>] ____fput+0xe/0x10
[<ffffffff8109410e>] task_work_run+0x7e/0xa0
[<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
[<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
[<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9
Fixes: 3afca265b5f53a0 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d19fb70dd68c4e960e2ac09b0b9c79dfdeefa726 upstream.
nfsd assigns the nfs4_free_lock_stateid to .sc_free in init_lock_stateid().
If nfsd doesn't go through init_lock_stateid() and put stateid at end,
there is a NULL reference to .sc_free when calling nfs4_put_stid(ns).
This patch let the nfs4_stid.sc_free assignment to nfs4_alloc_stid().
Fixes: 356a95ece7aa "nfsd: clean up races in lock stateid searching..."
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe upstream.
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c364b6d0b6cda1cd5d9ab689489adda3e82529aa upstream.
In a bmapx call, bmv_count is the total size of the array, including the
zeroth element that userspace uses to supply the search key. The output
array starts at offset 1 so that we can set up the user for the next
invocation. Since we now can split an extent into multiple bmap records
due to shared/unshared status, we have to be careful that we don't
overflow the output array.
In the original patch f86f403794b ("xfs: teach get_bmapx about shared
extents and the CoW fork") I used cur_ext (the output index) to check
for overflows, albeit with an off-by-one error. Since nexleft no longer
describes the number of unfilled slots in the output, we can rip all
that out and use cur_ext for the overflow check directly.
Failure to do this causes heap corruption in bmapx callers such as
xfs_io and xfs_scrub. xfs/328 can reproduce this problem.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2aa6ba7b5ad3189cc27f14540aa2f57f0ed8df4b upstream.
If we try to allocate memory pages to back an xfs_buf that we're trying
to read, it's possible that we'll be so short on memory that the page
allocation fails. For a blocking read we'll just wait, but for
readahead we simply dump all the pages we've collected so far.
Unfortunately, after dumping the pages we neglect to clear the
_XBF_PAGES state, which means that the subsequent call to xfs_buf_free
thinks that b_pages still points to pages we own. It then double-frees
the b_pages pages.
This results in screaming about negative page refcounts from the memory
manager, which xfs oughtn't be triggering. To reproduce this case,
mount a filesystem where the size of the inodes far outweighs the
availalble memory (a ~500M inode filesystem on a VM with 300MB memory
did the trick here) and run bulkstat in parallel with other memory
eating processes to put a huge load on the system. The "check summary"
phase of xfs_scrub also works for this purpose.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 493611ebd62673f39e2f52c2561182c558a21cb6 upstream.
With COW files they are the hotpath, just like for files with the
extent size hint attribute. We really shouldn't micro-manage anything
but failure cases with unlikely.
Additionally Arnd Bergmann recently reported that one of these two
unlikely annotations causes link failures together with an upcoming
kernel instrumentation patch, so let's get rid of it ASAP.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5a93790d4e2df73e30c965ec6e49be82fc3ccfce upstream.
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
away a lock cycle in cases where the fork does not exist or is otherwise
empty. This check is not safe, however, because an attribute fork short
form to extent format conversion includes a transient state that causes
the xfs_inode_hasattr() check to fail. Specifically,
xfs_attr_shortform_to_leaf() creates an empty extent format attribute
fork and then adds the existing shortform attributes to it.
This means that lookup of an existing xattr can spuriously return
-ENOATTR when racing against a setxattr that causes the associated
format conversion. This was originally reproduced by an untar on a
particularly configured glusterfs volume, but can also be reproduced on
demand with properly crafted xattr requests.
The format conversion occurs under the exclusive ilock. xfs_attr_get()
and xfs_attr_remove() already have the proper locking and checks further
down in the functions to handle this situation correctly. Drop the
unlocked checks to avoid the spurious failure and rely on the existing
logic.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 83d230eb5c638949350f4761acdfc0af5cb1bc00 upstream.
sb_dirblklog is added to sb_blocklog to compute the directory block size
in bytes. Therefore, we must compare the sum of both those values
against XFS_MAX_BLOCKSIZE_LOG, not just dirblklog.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d2b3964a0780d2d2994eba57f950d6c9fe489ed8 upstream.
Due to the way how xfs_iomap_write_allocate tries to convert the whole
found extents from delalloc to real space we can run into a race
condition with multiple threads doing writes to this same extent.
For the non-COW case that is harmless as the only thing that can happen
is that we call xfs_bmapi_write on an extent that has already been
converted to a real allocation. For COW writes where we move the extent
from the COW to the data fork after I/O completion the race is, however,
not quite as harmless. In the worst case we are now calling
xfs_bmapi_write on a region that contains hole in the COW work, which
will trip up an assert in debug builds or lead to file system corruption
in non-debug builds. This seems to be reproducible with workloads of
small O_DSYNC write, although so far I've not managed to come up with
a with an isolated reproducer.
The fix for the issue is relatively simple: tell xfs_bmapi_write
that we are only asked to convert delayed allocations and skip holes
in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fd29f7af75b7adf250beccffa63746c6a88e2b74 upstream.
A harmless warning just got introduced:
fs/xfs/libxfs/xfs_dir2.h:40:8: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
Removing the 'const' modifier avoids the warning and has no
other effect.
Fixes: 1fc4d33fed12 ("xfs: replace xfs_mode_to_ftype table with switch statement")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 657bdfb7f5e68ca5e2ed009ab473c429b0d6af85 upstream.
The GETNEXTQOTA ioctl takes whatever ID is sent in,
and looks for the next active quota for an user
equal or higher to that ID.
But if we are at the maximum ID and then ask for the "next"
one, we may wrap back to zero. In this case, userspace
may loop forever, because it will start querying again
at zero.
We'll fix this in userspace as well, but for the kernel,
return -ENOENT if we ask for the next quota ID
past UINT_MAX so the caller knows to stop.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a324cbf10a3c67aaa10c9f47f7b5801562925bc2 upstream.
Check for invalid file type in xfs_dinode_verify()
and fail to load the inode structure from disk.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fab8eef86c814c3dd46bc5d760b6e4a53d5fc5a6 upstream.
The helper xfs_dentry_to_name() is used by 2 different
classes of callers: Callers that pass zero mode and don't care
about the returned name.type field and Callers that pass
non zero mode and do care about the name.type field.
Change xfs_dentry_to_name() to not take the mode argument and
change the call sites of the first class to not pass the mode
argument.
Create a new helper xfs_dentry_mode_to_name() which does pass
the mode argument and returns -EFSCORRUPTED if mode is invalid.
Callers that translate non zero mode to on-disk file type now
check the return value and will export the error to user instead
of staging an invalid file type to be written to directory entry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1fc4d33fed124fb182e8e6c214e973a29389ae83.
The size of the xfs_mode_to_ftype[] conversion table
was too small to handle an invalid value of mode=S_IFMT.
Instead of fixing the table size, replace the conversion table
with a conversion helper that uses a switch statement.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b597dd5373a1ccc08218665dc8417433b1c09550 upstream.
xfs_dir2.h dereferences some data types in inline functions
and fails to include those type definitions, e.g.:
xfs_dir2_data_aoff_t, struct xfs_da_geometry.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3c6f46eacd876bd723a9bad3c6882714c052fd8e upstream.
This changes fixes an assertion hit when fuzzing on-disk
i_mode values.
The easy case to fix is when changing an empty file
i_mode to S_IFDIR. In this case, xfs_dinode_verify()
detects an illegal zero size for directory and fails
to load the inode structure from disk.
For the case of non empty file whose i_mode is changed
to S_IFDIR, the ASSERT() statement in xfs_dir2_isblock()
is replaced with return -EFSCORRUPTED, to avoid interacting
with corrupted jusk also when XFS_DEBUG is disabled.
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bf46ecc3d8cca05f2907cf482755c42c2b11a79d upstream.
The ASSERT() condition is the normal case, not the exception,
so testing the condition should be likely(), not unlikely().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 84a4620cfe97c9d57e39b2369bfb77faff55063d upstream.
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.
Simply removing the warnings thus makes the XFS log reporting significantly
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 12ef830198b0d71668eb9b59f9ba69d32951a48a upstream.
->total is a bit of an odd parameter passed down to the low-level
allocator all the way from the high-level callers. It's supposed to
contain the maximum number of blocks to be allocated for the whole
transaction [1].
But in xfs_iomap_write_allocate we only convert existing delayed
allocations and thus only have a minimal block reservation for the
current transaction, so xfs_alloc_space_available can't use it for
the allocation decisions. Use the maximum of args->total and the
calculated block requirement to make a decision. We probably should
get rid of args->total eventually and instead apply ->minleft more
broadly, but that will require some extensive changes all over.
[1] which creates lots of confusion as most callers don't decrement it
once doing a first allocation. But that's for a separate series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 54fee133ad59c87ab01dd84ab3e9397134b32acb upstream.
We must decide in xfs_alloc_fix_freelist if we can perform an
allocation from a given AG is possible or not based on the available
space, and should not fail the allocation past that point on a
healthy file system.
But currently we have two additional places that second-guess
xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the
maxlen parameter to remove the reservation before doing the
allocation (but ignores the various minium freespace requirements),
and xfs_alloc_fix_minleft tries to fix up the allocated length
after we've found an extent, but ignores the reservations and also
doesn't take the AGFL into account (and thus fails allocations
for not matching minlen in some cases).
Remove all these later fixups and just correct the maxlen argument
inside xfs_alloc_fix_freelist once we have the AGF buffer locked.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 255c516278175a6dc7037d1406307f35237d8688 upstream.
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5149fd327f16e393c1d04fa5325ab072c32472bf upstream.
Setting aside 4 blocks globally for bmbt splits isn't all that useful,
as different threads can allocate space in parallel. Bump it to 4
blocks per AG to allow each thread that is currently doing an
allocation to dip into it separately. Without that we may no have
enough reserved blocks if there are enough parallel transactions
in an almost out space file system that all run into bmap btree
splits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a430607b2ef7c3be090f88c71cfcb1b3988aa7c0 upstream.
Some nfsv4.0 servers may return a mode for the verifier following an open
with EXCLUSIVE4 createmode, but this does not mean the client should skip
setting the mode in the following SETATTR. It should only do that for
EXCLUSIVE4_1 or UNGAURDED createmode.
Fixes: 5334c5bdac92 ("NFS: Send attributes in OPEN request for NFS4_CREATE_EXCLUSIVE4_1")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8ac092519ad91931c96d306c4bfae2c6587c325f upstream.
We cannot call nfs4_handle_exception() without first ensuring that the
slot has been freed. If not, we end up deadlocking with the process
waiting for recovery to complete, and recovery waiting for the slot
table to drain.
Fixes: 2e80dbe7ac51 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 57b59ed2e5b91e958843609c7884794e29e6c4cb upstream.
Subvolume directory inodes can't have ACLs.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1fdf41941b8010691679638f8d0c8d08cfee7726 upstream.
When you snapshot a subvolume containing a subvolume, you get a
placeholder directory where the subvolume would be. These directory
inodes have ->i_ops set to btrfs_dir_ro_inode_operations. Previously,
these i_ops didn't include the xattr operation callbacks. The conversion
to xattr_handlers missed this case, leading to bogus attempts to set
xattrs on these inodes. This manifested itself as failures when running
delayed inodes.
To fix this, clear IOP_XATTR in ->i_opflags on these inodes.
Fixes: 6c6ef9f26e59 ("xattr: Stop calling {get,set,remove}xattr inode operations")
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Reported-by: Chris Murphy <lists@colorremedies.com>
Tested-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 67ade058ef2c65a3e56878af9c293ec76722a2e5 upstream.
As Jeff explained in c2951f32d36c ("btrfs: remove old tree_root dirent
processing in btrfs_real_readdir()"), supporting this old format is no
longer necessary since the Btrfs magic number has been updated since we
changed to the current format. There are other places where we still
handle this old format, but since this is part of a fix that is going to
stable, I'm only removing this one for now.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e0d76fa4475ef2cf4b52d18588b8ce95153d021b upstream.
Quotacheck runs at mount time in situations where quota accounting must
be recalculated. In doing so, it uses bulkstat to visit every inode in
the filesystem. Historically, every inode processed during quotacheck
was released and immediately tagged for reclaim because quotacheck runs
before the superblock is marked active by the VFS. In other words,
the final iput() lead to an immediate ->destroy_inode() call, which
allowed the XFS background reclaim worker to start reclaiming inodes.
Commit 17c12bcd3 ("xfs: when replaying bmap operations, don't let
unlinked inodes get reaped") marks the XFS superblock active sooner as
part of the mount process to support caching inodes processed during log
recovery. This occurs before quotacheck and thus means all inodes
processed by quotacheck are inserted to the LRU on release. The
s_umount lock is held until the mount has completed and thus prevents
the shrinkers from operating on the sb. This means that quotacheck can
excessively populate the inode LRU and lead to OOM conditions on systems
without sufficient RAM.
Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on
inodes processed by quotacheck. This causes ->drop_inode() to return 1
and in turn causes iput_final() to evict the inode. This preserves the
original quotacheck behavior and prevents it from overloading the LRU
and running out of memory.
Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fe2ed42517533068ac03eed5630fffafff27eacf upstream.
sparse says:
fs/ceph/inode.c:308:36: warning: incorrect type in argument 1 (different base types)
fs/ceph/inode.c:308:36: expected unsigned int [unsigned] [usertype] a
fs/ceph/inode.c:308:36: got restricted __le32 [usertype] frag
fs/ceph/inode.c:308:46: warning: incorrect type in argument 2 (different base types)
fs/ceph/inode.c:308:46: expected unsigned int [unsigned] [usertype] b
fs/ceph/inode.c:308:46: got restricted __le32 [usertype] frag
We need to convert these values to host-endian before calling the
comparator.
Fixes: a407846ef7c6 ("ceph: don't assume frag tree splits in mds reply are sorted")
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1097680d759918ce4a8705381c0ab2ed7bd60cf1 upstream.
sparse says:
fs/ceph/dir.c:1248:50: warning: incorrect type in assignment (different base types)
fs/ceph/dir.c:1248:50: expected restricted __le32 [usertype] mask
fs/ceph/dir.c:1248:50: got int [signed] [assigned] mask
Fixes: 200fd27c8fa2 ("ceph: use lookup request to revalidate dentry")
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6e09d0fb64402cec579f029ca4c7f39f5c48fc60 upstream.
Commit 5c341ee32881 ("ceph: fix scheduler warning due to nested
blocking") causes infinite loop when process is interrupted. Fix it.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5c341ee32881c554727ec14b71ec3e8832f01989 upstream.
try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:
WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109447d>] prepare_to_wait_event+0x5d/0x110
ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0
ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c
0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0
Call Trace:
[<ffffffff812f4409>] dump_stack+0x67/0x9e
[<ffffffff81052b46>] warn_slowpath_common+0x86/0xc0
[<ffffffff81052bcc>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
[<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
[<ffffffff8107767f>] __might_sleep+0x9f/0xb0
[<ffffffff81612d30>] mutex_lock+0x20/0x40
[<ffffffffa04eea14>] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
[<ffffffffa04fa692>] try_get_cap_refs+0xa2/0x320 [ceph]
[<ffffffffa04fd6f5>] ceph_get_caps+0x255/0x2b0 [ceph]
[<ffffffff81094370>] ? wait_woken+0xb0/0xb0
[<ffffffffa04f2c11>] ceph_write_iter+0x2b1/0xde0 [ceph]
[<ffffffff81613f22>] ? schedule_timeout+0x202/0x260
[<ffffffff8117f01a>] ? kmem_cache_free+0x1ea/0x200
[<ffffffff811b46ce>] ? iput+0x9e/0x230
[<ffffffff81077632>] ? __might_sleep+0x52/0xb0
[<ffffffff81156147>] ? __might_fault+0x37/0x40
[<ffffffff8119e123>] ? cp_new_stat+0x153/0x170
[<ffffffff81198cfa>] __vfs_write+0xaa/0xe0
[<ffffffff81199369>] vfs_write+0xa9/0x190
[<ffffffff811b6d01>] ? set_close_on_exec+0x31/0x70
[<ffffffff8119a056>] SyS_write+0x46/0xa0
This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.
Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
nested blocking")
Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6df8c9d80a27cb587f61b4f06b57e248d8bc3f86 upstream.
sparse says:
fs/ceph/mds_client.c:291:23: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:293:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:294:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:296:28: warning: restricted __le32 degrades to integer
The op value is __le32, so we need to convert it before comparing it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1cb51a15b576ee325d527726afff40947218fd5e upstream.
When replaying the journal it can happen that a journal entry points to
a garbage collected node.
This is the case when a power-cut occurred between a garbage collect run
and a commit. In such a case nodes have to be read using the failable
read functions to detect whether the found node matches what we expect.
One corner case was forgotten, when the journal contains an entry to
remove an inode all xattrs have to be removed too. UBIFS models xattr
like directory entries, so the TNC code iterates over
all xattrs of the inode and removes them too. This code re-uses the
functions for walking directories and calls ubifs_tnc_next_ent().
ubifs_tnc_next_ent() expects to be used only after the journal and
aborts when a node does not match the expected result. This behavior can
render an UBIFS volume unmountable after a power-cut when xattrs are
used.
Fix this issue by using failable read functions in ubifs_tnc_next_ent()
too when replaying the journal.
Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system")
Reported-by: Rock Lee <rockdotlee@gmail.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 210675270caa33253e4c33f3c5e657e7d6060812 upstream.
Commit bcb6f6d2b9c2 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.
Fixes: bcb6f6d2b9c2 ("fuse: use timespec64")
Signed-off-by: David Sheets <dsheets@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a8a86d78d673b1c99fe9b0064739fde9e9774184 upstream.
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.
Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.
This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().
Fixes: ee314a870e40 ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 497de07d89c1410d76a15bec2bb41f24a2a89f31 upstream.
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49 ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:
touch $testfile
chown 100:100 $testfile
chmod 2755 $testfile
_runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile
Signed-off-by: Gu Zheng <guzheng1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cfd278c280f997cf2fe4662e0acab0fe465f637b upstream.
Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds',
then ds->ds_clp will also be non-NULL.
This is not necessasrily true in the case when the process received a fatal signal
while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect().
In that case ->ds_clp may not be set, and the devid may not recently have been marked
unavailable.
So add a test for ds_clp == NULL and return NULL in that case.
Fixes: c23266d532b4 ("NFS4.1 Fix data server connection race")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Olga Kornievskaia <aglo@umich.edu>
Acked-by: Adamson, Andy <William.Adamson@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 79f687a3de9e3ba2518b4ea33f38ca6cbe9133eb upstream.
Ben Coddington reports that commit 311324ad1713, by adding the function
nfs_dir_mapping_need_revalidate() that checks page cache validity on
each call to nfs_readdir() causes a performance regression when
the directory is being modified.
If the directory is changing while we're iterating through the directory,
POSIX does not require us to invalidate the page cache unless the user
calls rewinddir(). However, we still do want to ensure that we use
readdirplus in order to avoid a load of stat() calls when the user
is doing an 'ls -l' workload.
The fix should be to invalidate the page cache immediately when we're
setting the NFS_INO_ADVISE_RDPLUS bit.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 311324ad1713 ("NFS: Be more aggressive in using readdirplus...")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ee284e35d8c71bf5d4d807eaff6f67a17134b359 upstream.
We must put the task to sleep while holding the inode->i_lock in order
to ensure atomicity with the test for NFS_LAYOUT_RETURN.
Fixes: 500d701f336b ("NFS41: make close wait for layoutreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f36ab161bebe464d33b998294eff29b17a9c8918 upstream.
Fix typo in parameter description.
Fixes: 5405fc44c337 ("NFSv4.x: Add kernel parameter to control the
callback server")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aa7c8da35d1905d80e840d075f07d26ec90144b5 upstream.
In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op
fails sets locked_ref->processing = 0 but doesn't re-increment
delayed_refs->num_heads_ready. As a result, we end up triggering
the WARN_ON in btrfs_select_ref_head.
Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads)
Reported-by: Jon Nelson <jnelson-suse@jamponi.net>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d0280996437081dd12ed1e982ac8aeaa62835ec4 upstream.
In __btrfs_run_delayed_refs, when we put back a delayed ref that's too
new, we have already dropped the lock on locked_ref when we set
->processing = 0.
This patch keeps the lock to cover that assignment.
Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 93362fa47fe98b62e4a34ab408c4a418432e7939 upstream.
Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.
The calltrace of CVE-2016-9191:
[ 5535.960522] Call Trace:
[ 5535.963265] [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.968817] [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
[ 5535.975346] [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
[ 5535.982256] [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
[ 5535.988972] [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
[ 5535.994804] [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
[ 5536.001227] [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
[ 5536.007648] [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654] [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657] [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344] [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
[ 5536.036447] [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844] [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336] [<ffffffff8116789d>] update_flag+0x11d/0x210
[ 5536.057373] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186] [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899] [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234] [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
[ 5536.091049] [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
[ 5536.097571] [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
[ 5536.104207] [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5536.110736] [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5536.117461] [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5536.123697] [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5536.130426] [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5536.135991] [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5536.142041] [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex. The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.
See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13
Fixes: f0c3b5093add ("[readdir] convert procfs")
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: Yang Shukui <yangshukui@huawei.com>
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3895dbf8985f656675b5bde610723a29cbce3fa7 upstream.
Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.
Kristen Johansen <kjlx@templeofstupid.com> reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.
Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.
I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.
d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.
Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ac0c7cf8be00f269f82964cf7b144ca3edc5dbc4 upstream.
Enabling btrfs tracepoints leads to instant crash, as reported. The wq
callbacks could free the memory and the tracepoints started to
dereference the members to get to fs_info.
The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
removed the tracepoints but we could preserve them by passing only the
required data in a safe way.
Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0a417b8dc1f10b03e8f558b8a831f07ec4c23795 upstream.
Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:
while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done
will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.
So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e7ee2c089e94067d68475990bdeed211c8852917 upstream.
The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.
The crash backtrace is below:
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282
RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?
The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.
The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.
The following diagram briefly illustrates how this crash happens:
RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
The 1st round:
Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)
RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()
/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/
if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/
The 2nd round:
Node 1 Node2
RSB1(become master from recovery)
ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */
The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.
Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ff97f2399edac1e0fb3fa7851d5fbcbdf04717cf upstream.
max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|