Age | Commit message (Collapse) | Author |
|
commit a17712c8e4be4fa5404d20e9cd3b2b21eae7bc56 upstream.
This patch attempts to close a hole leading to a BUG seen with hot
removals during writes [1].
A block device (NVME namespace in this test case) is formatted to EXT4
without partitions. It's mounted and write I/O is run to a file, then
the device is hot removed from the slot. The superblock attempts to be
written to the drive which is no longer present.
The typical chain of events leading to the BUG:
ext4_commit_super()
__sync_dirty_buffer()
submit_bh()
submit_bh_wbc()
BUG_ON(!buffer_mapped(bh));
This fix checks for the superblock's buffer head being mapped prior to
syncing.
[1] https://www.spinics.net/lists/linux-ext4/msg56527.html
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bfe0a5f47ada40d7984de67e59a7d3390b9b9ecc upstream.
The kernel's ext4 mount-time checks were more permissive than
e2fsprogs's libext2fs checks when opening a file system. The
superblock is considered too insane for debugfs or e2fsck to operate
on it, the kernel has no business trying to mount it.
This will make file system fuzzing tools work harder, but the failure
cases that they find will be more useful and be easier to evaluate.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c37e9e013469521d9adb932d17a1795c139b36db upstream.
If there is a directory entry pointing to a system inode (such as a
journal inode), complain and declare the file system to be corrupted.
Also, if the superblock's first inode number field is too small,
refuse to mount the file system.
This addresses CVE-2018-10882.
https://bugzilla.kernel.org/show_bug.cgi?id=200069
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6e8ab72a812396996035a37e5ca4b3b99b5d214b upstream.
When converting from an inode from storing the data in-line to a data
block, ext4_destroy_inline_data_nolock() was only clearing the on-disk
copy of the i_blocks[] array. It was not clearing copy of the
i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually
used by ext4_map_blocks().
This didn't matter much if we are using extents, since the extents
header would be invalid and thus the extents could would re-initialize
the extents tree. But if we are using indirect blocks, the previous
contents of the i_blocks array will be treated as block numbers, with
potentially catastrophic results to the file system integrity and/or
user data.
This gets worse if the file system is using a 1k block size and
s_first_data is zero, but even without this, the file system can get
quite badly corrupted.
This addresses CVE-2018-10881.
https://bugzilla.kernel.org/show_bug.cgi?id=200015
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bdbd6ce01a70f02e9373a584d0ae9538dcf0a121 upstream.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bc890a60247171294acc0bd67d211fa4b88d40ba upstream.
If there is a corupted file system where the claimed depth of the
extent tree is -1, this can cause a massive buffer overrun leading to
sadness.
This addresses CVE-2018-10877.
https://bugzilla.kernel.org/show_bug.cgi?id=199417
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8844618d8aa7a9973e7b527d038a2a589665002c upstream.
The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled. We were not
consistently looking at this field; fix this.
Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up. Check for these conditions and mark the
file system as corrupted if they are detected.
This addresses CVE-2018-10876.
https://bugzilla.kernel.org/show_bug.cgi?id=199403
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 819b23f1c501b17b9694325471789e6b5cc2d0d2 upstream.
Regardless of whether the flex_bg feature is set, we should always
check to make sure the bits we are setting in the block bitmap are
within the block group bounds.
https://bugzilla.kernel.org/show_bug.cgi?id=199865
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 77260807d1170a8cf35dbb06e07461a655f67eee upstream.
It's really bad when the allocation bitmaps and the inode table
overlap with the block group descriptors, since it causes random
corruption of the bg descriptors. So we really want to head those off
at the pass.
https://bugzilla.kernel.org/show_bug.cgi?id=199865
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e09463f220ca9a1a1ecfda84fcda658f99a1f12a upstream.
Do not set the b_modified flag in block's journal head should not
until after we're sure that jbd2_journal_dirty_metadat() will not
abort with an error due to there not being enough space reserved in
the jbd2 handle.
Otherwise, future attempts to modify the buffer may lead a large
number of spurious errors and warnings.
This addresses CVE-2018-10883.
https://bugzilla.kernel.org/show_bug.cgi?id=200071
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7ffbe65578b44fafdef577a360eb0583929f7c6e upstream.
For every request we send, whether it is SMB1 or SMB2+, we attempt to
reconnect tcon (cifs_reconnect_tcon or smb2_reconnect) before carrying
out the request.
So, while server->tcpStatus != CifsNeedReconnect, we wait for the
reconnection to succeed on wait_event_interruptible_timeout(). If it
returns, that means that either the condition was evaluated to true, or
timeout elapsed, or it was interrupted by a signal.
Since we're not handling the case where the process woke up due to a
received signal (-ERESTARTSYS), the next call to
wait_event_interruptible_timeout() will _always_ fail and we end up
looping forever inside either cifs_reconnect_tcon() or smb2_reconnect().
Here's an example of how to trigger that:
$ mount.cifs //foo/share /mnt/test -o
username=foo,password=foo,vers=1.0,hard
(break connection to server before executing bellow cmd)
$ stat -f /mnt/test & sleep 140
[1] 2511
$ ps -aux -q 2511
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2511 0.0 0.0 12892 1008 pts/0 S 12:24 0:00 stat -f
/mnt/test
$ kill -9 2511
(wait for a while; process is stuck in the kernel)
$ ps -aux -q 2511
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2511 83.2 0.0 12892 1008 pts/0 R 12:24 30:01 stat -f
/mnt/test
By using 'hard' mount point means that cifs.ko will keep retrying
indefinitely, however we must allow the process to be killed otherwise
it would hang the system.
Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Cc: stable@vger.kernel.org
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fa65653e575fbd958bdf5fb9c4a71a324e39510d upstream.
Detect when a directory entry is (possibly partially) beyond directory
size and return EIO in that case since it means the filesystem is
corrupted. Otherwise directory operations can further corrupt the
directory and possibly also oops the kernel.
CC: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
CC: stable@vger.kernel.org
Reported-and-tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fc40724fc6731d90cc7fb6d62d66135f85a33dd2 upstream.
The correct behaviour for NFSv4 sequence IDs is to wrap around
to the value 0 after 0xffffffff.
See https://tools.ietf.org/html/rfc5661#section-2.10.6.1
Fixes: 5f83d86cf531d ("NFSv4.x: Fix wraparound issues when validing...")
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d68894800ec5712d7ddf042356f11e36f87d7f78 upstream.
In nfs_idmap_read_and_verify_message there is an incorrect sprintf '%d'
that converts the __u32 'im_id' from struct idmap_msg to 'id_str', which
is a stack char array variable of length NFS_UINT_MAXLEN == 11.
If a uid or gid value is > 2147483647 = 0x7fffffff, the conversion
overflows into a negative value, for example:
crash> p (unsigned) (0x80000000)
$1 = 2147483648
crash> p (signed) (0x80000000)
$2 = -2147483648
The '-' sign is written to the buffer and this causes a 1 byte overflow
when the NULL byte is written, which corrupts kernel stack memory. If
CONFIG_CC_STACKPROTECTOR_STRONG is set we see a stack-protector panic:
[11558053.616565] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa05b8a8c
[11558053.639063] CPU: 6 PID: 9423 Comm: rpc.idmapd Tainted: G W ------------ T 3.10.0-514.el7.x86_64 #1
[11558053.641990] Hardware name: Red Hat OpenStack Compute, BIOS 1.10.2-3.el7_4.1 04/01/2014
[11558053.644462] ffffffff818c7bc0 00000000b1f3aec1 ffff880de0f9bd48 ffffffff81685eac
[11558053.646430] ffff880de0f9bdc8 ffffffff8167f2b3 ffffffff00000010 ffff880de0f9bdd8
[11558053.648313] ffff880de0f9bd78 00000000b1f3aec1 ffffffff811dcb03 ffffffffa05b8a8c
[11558053.650107] Call Trace:
[11558053.651347] [<ffffffff81685eac>] dump_stack+0x19/0x1b
[11558053.653013] [<ffffffff8167f2b3>] panic+0xe3/0x1f2
[11558053.666240] [<ffffffff811dcb03>] ? kfree+0x103/0x140
[11558053.682589] [<ffffffffa05b8a8c>] ? idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.689710] [<ffffffff810855db>] __stack_chk_fail+0x1b/0x30
[11558053.691619] [<ffffffffa05b8a8c>] idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.693867] [<ffffffffa00209d6>] rpc_pipe_write+0x56/0x70 [sunrpc]
[11558053.695763] [<ffffffff811fe12d>] vfs_write+0xbd/0x1e0
[11558053.702236] [<ffffffff810acccc>] ? task_work_run+0xac/0xe0
[11558053.704215] [<ffffffff811fec4f>] SyS_write+0x7f/0xe0
[11558053.709674] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b
Fix this by calling the internally defined nfs_map_numeric_to_string()
function which properly uses '%u' to convert this __u32. For consistency,
also replace the one other place where snprintf is called.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Stephen Johnston <sjohnsto@redhat.com>
Fixes: cf4ab538f1516 ("NFSv4: Fix the string length returned by the idmapper")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9c2ece6ef67e9d376f32823086169b489c422ed0 upstream.
nfsd4_readdir_rsize restricts rd_maxcount to svc_max_payload when
estimating the size of the readdir reply, but nfsd_encode_readdir
restricts it to INT_MAX when encoding the reply. This can result in log
messages like "kernel: RPC request reserved 32896 but used 1049444".
Restrict rd_dircount similarly (no reason it should be larger than
svc_max_payload).
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 353748a359f1821ee934afc579cf04572406b420 upstream.
There is potential for the size and len fields in ubifs_data_node to be
too large causing either a negative value for the length fields or an
integer overflow leading to an incorrect memory allocation. Likewise,
when the len field is small, an integer underflow may occur.
Signed-off-by: Silvio Cesare <silvio.cesare@gmail.com>
Fixes: 1e51764a3c2ac ("UBIFS: add new flash file system")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5811375325420052fcadd944792a416a43072b7f upstream.
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.
In case of free space inode, this ends up with a warning in
cow_file_range().
The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.
With this, run_delalloc_nocow() bails out when errors occur at the two
places.
cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c5b4a50b74018b3677098151ec5f4fce07d5e6a0 upstream.
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.
Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6becdb601bae2a043d7fb9762c4d48699528ea6e upstream.
syzbot is reporting NULL pointer dereference at fuse_ctl_remove_conn() [1].
Since fc->ctl_ndents is incremented by fuse_ctl_add_conn() when new_inode()
failed, fuse_ctl_remove_conn() reaches an inode-less dentry and tries to
clear d_inode(dentry)->i_private field.
Fix by only adding the dentry to the array after being fully set up.
When tearing down the control directory, do d_invalidate() on it to get rid
of any mounts that might have been added.
[1] https://syzkaller.appspot.com/bug?id=f396d863067238959c91c0b7cfc10b163638cac6
Reported-by: syzbot <syzbot+32c236387d66c4516827@syzkaller.appspotmail.com>
Fixes: bafa96541b25 ("[PATCH] fuse: add control filesystem")
Cc: <stable@vger.kernel.org> # v2.6.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 543b8f8662fe6d21f19958b666ab0051af9db21a upstream.
syzbot is reporting use-after-free at fuse_kill_sb_blk() [1].
Since sb->s_fs_info field is not cleared after fc was released by
fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds
already released fc and tries to hold the lock. Fix this by clearing
sb->s_fs_info field after calling fuse_conn_put().
[1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+ec3986119086fe4eec97@syzkaller.appspotmail.com>
Fixes: 3b463ae0c626 ("fuse: invalidation reverse calls")
Cc: John Muir <john@jmuir.com>
Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@redhat.com>
Cc: <stable@vger.kernel.org> # v2.6.31
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit df0e91d488276086bc07da2e389986cae0048c37 upstream.
Fuse has an "atomic_o_trunc" mode, where userspace filesystem uses the
O_TRUNC flag in the OPEN request to truncate the file atomically with the
open.
In this mode there's no need to send a SETATTR request to userspace after
the open, so fuse_do_setattr() checks this mode and returns. But this
misses the important step of truncating the pagecache.
Add the missing parts of truncation to the ATTR_OPEN branch.
Reported-by: Chad Austin <chadaustin@fb.com>
Fixes: 6ff958edbf39 ("fuse: add atomic open+truncate support")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5cc41e099504b77014358b58567c5ea6293dd220 upstream.
WHen registering a new binfmt_misc handler, it is possible to overflow
the offset to get a negative value, which might crash the system, or
possibly leak kernel data.
Here is a crash log when 2500000000 was used as an offset:
BUG: unable to handle kernel paging request at ffff989cfd6edca0
IP: load_misc_binary+0x22b/0x470 [binfmt_misc]
PGD 1ef3e067 P4D 1ef3e067 PUD 0
Oops: 0000 [#1] SMP NOPTI
Modules linked in: binfmt_misc kvm_intel ppdev kvm irqbypass joydev input_leds serio_raw mac_hid parport_pc qemu_fw_cfg parpy
CPU: 0 PID: 2499 Comm: bash Not tainted 4.15.0-22-generic #24-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
RIP: 0010:load_misc_binary+0x22b/0x470 [binfmt_misc]
Call Trace:
search_binary_handler+0x97/0x1d0
do_execveat_common.isra.34+0x667/0x810
SyS_execve+0x31/0x40
do_syscall_64+0x73/0x130
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Use kstrtoint instead of simple_strtoul. It will work as the code
already set the delimiter byte to '\0' and we only do it when the field
is not empty.
Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX. Also tested
with examples documented at Documentation/admin-guide/binfmt-misc.rst
and other registrations from packages on Ubuntu.
Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f6a4b4c9d07dda90c7c29dae96d6119ac6425dca upstream.
As long as a symlink inode remains in-core, the destination (and
therefore size) will not be re-fetched from the server, as it cannot
change. The original implementation of the attribute cache assumed that
setting the expiry time in the past was sufficient to cause a re-fetch
of all attributes on the next getattr. That does not work in this case.
The bug manifested itself as follows. When the command sequence
touch foo; ln -s foo bar; ls -l bar
is run, the output was
lrwxrwxrwx. 1 fedora fedora 4906 Apr 24 19:10 bar -> foo
However, after a re-mount, ls -l bar produces
lrwxrwxrwx. 1 fedora fedora 3 Apr 24 19:10 bar -> foo
After this commit, even before a re-mount, the output is
lrwxrwxrwx. 1 fedora fedora 3 Apr 24 19:10 bar -> foo
Reported-by: Becky Ligon <ligon@clemson.edu>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Fixes: 71680c18c8f2 ("orangefs: Cache getattr results.")
Cc: stable@vger.kernel.org
Cc: hubcap@omnibond.com
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b2adf22fdfba85a6701c481faccdbbb3a418ccfc upstream.
The server detects reconnect by the (non-zero) value in PreviousSessionId
of SMB2/SMB3 SessionSetup request, but this behavior regressed due
to commit 166cea4dc3a4f66f020cfb9286225ecd228ab61d
("SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup")
CC: Stable <stable@vger.kernel.org>
CC: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ac0b4145d662a3b9e34085dea460fb06ede9b69b upstream.
[BUG]
Btrfs can create compressed extent without checksum (even though it
shouldn't), and if we then try to replace device containing such extent,
the result device will contain all the uncompressed data instead of the
compressed one.
Test case already submitted to fstests:
https://patchwork.kernel.org/patch/10442353/
[CAUSE]
When handling compressed extent without checksum, device replace will
goe into copy_nocow_pages() function.
In that function, btrfs will get all inodes referring to this data
extents and then use find_or_create_page() to get pages direct from that
inode.
The problem here is, pages directly from inode are always uncompressed.
And for compressed data extent, they mismatch with on-disk data.
Thus this leads to corrupted compressed data extent written to replace
device.
[FIX]
In this attempt, we could just remove the "optimization" branch, and let
unified scrub_pages() to handle it.
Although scrub_pages() won't bother reusing page cache, it will be a
little slower, but it does the correct csum checking and won't cause
such data corruption caused by "optimization".
Note about the fix: this is the minimal fix that can be backported to
older stable trees without conflicts. The whole callchain from
copy_nocow_pages() can be deleted, and will be in followup patches.
Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk")
CC: stable@vger.kernel.org # 4.4+
Reported-by: James Harvey <jamespharvey20@gmail.com>
Reviewed-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ remove code removal, add note why ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fd4e994bd1f9dc9628e168a7f619bf69f6984635 upstream.
If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program on 4.7+:
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/btrfs.h>
#include <linux/btrfs_tree.h>
int main(int argc, char **argv)
{
struct btrfs_ioctl_vol_args_v2 vol_args = {
.flags = UINT64_MAX,
};
int ret;
int fd;
if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}
fd = open(argv[1], O_WRONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
if (ret == -1)
perror("ioctl");
close(fd);
return EXIT_SUCCESS;
}
When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
filesystem to be remounted read-only as the writer count will stay
lifted.
Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5c40d598f5408bd0ca22dfffa82f03cd9433f23 upstream.
In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.
The race window is only a few instructions in size, between the if and
the locks which is:
3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835 return -EISDIR;
where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0). The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.
Not impossible but still needs a lot of bad luck to hit unintentionally.
Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f2f76f751433908364ccff82f437a57d0e6e9b7 upstream.
ext4_resize_fs() has an off-by-one bug when checking whether growing of
a filesystem will not overflow inode count. As a result it allows a
filesystem with 8192 inodes per group to grow to 64TB which overflows
inode count to 0 and makes filesystem unusable. Fix it.
Cc: stable@vger.kernel.org
Fixes: 3f8a6411fbada1fa482276591e037f3b1adcf55b
Reported-by: Jaco Kroon <jaco@uls.co.za>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit eee597ac931305eff3d3fd1d61d6aae553bc0984 upstream.
Currently in ext4_punch_hole we're going to skip the mtime update if
there are no actual blocks to release. However we've actually modified
the file by zeroing the partial block so the mtime should be updated.
Moreover the sync and datasync handling is skipped as well, which is
also wrong. Fix it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Joe Habermann <joe.habermann@quantum.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2ee3ee06a8fd792765fa3267ddf928997797eec5 upstream.
When ext4_ind_map_blocks() computes a length of a hole, it doesn't count
with the fact that mapped offset may be somewhere in the middle of the
completely empty subtree. In such case it will return too large length
of the hole which then results in lseek(SEEK_DATA) to end up returning
an incorrect offset beyond the end of the hole.
Fix the problem by correctly taking offset within a subtree into account
when computing a length of a hole.
Fixes: facab4d9711e7aa3532cb82643803e8f1b9518e8
CC: stable@vger.kernel.org
Reported-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8810f7517a3bc4ca2d41d022446d3f5fd6b77c09 ]
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 186a6519dc94964a4c5c68fca482f20f71551f26.
This commit used an incorrect log message.
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e2731e55884f2138a252b0a3d7b24d57e49c3c59 upstream.
btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream.
If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().
Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).
Cc: stable@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a27ba2607e60312554cbcd43fc660b2c7f29dc9c upstream.
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.
This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.
Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.
Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.
This allows kernels that include the packing fix commit 96f859d52bcb
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 116e5258e4115aca0c64ac0bf40ded3b353ed626 ]
Currently when UDF filesystem is recorded without uid / gid (ids are set
to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
and gid= mount options. In such case filesystem could not be modified in
any way as VFS refuses to modify files with invalid ids (even by root).
This is confusing to users and not very useful default since such media
mode is generally used for removable media. Use overflow[ug]id instead
so that at least root can modify the filesystem.
Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 174d1232ebc84fcde8f5889d1171c9c7e74a10a7 ]
The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly. The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient. This needs
fixing in two places:
In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit. This fixes callers that rely on
ap->allowed to be set even when quotas are off.
In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit bf617f7a92edc6bb2909db2bfa4576f50b280ee5 ]
If noextent_cache mount option is on, we will never initialize extent tree
in inode, but still we're going to access it in f2fs_drop_extent_tree,
result in kernel panic as below:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: _raw_write_lock+0xc/0x30
Call Trace:
? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
f2fs_fallocate+0x5a0/0xdd0 [f2fs]
? common_file_perm+0x47/0xc0
? apparmor_file_permission+0x1a/0x20
vfs_fallocate+0x15b/0x290
SyS_fallocate+0x44/0x70
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
This patch fixes to check extent cache status before using in
f2fs_drop_extent_tree.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]
While running btrfs/011, I hit the following lockdep splat.
This is the important bit:
pcpu_alloc+0x1ac/0x5e0
__percpu_counter_init+0x4e/0xb0
btrfs_init_fs_root+0x99/0x1c0 [btrfs]
btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
resolve_indirect_refs+0x130/0x830 [btrfs]
find_parent_nodes+0x69e/0xff0 [btrfs]
btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
btrfs_find_all_roots+0x50/0x70 [btrfs]
btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
The percpu_counter_init call in btrfs_alloc_subvolume_writers
uses GFP_KERNEL, which we can't do during transaction commit.
This switches it to GFP_NOFS.
========================================================
WARNING: possible irq lock inversion dependency detected
4.12.14-kvmsmall #8 Tainted: G W
--------------------------------------------------------
kswapd0/50 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(pcpu_alloc_mutex){+.+.+.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pcpu_alloc_mutex);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/50:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0
#1: (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50
the shortest dependencies between 2nd lock and 1st lock:
-> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
HARDIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
kmem_cache_init_late+0x42/0x75
start_kernel+0x343/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
SOFTIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
kmem_cache_init_late+0x42/0x75
start_kernel+0x343/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
RECLAIM_FS-ON-W at:
__kmalloc+0x47/0x310
pcpu_extend_area_map+0x2b/0xc0
pcpu_alloc+0x3ec/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
__kmem_cache_create+0x1bf/0x390
create_cache+0xba/0x1b0
kmem_cache_create+0x1f8/0x2b0
ksm_init+0x6f/0x19d
do_one_initcall+0x50/0x1b0
kernel_init_freeable+0x201/0x289
kernel_init+0xa/0x100
ret_from_fork+0x3a/0x50
INITIAL USE at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
setup_cpu_cache+0x2f/0x1f0
__kmem_cache_create+0x1bf/0x390
create_boot_cache+0x8b/0xb1
kmem_cache_init+0xa1/0x19e
start_kernel+0x270/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
}
... key at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0
... acquired at:
pcpu_alloc+0x1ac/0x5e0
__percpu_counter_init+0x4e/0xb0
btrfs_init_fs_root+0x99/0x1c0 [btrfs]
btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
resolve_indirect_refs+0x130/0x830 [btrfs]
find_parent_nodes+0x69e/0xff0 [btrfs]
btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
btrfs_find_all_roots+0x50/0x70 [btrfs]
btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
transaction_kthread+0x176/0x1b0 [btrfs]
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
-> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
HARDIRQ-ON-W at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
HARDIRQ-ON-R at:
down_read+0x35/0x90
caching_thread+0x57/0x560 [btrfs]
normal_work_helper+0x1c0/0x5e0 [btrfs]
process_one_work+0x1e0/0x5c0
worker_thread+0x44/0x390
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
SOFTIRQ-ON-W at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-R at:
down_read+0x35/0x90
caching_thread+0x57/0x560 [btrfs]
normal_work_helper+0x1c0/0x5e0 [btrfs]
process_one_work+0x1e0/0x5c0
worker_thread+0x44/0x390
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
INITIAL USE at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
... acquired at:
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
btrfs_create_tree+0xbb/0x2a0 [btrfs]
btrfs_create_uuid_tree+0x37/0x140 [btrfs]
open_ctree+0x23c0/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> (&found->groups_sem){++++..} ops: 2134587 {
HARDIRQ-ON-W at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
HARDIRQ-ON-R at:
down_read+0x35/0x90
btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
open_ctree+0x207b/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-W at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-R at:
down_read+0x35/0x90
btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
open_ctree+0x207b/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
INITIAL USE at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
... acquired at:
find_free_extent+0xcb4/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
__btrfs_cow_block+0x110/0x5b0 [btrfs]
btrfs_cow_block+0xd7/0x290 [btrfs]
btrfs_search_slot+0x1f6/0x960 [btrfs]
btrfs_lookup_inode+0x2a/0x90 [btrfs]
__btrfs_update_delayed_inode+0x65/0x210 [btrfs]
btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
evict+0xc4/0x190
__dentry_kill+0xbf/0x170
dput+0x2ae/0x2f0
SyS_rename+0x2a6/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
HARDIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
IN-RECLAIM_FS-W at:
__mutex_lock+0x4e/0x8c0
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
INITIAL USE at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
... acquired at:
__lock_acquire+0x264/0x11c0
lock_acquire+0xbd/0x1e0
__mutex_lock+0x4e/0x8c0
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
stack backtrace:
CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x78/0xb7
print_irq_inversion_bug.part.38+0x19f/0x1aa
check_usage_forwards+0x102/0x120
? ret_from_fork+0x3a/0x50
? check_usage_backwards+0x110/0x110
mark_lock+0x16c/0x270
__lock_acquire+0x264/0x11c0
? pagevec_lookup_entries+0x1a/0x30
? truncate_inode_pages_range+0x2b3/0x7f0
lock_acquire+0xbd/0x1e0
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
__mutex_lock+0x4e/0x8c0
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
? mem_cgroup_shrink_node+0x2c0/0x2c0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x3a/0x50
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]
When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.
Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).
Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]
The extent tree of the test fs is like the following:
BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
item 0 key (4096 168 4096) itemoff 3944 itemsize 51
extent refs 1 gen 1 flags 2
tree block key (68719476736 0 0) level 1
^^^^^^^
ref#0: tree block backref root 5
And it's using an empty tree for fs tree, so there is no way that its
level can be 1.
For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:
item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
refs 1 gen 4 flags TREE_BLOCK
tree block key (256 INODE_ITEM 0) level 0
^^^^^^^
tree block backref root 5
Fix the level to 0, so it won't break later tree level checker.
Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 2c98425720233ae3e135add0c7e869b32913502f ]
If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.
The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.
Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit bb34f24c7d2c98d0c81838a7700e6068325b17a0 ]
We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
dlm domain. At last other nodes will get stuck into infinite loop when
requsting lock from us.
The problem is caused by concurrency umount between nodes. Before
receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
migrate target. So N2 will continue sending lockres to N1 even though
N1 has left domain.
N1 N2 (owner)
touch file
access the file,
and get pr lock
begin leave domain and
pick up N1 as new owner
begin leave domain and
migrate all lockres done
begin migrate lockres to N1
end leave domain, but
the lockres left
unexpectedly, because
migrate task has passed
[piaojun@huawei.com: v3]
Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]
do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.
The fix is to simply add the missing cond_resched.
Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8c81dd46ef3c416b3b95e3020fb90dbd44e6140b ]
Forcing the log to disk after reading the agf is wrong, we might be
calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.
This can cause a deadlock when racing a fstrim with a filesystem
shutdown.
The deadlock has been identified due a miscalculation bug in device-mapper
dm-thin, which returns lack of space to its users earlier than the device itself
really runs out of space, changing the device-mapper volume into an error state.
The problem happened while filling the filesystem with a single file,
triggering the bug in device-mapper, consequently causing an IO error
and shutting down the filesystem.
If such file is removed, and fstrim executed before the XFS finishes the
shut down process, the fstrim process will end up holding the buffer
lock, and going to sleep on the cil wait queue.
At this point, the shut down process will try to wake up all the threads
waiting on the cil wait queue, but for this, it will try to hold the
same buffer log already held my the fstrim, locking up the filesystem.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ]
proc_sys_link_fill_cache() does not take currently unregistering sysctl
tables into account, which might result into a page fault in
sysctl_follow_link() - add a check to fix it.
This bug has been present since v3.4.
Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.
Trivial reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdc /mnt/sdc
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
$ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
$ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
$ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
| btrfs receive -vv /mnt/sdd
Before this change the output of the second receive command is:
receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
utimes
write foobar, offset 8192, len 8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
After this change it is:
receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
utimes
update_extent foobar: offset=8192, len=8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 18106734b512664a8541026519ce4b862498b6c3 ]
When failing from ceph_fs_debugfs_init() in ceph_real_mount(),
there is lack of dput of root_dentry and it causes slab errors,
so change the calling order of ceph_fs_debugfs_init() and
open_root_dentry() and do some cleanups to avoid this issue.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|