summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2019-07-16Merge tag 'for-5.3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Highlights: - chunks that have been trimmed and unchanged since last mount are tracked and skipped on repeated trims - use hw assissed crc32c on more arches, speedups if native instructions or optimized implementation is available - the RAID56 incompat bit is automatically removed when the last block group of that type is removed Fixes: - fsync fix for reflink on NODATACOW files that could lead to ENOSPC - fix data loss after inode eviction, renaming it, and fsync it - fix fsync not persisting dentry deletions due to inode evictions - update ctime/mtime/iversion after hole punching - fix compression type validation (reported by KASAN) - send won't be allowed to start when relocation is in progress, this can cause spurious errors or produce incorrect send stream Core: - new tracepoints for space update - tree-checker: better check for end of extents for some tree items - preparatory work for more checksum algorithms - run delayed iput at unlink time and don't push the work to cleaner thread where it's not properly throttled - wrap block mapping to structures and helpers, base for further refactoring - split large files, part 1: - space info handling - block group reservations - delayed refs - delayed allocation - other cleanups and refactoring" * tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (103 commits) btrfs: fix memory leak of path on error return path btrfs: move the subvolume reservation stuff out of extent-tree.c btrfs: migrate the delalloc space stuff to it's own home btrfs: migrate btrfs_trans_release_chunk_metadata btrfs: migrate the delayed refs rsv code btrfs: Evaluate io_tree in find_lock_delalloc_range() btrfs: migrate the global_block_rsv helpers to block-rsv.c btrfs: migrate the block-rsv code to block-rsv.c btrfs: stop using block_rsv_release_bytes everywhere btrfs: cleanup the target logic in __btrfs_block_rsv_release btrfs: export __btrfs_block_rsv_release btrfs: export btrfs_block_rsv_add_bytes btrfs: move btrfs_block_rsv definitions into it's own header btrfs: Simplify update of space_info in __reserve_metadata_bytes() btrfs: unexport can_overcommit btrfs: move reserve_metadata_bytes and supporting code to space-info.c btrfs: move dump_space_info to space-info.c btrfs: export block_rsv_use_bytes btrfs: move btrfs_space_info_add_*_bytes to space-info.c btrfs: move the space info update macro to space-info.h ...
2019-07-15Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "A later pull request with some followup items. I had some vacation coming up to the merge window, so certain things items were delayed a bit. This pull request also contains fixes that came in within the last few days of the merge window, which I didn't want to push right before sending you a pull request. This contains: - NVMe pull request, mostly fixes, but also a few minor items on the feature side that were timing constrained (Christoph et al) - Report zones fixes (Damien) - Removal of dead code (Damien) - Turn on cgroup psi memstall (Josef) - block cgroup MAINTAINERS entry (Konstantin) - Flush init fix (Josef) - blk-throttle low iops timing fix (Konstantin) - nbd resize fixes (Mike) - nbd 0 blocksize crash fix (Xiubo) - block integrity error leak fix (Wenwen) - blk-cgroup writeback and priority inheritance fixes (Tejun)" * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits) MAINTAINERS: add entry for block io cgroup null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED block: Limit zone array allocation size sd_zbc: Fix report zones buffer allocation block: Kill gfp_t argument of blkdev_report_zones() block: Allow mapping of vmalloc-ed buffers block/bio-integrity: fix a memory leak bug nvme: fix NULL deref for fabrics options nbd: add netlink reconfigure resize support nbd: fix crash when the blksize is zero block: Disable write plugging for zoned block devices block: Fix elevator name declaration block: Remove unused definitions nvme: fix regression upon hot device removal and insertion blk-throttle: fix zero wait time for iops throttled group block: Fix potential overflow in blk_report_zones() blkcg: implement REQ_CGROUP_PUNT blkcg, writeback: Implement wbc_blkcg_css() blkcg, writeback: Add wbc->no_cgroup_owner blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() ...
2019-07-12Merge tag 'vfs-fix-ioctl-checking-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong: "Here's a patch series that sets up common parameter checking functions for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations. The goal here is to reduce the amount of behaviorial variance between the filesystems where those ioctls originated (ext2 and XFS, respectively) and everybody else. - Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls (which were the file attribute setters for ext4 and xfs and have now been hoisted to the vfs) - Only allow the DAX flag to be set on files and directories" * tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: only allow FSSETXATTR to set DAX flag on files and dirs vfs: teach vfs_ioc_fssetxattr_check to check extent size hints vfs: teach vfs_ioc_fssetxattr_check to check project id info vfs: create a generic checking function for FS_IOC_FSSETXATTR vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
2019-07-12Merge tag 'driver-core-5.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the "big" driver core and debugfs changes for 5.3-rc1 It's a lot of different patches, all across the tree due to some api changes and lots of debugfs cleanups. Other than the debugfs cleanups, in this set of changes we have: - bus iteration function cleanups - scripts/get_abi.pl tool to display and parse Documentation/ABI entries in a simple way - cleanups to Documenatation/ABI/ entries to make them parse easier due to typos and other minor things - default_attrs use for some ktype users - driver model documentation file conversions to .rst - compressed firmware file loading - deferred probe fixes All of these have been in linux-next for a while, with a bunch of merge issues that Stephen has been patient with me for" * tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits) debugfs: make error message a bit more verbose orangefs: fix build warning from debugfs cleanup patch ubifs: fix build warning after debugfs cleanup patch driver: core: Allow subsystems to continue deferring probe drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT arch_topology: Remove error messages on out-of-memory conditions lib: notifier-error-inject: no need to check return value of debugfs_create functions swiotlb: no need to check return value of debugfs_create functions ceph: no need to check return value of debugfs_create functions sunrpc: no need to check return value of debugfs_create functions ubifs: no need to check return value of debugfs_create functions orangefs: no need to check return value of debugfs_create functions nfsd: no need to check return value of debugfs_create functions lib: 842: no need to check return value of debugfs_create functions debugfs: provide pr_fmt() macro debugfs: log errors when something goes wrong drivers: s390/cio: Fix compilation warning about const qualifiers drivers: Add generic helper to match by of_node driver_find_device: Unify the match function with class_find_device() bus_find_device: Unify the match callback with class_find_device ...
2019-07-10Merge tag 'fsnotify_for_v5.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "This contains cleanups of the fsnotify name removal hook and also a patch to disable fanotify permission events for 'proc' filesystem" * tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: get rid of fsnotify_nameremove() fsnotify: move fsnotify_nameremove() hook out of d_delete() configfs: call fsnotify_rmdir() hook debugfs: call fsnotify_{unlink,rmdir}() hooks debugfs: simplify __debugfs_remove_file() devpts: call fsnotify_unlink() hook tracefs: call fsnotify_{unlink,rmdir}() hooks rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks btrfs: call fsnotify_rmdir() hook fsnotify: add empty fsnotify_{unlink,rmdir}() hooks fanotify: Disallow permission events for proc filesystem
2019-07-10blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()Tejun Heo
wbc_account_io() does a very specific job - try to see which cgroup is actually dirtying an inode and transfer its ownership to the majority dirtier if needed. The name is too generic and confusing. Let's rename it to something more specific. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-05btrfs: fix memory leak of path on error return pathColin Ian King
Currently if the allocation of roots or tmp_ulist fails the error handling does not free up the allocation of path causing a memory leak. Fix this and other similar leaks by moving the call of btrfs_free_path from label out to label out_free_ulist. Kudos to David Sterba for spotting the issue in my original fix and suggesting the correct way to fix the leak and Anand Jain for spotting a double free issue. Addresses-Coverity: ("Resource leak") Fixes: 5911c8fe05c5 ("btrfs: fiemap: preallocate ulists for btrfs_check_shared") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: move the subvolume reservation stuff out of extent-tree.cJosef Bacik
This is just two functions, put it in root-tree.c since it involves root items. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate the delalloc space stuff to it's own homeJosef Bacik
We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate btrfs_trans_release_chunk_metadataJosef Bacik
Move this into transaction.c with the rest of the transaction related code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate the delayed refs rsv codeJosef Bacik
These belong with the delayed refs related code, not in extent-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: Evaluate io_tree in find_lock_delalloc_range()Goldwyn Rodrigues
Simplification. No point passing the tree variable when it can be evaluated from inode. The tests now use the io_tree from btrfs_inode as opposed to creating one. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: migrate the global_block_rsv helpers to block-rsv.cJosef Bacik
These helpers belong in block-rsv.c Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: migrate the block-rsv code to block-rsv.cJosef Bacik
This moves everything out of extent-tree.c to block-rsv.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: stop using block_rsv_release_bytes everywhereJosef Bacik
block_rsv_release_bytes() is the internal to the block_rsv code, and shouldn't be called directly by anything else. Switch all users to the exported helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: cleanup the target logic in __btrfs_block_rsv_releaseJosef Bacik
This works for all callers already, but if we wanted to use the helper for the global_block_rsv it would end up trying to refill itself. Fix the logic to be able to be used no matter which block rsv is passed in to this helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: export __btrfs_block_rsv_releaseJosef Bacik
The delalloc reserve stuff calls this directly because it cares about the qgroup accounting stuff, so export it so we can move it around. Fix btrfs_block_rsv_release() to just be a static inline since it just calls __btrfs_block_rsv_release() with NULL for the qgroup stuff. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: export btrfs_block_rsv_add_bytesJosef Bacik
This is used in a few places, we need to make sure it's exported so we can move it around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move btrfs_block_rsv definitions into it's own headerJosef Bacik
Prep work for separating out all of the block_rsv related code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: Simplify update of space_info in __reserve_metadata_bytes()Goldwyn Rodrigues
We don't need an if-else-if chain where we can use a simple OR since both conditions are performing the same action. The short-circuit for OR will ensure that if the first condition is true, can_overcommit() is not called. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: unexport can_overcommitJosef Bacik
Now that we've moved all of the users to space-info.c, unexport it and name it back to can_overcommit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move reserve_metadata_bytes and supporting code to space-info.cJosef Bacik
This moves all of the metadata reservation code into space-info.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move dump_space_info to space-info.cJosef Bacik
We'll need this exported so we can use it in all the various was we need to use it. This is prep work to move reserve_metadata_bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: export block_rsv_use_bytesJosef Bacik
We are going to need this to move the metadata reservation stuff to space_info.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move btrfs_space_info_add_*_bytes to space-info.cJosef Bacik
Now that we've moved all the pre-requisite stuff, move these two functions. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move the space info update macro to space-info.hJosef Bacik
Also rename it to btrfs_space_info_update_* so it's clear what we're updating. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move and export can_overcommitJosef Bacik
This is the first piece of moving the space reservation code to space-info.c Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move the space_info handling code to space-info.cJosef Bacik
These are the basic init and lookup functions and some helper functions, fairly straightforward before the bad stuff starts. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: export space_info_add_*_bytesJosef Bacik
Prep work for consolidating all of the space_info code into one file. We need to export these so multiple files can use them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: rename do_chunk_alloc to btrfs_chunk_allocJosef Bacik
Really we just need the enum, but as we break more things up it'll help to have this external to extent-tree.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move space_info to space-info.hJosef Bacik
Migrate the struct definition and the one helper that's in ctree.h into space-info.h Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: lift bio_set_dev from bio allocation helpersDavid Sterba
The block device is passed around for the only purpose to set it in new bios. Move the assignment one level up. This is a preparatory patch for further bdev cleanups. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: use raid_attr for minimum stripe count in btrfs_calc_avail_data_spaceDavid Sterba
Minimum stripe count matches the minimum devices required for a given profile. The open coded assignments match the raid_attr table. What's changed here is the meaning for RAID5/6. Previously their min_stripes would be 1, while newly it's devs_min. This however shold be the same as before because it's not possible to create filesystem on fewer devices than the raid_attr table allows. There's no adjustment regarding the parity stripes (like calc_data_stripes does), because we're interested in overall space that would fit on the devices. Missing devices make no difference for the whole calculation, we have the size stored in the structures. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: use raid_attr to adjust minimal stripe size in ↵David Sterba
btrfs_calc_avail_data_space Special case for DUP can be replaced by lookup to the attribute table, where the dev_stripes is the right coefficient. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: drop default value assignments in enumsDavid Sterba
A few more instances whre we don't need to specify the values as long as they are the same that enum assigns automatically. All of the enums are in-memory only and nothing relies on the exact values. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: use common helpers for extent IO state insertion messagesDavid Sterba
Print the error messages using the helpers that also print the filesystem identification. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: run delayed iput at unlink timeJosef Bacik
We have been seeing issues in production where a cleaner script will end up unlinking a bunch of files that have pending iputs. This means they will get their final iput's run at btrfs-cleaner time and thus are not throttled, which impacts the workload. Since we are unlinking these files we can just drop the delayed iput at unlink time. We are already holding a reference to the inode so this will not be the final iput and thus is completely safe to do at this point. Doing this means we are more likely to be doing the final iput at unlink time, and thus will get the IO charged to the caller and get throttled appropriately without affecting the main workload. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02Btrfs: add missing inode version, ctime and mtime updates when punching holeFilipe Manana
If the range for which we are punching a hole covers only part of a page, we end up updating the inode item but we skip the update of the inode's iversion, mtime and ctime. Fix that by ensuring we update those properties of the inode. A patch for fstests test case generic/059 that tests this as been sent along with this fix. Fixes: 2aaa66558172b0 ("Btrfs: add hole punching") Fixes: e8c1c76e804b18 ("Btrfs: add missing inode update when punching hole") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02Btrfs: fix fsync not persisting dentry deletions due to inode evictionsFilipe Manana
In order to avoid searches on a log tree when unlinking an inode, we check if the inode being unlinked was logged in the current transaction, as well as the inode of its parent directory. When any of the inodes are logged, we proceed to delete directory items and inode reference items from the log, to ensure that if a subsequent fsync of only the inode being unlinked or only of the parent directory when the other is not fsync'ed as well, does not result in the entry still existing after a power failure. That check however is not reliable when one of the inodes involved (the one being unlinked or its parent directory's inode) is evicted, since the logged_trans field is transient, that is, it is not stored on disk, so it is lost when the inode is evicted and loaded into memory again (which is set to zero on load). As a consequence the checks currently being done by btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always return true if the inode was evicted before, regardless of the inode having been logged or not before (and in the current transaction), this results in the dentry being unlinked still existing after a log replay if after the unlink operation only one of the inodes involved is fsync'ed. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/foo $ xfs_io -c fsync /mnt/dir/foo # Keep an open file descriptor on our directory while we evict inodes. # We just want to evict the file's inode, the directory's inode must not # be evicted. $ ( cd /mnt/dir; while true; do :; done ) & $ pid=$! # Wait a bit to give time to background process to chdir to our test # directory. $ sleep 0.5 # Trigger eviction of the file's inode. $ echo 2 > /proc/sys/vm/drop_caches # Unlink our file and fsync the parent directory. After a power failure # we don't expect to see the file anymore, since we fsync'ed the parent # directory. $ rm -f $SCRATCH_MNT/dir/foo $ xfs_io -c fsync /mnt/dir <power failure> $ mount /dev/sdb /mnt $ ls /mnt/dir foo $ --> file still there, unlink not persisted despite explicit fsync on dir Fix this by checking if the inode has the full_sync bit set in its runtime flags as well, since that bit is set everytime an inode is loaded from disk, or for other less common cases such as after a shrinking truncate or failure to allocate extent maps for holes, and gets cleared after the first fsync. Also consider the inode as possibly logged only if it was last modified in the current transaction (besides having the full_fsync flag set). Fixes: 3a5f1d458ad161 ("Btrfs: Optimize btree walking while logging inodes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: Use btrfs_get_io_geometry appropriatelyNikolay Borisov
Presently btrfs_map_block is used not only to do everything necessary to map a bio to the underlying allocation profile but it's also used to identify how much data could be written based on btrfs' stripe logic without actually submitting anything. This is achieved by passing NULL for 'bbio_ret' parameter. This patch refactors all callers that require just the mapping length by switching them to using btrfs_io_geometry instead of calling btrfs_map_block with a special NULL value for 'bbio_ret'. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: Introduce btrfs_io_geometry infrastructureNikolay Borisov
Add a structure that holds various parameters for IO calculations and a helper that fills the values. This will help further refactoring and reduction of functions that in some way open-coded the calculations. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: improve messages when updating feature flagsDavid Sterba
Currently the messages printed after setting an incompat feature are cryptis, we can easily make it better as the textual description is passed to the helpers. Old: setting 128 feature flag updated: setting incompat feature flag for RAID56 (0x80) Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: shut up bogus -Wmaybe-uninitialized warningArnd Bergmann
gcc sometimes can't determine whether a variable has been initialized when both the initialization and the use are conditional: fs/btrfs/props.c: In function 'inherit_props': fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized] btrfs_block_rsv_release(fs_info, trans->block_rsv, This code is fine. Unfortunately, I cannot think of a good way to rephrase it in a way that makes gcc understand this, so I add a bogus initialization the way one should not. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> [ gcc 8 and 9 don't emit the warning ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02Btrfs: prevent send failures and crashes due to concurrent relocationFilipe Manana
Send always operates on read-only trees and always expected that while it is in progress, nothing changes in those trees. Due to that expectation and the fact that send is a read-only operation, it operates on commit roots and does not hold transaction handles. However relocation can COW nodes and leafs from read-only trees, which can cause unexpected failures and crashes (hitting BUG_ONs). while send using a node/leaf, it gets COWed, the transaction used to COW it is committed, a new transaction starts, the extent previously used for that node/leaf gets allocated, possibly for another tree, and the respective extent buffer' content changes while send is still using it. When this happens send normally fails with EIO being returned to user space and messages like the following are found in dmesg/syslog: [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253 [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984 Other times, less often, we hit a BUG_ON() because an extent buffer that send is using used to be a node, and while send is still using it, it got COWed and got reused as a leaf while send is still using, producing the following trace: [ 3478.466280] ------------[ cut here ]------------ [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806! [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1 [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246 [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002 [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000 [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708 [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000 [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0 [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3478.479003] Call Trace: [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0 [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs] [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs] [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs] [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs] [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs] [ 3478.483581] ? rq_clock_task+0x2e/0x60 [ 3478.484086] ? wake_up_new_task+0x1f3/0x370 [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0 [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0 [ 3478.486016] ? __fget+0x113/0x200 [ 3478.486467] ksys_ioctl+0x70/0x80 [ 3478.486911] __x64_sys_ioctl+0x16/0x20 [ 3478.487337] do_syscall_64+0x60/0x1b0 [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7 (...) [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7 [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005 [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700 [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005 [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001 (...) [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]--- Another possibility, much less likely to happen, is that send will not fail but the contents of the stream it produces may not be correct. To avoid this, do not allow send and relocation (balance) to run in parallel. In the long term the goal is to allow for both to be able to run concurrently without any problems, but that will take a significant effort in development and testing. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: document BTRFS_MAX_MIRRORSDavid Sterba
The real meaning of that constant is not clear from the context due to the target device inclusion. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: use mask for RAID56 profilesDavid Sterba
We don't need to enumerate the profiles, use the mask for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: add mask for all RAID1 typesDavid Sterba
Preparatory patch for additional RAID1 profiles with more copies. The mask will contain 3-copy and 4-copy, most of the checks for plain RAID1 work the same for the other profiles. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit()Qu Wenruo
[BUG] Lockdep will report the following circular locking dependency: WARNING: possible circular locking dependency detected 5.2.0-rc2-custom #24 Tainted: G O ------------------------------------------------------ btrfs/8631 is trying to acquire lock: 000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs] but task is already holding lock: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->tree_log_mutex){+.+.}: __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_commit_transaction+0x475/0xa00 [btrfs] btrfs_commit_super+0x71/0x80 [btrfs] close_ctree+0x2bd/0x320 [btrfs] btrfs_put_super+0x15/0x20 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x16/0xa0 [btrfs] deactivate_locked_super+0x3a/0x80 deactivate_super+0x51/0x60 cleanup_mnt+0x3f/0x80 __cleanup_mnt+0x12/0x20 task_work_run+0x94/0xb0 exit_to_usermode_loop+0xd8/0xe0 do_syscall_64+0x210/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&fs_info->reloc_mutex){+.+.}: __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_commit_transaction+0x40d/0xa00 [btrfs] btrfs_quota_enable+0x2da/0x730 [btrfs] btrfs_ioctl+0x2691/0x2b40 [btrfs] do_vfs_ioctl+0xa9/0x6d0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}: lock_acquire+0xa7/0x190 __mutex_lock+0x76/0x940 mutex_lock_nested+0x1b/0x20 btrfs_qgroup_inherit+0x40/0x620 [btrfs] create_pending_snapshot+0x9d7/0xe60 [btrfs] create_pending_snapshots+0x94/0xb0 [btrfs] btrfs_commit_transaction+0x415/0xa00 [btrfs] btrfs_mksubvol+0x496/0x4e0 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0xa90/0x2b40 [btrfs] do_vfs_ioctl+0xa9/0x6d0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->tree_log_mutex); lock(&fs_info->reloc_mutex); lock(&fs_info->tree_log_mutex); lock(&fs_info->qgroup_ioctl_lock#2); *** DEADLOCK *** 6 locks held by btrfs/8631: #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60 #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs] #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs] #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs] #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs] #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs] [CAUSE] Due to the delayed subvolume creation, we need to call btrfs_qgroup_inherit() inside commit transaction code, with a lot of other mutex hold. This hell of lock chain can lead to above problem. [FIX] On the other hand, we don't really need to hold qgroup_ioctl_lock if we're in the context of create_pending_snapshot(). As in that context, we're the only one being able to modify qgroup. All other qgroup functions which needs qgroup_ioctl_lock are either holding a transaction handle, or will start a new transaction: Functions will start a new transaction(): * btrfs_quota_enable() * btrfs_quota_disable() Functions hold a transaction handler: * btrfs_add_qgroup_relation() * btrfs_del_qgroup_relation() * btrfs_create_qgroup() * btrfs_remove_qgroup() * btrfs_limit_qgroup() * btrfs_qgroup_inherit() call inside create_subvol() So we have a higher level protection provided by transaction, thus we don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit(). Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in create_pending_snapshot() is already protected by transaction. So the fix is to detect the context by checking trans->transaction->state. If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction context and no need to get the mutex. Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: correctly validate compression typeJohannes Thumshirn
Nikolay reported the following KASAN splat when running btrfs/048: [ 1843.470920] ================================================================== [ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0 [ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979 [ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536 [ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 1843.476322] Call Trace: [ 1843.476674] dump_stack+0x7c/0xbb [ 1843.477132] ? strncmp+0x66/0xb0 [ 1843.477587] print_address_description+0x114/0x320 [ 1843.478256] ? strncmp+0x66/0xb0 [ 1843.478740] ? strncmp+0x66/0xb0 [ 1843.479185] __kasan_report+0x14e/0x192 [ 1843.479759] ? strncmp+0x66/0xb0 [ 1843.480209] kasan_report+0xe/0x20 [ 1843.480679] strncmp+0x66/0xb0 [ 1843.481105] prop_compression_validate+0x24/0x70 [ 1843.481798] btrfs_xattr_handler_set_prop+0x65/0x160 [ 1843.482509] __vfs_setxattr+0x71/0x90 [ 1843.483012] __vfs_setxattr_noperm+0x84/0x130 [ 1843.483606] vfs_setxattr+0xac/0xb0 [ 1843.484085] setxattr+0x18c/0x230 [ 1843.484546] ? vfs_setxattr+0xb0/0xb0 [ 1843.485048] ? __mod_node_page_state+0x1f/0xa0 [ 1843.485672] ? _raw_spin_unlock+0x24/0x40 [ 1843.486233] ? __handle_mm_fault+0x988/0x1290 [ 1843.486823] ? lock_acquire+0xb4/0x1e0 [ 1843.487330] ? lock_acquire+0xb4/0x1e0 [ 1843.487842] ? mnt_want_write_file+0x3c/0x80 [ 1843.488442] ? debug_lockdep_rcu_enabled+0x22/0x40 [ 1843.489089] ? rcu_sync_lockdep_assert+0xe/0x70 [ 1843.489707] ? __sb_start_write+0x158/0x200 [ 1843.490278] ? mnt_want_write_file+0x3c/0x80 [ 1843.490855] ? __mnt_want_write+0x98/0xe0 [ 1843.491397] __x64_sys_fsetxattr+0xba/0xe0 [ 1843.492201] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1843.493201] do_syscall_64+0x6c/0x230 [ 1843.493988] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1843.495041] RIP: 0033:0x7fa7a8a7707a [ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48 [ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be [ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a [ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003 [ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000 [ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91 [ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff [ 1843.505268] Allocated by task 3979: [ 1843.505771] save_stack+0x19/0x80 [ 1843.506211] __kasan_kmalloc.constprop.5+0xa0/0xd0 [ 1843.506836] setxattr+0xeb/0x230 [ 1843.507264] __x64_sys_fsetxattr+0xba/0xe0 [ 1843.507886] do_syscall_64+0x6c/0x230 [ 1843.508429] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1843.509558] Freed by task 0: [ 1843.510188] (stack is not available) [ 1843.511309] The buggy address belongs to the object at ffff888111e369e0 which belongs to the cache kmalloc-8 of size 8 [ 1843.514095] The buggy address is located 2 bytes inside of 8-byte region [ffff888111e369e0, ffff888111e369e8) [ 1843.516524] The buggy address belongs to the page: [ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0 [ 1843.519993] flags: 0x4404000010200(slab|head) [ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300 [ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000 [ 1843.524281] page dumped because: kasan: bad access detected [ 1843.525936] Memory state around the buggy address: [ 1843.526975] ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.528479] ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc [ 1843.531877] ^ [ 1843.533287] ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.534874] ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.536468] ================================================================== This is caused by supplying a too short compression value ('lz') in the test-case and comparing it to 'lzo' with strncmp() and a length of 3. strncmp() read past the 'lz' when looking for the 'o' and thus caused an out-of-bounds read. Introduce a new check 'btrfs_compress_is_valid_type()' which not only checks the user-supplied value against known compression types, but also employs checks for too short values. Reported-by: Nikolay Borisov <nborisov@suse.com> Fixes: 272e5326c783 ("btrfs: prop: fix vanished compression property after failed set") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02Btrfs: fix data loss after inode eviction, renaming it, and fsync itFilipe Manana
When we log an inode, regardless of logging it completely or only that it exists, we always update it as logged (logged_trans and last_log_commit fields of the inode are updated). This is generally fine and avoids future attempts to log it from having to do repeated work that brings no value. However, if we write data to a file, then evict its inode after all the dealloc was flushed (and ordered extents completed), rename the file and fsync it, we end up not logging the new extents, since the rename may result in logging that the inode exists in case the parent directory was logged before. The following reproducer shows and explains how this can happen: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/foo $ touch /mnt/dir/bar # Do a direct IO write instead of a buffered write because with a # buffered write we would need to make sure dealloc gets flushed and # complete before we do the inode eviction later, and we can not do that # from user space with call to things such as sync(2) since that results # in a transaction commit as well. $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar # Keep the directory dir in use while we evict inodes. We want our file # bar's inode to be evicted but we don't want our directory's inode to # be evicted (if it were evicted too, we would not be able to reproduce # the issue since the first fsync below, of file foo, would result in a # transaction commit. $ ( cd /mnt/dir; while true; do :; done ) & $ pid=$! # Wait a bit to give time for the background process to chdir. $ sleep 0.1 # Evict all inodes, except the inode for the directory dir because it is # currently in use by our background process. $ echo 2 > /proc/sys/vm/drop_caches # fsync file foo, which ends up persisting information about the parent # directory because it is a new inode. $ xfs_io -c fsync /mnt/dir/foo # Rename bar, this results in logging that this inode exists (inode item, # names, xattrs) because the parent directory is in the log. $ mv /mnt/dir/bar /mnt/dir/baz # Now fsync baz, which ends up doing absolutely nothing because of the # rename operation which logged that the inode exists only. $ xfs_io -c fsync /mnt/dir/baz <power failure> $ mount /dev/sdb /mnt $ od -t x1 -A d /mnt/dir/baz 0000000 --> Empty file, data we wrote is missing. Fix this by not updating last_sub_trans of an inode when we are logging only that it exists and the inode was not yet logged since it was loaded from disk (full_sync bit set), this is enough to make btrfs_inode_in_log() return false for this scenario and make us log the inode. The logged_trans of the inode is still always setsince that alone is used to track if names need to be deleted as part of unlink operations. Fixes: 257c62e1bce03e ("Btrfs: avoid tree log commit when there are no changes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>