summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-11-05coredump: use override credential guardChristian Brauner
Use override credential guards for scoped credential override with automatic restoration on scope exit. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-10-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05coredump: use prepare credential guardChristian Brauner
Use the prepare credential guard for allocating a new set of credentials. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-9-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05coredump: split out do_coredump() from vfs_coredump()Christian Brauner
Make the function easier to follow and prepare for some of the following changes. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-8-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05coredump: mark struct mm_struct as constChristian Brauner
We don't actually modify it. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-7-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05coredump: pass struct linux_binfmt as constChristian Brauner
We don't actually modify it. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-6-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05coredump: move revert_cred() before coredump_cleanup()Christian Brauner
There's no need to pin the credentials across the coredump_cleanup() call. Nothing in there depends on elevated credentials. Link: https://patch.msgid.link/20251103-work-creds-guards-prepare_creds-v1-5-b447b82f2c9b@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05xfs: use super write guard in xfs_file_ioctl()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-8-5108ac78a171@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05open: use super write guard in do_ftruncate()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-7-5108ac78a171@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05btrfs: use super write guard in relocating_repair_kthread()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-6-5108ac78a171@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05ext4: use super write guard in write_mmp_block()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-5-5108ac78a171@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05btrfs: use super write guard in sb_start_write()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-4-5108ac78a171@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05btrfs: use super write guard btrfs_run_defrag_inode()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-3-5108ac78a171@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05btrfs: use super write guard in btrfs_reclaim_bgs_work()Christian Brauner
Link: https://patch.msgid.link/20251104-work-guards-v1-2-5108ac78a171@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fs: inline current_umask() and move it to fs_struct.hMateusz Guzik
There is no good reason to have this as a func call, other than avoiding the churn of adding fs_struct.h as needed. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05btrfs: release root after error in data_reloc_print_warning_inode()Zilin Guan
data_reloc_print_warning_inode() calls btrfs_get_fs_root() to obtain local_root, but fails to release its reference when paths_from_inode() returns an error. This causes a potential memory leak. Add a missing btrfs_put_root() call in the error path to properly decrease the reference count of local_root. Fixes: b9a9a85059cde ("btrfs: output affected files when relocation fails") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-05btrfs: scrub: put bio after errors in scrub_raid56_parity_stripe()Zilin Guan
scrub_raid56_parity_stripe() allocates a bio with bio_alloc(), but fails to release it on some error paths, leading to a potential memory leak. Add the missing bio_put() calls to properly drop the bio reference in those error cases. Fixes: 1009254bf22a3 ("btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-05btrfs: do not update last_log_commit when logging inode due to a new nameFilipe Manana
When logging that a new name exists, we skip updating the inode's last_log_commit field to prevent a later explicit fsync against the inode from doing nothing (as updating last_log_commit makes btrfs_inode_in_log() return true). We are detecting, at btrfs_log_inode(), that logging a new name is happening by checking the logging mode is not LOG_INODE_EXISTS, but that is not enough because we may log parent directories when logging a new name of a file in LOG_INODE_ALL mode - we need to check that the logging_new_name field of the log context too. An example scenario where this results in an explicit fsync against a directory not persisting changes to the directory is the following: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/foo $ sync $ mkdir /mnt/dir # Write some data to our file and fsync it. $ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/foo # Add a new link to our file. Since the file was logged before, we # update it in the log tree by calling btrfs_log_new_name(). $ ln /mnt/foo /mnt/dir/bar # fsync the root directory - we expect it to persist the dentry for # the new directory "dir". $ xfs_io -c "fsync" /mnt <power fail> After mounting the fs the entry for directory "dir" does not exists, despite the explicit fsync on the root directory. Here's why this happens: 1) When we fsync the file we log the inode, so that it's present in the log tree; 2) When adding the new link we enter btrfs_log_new_name(), and since the inode is in the log tree we proceed to updating the inode in the log tree; 3) We first set the inode's last_unlink_trans to the current transaction (early in btrfs_log_new_name()); 4) We then eventually enter btrfs_log_inode_parent(), and after logging the file's inode, we call btrfs_log_all_parents() because the inode's last_unlink_trans matches the current transaction's ID (updated in the previous step); 5) So btrfs_log_all_parents() logs the root directory by calling btrfs_log_inode() for the root's inode with a log mode of LOG_INODE_ALL so that new dentries are logged; 6) At btrfs_log_inode(), because the log mode is LOG_INODE_ALL, we update root inode's last_log_commit to the last transaction that changed the inode (->last_sub_trans field of the inode), which corresponds to the current transaction's ID; 7) Then later when user space explicitly calls fsync against the root directory, we enter btrfs_sync_file(), which calls skip_inode_logging() and that returns true, since its call to btrfs_inode_in_log() returns true and there are no ordered extents (it's a directory, never has ordered extents). This results in btrfs_sync_file() returning without syncing the log or committing the current transaction, so all the updates we did when logging the new name, including logging the root directory, are not persisted. So fix this by but updating the inode's last_log_commit if we are sure we are not logging a new name (if ctx->logging_new_name is false). A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/03c5d7ec-5b3d-49d1-95bc-8970a7f82d87@gmail.com/ Fixes: 130341be7ffa ("btrfs: always update the logged transaction when logging new names") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-05btrfs: zoned: fix stripe width calculationNaohiro Aota
The stripe offset calculation in the zoned code for raid0 and raid10 wrongly uses map->stripe_size to calculate it. In fact, map->stripe_size is the size of the device extent composing the block group, which always is the zone_size on the zoned setup. Fix it by using BTRFS_STRIPE_LEN and BTRFS_STRIPE_LEN_SHIFT. Also, optimize the calculation a bit by doing the common calculation only once. Fixes: c0d90a79e8e6 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups") CC: stable@vger.kernel.org # 6.17+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-05btrfs: zoned: fix conventional zone capacity calculationNaohiro Aota
When a block group contains both conventional zone and sequential zone, the capacity of the block group is wrongly set to the block group's full length. The capacity should be calculated in btrfs_load_block_group_* using the last allocation offset. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # v6.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-05xfs: fix zone selection in xfs_select_open_zone_mruChristoph Hellwig
xfs_select_open_zone_mru needs to pass XFS_ZONE_ALLOC_OK to xfs_try_use_zone because we only want to tightly pack into zones of the same or a compatible temperature instead of any available zone. This got broken in commit 0301dae732a5 ("xfs: refactor hint based zone allocation"), which failed to update this particular caller when switching to an enum. xfs/638 sometimes, but not reliably fails due to this change. Fixes: 0301dae732a5 ("xfs: refactor hint based zone allocation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix a rtgroup leak when xfs_init_zone failsChristoph Hellwig
Drop the rtgrop reference when xfs_init_zone fails for a conventional device. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix various problems in xfs_atomic_write_cow_iomap_beginDarrick J. Wong
I think there are several things wrong with this function: A) xfs_bmapi_write can return a much larger unwritten mapping than what the caller asked for. We convert part of that range to written, but return the entire written mapping to iomap even though that's inaccurate. B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an unwritten mapping could be *smaller* than the write range (or even the hole range). In this case, we convert too much file range to written state because we then return a smaller mapping to iomap. C) It doesn't handle delalloc mappings. This I covered in the patch that I already sent to the list. D) Reassigning count_fsb to handle the hole means that if the second cmap lookup attempt succeeds (due to racing with someone else) we trim the mapping more than is strictly necessary. The changing meaning of count_fsb makes this harder to notice. E) The tracepoint is kinda wrong because @length is mutated. That makes it harder to chase the data flows through this function because you can't just grep on the pos/bytecount strings. F) We don't actually check that the br_state = XFS_EXT_NORM assignment is accurate, i.e that the cow fork actually contains a written mapping for the range we're interested in G) Somewhat inadequate documentation of why we need to xfs_trim_extent so aggressively in this function. H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped the write range to s_maxbytes. Fix these issues, and then the atomic writes regressions in generic/760, generic/617, generic/091, generic/263, and generic/521 all go away for me. Cc: stable@vger.kernel.org # v6.16 Fixes: bd1d2c21d5d249 ("xfs: add xfs_atomic_write_cow_iomap_begin()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix delalloc write failures in software-provided atomic writesDarrick J. Wong
With the 20 Oct 2025 release of fstests, generic/521 fails for me on regular (aka non-block-atomic-writes) storage: QA output created by 521 dowrite: write: Input/output error LOG DUMP (8553 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): WRITE 0x7e000 thru 0x8dfff (0x10000 bytes) HOLE 3( 3 mod 256): READ 0x69000 thru 0x79fff (0x11000 bytes) 4( 4 mod 256): FALLOC 0x53c38 thru 0x5e853 (0xac1b bytes) INTERIOR 5( 5 mod 256): COPY 0x55000 thru 0x59fff (0x5000 bytes) to 0x25000 thru 0x29fff 6( 6 mod 256): WRITE 0x74000 thru 0x88fff (0x15000 bytes) 7( 7 mod 256): ZERO 0xedb1 thru 0x11693 (0x28e3 bytes) with a warning in dmesg from iomap about XFS trying to give it a delalloc mapping for a directio write. Fix the software atomic write iomap_begin code to convert the reservation into a written mapping. This doesn't fix the data corruption problems reported by generic/760, but it's a start. Cc: stable@vger.kernel.org # v6.16 Fixes: bd1d2c21d5d249 ("xfs: add xfs_atomic_write_cow_iomap_begin()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: use blkdev_report_zones_cached()Damien Le Moal
Modify xfs_mount_zones() to replace the call to blkdev_report_zones() with blkdev_report_zones_cached() to speed-up mount operations. Since this causes xfs_zone_validate_seq() to see zones with the BLK_ZONE_COND_ACTIVE condition, this function is also modified to acept this condition as valid. With this change, mounting a freshly formatted large capacity (30 TB) SMR HDD completes under 2s compared to over 4.7s before. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05btrfs: use blkdev_report_zones_cached()Damien Le Moal
Modify btrfs_get_dev_zones() and btrfs_sb_log_location_bdev() to replace the call to blkdev_report_zones() with blkdev_report_zones_cached() to speed-up mount operations. btrfs_get_dev_zone_info() is also modified to take into account the BLK_ZONE_COND_ACTIVE condition, which is equivalent to either BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN or BLK_ZONE_COND_CLOSED. With this change, mounting a freshly formatted large capacity (30 TB) SMR HDD completes under 100ms compared to over 1.8s before. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_superYongpeng Yang
sb_min_blocksize() may return 0. Check its return value to avoid the filesystem super block when sb->s_blocksize is 0. Cc: stable@vger.kernel.org # v6.15 Fixes: a64e5a596067bd ("bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Link: https://patch.msgid.link/20251104125009.2111925-5-yangyongpeng.storage@gmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05isofs: check the return value of sb_min_blocksize() in isofs_fill_superYongpeng Yang
sb_min_blocksize() may return 0. Check its return value to avoid opt->blocksize and sb->s_blocksize is 0. Cc: stable@vger.kernel.org # v6.15 Fixes: 1b17a46c9243e9 ("isofs: convert isofs to use the new mount API") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Link: https://patch.msgid.link/20251104125009.2111925-4-yangyongpeng.storage@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05exfat: check return value of sb_min_blocksize in exfat_read_boot_sectorYongpeng Yang
sb_min_blocksize() may return 0. Check its return value to avoid accessing the filesystem super block when sb->s_blocksize is 0. Cc: stable@vger.kernel.org # v6.15 Fixes: 719c1e1829166d ("exfat: add super block operations") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Link: https://patch.msgid.link/20251104125009.2111925-3-yangyongpeng.storage@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05vfat: fix missing sb_min_blocksize() return value checksYongpeng Yang
When emulating an nvme device on qemu with both logical_block_size and physical_block_size set to 8 KiB, but without format, a kernel panic was triggered during the early boot stage while attempting to mount a vfat filesystem. [95553.682035] EXT4-fs (nvme0n1): unable to set blocksize [95553.684326] EXT4-fs (nvme0n1): unable to set blocksize [95553.686501] EXT4-fs (nvme0n1): unable to set blocksize [95553.696448] ISOFS: unsupported/invalid hardware sector size 8192 [95553.697117] ------------[ cut here ]------------ [95553.697567] kernel BUG at fs/buffer.c:1582! [95553.697984] Oops: invalid opcode: 0000 [#1] SMP NOPTI [95553.698602] CPU: 0 UID: 0 PID: 7212 Comm: mount Kdump: loaded Not tainted 6.18.0-rc2+ #38 PREEMPT(voluntary) [95553.699511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [95553.700534] RIP: 0010:folio_alloc_buffers+0x1bb/0x1c0 [95553.701018] Code: 48 8b 15 e8 93 18 02 65 48 89 35 e0 93 18 02 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc <0f> 0b 90 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f [95553.702648] RSP: 0018:ffffd1b0c676f990 EFLAGS: 00010246 [95553.703132] RAX: ffff8cfc4176d820 RBX: 0000000000508c48 RCX: 0000000000000001 [95553.703805] RDX: 0000000000002000 RSI: 0000000000000000 RDI: 0000000000000000 [95553.704481] RBP: ffffd1b0c676f9c8 R08: 0000000000000000 R09: 0000000000000000 [95553.705148] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [95553.705816] R13: 0000000000002000 R14: fffff8bc8257e800 R15: 0000000000000000 [95553.706483] FS: 000072ee77315840(0000) GS:ffff8cfdd2c8d000(0000) knlGS:0000000000000000 [95553.707248] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95553.707782] CR2: 00007d8f2a9e5a20 CR3: 0000000039d0c006 CR4: 0000000000772ef0 [95553.708439] PKRU: 55555554 [95553.708734] Call Trace: [95553.709015] <TASK> [95553.709266] __getblk_slow+0xd2/0x230 [95553.709641] ? find_get_block_common+0x8b/0x530 [95553.710084] bdev_getblk+0x77/0xa0 [95553.710449] __bread_gfp+0x22/0x140 [95553.710810] fat_fill_super+0x23a/0xfc0 [95553.711216] ? __pfx_setup+0x10/0x10 [95553.711580] ? __pfx_vfat_fill_super+0x10/0x10 [95553.712014] vfat_fill_super+0x15/0x30 [95553.712401] get_tree_bdev_flags+0x141/0x1e0 [95553.712817] get_tree_bdev+0x10/0x20 [95553.713177] vfat_get_tree+0x15/0x20 [95553.713550] vfs_get_tree+0x2a/0x100 [95553.713910] vfs_cmd_create+0x62/0xf0 [95553.714273] __do_sys_fsconfig+0x4e7/0x660 [95553.714669] __x64_sys_fsconfig+0x20/0x40 [95553.715062] x64_sys_call+0x21ee/0x26a0 [95553.715453] do_syscall_64+0x80/0x670 [95553.715816] ? __fs_parse+0x65/0x1e0 [95553.716172] ? fat_parse_param+0x103/0x4b0 [95553.716587] ? vfs_parse_fs_param_source+0x21/0xa0 [95553.717034] ? __do_sys_fsconfig+0x3d9/0x660 [95553.717548] ? __x64_sys_fsconfig+0x20/0x40 [95553.717957] ? x64_sys_call+0x21ee/0x26a0 [95553.718360] ? do_syscall_64+0xb8/0x670 [95553.718734] ? __x64_sys_fsconfig+0x20/0x40 [95553.719141] ? x64_sys_call+0x21ee/0x26a0 [95553.719545] ? do_syscall_64+0xb8/0x670 [95553.719922] ? x64_sys_call+0x1405/0x26a0 [95553.720317] ? do_syscall_64+0xb8/0x670 [95553.720702] ? __x64_sys_close+0x3e/0x90 [95553.721080] ? x64_sys_call+0x1b5e/0x26a0 [95553.721478] ? do_syscall_64+0xb8/0x670 [95553.721841] ? irqentry_exit+0x43/0x50 [95553.722211] ? exc_page_fault+0x90/0x1b0 [95553.722681] entry_SYSCALL_64_after_hwframe+0x76/0x7e [95553.723166] RIP: 0033:0x72ee774f3afe [95553.723562] Code: 73 01 c3 48 8b 0d 0a 33 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 49 89 ca b8 af 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d da 32 0f 00 f7 d8 64 89 01 48 [95553.725188] RSP: 002b:00007ffe97148978 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [95553.725892] RAX: ffffffffffffffda RBX: 00005dcfe53d0080 RCX: 000072ee774f3afe [95553.726526] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [95553.727176] RBP: 00007ffe97148ac0 R08: 0000000000000000 R09: 000072ee775e7ac0 [95553.727818] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [95553.728459] R13: 00005dcfe53d04b0 R14: 000072ee77670b00 R15: 00005dcfe53d1a28 [95553.729086] </TASK> The panic occurs as follows: 1. logical_block_size is 8KiB, causing {struct super_block *sb}->s_blocksize is initialized to 0. vfat_fill_super - fat_fill_super - sb_min_blocksize - sb_set_blocksize //return 0 when size is 8KiB. 2. __bread_gfp is called with size == 0, causing folio_alloc_buffers() to compute an offset equal to folio_size(folio), which triggers a BUG_ON. fat_fill_super - sb_bread - __bread_gfp // size == {struct super_block *sb}->s_blocksize == 0 - bdev_getblk - __getblk_slow - grow_buffers - grow_dev_folio - folio_alloc_buffers // size == 0 - folio_set_bh //offset == folio_size(folio) and panic To fix this issue, add proper return value checks for sb_min_blocksize(). Cc: stable@vger.kernel.org # v6.15 Fixes: a64e5a596067bd ("bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()") Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Link: https://patch.msgid.link/20251104125009.2111925-2-yangyongpeng.storage@gmail.com Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05binfmt_misc: restore write access before closing files opened by open_exec()Zilin Guan
bm_register_write() opens an executable file using open_exec(), which internally calls do_open_execat() and denies write access on the file to avoid modification while it is being executed. However, when an error occurs, bm_register_write() closes the file using filp_close() directly. This does not restore the write permission, which may cause subsequent write operations on the same file to fail. Fix this by calling exe_file_allow_write_access() before filp_close() to restore the write permission properly. Fixes: e7850f4d844e ("binfmt_misc: fix possible deadlock in bm_register_write") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Link: https://patch.msgid.link/20251105022923.1813587-1-zilin@seu.edu.cn Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05virtio-fs: fix incorrect check for fsvq->kobjAlok Tiwari
In virtio_fs_add_queues_sysfs(), the code incorrectly checks fs->mqs_kobj after calling kobject_create_and_add(). Change the check to fsvq->kobj (fs->mqs_kobj -> fsvq->kobj) to ensure the per-queue kobject is successfully created. Fixes: 87cbdc396a31 ("virtio_fs: add sysfs entries for queue information") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20251027104658.1668537-1-alok.a.tiwari@oracle.com Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05Fix a drop_nlink warning in minix_renameJori Koolstra
Syzbot found a drop_nlink warning that is triggered by an easy to detect nlink corruption. This patch adds sanity checks to minix_unlink and minix_rename to prevent the warning and instead return EFSCORRUPTED to the caller. The changes were tested using the syzbot reproducer as well as local testing. Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl> Link: https://patch.msgid.link/20251104143005.3283980-4-jkoolstra@xs4all.nl Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: syzbot+a65e824272c5f741247d@syzkaller.appspotmail.com Closes: https://syzbot.org/bug?extid=a65e824272c5f741247d Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05Fix a drop_nlink warning in minix_rmdirJori Koolstra
Syzbot found a drop_nlink warning that is triggered by an easy to detect nlink corruption of a directory. This patch adds a sanity check to minix_rmdir to prevent the warning and instead return EFSCORRUPTED to the caller. The changes were tested using the syzbot reproducer as well as local testing. Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl> Link: https://patch.msgid.link/20251104143005.3283980-3-jkoolstra@xs4all.nl Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: syzbot+4e49728ec1cbaf3b91d2@syzkaller.appspotmail.com Closes: https://syzbot.org/bug?extid=4e49728ec1cbaf3b91d2 Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05Add error handling to minix filesystem for inode corruption detectionJori Koolstra
We would like to provide early and specific warnings of filesystem corruption without running into generic WARN_ONs and BUG_ONs. Towards this goal, ext4, e.g., has a EFSCORRUPTED errno and a standardized inode corruption message format. This patch adds this errno and message format to the minix filesystem. Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl> Link: https://patch.msgid.link/20251104143005.3283980-2-jkoolstra@xs4all.nl Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05xfs: support sub-block aligned vectors in always COW modeChristoph Hellwig
Now that the block layer and iomap have grown support to indicate the bio sector size explicitly instead of assuming the device sector size, we can ask for logical block size alignment and thus support direct I/O writes where the overall size is logical block size aligned, but the boundaries between vectors might not be. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251031131045.1613229-3-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flagQu Wenruo
Btrfs requires all of its bios to be fs block aligned, normally it's totally fine but with the incoming block size larger than page size (bs > ps) support, the requirement is no longer met for direct IOs. Because iomap_dio_bio_iter() calls bio_iov_iter_get_pages(), only requiring alignment to be bdev_logical_block_size(). In the real world that value is either 512 or 4K, on 4K page sized systems it means bio_iov_iter_get_pages() can break the bio at any page boundary, breaking btrfs' requirement for bs > ps cases. To address this problem, introduce a new public iomap dio flag, IOMAP_DIO_FSBLOCK_ALIGNED. When calling __iomap_dio_rw() with that new flag, iomap_dio::flags will inherit that new flag, and iomap_dio_bio_iter() will take fs block size into the calculation of the alignment, and pass the alignment to bio_iov_iter_get_pages(), respecting the fs block size requirement. The initial user of this flag will be btrfs, which needs to calculate the checksum for direct read and thus requires the biovec to be fs block aligned for the incoming bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> [hch: also align pos/len, incorporate the trace flags from Darrick] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251031131045.1613229-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05xfs: error tag to force zeroing on debug kernelsBrian Foster
iomap_zero_range() has to cover various corner cases that are difficult to test on production kernels because it is used in fairly limited use cases. For example, it is currently only used by XFS and mostly only in partial block zeroing cases. While it's possible to test most of these functional cases, we can provide more robust test coverage by co-opting fallocate zero range to invoke zeroing of the entire range instead of the more efficient block punch/allocate sequence. Add an errortag to occasionally invoke forced zeroing. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: remove old partial eof zeroing optimizationBrian Foster
iomap_zero_range() optimizes the partial eof block zeroing use case by force zeroing if the mapping is dirty. This is to avoid frequent flushing on file extending workloads, which hurts performance. Now that the folio batch mechanism provides a more generic solution and is used by the only real zero range user (XFS), this isolated optimization is no longer needed. Remove the unnecessary code and let callers use the folio batch or fall back to flushing by default. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05xfs: fill dirty folios on zero range of unwritten mappingsBrian Foster
Use the iomap folio batch mechanism to select folios to zero on zero range of unwritten mappings. Trim the resulting mapping if the batch is filled (unlikely for current use cases) to distinguish between a range to skip and one that requires another iteration due to a full batch. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05xfs: always trim mapping to requested range for zero rangeBrian Foster
Refactor and tweak the IOMAP_ZERO logic in preparation to support filling the folio batch for unwritten mappings. Drop the superfluous imap offset check since the hole case has already been filtered out. Split the the delalloc case handling into a sub-branch, and always trim the imap to the requested offset/count so it can be more easily used to bound the range to lookup in pagecache. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: optional zero range dirty folio processingBrian Foster
The only way zero range can currently process unwritten mappings with dirty pagecache is to check whether the range is dirty before mapping lookup and then flush when at least one underlying mapping is unwritten. This ordering is required to prevent iomap lookup from racing with folio writeback and reclaim. Since zero range can skip ranges of unwritten mappings that are clean in cache, this operation can be improved by allowing the filesystem to provide a set of dirty folios that require zeroing. In turn, rather than flush or iterate file offsets, zero range can iterate on folios in the batch and advance over clean or uncached ranges in between. Add a folio_batch in struct iomap and provide a helper for filesystems to populate the batch at lookup time. Update the folio lookup path to return the next folio in the batch, if provided, and advance the iter if the folio starts beyond the current offset. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: remove pos+len BUG_ON() to after folio lookupBrian Foster
The bug checks at the top of iomap_write_begin() assume the pos/len reflect exactly the next range to process. This may no longer be the case once the get folio path is able to process a folio batch from the filesystem. On top of that, len is already trimmed to within the iomap/srcmap by iomap_length(), so these checks aren't terribly useful. Remove the unnecessary BUG_ON() checks. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fuse: remove fc->blkbits workaround for partial writesJoanne Koong
Now that fuse is integrated with iomap for read/readahead, we can remove the workaround that was added in commit bd24d2108e9c ("fuse: fix fuseblk i_blkbits for iomap partial writes"), which was previously needed to avoid a race condition where an iomap partial write may be overwritten by a read if blocksize < PAGE_SIZE. Now that fuse does iomap read/readahead, this is protected against since there is granular uptodate tracking of blocks, which means this workaround can be removed. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fuse: use iomap for readaheadJoanne Koong
Do readahead in fuse using iomap. This gives us granular uptodate tracking for large folios, which optimizes how much data needs to be read in. If some portions of the folio are already uptodate (eg through a prior write), we only need to read in the non-uptodate portions. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fuse: use iomap for read_folioJoanne Koong
Read folio data into the page cache using iomap. This gives us granular uptodate tracking for large folios, which optimizes how much data needs to be read in. If some portions of the folio are already uptodate (eg through a prior write), we only need to read in the non-uptodate portions. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: make iomap_read_folio() a void returnJoanne Koong
No errors are propagated in iomap_read_folio(). Change iomap_read_folio() to a void return to make this clearer to callers. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: move buffered io bio logic into new fileChristoph Hellwig [1]
Move bio logic in the buffered io code into its own file and remove CONFIG_BLOCK gating for iomap read/readahead. [1] https://lore.kernel.org/linux-fsdevel/aMK2GuumUf93ep99@infradead.org/ Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: add caller-provided callbacks for read and readaheadJoanne Koong
Add caller-provided callbacks for read and readahead so that it can be used generically, especially by filesystems that are not block-based. In particular, this: * Modifies the read and readahead interface to take in a struct iomap_read_folio_ctx that is publicly defined as: struct iomap_read_folio_ctx { const struct iomap_read_ops *ops; struct folio *cur_folio; struct readahead_control *rac; void *read_ctx; }; where struct iomap_read_ops is defined as: struct iomap_read_ops { int (*read_folio_range)(const struct iomap_iter *iter, struct iomap_read_folio_ctx *ctx, size_t len); void (*read_submit)(struct iomap_read_folio_ctx *ctx); }; read_folio_range() reads in the folio range and is required by the caller to provide. read_submit() is optional and is used for submitting any pending read requests. * Modifies existing filesystems that use iomap for read and readahead to use the new API, through the new statically inlined helpers iomap_bio_read_folio() and iomap_bio_readahead(). There is no change in functionality for those filesystems. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: set accurate iter->pos when reading folio rangesJoanne Koong
Advance iter to the correct position before calling an IO helper to read in a folio range. This allows the helper to reliably use iter->pos to determine the starting offset for reading. This will simplify the interface for reading in folio ranges when iomap read/readahead supports caller-provided callbacks. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05iomap: track pending read bytes more optimallyJoanne Koong
Instead of incrementing read_bytes_pending for every folio range read in (which requires acquiring the spinlock to do so), set read_bytes_pending to the folio size when the first range is asynchronously read in, keep track of how many bytes total are asynchronously read in, and adjust read_bytes_pending accordingly after issuing requests to read in all the necessary ranges. iomap_read_folio_ctx->cur_folio_in_bio can be removed since a non-zero value for pending bytes necessarily indicates the folio is in the bio. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>