| Age | Commit message (Collapse) | Author |
|
The d_path function does not require the caller to pre-zero the
buffer.
Signed-off-by: pengdonglin <pengdonglin@xiaomi.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20251211123829.2777009-1-dolinux.peng@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fast commits only log operations that have dedicated replay support.
EXT4_IOC_GROUP_EXTEND grows the filesystem to the end of the last
block group and updates the same on-disk metadata without going
through the fast commit tracking paths.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.
Teach ext4 to mark the filesystem fast-commit ineligible when
EXT4_IOC_GROUP_EXTEND grows the filesystem.
This forces those transactions to fall back to a full commit,
ensuring that the group extension changes are captured by the normal
journal rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes online
resize via GROUP_EXTEND safer and easier to reason about under fast
commit.
Testing:
1. prepare:
dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256
mkfs.ext4 -O fast_commit -F /root/fc_resize.img
mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize
2. Extended the filesystem to the end of the last block group using a
helper that calls EXT4_IOC_GROUP_EXTEND on the mounted filesystem
and checked fc_info:
./group_extend_helper /mnt/fc_resize
cat /proc/fs/ext4/loop0/fc_info
shows the "Resize" ineligible reason increased.
3. Fsynced a file on the resized filesystem and confirmed that the fast
commit ineligible counter incremented for the resize transaction:
touch /mnt/fc_resize/file
/root/fsync_file /mnt/fc_resize/file
sync
cat /proc/fs/ext4/loop0/fc_info
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-6-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fast commits only log operations that have dedicated replay support.
Online resize via EXT4_IOC_GROUP_ADD updates the superblock and group
descriptor metadata without going through the fast commit tracking
paths.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.
Teach ext4 to mark the filesystem fast-commit ineligible when
ext4_ioctl_group_add() adds new block groups.
This forces those transactions to fall back to a full commit,
ensuring that the filesystem geometry updates are captured by the
normal journal rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes online
resize via GROUP_ADD safer and easier to reason about under fast
commit.
Testing:
1. prepare:
dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256
mkfs.ext4 -O fast_commit -F /root/fc_resize.img
mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize
2. Ran a helper that issues EXT4_IOC_GROUP_ADD on the mounted
filesystem and checked the resize ineligible reason:
./group_add_helper /mnt/fc_resize
cat /proc/fs/ext4/loop0/fc_info
shows "Resize": > 0.
3. Fsynced a file on the resized filesystem and verified that the fast
commit stats report at least one ineligible commit:
touch /mnt/fc_resize/file
/root/fsync_file /mnt/fc_resize/file
sync
cat /proc/fs/ext4/loop0/fc_info
shows fc stats ineligible > 0.
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-5-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fast commits only log operations that have dedicated replay support.
EXT4_IOC_MOVE_EXT swaps extents between regular files and may copy
data, rewriting the affected inodes' block mapping layout without
going through the fast commit tracking paths.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.
Teach ext4 to mark the filesystem fast-commit ineligible for the
journal transactions used by move_extent_per_page() when
EXT4_IOC_MOVE_EXT runs.
This forces those transactions to fall back to a full commit,
ensuring that these multi-inode extent swaps are captured by the
normal journal rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes online
defragmentation safer and easier to reason about under fast commit.
Testing:
1. prepare:
dd if=/dev/zero of=/root/fc_move.img bs=1M count=0 seek=256
mkfs.ext4 -O fast_commit -F /root/fc_move.img
mkdir -p /mnt/fc_move && mount -t ext4 -o loop \
/root/fc_move.img /mnt/fc_move
2. Created two files, ran EXT4_IOC_MOVE_EXT via e4defrag, and checked
the ineligible reason statistics:
fallocate -l 64M /mnt/fc_move/file1
cp /mnt/fc_move/file1 /mnt/fc_move/file2
e4defrag /mnt/fc_move/file1
cat /proc/fs/ext4/loop0/fc_info
shows "Move extents": > 0 and fc stats ineligible > 0.
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-4-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fast commits only log operations that have dedicated replay support.
Enabling fs-verity builds a Merkle tree and updates inode and orphan
state in ways that are not described by the fast commit replay tags.
In practice these operations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall
semantics harder to reason about and risks replay gaps if new call
sites appear.
Teach ext4 to mark the filesystem fast-commit ineligible when
ext4_end_enable_verity() starts its journal transaction.
This forces that transaction to fall back to a full commit, ensuring
that the fs-verity enable changes are captured by the normal journal
rather than partially encoded in fast commit TLVs.
This change should not affect common workloads but makes fs-verity
enable safer and easier to reason about under fast commit.
Testing:
1. prepare:
dd if=/dev/zero of=/root/fc_verity.img bs=1M count=0 seek=128
mkfs.ext4 -O fast_commit,verity -F /root/fc_verity.img
mkdir -p /mnt/fc_verity && mount -t ext4 -o loop /root/fc_verity.img /mnt/fc_verity
2. Enabled fs-verity on a file and verified reason accounting:
echo "data" > /mnt/fc_verity/verityfile
/root/enable_verity /mnt/fc_verity/verityfile
sync
tail -n 1 /proc/fs/ext4/loop0/fc_info
"fs-verity enable": 1
3. Enabled fs-verity on a second file, fsynced it, and checked that the
ineligible commit counter is updated too:
echo "data2" > /mnt/fc_verity/verityfile2
/root/enable_verity /mnt/fc_verity/verityfile2
/root/fsync_file /mnt/fc_verity/verityfile2
sync
/proc/fs/ext4/loop0/fc_info shows "fs-verity enable" incremented and
fc stats ineligible increased accordingly.
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-3-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Fast commits only log operations that have dedicated replay support.
Inode format migration (indirect<->extent layout changes via
EXT4_IOC_MIGRATE or toggling EXT4_EXTENTS_FL) rewrites the block mapping
representation without going through the fast commit tracking paths.
In practice these migrations are rare and usually followed by further
updates, but mixing them into a fast commit makes the overall semantics
harder to reason about and risks replay gaps if new call sites appear.
Teach ext4 to mark the filesystem fast-commit ineligible when
ext4_ext_migrate() or ext4_ind_migrate() start their journal transactions.
This forces those transactions to fall back to a full commit, ensuring
that the entire inode layout change is captured by the normal journal
rather than partially encoded in fast commit TLVs. This change should
not affect common workloads but makes format migrations safer and easier
to reason about under fast commit.
Testing:
1. prepare:
dd if=/dev/zero of=/root/fc.img bs=1M count=0 seek=128
mkfs.ext4 -O fast_commit -F /root/fc.img
mkdir -p /mnt/fc && mount -t ext4 -o loop /root/fc.img /mnt/fc
2. Created a test file and toggled the extents flag to exercise both
ext4_ind_migrate() and ext4_ext_migrate():
touch /mnt/fc/migtest
chattr -e /mnt/fc/migtest
chattr +e /mnt/fc/migtest
3. Verified fast-commit ineligible statistics:
tail -n 1 /proc/fs/ext4/loop0/fc_info
"Inode format migration": 2
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20251211115146.897420-2-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add a new sysfs attribute "err_report_sec" to control the s_err_report
timer in ext4_sb_info. Writing '0' disables the timer, while writing
a non-zero value enables the timer and sets the timeout in seconds.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Link: https://patch.msgid.link/20251211030256.28613-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When running `kvm-xfstests -c ext4/1k -C 1 generic/383` with the
`DOUBLE_CHECK` macro defined, the following panic is triggered:
==================================================================
EXT4-fs error (device vdc): ext4_validate_block_bitmap:423:
comm mount: bg 0: bad block bitmap checksum
BUG: unable to handle page fault for address: ff110000fa2cc000
PGD 3e01067 P4D 3e02067 PUD 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 2386 Comm: mount Tainted: G W
6.18.0-gba65a4e7120a-dirty #1152 PREEMPT(none)
RIP: 0010:percpu_counter_add_batch+0x13/0xa0
Call Trace:
<TASK>
ext4_mark_group_bitmap_corrupted+0xcb/0xe0
ext4_validate_block_bitmap+0x2a1/0x2f0
ext4_read_block_bitmap+0x33/0x50
mb_group_bb_bitmap_alloc+0x33/0x80
ext4_mb_add_groupinfo+0x190/0x250
ext4_mb_init_backend+0x87/0x290
ext4_mb_init+0x456/0x640
__ext4_fill_super+0x1072/0x1680
ext4_fill_super+0xd3/0x280
get_tree_bdev_flags+0x132/0x1d0
vfs_get_tree+0x29/0xd0
vfs_cmd_create+0x59/0xe0
__do_sys_fsconfig+0x4f6/0x6b0
do_syscall_64+0x50/0x1f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================
This issue can be reproduced using the following commands:
mkfs.ext4 -F -q -b 1024 /dev/sda 5G
tune2fs -O quota,project /dev/sda
mount /dev/sda /tmp/test
With DOUBLE_CHECK defined, mb_group_bb_bitmap_alloc() reads
and validates the block bitmap. When the validation fails,
ext4_mark_group_bitmap_corrupted() attempts to update
sbi->s_freeclusters_counter. However, this percpu_counter has not been
initialized yet at this point, which leads to the panic described above.
Fix this by moving the execution of ext4_percpu_param_init() to occur
before ext4_mb_init(), ensuring the per-CPU counters are initialized
before they are used.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20251209133116.731350-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now we have ext4_es_cache_extent() to cache on-disk extents instead of
ext4_es_insert_extent(), so drop the TODO comment.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-15-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In ext4, the remaining places for inserting extents into the extent
status tree within ext4_ext_determine_insert_hole() and
ext4_map_query_blocks() directly cache on-disk extents. We can use
ext4_es_cache_extent() instead of ext4_es_insert_extent() in these
cases. This will help reduce unnecessary increases in extent sequence
numbers and cache invalidations after supporting IOMAP in the future.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-14-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Print a trace point after successfully inserting an extent in the
ext4_es_cache_extent() function. Additionally, similar to other extent
cache operation functions, call ext4_print_pending_tree() to display the
extent debug information of the inode when in ES_DEBUG mode.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-13-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, ext4_es_cache_extent() is used to load extents into the
extent status tree when reading on-disk extent blocks. But it inserts
information into the extent status tree if and only if there isn't
information about the specified range already. So it only used for the
initial loading and does not support overwrit extents.
However, there are many other places in ext4 where on-disk extents are
inserted into the extent status tree, such as in ext4_map_query_blocks().
Currently, they call ext4_es_insert_extent() to perform the insertion,
but they don't modify the extents, so ext4_es_cache_extent() would be a
more appropriate choice. However, when ext4_map_query_blocks() inserts
an extent, it may overwrite a short existing extent of the same type.
Therefore, to prepare for the replacements, we need to extend
ext4_es_cache_extent() to allow it to overwrite existing extents with
the same status. So it checks the found extents before removing and
inserting. (There is one exception, a hole in the on-disk extent but a
delayed extent in the extent status tree is allowed.)
In addition, since cached extents can be more lenient than the extents
they modify and do not involve modifying reserved blocks, it is not
necessary to ensure that the insertion operation succeeds as strictly as
in the ext4_es_insert_extent() function.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-12-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, __es_remove_extent() unconditionally removes extent status
entries within the specified range. In order to prepare for extending
the ext4_es_cache_extent() function to cache on-disk extents, which may
overwrite some existing short-length extents with the same status, allow
__es_remove_extent() to check the specified extent type before removing
it, and return error and pass out the conflicting extent if the status
does not match.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-11-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The out label in __es_remove_extent() is just return err value, we can
return it directly if something bad happens. Therefore, remove the
useless out label and rename out_get_reserved to out.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-10-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
zero_ex is a temporary variable used only for writing zeros and
inserting extent status entry, it will not be directly inserted into the
tree. Therefore, it can be assigned values from the target extent in
various scenarios, eliminating the need to explicitly assign values to
each variable individually.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-9-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When the split extent fails, we might leave some extents still being
processed and return an error directly, which will result in stale
extent entries remaining in the extent status tree. So drop all of the
remaining potentially stale extents if the splitting fails.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251129103247.686136-8-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When splitting an unwritten extent in the middle and converting it to
initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and
EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent.
Assume we have an unwritten file and buffered write in the middle of it
without dioread_nolock enabled, it will allocate blocks as written
extent.
0 A B N
[UUUUUUUUUUUU] on-disk extent U: unwritten extent
[UUUUUUUUUUUU] extent status tree
[--DDDDDDDD--] D: valid data
|<- ->| ----> this range needs to be initialized
ext4_split_extent() first try to split this extent at B with
EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
ext4_split_extent_at() failed to split this extent due to temporary lack
of space. It zeroout B to N and leave the entire extent as unwritten.
0 A B N
[UUUUUUUUUUUU] on-disk extent
[UUUUUUUUUUUU] extent status tree
[--DDDDDDDDZZ] Z: zeroed data
ext4_split_extent() then try to split this extent at A with
EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and
leave an written extent from A to N.
0 A B N
[UUWWWWWWWWWW] on-disk extent W: written extent
[UUUUUUUUUUUU] extent status tree
[--DDDDDDDDZZ]
Finally ext4_map_create_blocks() only insert extent A to B to the extent
status tree, and leave an stale unwritten extent in the status tree.
0 A B N
[UUWWWWWWWWWW] on-disk extent W: written extent
[UUWWWWWWWWUU] extent status tree
[--DDDDDDDDZZ]
Fix this issue by always cached extent status entry after zeroing out
the second part.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251129103247.686136-7-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Caching extents during the splitting process is risky, as it may result
in stale extents remaining in the status tree. Moreover, in most cases,
the corresponding extent block entries are likely already cached before
the split happens, making caching here not particularly useful.
Assume we have an unwritten extent, and then DIO writes the first half.
[UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent
[UUUUUUUUUUUUUUUU] extent status tree
|<- ->| ----> dio write this range
First, when ext4_split_extent_at() splits this extent, it truncates the
existing extent and then inserts a new one. During this process, this
extent status entry may be shrunk, and calls to ext4_find_extent() and
ext4_cache_extents() may occur, which could potentially insert the
truncated range as a hole into the extent status tree. After the split
is completed, this hole is not replaced with the correct status.
[UUUUUUU|UUUUUUUU] on-disk extent U: unwritten extent
[UUUUUUU|HHHHHHHH] extent status tree H: hole
Then, the outer calling functions will not correct this remaining hole
extent either. Finally, if we perform a delayed buffer write on this
latter part, it will re-insert the delayed extent and cause an error in
space accounting.
In adition, if the unwritten extent cache is not shrunk during the
splitting, ext4_cache_extents() also conflicts with existing extents
when caching extents. In the future, we will add checks when caching
extents, which will trigger a warning. Therefore, Do not cache extents
that are being split.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Message-ID: <20251129103247.686136-6-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Before submitting I/O and allocating blocks with the
EXT4_GET_BLOCKS_PRE_IO flag set, ext4_split_convert_extents() may
convert the target extent range to initialized due to ENOSPC, ENOMEM, or
EQUOTA errors. However, it still marks the mapping as incorrectly
unwritten. Although this may not seem to cause any practical problems,
it will result in an unnecessary extent conversion operation after I/O
completion. Therefore, it's better to correct the returned mapping
status.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251129103247.686136-5-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When allocating blocks during within-EOF DIO and writeback with
dioread_nolock enabled, EXT4_GET_BLOCKS_PRE_IO was set to split an
existing large unwritten extent. However, EXT4_GET_BLOCKS_CONVERT was
set when calling ext4_split_convert_extents(), which may potentially
result in stale data issues.
Assume we have an unwritten extent, and then DIO writes the second half.
[UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent
[UUUUUUUUUUUUUUUU] extent status tree
|<- ->| ----> dio write this range
First, ext4_iomap_alloc() call ext4_map_blocks() with
EXT4_GET_BLOCKS_PRE_IO, EXT4_GET_BLOCKS_UNWRIT_EXT and
EXT4_GET_BLOCKS_CREATE flags set. ext4_map_blocks() find this extent and
call ext4_split_convert_extents() with EXT4_GET_BLOCKS_CONVERT and the
above flags set.
Then, ext4_split_convert_extents() calls ext4_split_extent() with
EXT4_EXT_MAY_ZEROOUT, EXT4_EXT_MARK_UNWRIT2 and EXT4_EXT_DATA_VALID2
flags set, and it calls ext4_split_extent_at() to split the second half
with EXT4_EXT_DATA_VALID2, EXT4_EXT_MARK_UNWRIT1, EXT4_EXT_MAY_ZEROOUT
and EXT4_EXT_MARK_UNWRIT2 flags set. However, ext4_split_extent_at()
failed to insert extent since a temporary lack -ENOSPC. It zeroes out
the first half but convert the entire on-disk extent to written since
the EXT4_EXT_DATA_VALID2 flag set, but left the second half as unwritten
in the extent status tree.
[0000000000SSSSSS] data S: stale data, 0: zeroed
[WWWWWWWWWWWWWWWW] on-disk extent W: written extent
[WWWWWWWWWWUUUUUU] extent status tree
Finally, if the DIO failed to write data to the disk, the stale data in
the second half will be exposed once the cached extent entry is gone.
Fix this issue by not passing EXT4_GET_BLOCKS_CONVERT when splitting
an unwritten extent before submitting I/O, and make
ext4_split_convert_extents() to zero out the entire extent range
to zero for this case, and also mark the extent in the extent status
tree for consistency.
Fixes: b8a8684502a0 ("ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Message-ID: <20251129103247.686136-4-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When allocating initialized blocks from a large unwritten extent, or
when splitting an unwritten extent during end I/O and converting it to
initialized, there is currently a potential issue of stale data if the
extent needs to be split in the middle.
0 A B N
[UUUUUUUUUUUU] U: unwritten extent
[--DDDDDDDD--] D: valid data
|<- ->| ----> this range needs to be initialized
ext4_split_extent() first try to split this extent at B with
EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
ext4_split_extent_at() failed to split this extent due to temporary lack
of space. It zeroout B to N and mark the entire extent from 0 to N
as written.
0 A B N
[WWWWWWWWWWWW] W: written extent
[SSDDDDDDDDZZ] Z: zeroed, S: stale data
ext4_split_extent() then try to split this extent at A with
EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
a stale written extent from 0 to A.
0 A B N
[WW|WWWWWWWWWW]
[SS|DDDDDDDDZZ]
Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
when splitting at B, don't convert the entire extent to written and left
it as unwritten after zeroing out B to N. The remaining work is just
like the standard two-part split. ext4_split_extent() will pass the
EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
second time, allowing it to properly handle the split. If the split is
successful, it will keep extent from 0 to A as unwritten.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Message-ID: <20251129103247.686136-3-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When splitting an extent, if the EXT4_GET_BLOCKS_CONVERT flag is set and
it is necessary to split the target extent in the middle,
ext4_split_extent() first handles splitting the latter half of the
extent and passes the EXT4_EXT_DATA_VALID1 flag. This flag implies that
all blocks before the split point contain valid data; however, this
assumption is incorrect.
Therefore, subdivid EXT4_EXT_DATA_VALID1 into
EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_DATA_PARTIAL_VALID1, which
indicate that the first half of the extent is either entirely valid or
only partially valid, respectively. These two flags cannot be set
simultaneously.
This patch does not use EXT4_EXT_DATA_PARTIAL_VALID1, it only replaces
EXT4_EXT_DATA_VALID1 with EXT4_EXT_DATA_ENTIRE_VALID1 at the location
where it is set, no logical changes.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Message-ID: <20251129103247.686136-2-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The error branch for ext4_xattr_inode_update_ref forget to release the
refcount for iloc.bh. Find this when review code.
Fixes: 57295e835408 ("ext4: guard against EA inode refcount underflow in xattr update")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20251213055706.3417529-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Commit 962e8a01eab9 ("ext4: introduce mext_move_extent()") attempts to
call ext4_swap_extents() on the failure path to recover the swapped
extents, but fails to acquire locks for the two inode->i_data_sem,
triggering the BUG_ON statement in ext4_swap_extents().
This issue can be fixed by calling ext4_double_down_write_data_sem()
before ext4_swap_extents().
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reported-by: syzbot+4ea6bd8737669b423aae@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69368649.a70a0220.38f243.0093.GAE@google.com/
Fixes: 962e8a01eab9 ("ext4: introduce mext_move_extent()")
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20251208123713.1971068-1-sunjunchao@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the new fserror functions to report metadata errors to fsnotify.
Note that ext4 inconsistently passes around negative and positive error
numbers all over the codebase, so we force them all to negative for
consistency in what we report to fserror, and fserror ensures that only
positive error numbers are passed to fanotify, per the fanotify(7)
manpage.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/176826402693.3490369.5875002879192895558.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.
Cc: linux-api@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add the setlease file_operation to ext4_file_operations and
ext4_dir_operations, pointing to generic_setlease. A future patch will
change the default behavior to reject lease attempts with -EINVAL when
there is no setlease file operation defined. Add generic_setlease to
retain the ability to set leases on this filesystem.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260108-setlease-6-20-v1-6-ea4dec9b67fa@kernel.org
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a blk_crypto_submit_bio helper that either submits the bio when
it is not encrypted or inline encryption is provided, but otherwise
handles the encryption before going down into the low-level driver.
This reduces the risk from bio reordering and keeps memory allocation
as high up in the stack as possible.
Note that if the submitter knows that inline enctryption is known to
be supported by the underyling driver, it can still use plain
submit_bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'.
Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long'
and so cannot discard significant bits.
A couple of places need umin() because of loops like:
nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE);
for (i = 0; i < nfolios; i++) {
struct folio *folio = page_folio(pages[i]);
...
unsigned int len = umin(ret, PAGE_SIZE - start);
...
ret -= len;
...
}
where the compiler doesn't track things well enough to know that
'ret' is never negative.
The alternate loop:
for (i = 0; ret > 0; i++) {
struct folio *folio = page_folio(pages[i]);
...
unsigned int len = min(ret, PAGE_SIZE - start);
...
ret -= len;
...
}
would be equivalent and doesn't need 'nfolios'.
Most of the 'unsigned long' actually come from PAGE_SIZE.
Detected by an extra check added to min_t().
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Link: https://patch.msgid.link/20251119224140.8616-31-david.laight.linux@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features and improvements for the ext4 file system:
- Optimize online defragmentation by using folios instead of
individual buffer heads
- Improve error codes stored in the superblock when the journal
aborts
- Minor cleanups and clarifications in ext4_map_blocks()
- Add documentation of the casefold and encrypt flags
- Add support for file systems with a blocksize greater than the
pagesize
- Improve performance by enabling the caching the fact that an inode
does not have a Posix ACL
Various Bug Fixes:
- Fix false positive complaints from smatch
- Fix error code which is returned by ext4fs_dirhash() when Siphash
is used without the encryption key
- Fix races when writing to inline data files which could trigger a
BUG
- Fix potential NULL dereference when there is an corrupt file system
with an extended attribute value stored in a inode
- Fix false positive lockdep report when syzbot uses ext4 and ocfs2
together
- Fix false positive reported by DEPT by adjusting lock annotation
- Avoid a potential BUG_ON in jbd2 when a file system is massively
corrupted
- Fix a WARN_ON when superblock is corrupted with a non-NULL
terminated mount options field
- Add check if the userspace passes in a non-NULL terminated mount
options field to EXT4_IOC_SET_TUNE_SB_PARAM
- Fix a potential journal checksum failure whena file system is
copied while it is mounted read-only
- Fix a potential potential orphan file tracking error which only
showed on 32-bit systems
- Fix assertion checks in mballoc (which have to be explicitly enbled
by manually enabling AGGRESSIVE_CHECKS and recompiling)
- Avoid complaining about overly large orphan files created by mke2fs
with with file systems with a 64k block size"
* tag 'ext4_for_linus-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
ext4: mark inodes without acls in __ext4_iget()
ext4: enable block size larger than page size
ext4: add checks for large folio incompatibilities when BS > PS
ext4: support verifying data from large folios with fs-verity
ext4: make data=journal support large block size
ext4: support large block size in __ext4_block_zero_page_range()
ext4: support large block size in mpage_prepare_extent_to_map()
ext4: support large block size in mpage_map_and_submit_buffers()
ext4: support large block size in ext4_block_write_begin()
ext4: support large block size in ext4_mpage_readpages()
ext4: rename 'page' references to 'folio' in multi-block allocator
ext4: prepare buddy cache inode for BS > PS with large folios
ext4: support large block size in ext4_mb_init_cache()
ext4: support large block size in ext4_mb_get_buddy_page_lock()
ext4: support large block size in ext4_mb_load_buddy_gfp()
ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page conversion
ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion
ext4: support large block size in ext4_readdir()
ext4: support large block size in ext4_calculate_overhead()
ext4: introduce s_min_folio_order for future BS > PS support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock lock guard updates from Christian Brauner:
"This starts the work of introducing guards for superblock related
locks.
Introduce super_write_guard for scoped superblock write protection.
This provides a guard-based alternative to the manual sb_start_write()
and sb_end_write() pattern, allowing the compiler to automatically
handle the cleanup"
* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: use super write guard in xfs_file_ioctl()
open: use super write guard in do_ftruncate()
btrfs: use super write guard in relocating_repair_kthread()
ext4: use super write guard in write_mmp_block()
btrfs: use super write guard in sb_start_write()
btrfs: use super write guard btrfs_run_defrag_inode()
btrfs: use super write guard in btrfs_reclaim_bgs_work()
fs: add super_write_guard
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
Mark inodes without acls with cache_no_acl() in __ext4_iget() so that
path lookup can run in RCU mode from the start. This is interesting in
particular for the case where the file owner does the lookup because in
that case end up constantly hitting the slow path otherwise. We drop out
from the fast path (because ACL state is unknown) but never end up calling
check_acl() to cache ACL state.
The problem was originally analyzed by Linus and fix tested by Matheusz,
I'm just putting it into mergeable form :).
Link: https://lore.kernel.org/all/CAHk-=whSzc75TLLPWskV0xuaHR4tpWBr=LduqhcCFr4kCmme_w@mail.gmail.com
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Message-ID: <20251125101340.24276-2-jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since block device (See commit 3c20917120ce ("block/bdev: enable large
folio support for large logical block sizes")) and page cache (See commit
ab95d23bab220ef8 ("filemap: allocate mapping_min_order folios in the page
cache")) has the ability to have a minimum order when allocating folio,
and ext4 has supported large folio in commit 7ac67301e82f ("ext4: enable
large folio for regular file"), now add support for block_size > PAGE_SIZE
in ext4.
set_blocksize() -> bdev_validate_blocksize() already validates the block
size, so ext4_load_super() does not need to perform additional checks.
Here we only need to add the FS_LBS bit to fs_flags.
In addition, block sizes larger than the page size are currently supported
only when CONFIG_TRANSPARENT_HUGEPAGE is enabled. To make this explicit,
a blocksize_gt_pagesize entry has been added under /sys/fs/ext4/feature/,
indicating whether bs > ps is supported. This allows mke2fs to check the
interface and determine whether a warning should be issued when formatting
a filesystem with block size larger than the page size.
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-25-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Supporting a block size greater than the page size (BS > PS) requires
support for large folios. However, several features (e.g., encrypt)
do not yet support large folios.
To prevent conflicts, this patch adds checks at mount time to prohibit
these features from being used when BS > PS. Since these features cannot
be changed on remount, there is no need to check on remount.
This patch adds s_max_folio_order, initialized during mount according to
filesystem features and mount options. If s_max_folio_order is 0, large
folios are disabled.
With this in place, ext4_set_inode_mapping_order() can be simplified by
checking s_max_folio_order, avoiding redundant checks.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-24-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Eric Biggers already added support for verifying data from large folios
several years ago in commit 5d0f0e57ed90 ("fsverity: support verifying
data from large folios").
With ext4 now supporting large block sizes, the fs-verity tests
`kvm-xfstests -c ext4/64k -g verity -x encrypt` pass without issues.
Therefore, remove the restriction and allow large folios to be enabled
together with fs-verity.
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-23-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, ext4_set_inode_mapping_order() does not set max folio order
for files with the data journalling flag. For files that already have
large folios enabled, ext4_inode_journal_mode() ignores the data
journalling flag once max folio order is set.
This is not because data journalling cannot work with large folios, but
because credit estimates will go through the roof if there are too many
blocks per folio.
Since the real constraint is blocks-per-folio, to support data=journal
under LBS, we now set max folio order to be equal to min folio order for
files with the journalling flag. When LBS is disabled, the max folio order
remains unset as before.
Therefore, before ext4_change_inode_journal_flag() switches the journalling
mode, we call truncate_pagecache() to drop all page cache for that inode,
and filemap_write_and_wait() is called unconditionally.
After that, once the journalling mode has been switched, we can safely
reset the inode mapping order, and the mapping_large_folio_support() check
in ext4_inode_journal_mode() can be removed.
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-22-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the EXT4_PG_TO_LBLK() macro to convert folio indexes to blocks to avoid
negative left shifts after supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-21-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the EXT4_PG_TO_LBLK/EXT4_LBLK_TO_PG macros to complete the conversion
between folio indexes and blocks to avoid negative left/right shifts after
supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-20-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the EXT4_PG_TO_LBLK/EXT4_LBLK_TO_PG macros to complete the conversion
between folio indexes and blocks to avoid negative left/right shifts after
supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-19-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the EXT4_PG_TO_LBLK() macro to convert folio indexes to blocks to avoid
negative left shifts after supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-18-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use the EXT4_PG_TO_LBLK() macro to convert folio indexes to blocks to avoid
negative left shifts after supporting blocksize greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-17-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The ext4 multi-block allocator now fully supports folio objects. Update
all variable names, function names, and comments to replace legacy 'page'
terminology with 'folio', improving clarity and consistency.
No functional changes.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-16-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We use EXT4_BAD_INO for the buddy cache inode number. This inode is not
accessed via __ext4_new_inode() or __ext4_iget(), meaning
ext4_set_inode_mapping_order() is not called to set its folio order range.
However, future block size greater than page size support requires this
inode to support large folios, and the buddy cache code already handles
BS > PS. Therefore, ext4_set_inode_mapping_order() is now explicitly
called for this specific inode to set its folio order range.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-15-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, ext4_mb_init_cache() uses blocks_per_page to calculate the
folio index and offset. However, when blocksize is larger than PAGE_SIZE,
blocks_per_page becomes zero, leading to a potential division-by-zero bug.
Since we now have the folio, we know its exact size. This allows us to
convert {blocks, groups}_per_page to {blocks, groups}_per_folio, thus
supporting block sizes greater than page size.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-14-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, ext4_mb_get_buddy_page_lock() uses blocks_per_page to calculate
folio index and offset. However, when blocksize is larger than PAGE_SIZE,
blocks_per_page becomes zero, leading to a potential division-by-zero bug.
To support BS > PS, use bytes to compute folio index and offset within
folio to get rid of blocks_per_page.
Also, since ext4_mb_get_buddy_page_lock() already fully supports folio,
rename it to ext4_mb_get_buddy_folio_lock().
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-13-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, ext4_mb_load_buddy_gfp() uses blocks_per_page to calculate the
folio index and offset. However, when blocksize is larger than PAGE_SIZE,
blocks_per_page becomes zero, leading to a potential division-by-zero bug.
To support BS > PS, use bytes to compute folio index and offset within
folio to get rid of blocks_per_page.
Also, if buddy and bitmap land in the same folio, we get that folio’s ref
instead of looking it up again before updating the buddy.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-12-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
As BS > PS support is coming, all block number to page index (and
vice-versa) conversions must now go via bytes. Added EXT4_LBLK_TO_PG()
and EXT4_PG_TO_LBLK() macros to simplify these conversions and handle
both BS <= PS and BS > PS scenarios cleanly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-11-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|