<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/fs/btrfs/delayed-ref.c, branch v6.7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>btrfs: fix qgroup record leaks when using simple quotas</title>
<updated>2023-11-09T13:01:59+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-11-06T20:17:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=609d99379736aa6c5b0658654084198aa808035a'/>
<id>609d99379736aa6c5b0658654084198aa808035a</id>
<content type='text'>
When using simple quotas we are not supposed to allocate qgroup records
when adding delayed references. However we allocate them if either mode
of quotas is enabled (the new simple one or the old one), but then we
never free them because running the accounting, which frees the records,
is only run when using the old quotas (at btrfs_qgroup_account_extents()),
resulting in a memory leak of the records allocated when adding delayed
references.

Fix this by allocating the records only if the old quotas mode is enabled.
Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
mode is not enabled - meaning the caller has to free the record.

Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas")
Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When using simple quotas we are not supposed to allocate qgroup records
when adding delayed references. However we allocate them if either mode
of quotas is enabled (the new simple one or the old one), but then we
never free them because running the accounting, which frees the records,
is only run when using the old quotas (at btrfs_qgroup_account_extents()),
resulting in a memory leak of the records allocated when adding delayed
references.

Fix this by allocating the records only if the old quotas mode is enabled.
Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
mode is not enabled - meaning the caller has to free the record.

Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas")
Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: stop reserving excessive space for block group item insertions</title>
<updated>2023-10-12T14:44:16+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-09-28T10:12:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9ef17228e1096e7e75bdde752ae1f0e9a5bcc8ab'/>
<id>9ef17228e1096e7e75bdde752ae1f0e9a5bcc8ab</id>
<content type='text'>
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().

That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.

So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().

That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.

So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: stop reserving excessive space for block group item updates</title>
<updated>2023-10-12T14:44:16+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-09-28T10:12:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f66e0209bd914465c277c259472aa974cad94e3f'/>
<id>f66e0209bd914465c277c259472aa974cad94e3f</id>
<content type='text'>
Space for block group item updates, necessary after allocating or
deallocating an extent from a block group, is reserved in the delayed
refs block reserve. Currently we do this by incrementing the transaction
handle's delayed_ref_updates counter and then calling
btrfs_update_delayed_refs_rsv(), which will increase the size of the
delayed refs block reserve by an amount that corresponds to the same
amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().

That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_metadata_size(), since we only need to
update an existing block group item in the extent tree (or block group
tree if this feature is enabled). By using btrfs_calc_metadata_size() we
will need to reserve 4 times less space when using the free space tree
and 2 times less space when not using it, putting less pressure on space
reservation.

So use helpers to reserve and release space for block group item updates
that use btrfs_calc_metadata_size() for calculation of the space.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Space for block group item updates, necessary after allocating or
deallocating an extent from a block group, is reserved in the delayed
refs block reserve. Currently we do this by incrementing the transaction
handle's delayed_ref_updates counter and then calling
btrfs_update_delayed_refs_rsv(), which will increase the size of the
delayed refs block reserve by an amount that corresponds to the same
amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().

That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_metadata_size(), since we only need to
update an existing block group item in the extent tree (or block group
tree if this feature is enabled). By using btrfs_calc_metadata_size() we
will need to reserve 4 times less space when using the free space tree
and 2 times less space when not using it, putting less pressure on space
reservation.

So use helpers to reserve and release space for block group item updates
that use btrfs_calc_metadata_size() for calculation of the space.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: record simple quota deltas in delayed refs</title>
<updated>2023-10-12T14:44:11+00:00</updated>
<author>
<name>Boris Burkov</name>
<email>boris@bur.io</email>
</author>
<published>2023-06-28T18:00:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cecbb533b5fcec4ff77e786b7f94457f6cacd9e7'/>
<id>cecbb533b5fcec4ff77e786b7f94457f6cacd9e7</id>
<content type='text'>
At the moment that we run delayed refs, we make the final ref-count
based decision on creating/removing extent (and metadata) items.
Therefore, it is exactly the spot to hook up simple quotas.

There are a few important subtleties to the fields we must collect to
accurately track simple quotas, particularly when removing an extent.
When removing a data extent, the ref could be in any tree (due to
reflink, for example) and so we need to recover the owning root id from
the owner ref item. When removing a metadata extent, we know the owning
root from the owner field in the header when we create the delayed ref,
so we can recover it from there.

We must also be careful to handle reservations properly to not leaked
reserved space. The happy path is freeing the reservation when the
simple quota delta runs on a data extent. If that doesn't happen, due to
refs canceling out or some error, the ref head already has the
must_insert_reserved machinery to handle this, so we piggy back on that
and use it to clean up the reserved data.

Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
At the moment that we run delayed refs, we make the final ref-count
based decision on creating/removing extent (and metadata) items.
Therefore, it is exactly the spot to hook up simple quotas.

There are a few important subtleties to the fields we must collect to
accurately track simple quotas, particularly when removing an extent.
When removing a data extent, the ref could be in any tree (due to
reflink, for example) and so we need to recover the owning root id from
the owner ref item. When removing a metadata extent, we know the owning
root from the owner field in the header when we create the delayed ref,
so we can recover it from there.

We must also be careful to handle reservations properly to not leaked
reserved space. The happy path is freeing the reservation when the
simple quota delta runs on a data extent. If that doesn't happen, due to
refs canceling out or some error, the ref head already has the
must_insert_reserved machinery to handle this, so we piggy back on that
and use it to clean up the reserved data.

Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: track original extent owner in head_ref</title>
<updated>2023-10-12T14:44:11+00:00</updated>
<author>
<name>Boris Burkov</name>
<email>boris@bur.io</email>
</author>
<published>2023-06-28T21:03:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cf79ac47932b377d0cfe6b61f4472cdc17eac042'/>
<id>cf79ac47932b377d0cfe6b61f4472cdc17eac042</id>
<content type='text'>
Simple quotas requires tracking the original creating root of any given
extent. This gets complicated when multiple subvolumes create
overlapping/contradictory refs in the same transaction. For example,
due to modifying or deleting an extent while also snapshotting it.

To resolve this in a general way, take advantage of the fact that we are
essentially already tracking this for handling releasing reservations.
The head ref coalesces the various refs and uses must_insert_reserved to
check if it needs to create an extent/free reservation. Store the ref
that set must_insert_reserved as the owning ref on the head ref.

Note that this can result in writing an extent for the very first time
with an owner different from its only ref, but it will look the same as
if you first created it with the original owning ref, then added the
other ref, then removed the owning ref.

Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Simple quotas requires tracking the original creating root of any given
extent. This gets complicated when multiple subvolumes create
overlapping/contradictory refs in the same transaction. For example,
due to modifying or deleting an extent while also snapshotting it.

To resolve this in a general way, take advantage of the fact that we are
essentially already tracking this for handling releasing reservations.
The head ref coalesces the various refs and uses must_insert_reserved to
check if it needs to create an extent/free reservation. Store the ref
that set must_insert_reserved as the owning ref on the head ref.

Note that this can result in writing an extent for the very first time
with an owner different from its only ref, but it will look the same as
if you first created it with the original owning ref, then added the
other ref, then removed the owning ref.

Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: rename tree_ref and data_ref owning_root</title>
<updated>2023-10-12T14:44:11+00:00</updated>
<author>
<name>Boris Burkov</name>
<email>boris@bur.io</email>
</author>
<published>2023-06-28T21:32:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=610647d7efd1ab06e9cd7ed6f0abe84b16385da7'/>
<id>610647d7efd1ab06e9cd7ed6f0abe84b16385da7</id>
<content type='text'>
commit 113479d5b8eb ("btrfs: rename root fields in delayed refs structs")
changed these from ref_root to owning_root. However, there are many
circumstances where that name is not really accurate and the root on the
ref struct _is_ the referring root. In general, these are not the owning
root, though it does happen in some ref merging cases involving
overwrites during snapshots and similar.

Simple quotas cares quite a bit about tracking the original owner of an
extent through delayed refs, so rename these back to free up the name
for the real owning root (which will live on the generic btrfs_ref and
the head ref)

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 113479d5b8eb ("btrfs: rename root fields in delayed refs structs")
changed these from ref_root to owning_root. However, there are many
circumstances where that name is not really accurate and the root on the
ref struct _is_ the referring root. In general, these are not the owning
root, though it does happen in some ref merging cases involving
overwrites during snapshots and similar.

Simple quotas cares quite a bit about tracking the original owner of an
extent through delayed refs, so rename these back to free up the name
for the real owning root (which will live on the generic btrfs_ref and
the head ref)

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: qgroup: add new quota mode for simple quotas</title>
<updated>2023-10-12T14:44:10+00:00</updated>
<author>
<name>Boris Burkov</name>
<email>boris@bur.io</email>
</author>
<published>2023-05-16T23:35:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=182940f4f4dbd932776414744c8de64333957725'/>
<id>182940f4f4dbd932776414744c8de64333957725</id>
<content type='text'>
Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.

Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.

Signed-off-by: Boris Burkov &lt;boris@bur.io&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: always reserve space for delayed refs when starting transaction</title>
<updated>2023-10-12T14:44:06+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-09-08T17:20:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=28270e25c69a2c76ea1ed0922095bffb9b9a4f98'/>
<id>28270e25c69a2c76ea1ed0922095bffb9b9a4f98</id>
<content type='text'>
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
   the caller wants to be inserted/modified/deleted in a btree. This space
   goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
   than the amount of its reserved space, and the flush method is
   BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
   it corresponding to the number of items the caller wants to
   insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
   creates delayed refs after COWing an extent buffer, allocating a new
   one or deleting (freeing) an extent buffer. This happens after the
   the task started or joined a transaction, whenever it calls
   btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
   btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
   operations or when someone else calls btrfs_start_transaction() with
   a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
   transaction block reserve. When the task releases its transaction
   handle (btrfs_end_transaction()) or it attempts to commit the
   transaction, it releases any remaining space in the transaction block
   reserve that it did not use, as not all space may have been used (due
   to pessimistic space calculation) by calling btrfs_block_rsv_release()
   which will try to add that unused space to the delayed refs block
   reserve (if its current size is greater than its reserved space).
   That transferred space may not be enough to completely fulfill the
   delayed refs block reserve.

   Plus we have some tasks that will attempt do modify as many leaves
   as they can before getting -ENOSPC (and then reserving more space and
   retrying), such as hole punching and extent cloning which call
   btrfs_replace_file_extents(). Such tasks can generate therefore a
   high number of delayed refs, for both metadata and data (we can't
   know in advance how many file extent items we will find in a range
   and therefore how many delayed refs for dropping references on data
   extents we will generate);

6) If a transaction starts its commit before the delayed refs block
   reserve is refilled, for example by the transaction kthread or by
   someone who called btrfs_join_transaction() before starting the
   commit, then when running delayed references if we don't have enough
   reserved space in the delayed refs block reserve, we will consume
   space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
   transaction, since we have no guarantees the delayed refs block
   reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
   when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
   number of items greater than zero and at the time the delayed refs
   reserve is full, then we don't reserve any space at
   btrfs_start_transaction() for the delayed refs that will be generated
   by a task, and we can therefore end up using a lot of space from the
   global reserve when running the delayed refs during a transaction
   commit;

4) There are also other operations that result in bumping the size of the
   delayed refs reserve, such as creating and deleting block groups, as
   well as the need to update a block group item because we allocated or
   freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
   and its reserved space, two very bad things may happen:

   1) The reserved space of the global reserve may not be enough and we
      fail the transaction commit with -ENOSPC when running delayed refs;

   2) If the available space in the global reserve is enough it may result
      in nearly exhausting it. If the fs has no more unallocated device
      space for allocating a new block group and all the available space
      in existing metadata block groups is not far from the global
      reserve's size before we started the transaction commit, we may end
      up in a situation where after the transaction commit we have too
      little available metadata space, and any future transaction commit
      will fail with -ENOSPC, because although we were able to reserve
      space to start the transaction, we were not able to commit it, as
      running delayed refs generates some more delayed refs (to update the
      extent tree for example) - this includes not even being able to
      commit a transaction that was started with the goal of unlinking a
      file, removing an empty data block group or doing reclaim/balance,
      so there's no way to release metadata space.

      In the worst case the next time we mount the filesystem we may
      also fail with -ENOSPC due to failure to commit a transaction to
      cleanup orphan inodes. This later case was reported and hit by
      someone running a SLE (SUSE Linux Enterprise) distribution for
      example - where the fs had no more unallocated space that could be
      used to allocate a new metadata block group, and the available
      metadata space was about 1.5M, not enough to commit a transaction
      to cleanup an orphan inode (or do relocation of data block groups
      that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info-&gt;delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
   the caller wants to be inserted/modified/deleted in a btree. This space
   goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
   than the amount of its reserved space, and the flush method is
   BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
   it corresponding to the number of items the caller wants to
   insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
   creates delayed refs after COWing an extent buffer, allocating a new
   one or deleting (freeing) an extent buffer. This happens after the
   the task started or joined a transaction, whenever it calls
   btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
   btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
   operations or when someone else calls btrfs_start_transaction() with
   a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
   transaction block reserve. When the task releases its transaction
   handle (btrfs_end_transaction()) or it attempts to commit the
   transaction, it releases any remaining space in the transaction block
   reserve that it did not use, as not all space may have been used (due
   to pessimistic space calculation) by calling btrfs_block_rsv_release()
   which will try to add that unused space to the delayed refs block
   reserve (if its current size is greater than its reserved space).
   That transferred space may not be enough to completely fulfill the
   delayed refs block reserve.

   Plus we have some tasks that will attempt do modify as many leaves
   as they can before getting -ENOSPC (and then reserving more space and
   retrying), such as hole punching and extent cloning which call
   btrfs_replace_file_extents(). Such tasks can generate therefore a
   high number of delayed refs, for both metadata and data (we can't
   know in advance how many file extent items we will find in a range
   and therefore how many delayed refs for dropping references on data
   extents we will generate);

6) If a transaction starts its commit before the delayed refs block
   reserve is refilled, for example by the transaction kthread or by
   someone who called btrfs_join_transaction() before starting the
   commit, then when running delayed references if we don't have enough
   reserved space in the delayed refs block reserve, we will consume
   space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
   transaction, since we have no guarantees the delayed refs block
   reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
   when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
   number of items greater than zero and at the time the delayed refs
   reserve is full, then we don't reserve any space at
   btrfs_start_transaction() for the delayed refs that will be generated
   by a task, and we can therefore end up using a lot of space from the
   global reserve when running the delayed refs during a transaction
   commit;

4) There are also other operations that result in bumping the size of the
   delayed refs reserve, such as creating and deleting block groups, as
   well as the need to update a block group item because we allocated or
   freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
   and its reserved space, two very bad things may happen:

   1) The reserved space of the global reserve may not be enough and we
      fail the transaction commit with -ENOSPC when running delayed refs;

   2) If the available space in the global reserve is enough it may result
      in nearly exhausting it. If the fs has no more unallocated device
      space for allocating a new block group and all the available space
      in existing metadata block groups is not far from the global
      reserve's size before we started the transaction commit, we may end
      up in a situation where after the transaction commit we have too
      little available metadata space, and any future transaction commit
      will fail with -ENOSPC, because although we were able to reserve
      space to start the transaction, we were not able to commit it, as
      running delayed refs generates some more delayed refs (to update the
      extent tree for example) - this includes not even being able to
      commit a transaction that was started with the goal of unlinking a
      file, removing an empty data block group or doing reclaim/balance,
      so there's no way to release metadata space.

      In the worst case the next time we mount the filesystem we may
      also fail with -ENOSPC due to failure to commit a transaction to
      cleanup orphan inodes. This later case was reported and hit by
      someone running a SLE (SUSE Linux Enterprise) distribution for
      example - where the fs had no more unallocated space that could be
      used to allocate a new metadata block group, and the available
      metadata space was about 1.5M, not enough to commit a transaction
      to cleanup an orphan inode (or do relocation of data block groups
      that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info-&gt;delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: stop doing excessive space reservation for csum deletion</title>
<updated>2023-10-12T14:44:06+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-09-08T17:20:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=adb86dbe426f9a54843d70092819deca220a224d'/>
<id>adb86dbe426f9a54843d70092819deca220a224d</id>
<content type='text'>
Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:

1) It uses btrfs_calc_insert_metadata_size() which is excessive because
   we only need to delete csum items from the csum tree, we don't need
   to insert any items, so btrfs_calc_metadata_size() is all we need (as
   it computes space needed to delete an item);

2) If the free space tree is enabled, it doubles the amount of space,
   which is pointless for csum deletion since we don't need to touch the
   free space tree or any other tree other than the csum tree.

So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.

For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/test

  umount $DEV &amp;&gt; /dev/null
  mkfs.btrfs -f $DEV
  # Use compression to quickly create files with a lot of extents
  # (each with a size of 128K).
  mount -o compress=lzo $DEV $MNT

  # 100G gives at least 983040 extents with a size of 128K.
  xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar

  # Flush all delalloc and clear all metadata from memory.
  umount $MNT
  mount -o compress=lzo $DEV $MNT

  start=$(date +%s%N)
  rm -f $MNT/foobar
  end=$(date +%s%N)
  dur=$(( (end - start) / 1000000 ))
  echo "rm took $dur milliseconds"

  umount $MNT

Before this change rm took: 7504 milliseconds
After this change rm took:  6574 milliseconds  (-12.4%)

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:

1) It uses btrfs_calc_insert_metadata_size() which is excessive because
   we only need to delete csum items from the csum tree, we don't need
   to insert any items, so btrfs_calc_metadata_size() is all we need (as
   it computes space needed to delete an item);

2) If the free space tree is enabled, it doubles the amount of space,
   which is pointless for csum deletion since we don't need to touch the
   free space tree or any other tree other than the csum tree.

So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.

For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/test

  umount $DEV &amp;&gt; /dev/null
  mkfs.btrfs -f $DEV
  # Use compression to quickly create files with a lot of extents
  # (each with a size of 128K).
  mount -o compress=lzo $DEV $MNT

  # 100G gives at least 983040 extents with a size of 128K.
  xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar

  # Flush all delalloc and clear all metadata from memory.
  umount $MNT
  mount -o compress=lzo $DEV $MNT

  start=$(date +%s%N)
  rm -f $MNT/foobar
  end=$(date +%s%N)
  dur=$(( (end - start) / 1000000 ))
  echo "rm took $dur milliseconds"

  umount $MNT

Before this change rm took: 7504 milliseconds
After this change rm took:  6574 milliseconds  (-12.4%)

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release()</title>
<updated>2023-10-12T14:44:06+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-09-08T17:20:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b6ea3e6ab569d9ed7472e8df4cbf5f78fe49f277'/>
<id>b6ea3e6ab569d9ed7472e8df4cbf5f78fe49f277</id>
<content type='text'>
There's no point in initializing to 0 the local variable 'released' as
we don't use it before the next assignment to it. So remove the
initialization. This may help avoid some warnings with clang tools such
as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant
initialization of variables in log_new_ancestors").

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There's no point in initializing to 0 the local variable 'released' as
we don't use it before the next assignment to it. So remove the
initialization. This may help avoid some warnings with clang tools such
as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant
initialization of variables in log_new_ancestors").

Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
