summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)Author
2023-04-26xfs: fix forkoff miscalculation related to XFS_LITINO(mp)Gao Xiang
commit ada49d64fb3538144192181db05de17e2ffc3551 upstream. Currently, commit e9e2eae89ddb dropped a (int) decoration from XFS_LITINO(mp), and since sizeof() expression is also involved, the result of XFS_LITINO(mp) is simply as the size_t type (commonly unsigned long). Considering the expression in xfs_attr_shortform_bytesfit(): offset = (XFS_LITINO(mp) - bytes) >> 3; let "bytes" be (int)340, and "XFS_LITINO(mp)" be (unsigned long)336. on 64-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffffffffffcUL >> 3) = -1 but on 32-bit platform, the expression is offset = ((unsigned long)336 - (int)340) >> 3 = (int)(0xfffffffcUL >> 3) = 0x1fffffff instead. so offset becomes a large positive number on 32-bit platform, and cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0. Therefore, one result is "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));" assertion failure in xfs_idata_realloc(), which was also the root cause of the original bugreport from Dennis, see: https://bugzilla.redhat.com/show_bug.cgi?id=1894177 And it can also be manually triggered with the following commands: $ touch a; $ setfattr -n user.0 -v "`seq 0 80`" a; $ setfattr -n user.1 -v "`seq 0 80`" a on 32-bit platform. Fix the case in xfs_attr_shortform_bytesfit() by bailing out "XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading comment together with this bugfix suggested by Darrick. It seems the other users of XFS_LITINO(mp) are not impacted. Fixes: e9e2eae89ddb ("xfs: only check the superblock version for dinode size calculation") Cc: <stable@vger.kernel.org> # 5.7+ Reported-and-tested-by: Dennis Gilmore <dgilmore@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: consider shutdown in bmapbt cursor delete assertBrian Foster
commit 1cd738b13ae9b29e03d6149f0246c61f76e81fcf upstream. [ Slightly modify fs/xfs/libxfs/xfs_btree.c to resolve merge conflicts ] The assert in xfs_btree_del_cursor() checks that the bmapbt block allocation field has been handled correctly before the cursor is freed. This field is used for accurate calculation of indirect block reservation requirements (for delayed allocations), for example. generic/019 reproduces a scenario where this assert fails because the filesystem has shutdown while in the middle of a bmbt record insertion. This occurs after a bmbt block has been allocated via the cursor but before the higher level bmap function (i.e. xfs_bmap_add_extent_hole_real()) completes and resets the field. Update the assert to accommodate the transient state if the filesystem has shutdown. While here, clean up the indentation and comments in the function. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: remove the di_version field from struct icdinodeChristoph Hellwig
commit 6471e9c5e7a109a952be8e3e80b8d9e262af239d upstream. We know the version is 3 if on a v5 file system. For earlier file systems formats we always upgrade the remaining v1 inodes to v2 and thus only use v2 inodes. Use the xfs_sb_version_has_large_dinode helper to check if we deal with small or large dinodes, and thus remove the need for the di_version field in struct icdinode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: only check the superblock version for dinode size calculationChristoph Hellwig
commit e9e2eae89ddb658ea332295153fdca78c12c1e0d upstream. The size of the dinode structure is only dependent on the file system version, so instead of checking the individual inode version just use the newly added xfs_sb_version_has_large_dinode helper, and simplify various calling conventions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: add a new xfs_sb_version_has_v3inode helperChristoph Hellwig
commit b81b79f4eda2ea98ae5695c0b6eb384c8d90b74d upstream. Add a new wrapper to check if a file system supports the v3 inode format with a larger dinode core. Previously we used xfs_sb_version_hascrc for that, which is technically correct but a little confusing to read. Also move xfs_dinode_good_version next to xfs_sb_version_has_v3inode so that we have one place that documents the superblock version to inode version relationship. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: remove the kuid/kgid conversion wrappersChristoph Hellwig
commit ba8adad5d036733d240fa8a8f4d055f3d4490562 upstream. Remove the XFS wrappers for converting from and to the kuid/kgid types. Mostly this means switching to VFS i_{u,g}id_{read,write} helpers, but in a few spots the calls to the conversion functions is open coded. To match the use of sb->s_user_ns in the helpers and other file systems, sb->s_user_ns is also used in the quota code. The ACL code already does the conversion in a grotty layering violation in the VFS xattr code, so it keeps using init_user_ns for the identity mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: remove the icdinode di_uid/di_gid membersChristoph Hellwig
commit 542951592c99ff7b15c050954c051dd6dd6c0f97 upstream. Use the Linux inode i_uid/i_gid members everywhere and just convert from/to the scalar value when reading or writing the on-disk inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: ensure that the inode uid/gid match values match the icdinode onesChristoph Hellwig
commit 3d8f2821502d0b60bac2789d0bea951fda61de0c upstream. Instead of only synchronizing the uid/gid values in xfs_setup_inode, ensure that they always match to prepare for removing the icdinode fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20xfs: merge the projid fields in struct xfs_icdinodeChristoph Hellwig
commit de7a866fd41b227b0aa6e9cbeb0dae221c12f542 upstream. There is no point in splitting the fields like this in an purely in-memory structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: only relog deferred intent items if free space in the log gets lowDarrick J. Wong
commit 74f4d6a1e065c92428c5b588099e307a582d79d9 upstream. Now that we have the ability to ask the log how far the tail needs to be pushed to maintain its free space targets, augment the decision to relog an intent item so that we only do it if the log has hit the 75% full threshold. There's no point in relogging an intent into the same checkpoint, and there's no need to relog if there's plenty of free space in the log. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: periodically relog deferred intent itemsDarrick J. Wong
commit 4e919af7827a6adfc28e82cd6c4ffcfcc3dd6118 upstream. [ Modify xfs_{bmap|extfree|refcount|rmap}_item.c to fix merge conflicts ] There's a subtle design flaw in the deferred log item code that can lead to pinning the log tail. Taking up the defer ops chain examples from the previous commit, we can get trapped in sequences like this: Caller hands us a transaction t0 with D0-D3 attached. The defer ops chain will look like the following if the transaction rolls succeed: t1: D0(t0), D1(t0), D2(t0), D3(t0) t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) In transaction 9, we finish d9 and try to roll to t10 while holding onto an intent item for D3 that we logged in t0. The previous commit changed the order in which we place new defer ops in the defer ops processing chain to reduce the maximum chain length. Now make xfs_defer_finish_noroll capable of relogging the entire chain periodically so that we can always move the log tail forward. Most chains will never get relogged, except for operations that generate very long chains (large extents containing many blocks with different sharing levels) or are on filesystems with small logs and a lot of ongoing metadata updates. Callers are now required to ensure that the transaction reservation is large enough to handle logging done items and new intent items for the maximum possible chain length. Most callers are careful to keep the chain lengths low, so the overhead should be minimal. The decision to relog an intent item is made based on whether the intent was logged in a previous checkpoint, since there's no point in relogging an intent into the same checkpoint. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: change the order in which child and parent defer ops are finishedDarrick J. Wong
commit 27dada070d59c28a441f1907d2cec891b17dcb26 upstream. The defer ops code has been finishing items in the wrong order -- if a top level defer op creates items A and B, and finishing item A creates more defer ops A1 and A2, we'll put the new items on the end of the chain and process them in the order A B A1 A2. This is kind of weird, since it's convenient for programmers to be able to think of A and B as an ordered sequence where all the sub-tasks for A must finish before we move on to B, e.g. A A1 A2 D. Right now, our log intent items are not so complex that this matters, but this will become important for the atomic extent swapping patchset. In order to maintain correct reference counting of extents, we have to unmap and remap extents in that order, and we want to complete that work before moving on to the next range that the user wants to swap. This patch fixes defer ops to satsify that requirement. The primary symptom of the incorrect order was noticed in an early performance analysis of the atomic extent swap code. An astonishingly large number of deferred work items accumulated when userspace requested an atomic update of two very fragmented files. The cause of this was traced to the same ordering bug in the inner loop of xfs_defer_finish_noroll. If the ->finish_item method of a deferred operation queues new deferred operations, those new deferred ops are appended to the tail of the pending work list. To illustrate, say that a caller creates a transaction t0 with four deferred operations D0-D3. The first thing defer ops does is roll the transaction to t1, leaving us with: t1: D0(t0), D1(t0), D2(t0), D3(t0) Let's say that finishing each of D0-D3 will create two new deferred ops. After finish D0 and roll, we'll have the following chain: t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1) d4 and d5 were logged to t1. Notice that while we're about to start work on D1, we haven't actually completed all the work implied by D0 being finished. So far we've been careful (or lucky) to structure the dfops callers such that D1 doesn't depend on d4 or d5 being finished, but this is a potential logic bomb. There's a second problem lurking. Let's see what happens as we finish D1-D3: t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2) t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3) t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) Let's say that d4-d11 are simple work items that don't queue any other operations, which means that we can complete each d4 and roll to t6: t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) ... t11: d10(t4), d11(t4) t12: d11(t4) <done> When we try to roll to transaction #12, we're holding defer op d11, which we logged way back in t4. This means that the tail of the log is pinned at t4. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot get roll to t11 because there isn't enough space left before we'd run into t4. Let's shift back to the original failure. I mentioned before that I discovered this flaw while developing the atomic file update code. In that scenario, we have a defer op (D0) that finds a range of file blocks to remap, creates a handful of new defer ops to do that, and then asks to be continued with however much work remains. So, D0 is the original swapext deferred op. The first thing defer ops does is rolls to t1: t1: D0(t0) We try to finish D0, logging d1 and d2 in the process, but can't get all the work done. We log a done item and a new intent item for the work that D0 still has to do, and roll to t2: t2: D0'(t1), d1(t1), d2(t1) We roll and try to finish D0', but still can't get all the work done, so we log a done item and a new intent item for it, requeue D0 a second time, and roll to t3: t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2) If it takes 48 more rolls to complete D0, then we'll finally dispense with D0 in t50: t50: D<fifty primes>(t49), d1(t1), ..., d102(t50) We then try to roll again to get a chain like this: t51: d1(t1), d2(t1), ..., d101(t50), d102(t50) ... t152: d102(t50) <done> Notice that in rolling to transaction #51, we're holding on to a log intent item for d1 that was logged in transaction #1. This means that the tail of the log is pinned at t1. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot roll to t51 because there isn't enough space left before we'd run into t1. This is of course problem #2 again. But notice the third problem with this scenario: we have 102 defer ops tied to this transaction! Each of these items are backed by pinned kernel memory, which means that we risk OOM if the chains get too long. Yikes. Problem #1 is a subtle logic bomb that could hit someone in the future; problem #2 applies (rarely) to the current upstream, and problem This is not how incremental deferred operations were supposed to work. The dfops design of logging in the same transaction an intent-done item and a new intent item for the work remaining was to make it so that we only have to juggle enough deferred work items to finish that one small piece of work. Deferred log item recovery will find that first unfinished work item and restart it, no matter how many other intent items might follow it in the log. Therefore, it's ok to put the new intents at the start of the dfops chain. For the first example, the chains look like this: t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) For the second example, the chains look like this: t1: D0(t0) t2: d1(t1), d2(t1), D0'(t1) t3: d2(t1), D0'(t1) t4: D0'(t1) t5: d1(t4), d2(t4), D0''(t4) ... t148: D0<50 primes>(t147) t149: d101(t148), d102(t148) t150: d102(t148) <done> This actually sucks more for pinning the log tail (we try to roll to t10 while holding an intent item that was logged in t1) but we've solved problem #1. We've also reduced the maximum chain length from: sum(all the new items) + nr_original_items to: max(new items that each original item creates) + nr_original_items This solves problem #3 by sharply reducing the number of defer ops that can be attached to a transaction at any given time. The change makes the problem of log tail pinning worse, but is improvement we need to solve problem #2. Actually solving #2, however, is left to the next patch. Note that a subsequent analysis of some hard-to-trigger reflink and COW livelocks on extremely fragmented filesystems (or systems running a lot of IO threads) showed the same symptoms -- uncomfortably large numbers of incore deferred work items and occasional stalls in the transaction grant code while waiting for log reservations. I think this patch and the next one will also solve these problems. As originally written, the code used list_splice_tail_init instead of list_splice_init, so change that, and leave a short comment explaining our actions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: fix an incore inode UAF in xfs_bui_recoverDarrick J. Wong
commit ff4ab5e02a0447dd1e290883eb6cd7d94848e590 upstream. In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: xfs_defer_capture should absorb remaining transaction reservationDarrick J. Wong
commit 929b92f64048d90d23e40a59c47adf59f5026903 upstream. When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the transaction reservation type from the old transaction so that when we continue the dfops chain, we still use the same reservation parameters. Doing this means that the log item recovery functions get to determine the transaction reservation instead of abusing tr_itruncate in yet another part of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: xfs_defer_capture should absorb remaining block reservationsDarrick J. Wong
commit 4f9a60c48078c0efa3459678fa8d6e050e8ada5d upstream. When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the remaining block reservations so that when we continue the dfops chain, we can reserve the same number of blocks to use. We capture the reservations for both data and realtime volumes. This adds the requirement that every log intent item recovery function must be careful to reserve enough blocks to handle both itself and all defer ops that it can queue. On the other hand, this enables us to do away with the handwaving block estimation nonsense that was going on in xlog_finish_defer_ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: proper replay of deferred ops queued during log recoveryDarrick J. Wong
commit e6fff81e487089e47358a028526a9f63cdbcd503 upstream. When we replay unfinished intent items that have been recovered from the log, it's possible that the replay will cause the creation of more deferred work items. As outlined in commit 509955823cc9c ("xfs: log recovery should replay deferred ops in order"), later work items have an implicit ordering dependency on earlier work items. Therefore, recovery must replay the items (both recovered and created) in the same order that they would have been during normal operation. For log recovery, we enforce this ordering by using an empty transaction to collect deferred ops that get created in the process of recovering a log intent item to prevent them from being committed before the rest of the recovered intent items. After we finish committing all the recovered log items, we allocate a transaction with an enormous block reservation, splice our huge list of created deferred ops into that transaction, and commit it, thereby finishing all those ops. This is /really/ hokey -- it's the one place in XFS where we allow nested transactions; the splicing of the defer ops list is is inelegant and has to be done twice per recovery function; and the broken way we handle inode pointers and block reservations cause subtle use-after-free and allocator problems that will be fixed by this patch and the two patches after it. Therefore, replace the hokey empty transaction with a structure designed to capture each chain of deferred ops that are created as part of recovering a single unfinished log intent. Finally, refactor the loop that replays those chains to do so using one transaction per chain. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: log new intent items created as part of finishing recovered intent itemsDarrick J. Wong
commit 93293bcbde93567efaf4e6bcd58cad270e1fcbf5 upstream. [Slightly edit fs/xfs/xfs_bmap_item.c & fs/xfs/xfs_refcount_item.c to resolve merge conflicts] During a code inspection, I found a serious bug in the log intent item recovery code when an intent item cannot complete all the work and decides to requeue itself to get that done. When this happens, the item recovery creates a new incore deferred op representing the remaining work and attaches it to the transaction that it allocated. At the end of _item_recover, it moves the entire chain of deferred ops to the dummy parent_tp that xlog_recover_process_intents passed to it, but fail to log a new intent item for the remaining work before committing the transaction for the single unit of work. xlog_finish_defer_ops logs those new intent items once recovery has finished dealing with the intent items that it recovered, but this isn't sufficient. If the log is forced to disk after a recovered log item decides to requeue itself and the system goes down before we call xlog_finish_defer_ops, the second log recovery will never see the new intent item and therefore has no idea that there was more work to do. It will finish recovery leaving the filesystem in a corrupted state. The same logic applies to /any/ deferred ops added during intent item recovery, not just the one handling the remaining work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: refactor xfs_defer_finish_norollChristoph Hellwig
commit bb47d79750f1a68a75d4c7defc2da934ba31de14 upstream. Split out a helper that operates on a single xfs_defer_pending structure to untangle the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: turn dfp_intent into a xfs_log_itemChristoph Hellwig
commit 13a8333339072b8654c1d2c75550ee9f41ee15de upstream. All defer op instance place their own extension of the log item into the dfp_intent field. Replace that with a xfs_log_item to improve type safety and make the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: merge the ->diff_items defer op into ->create_intentChristoph Hellwig
commit d367a868e46b025a8ced8e00ef2b3a3c2f3bf732 upstream. This avoids a per-item indirect call, and also simplifies the interface a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: merge the ->log_item defer op into ->create_intentChristoph Hellwig
commit c1f09188e8de0ae65433cb9c8ace4feb66359bcc upstream. These are aways called together, and my merging them we reduce the amount of indirect calls, improve type safety and in general clean up the code a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: factor out a xfs_defer_create_intent helperChristoph Hellwig
commit e046e949486ec92d83b2ccdf0e7e9144f74ef028 upstream. Create a helper that encapsulates the whole logic to create a defer intent. This reorders some of the work that was done, but none of that has an affect on the operation as only fields that don't directly interact are affected. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22xfs: remove the xfs_inode_log_item_t typedefChristoph Hellwig
commit fd9cbe51215198ccffa64169c98eae35b0916088 upstream. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-25xfs: redesign the reflink remap loop to fix blkres depletion crashDarrick J. Wong
commit 00fd1d56dd08a8ceaa9e4ee1a41fefd9f6c6bc7d upstream. The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reported-by: Edwin Török <edwin@etorok.net> Tested-by: Edwin Török <edwin@etorok.net> Acked-by: Darrick J. Wong <djwong@kernel.org> [backported to 5.4.y - Tested-by above does not refer to the backport] Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-25xfs: rename xfs_bmap_is_real_extent to is_written_extentDarrick J. Wong
commit 877f58f53684f14ca3202640f70592bf44890924 upstream. [ Slightly modify fs/xfs/libxfs/xfs_rtbitmap.c & fs/xfs/xfs_reflink.c to resolve merge conflict ] The name of this predicate is a little misleading -- it decides if the extent mapping is allocated and written. Change the name to be more direct, as we're going to add a new predicate in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-25xfs: preserve rmapbt swapext block reservation from freed blocksBrian Foster
commit f74681ba2006434be195402e0b15fc5763cddd7e upstream. [Slightly modify xfs_trans_alloc() to fix a merge conflict due to missing "atomic_inc(&mp->m_active_trans)" statement in v5.9 kernel] The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: b3fed434822d0 ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-10xfs: use ordered buffers to initialize dquot buffers during quotacheckDarrick J. Wong
commit 78bba5c812cc651cee51b64b786be926ab7fe2a9 upstream. While QAing the new xfs_repair quotacheck code, I uncovered a quota corruption bug resulting from a bad interaction between dquot buffer initialization and quotacheck. The bug can be reproduced with the following sequence: # mkfs.xfs -f /dev/sdf # mount /dev/sdf /opt -o usrquota # su nobody -s /bin/bash -c 'touch /opt/barf' # sync # xfs_quota -x -c 'report -ahi' /opt User quota on /opt (/dev/sdf) Inodes User ID Used Soft Hard Warn/Grace ---------- --------------------------------- root 3 0 0 00 [------] nobody 1 0 0 00 [------] # xfs_io -x -c 'shutdown' /opt # umount /opt # mount /dev/sdf /opt -o usrquota # touch /opt/man2 # xfs_quota -x -c 'report -ahi' /opt User quota on /opt (/dev/sdf) Inodes User ID Used Soft Hard Warn/Grace ---------- --------------------------------- root 1 0 0 00 [------] nobody 1 0 0 00 [------] # umount /opt Notice how the initial quotacheck set the root dquot icount to 3 (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery, xfs_quota reports that the root dquot has only 1 icount. We haven't deleted anything from the filesystem, which means that quota is now under-counting. This behavior is not limited to icount or the root dquot, but this is the shortest reproducer. I traced the cause of this discrepancy to the way that we handle ondisk dquot updates during quotacheck vs. regular fs activity. Normally, when we allocate a disk block for a dquot, we log the buffer as a regular (dquot) buffer. Subsequent updates to the dquots backed by that block are done via separate dquot log item updates, which means that they depend on the logged buffer update being written to disk before the dquot items. Because individual dquots have their own LSN fields, that initial dquot buffer must always be recovered. However, the story changes for quotacheck, which can cause dquot block allocations but persists the final dquot counter values via a delwri list. Because recovery doesn't gate dquot buffer replay on an LSN, this means that the initial dquot buffer can be replayed over the (newer) contents that were delwritten at the end of quotacheck. In effect, this re-initializes the dquot counters after they've been updated. If the log does not contain any other dquot items to recover, the obsolete dquot contents will not be corrected by log recovery. Because quotacheck uses a transaction to log the setting of the CHKD flags in the superblock, we skip quotacheck during the second mount call, which allows the incorrect icount to remain. Fix this by changing the ondisk dquot initialization function to use ordered buffers to write out fresh dquot blocks if it detects that we're running quotacheck. If the system goes down before quotacheck can complete, the CHKD flags will not be set in the superblock and the next mount will run quotacheck again, which can fix uninitialized dquot buffers. This requires amending the defer code to maintaine ordered buffer state across defer rolls for the sake of the dquot allocation code. For regular operations we preserve the current behavior since the dquot items require properly initialized ondisk dquot records. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-10xfs: don't fail verifier on empty attr3 leaf blockBrian Foster
commit f28cef9e4daca11337cb9f144cdebedaab69d78c upstream. The attr fork can transition from shortform to leaf format while empty if the first xattr doesn't fit in shortform. While this empty leaf block state is intended to be transient, it is technically not due to the transactional implementation of the xattr set operation. We historically have a couple of bandaids to work around this problem. The first is to hold the buffer after the format conversion to prevent premature writeback of the empty leaf buffer and the second is to bypass the xattr count check in the verifier during recovery. The latter assumes that the xattr set is also in the log and will be recovered into the buffer soon after the empty leaf buffer is reconstructed. This is not guaranteed, however. If the filesystem crashes after the format conversion but before the xattr set that induced it, only the format conversion may exist in the log. When recovered, this creates a latent corrupted state on the inode as any subsequent attempts to read the buffer fail due to verifier failure. This includes further attempts to set xattrs on the inode or attempts to destroy the attr fork, which prevents the inode from ever being removed from the unlinked list. To avoid this condition, accept that an empty attr leaf block is a valid state and remove the count check from the verifier. This means that on rare occasions an attr fork might exist in an unexpected state, but is otherwise consistent and functional. Note that we retain the logic to avoid racing with metadata writeback to reduce the window where this can occur. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: remove the xfs_qoff_logitem_t typedefPavel Reichl
commit d0bdfb106907e4a3ef4f25f6d27e392abf41f3a0 upstream. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix a comment] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: remove the xfs_disk_dquot_t and xfs_dquot_tPavel Reichl
commit aefe69a45d84901c702f87672ec1e93de1d03f73 upstream. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix some of the comments] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: check owner of dir3 blocksDarrick J. Wong
commit 1b2c1a63b678d63e9c98314d44413f5af79c9c80 upstream. Check the owner field of dir3 block headers. If it's corrupt, release the buffer and return EFSCORRUPTED. All callers handle this properly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: check owner of dir3 data blocksDarrick J. Wong
commit a10c21ed5d5241d11cf1d5a4556730840572900b upstream. [Slightly edit xfs_dir3_data_read() to work with existing mapped_bno argument instead of flag values introduced in later kernels] Check the owner field of dir3 data block headers. If it's corrupt, release the buffer and return EFSCORRUPTED. All callers handle this properly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: fix buffer corruption reporting when xfs_dir3_free_header_check failsDarrick J. Wong
commit ce99494c9699df58b31d0a839e957f86cd58c755 upstream. xfs_verifier_error is supposed to be called on a corrupt metadata buffer from within a buffer verifier function, whereas xfs_buf_mark_corrupt is the function to be called when a piece of code has read a buffer and catches something that a read verifier cannot. The first function sets b_error anticipating that the low level buffer handling code will see the nonzero b_error and clear XBF_DONE on the buffer, whereas the second function does not. Since xfs_dir3_free_header_check examines fields in the dir free block header that require more context than can be provided to read verifiers, we must call xfs_buf_mark_corrupt when it finds a problem. Switching the calls has a secondary effect that we no longer corrupt the buffer state by setting b_error and leaving XBF_DONE set. When /that/ happens, we'll trip over various state assertions (most commonly the b_error check in xfs_buf_reverify) on a subsequent attempt to read the buffer. Fixes: bc1a09b8e334bf5f ("xfs: refactor verifier callers to print address of failing check") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: add a function to deal with corrupt buffers post-verifiersDarrick J. Wong
commit 8d57c21600a514d7a9237327c2496ae159bab5bb upstream. Add a helper function to get rid of buffers that we have decided are corrupt after the verifiers have run. This function is intended to handle metadata checks that can't happen in the verifiers, such as inter-block relationship checking. Note that we now mark the buffer stale so that it will not end up on any LRU and will be purged on release. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-29xfs: open code insert range extent split helperBrian Foster
commit b73df17e4c5ba977205253fb7ef54267717a3cba upstream. The insert range operation currently splits the extent at the target offset in a separate transaction and lock cycle from the one that shifts extents. In preparation for reworking insert range into an atomic operation, lift the code into the caller so it can be easily condensed to a single rolling transaction and lock cycle and eliminate the helper. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07xfs: streamline xfs_attr3_leaf_inactiveDarrick J. Wong
commit 0bb9d159bd018b271e783d3b2d3bc82fa0727321 upstream. Now that we know we don't have to take a transaction to stale the incore buffers for a remote value, get rid of the unnecessary memory allocation in the leaf walker and call the rmt_stale function directly. Flatten the loop while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07xfs: move incore structures out of xfs_da_format.hChristoph Hellwig
commit a39f089a25e75c3d17b955d8eb8bc781f23364f3 upstream. Move the abstract in-memory version of various btree block headers out of xfs_da_format.h as they aren't on-disk formats. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07xfs: fix memory corruption during remote attr value buffer invalidationDarrick J. Wong
commit e8db2aafcedb7d88320ab83f1000f1606b26d4d7 upstream. [Replaced XFS_IS_CORRUPT() calls with ASSERT() for 5.4.y backport] While running generic/103, I observed what looks like memory corruption and (with slub debugging turned on) a slub redzone warning on i386 when inactivating an inode with a 64k remote attr value. On a v5 filesystem, maximally sized remote attr values require one block more than 64k worth of space to hold both the remote attribute value header (64 bytes). On a 4k block filesystem this results in a 68k buffer; on a 64k block filesystem, this would be a 128k buffer. Note that even though we'll never use more than 65,600 bytes of this buffer, XFS_MAX_BLOCKSIZE is 64k. This is a problem because the definition of struct xfs_buf_log_format allows for XFS_MAX_BLOCKSIZE worth of dirty bitmap (64k). On i386 when we invalidate a remote attribute, xfs_trans_binval zeroes all 68k worth of the dirty map, writing right off the end of the log item and corrupting memory. We've gotten away with this on x86_64 for years because the compiler inserts a u32 padding on the end of struct xfs_buf_log_format. Fortunately for us, remote attribute values are written to disk with xfs_bwrite(), which is to say that they are not logged. Fix the problem by removing all places where we could end up creating a buffer log item for a remote attribute value and leave a note explaining why. Next, replace the open-coded buffer invalidation with a call to the helper we created in the previous patch that does better checking for bad metadata before marking the buffer stale. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07xfs: refactor remote attr value buffer invalidationDarrick J. Wong
commit 8edbb26b06023de31ad7d4c9b984d99f66577929 upstream. [Replaced XFS_IS_CORRUPT() calls with ASSERT() for 5.4.y backport] Hoist the code that invalidates remote extended attribute value buffers into a separate helper function. This prepares us for a memory corruption fix in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07xfs: introduce XFS_MAX_FILEOFFDarrick J. Wong
commit a5084865524dee1fe8ea1fee17c60b4369ad4f5e upstream. Introduce a new #define for the maximum supported file block offset. We'll use this in the next patch to make it more obvious that we're doing some operation for all possible inode fork mappings after a given offset. We can't use ULLONG_MAX here because bunmapi uses that to detect when it's done. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07xfs: fix misuse of the XFS_ATTR_INCOMPLETE flagChristoph Hellwig
commit 780d29057781d986cd87dbbe232cd02876ad430f upstream. XFS_ATTR_INCOMPLETE is a flag in the on-disk attribute format, and thus in a different namespace as the ATTR_* flags in xfs_da_args.flags. Switch to using a XFS_DA_OP_INCOMPLETE flag in op_flags instead. Without this users might be able to inject this flag into operations using the attr by handle ioctl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: fix an ABBA deadlock in xfs_renameDarrick J. Wong
commit 6da1b4b1ab36d80a3994fd4811c8381de10af604 upstream. When overlayfs is running on top of xfs and the user unlinks a file in the overlay, overlayfs will create a whiteout inode and ask xfs to "rename" the whiteout file atop the one being unlinked. If the file being unlinked loses its one nlink, we then have to put the inode on the unlinked list. This requires us to grab the AGI buffer of the whiteout inode to take it off the unlinked list (which is where whiteouts are created) and to grab the AGI buffer of the file being deleted. If the whiteout was created in a higher numbered AG than the file being deleted, we'll lock the AGIs in the wrong order and deadlock. Therefore, grab all the AGI locks we think we'll need ahead of time, and in order of increasing AG number per the locking rules. Reported-by: wenli xie <wlxie7296@gmail.com> Fixes: 93597ae8dac0 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: don't commit sunit/swidth updates to disk if that would cause repair ↵Darrick J. Wong
failures commit 13eaec4b2adf2657b8167b67e27c97cc7314d923 upstream. Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit and swidth values could cause xfs_repair to fail loudly. The problem here is that repair calculates the where mkfs should have allocated the root inode, based on the superblock geometry. The allocation decisions depend on sunit, which means that we really can't go updating sunit if it would lead to a subsequent repair failure on an otherwise correct filesystem. Port from xfs_repair some code that computes the location of the root inode and teach mount to skip the ondisk update if it would cause problems for repair. Along the way we'll update the documentation, provide a function for computing the minimum AGFL size instead of open-coding it, and cut down some indenting in the mount code. Note that we allow the mount to proceed (and new allocations will reflect this new geometry) because we've never screened this kind of thing before. We'll have to wait for a new future incompat feature to enforce correct behavior, alas. Note that the geometry reporting always uses the superblock values, not the incore ones, so that is what xfs_info and xfs_growfs will report. [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d Reported-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: refactor agfl length computation functionDarrick J. Wong
commit 1cac233cfe71f21e069705a4930c18e48d897be6 upstream. Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which case it returns the largest possible minimum length. This will be used in an upcoming patch to compute the length of the AGFL at mkfs time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: stabilize insert range start boundary to avoid COW writeback raceBrian Foster
commit d0c2204135a0cdbc607c94c481cf1ccb2f659aa7 upstream. generic/522 (fsx) occasionally fails with a file corruption due to an insert range operation. The primary characteristic of the corruption is a misplaced insert range operation that differs from the requested target offset. The reason for this behavior is a race between the extent shift sequence of an insert range and a COW writeback completion that causes a front merge with the first extent in the shift. The shift preparation function flushes and unmaps from the target offset of the operation to the end of the file to ensure no modifications can be made and page cache is invalidated before file data is shifted. An insert range operation then splits the extent at the target offset, if necessary, and begins to shift the start offset of each extent starting from the end of the file to the start offset. The shift sequence operates at extent level and so depends on the preparation sequence to guarantee no changes can be made to the target range during the shift. If the block immediately prior to the target offset was dirty and shared, however, it can undergo writeback and move from the COW fork to the data fork at any point during the shift. If the block is contiguous with the block at the start offset of the insert range, it can front merge and alter the start offset of the extent. Once the shift sequence reaches the target offset, it shifts based on the latest start offset and silently changes the target offset of the operation and corrupts the file. To address this problem, update the shift preparation code to stabilize the start boundary along with the full range of the insert. Also update the existing corruption check to fail if any extent is shifted with a start offset behind the target offset of the insert range. This prevents insert from racing with COW writeback completion and fails loudly in the event of an unexpected extent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: always log corruption errorsDarrick J. Wong
commit a5155b870d687de1a5f07e774b49b1e8ef0f6f50 upstream. Make sure we log something to dmesg whenever we return -EFSCORRUPTED up the call stack. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()kaixuxia
commit 93597ae8dac0149b5c00b787cba6bf7ba213e666 upstream. When target_ip exists in xfs_rename(), the xfs_dir_replace() call may need to hold the AGF lock to allocate more blocks, and then invoking the xfs_droplink() call to hold AGI lock to drop target_ip onto the unlinked list, so we get the lock order AGF->AGI. This would break the ordering constraint on AGI and AGF locking - inode allocation locks the AGI, then can allocate a new extent for new inodes, locking the AGF after the AGI. In this patch we check whether the replace operation need more blocks firstly. If so, acquire the agi lock firstly to preserve locking order(AGI/AGF). Actually, the locking order problem only occurs when we are locking the AGI/AGF of the same AG. For multiple AGs the AGI lock will be released after the transaction committed. Signed-off-by: kaixuxia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: reword the comment] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28xfs: replace -EIO with -EFSCORRUPTED for corrupt metadataDarrick J. Wong
commit c2414ad6e66ab96b867309454498f7fb29b7e855 upstream. There are a few places where we return -EIO instead of -EFSCORRUPTED when we find corrupt metadata. Fix those places. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30writeback: Drop I_DIRTY_TIME_EXPIREJan Kara
commit 5fcd57505c002efc5823a7355e21f48dd02d5a51 upstream. The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-24xfs: revert "xfs: fix rmap key and record comparison functions"Darrick J. Wong
[ Upstream commit eb8409071a1d47e3593cfe077107ac46853182ab ] This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f. Your maintainer committed a major braino in the rmap code by adding the attr fork, bmbt, and unwritten extent usage bits into rmap record key comparisons. While XFS uses the usage bits *in the rmap records* for cross-referencing metadata in xfs_scrub and xfs_repair, it only needs the owner and offset information to distinguish between reverse mappings of the same physical extent into the data fork of a file at multiple offsets. The other bits are not important for key comparisons for index lookups, and never have been. Eric Sandeen reports that this causes regressions in generic/299, so undo this patch before it does more damage. Reported-by: Eric Sandeen <sandeen@sandeen.net> Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>