From 2e588a46aace858b2baad755c06c66235e152235 Mon Sep 17 00:00:00 2001
From: Brian Foster <bfoster@redhat.com>
Date: Mon, 1 Jun 2015 07:15:23 +1000
Subject: xfs: always log the inode on unwritten extent conversion

The fsync() requirements for crash consistency on XFS are to flush file
data and force any in-core inode updates to the log. We currently check
whether the inode is pinned to identify whether the log needs to be
forced, since a non-zero pin count generally represents an inode that
has transactions awaiting a flush to the on-disk log.

This is not sufficient in all cases, however. Reports of xfstests test
generic/311 failures on ppc64/s390x hosts have identified failures to
fsync outstanding inode modifications due to the inode not being pinned
at the time of the fsync. This occurs because certain bmap updates can
complete by logging bmapbt buffers but without ever dirtying (and thus
pinning) the core inode. The following is a specific incarnation of this
problem:

$ mount $dev /mnt -o noatime,nobarrier
$ for i in $(seq 0 2 31); do \
        xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
	done
$ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
	hexdump /mnt/file; \
	./xfstests-dev/src/godown /mnt
...
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0014000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000
$ umount /mnt; mount ...
$ hexdump /mnt/file
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000

In short, the unwritten extent conversion for the last write is lost
despite the fact that an fsync executed before the filesystem was
shutdown. Note that this is impossible to reproduce on v5 supers due to
unconditional time callbacks for di_changecount and highly difficult to
reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
reduces timer granularity enough to increase the odds that time updates
are skipped and allows this to reproduce within a handful of attempts.

To deal with this problem, unconditionally log the core in the unwritten
extent conversion path. Fix up logflags after the extent conversion to
keep the extent update code consistent with the other extent update
helpers. This fixup is not necessary for the other (hole, delay) extent
helpers because they execute in the block allocation codepath, which
already logs the inode for other reasons (e.g., for di_nblocks).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

(limited to 'fs/xfs')

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaaac0ec..68e9e233f369 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4417,7 +4417,15 @@ xfs_bmapi_convert_unwritten(
 	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
 			&bma->cur, mval, bma->firstblock, bma->flist,
 			&tmp_logflags);
-	bma->logflags |= tmp_logflags;
+	/*
+	 * Log the inode core unconditionally in the unwritten extent conversion
+	 * path because the conversion might not have done so (e.g., if the
+	 * extent count hasn't changed). We need to make sure the inode is dirty
+	 * in the transaction for the sake of fsync(), even if nothing has
+	 * changed, because fsync() will not force the log for this transaction
+	 * unless it sees the inode pinned.
+	 */
+	bma->logflags |= tmp_logflags | XFS_ILOG_CORE;
 	if (error)
 		return error;
 
-- 
cgit v1.2.3


From 7f884dc198c641c95f5c4325f0d782b1efd298b4 Mon Sep 17 00:00:00 2001
From: Brian Foster <bfoster@redhat.com>
Date: Mon, 1 Jun 2015 07:15:37 +1000
Subject: xfs: fix quota block reservation leak when tp allocates and frees
 blocks

Al Viro reports that generic/231 fails frequently on XFS and bisected
the problem to the following commit:

	5d11fb4b xfs: rework zero range to prevent invalid i_size updates

... which is just the first commit that happens to cause fsx to
reproduce the problem. fsx reproduces via zero range calls. The
aforementioned commit overhauls zero range to use hole punch and
fallocate. As it turns out, the problem is reproducible on demand using
basic hole punch as follows:

$ mkfs.xfs -f -m crc=1,finobt=1 <dev>
$ mount <dev> /mnt -o uquota
$ xfs_io -f -c "falloc 0 50m" /mnt/file
$ for i in $(seq 1 20); do xfs_io -c "fpunch ${i}m 32k" /mnt/file; done
$ rm -f /mnt/file
$ repquota -us /mnt
...
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --     32K      0K      0K              3     0     0

A file is allocated with a single 50m extent. The extent count increases
via hole punches until the bmap converts to btree format. The file is
removed but quota reports 32k of space usage for the user. This
reservation is effectively leaked for the lifetime of the mount.

The reason this occurs is because the quota block reservation tracking
is confused when a transaction happens to free and allocate blocks at
the same time. Consider the following sequence of events:

- tp is allocated from xfs_free_file_space() and reserves several blocks
  for btree management. Blocks are reserved against the dquot and marked
  as such in the transaction (qtrx->qt_blk_res).
- 8 blocks are accounted free when the 32k range is punched out.
  xfs_trans_mod_dquot() is called with XFS_TRANS_DQ_BCOUNT and sets
  ->qt_bcount_delta to -8.
- Subsequently, a block is allocated against the same transaction by
  xfs_bmap_extents_to_btree() for btree conversion. A call to
  xfs_trans_mod_dquot() increases qt_blk_res_used to 1 and qt_bcount_delta
  to -7.
- The transaction is dup'd and committed by xfs_bmap_finish().
  xfs_trans_dup_dqinfo() sets the first transaction up such that it has a
  matching qt_blk_res and qt_blk_res_used of 1. The remaining unused
  reservation is transferred to the duplicate tp.

When the transactions are committed, the dquots are fixed up in
xfs_trans_apply_dquot_deltas() according to one of two methods:

1.) If the transaction holds a block reservation (->qt_blk_res != 0),
_only_ the unused portion reservation is unaccounted from the dquot.
Note that the tp duplication behavior of xfs_bmap_finish() makes it such
that qt_blk_res is typically 0 for tp's with unused reservation.
2.) Otherwise, the dquot is fixed up based on the block delta
(->qt_bcount_delta) created by the transaction.

Therefore, if a transaction has a negative qt_bcount_delta and positive
qt_blk_res_used, the former set of blocks that have been removed from
the file are never factored out of the in-core dquot reservation.
Instead, *_apply_dquot_deltas() sees 1 block used out of a 1 block
reservation and believes there is nothing to fix up. The on-disk
d_bcount is updated independently from qt_bcount_delta, and thus is
correct (and allows the quota usage to correct on remount).

To deal with this situation, we effectively want the "used reservation"
part of the transaction to be consistent with any freed blocks with
respect to quota tracking. For example, if 8 blocks are freed, the
subsequent single block allocation does not need to consume the initial
reservation made by the tp. Instead, it simply borrows one from the
previously freed. One possible implementation of such borrowing is to
avoid the blks_res_used increment when bcount_delta is negative. This
alone is flawed logic in that it only handles the case where blocks are
freed before allocated, however.

Rather than add more complexity to manage synchronization between
bcount_delta and blks_res_used, kill the latter entirely. blk_res_used
is only updated in one place and always in sync with delta_bcount.
Therefore, the net block reservation consumption of the transaction is
always available from bcount_delta. Calculate the reservation
consumption on the fly where necessary based on whether the tp has a
reservation and results in a positive net block delta on the inode.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_quota.h       |  1 -
 fs/xfs/xfs_trans_dquot.c | 28 +++++++++++++++++-----------
 2 files changed, 17 insertions(+), 12 deletions(-)

(limited to 'fs/xfs')

diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 5376dd406ba2..ce6506adab7b 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -55,7 +55,6 @@ struct xfs_trans;
 typedef struct xfs_dqtrx {
 	struct xfs_dquot *qt_dquot;	  /* the dquot this refers to */
 	ulong		qt_blk_res;	  /* blks reserved on a dquot */
-	ulong		qt_blk_res_used;  /* blks used from the reservation */
 	ulong		qt_ino_res;	  /* inode reserved on a dquot */
 	ulong		qt_ino_res_used;  /* inodes used from the reservation */
 	long		qt_bcount_delta;  /* dquot blk count changes */
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 76a16df55ef7..58c0c6b2f4ed 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -92,6 +92,7 @@ xfs_trans_dup_dqinfo(
 	xfs_dqtrx_t	*oq, *nq;
 	int		i,j;
 	xfs_dqtrx_t	*oqa, *nqa;
+	ulong		blk_res_used;
 
 	if (!otp->t_dqinfo)
 		return;
@@ -109,11 +110,16 @@ xfs_trans_dup_dqinfo(
 		oqa = otp->t_dqinfo->dqs[j];
 		nqa = ntp->t_dqinfo->dqs[j];
 		for (i = 0; i < XFS_QM_TRANS_MAXDQS; i++) {
+			blk_res_used = 0;
+
 			if (oqa[i].qt_dquot == NULL)
 				break;
 			oq = &oqa[i];
 			nq = &nqa[i];
 
+			if (oq->qt_blk_res && oq->qt_bcount_delta > 0)
+				blk_res_used = oq->qt_bcount_delta;
+
 			nq->qt_dquot = oq->qt_dquot;
 			nq->qt_bcount_delta = nq->qt_icount_delta = 0;
 			nq->qt_rtbcount_delta = 0;
@@ -121,8 +127,8 @@ xfs_trans_dup_dqinfo(
 			/*
 			 * Transfer whatever is left of the reservations.
 			 */
-			nq->qt_blk_res = oq->qt_blk_res - oq->qt_blk_res_used;
-			oq->qt_blk_res = oq->qt_blk_res_used;
+			nq->qt_blk_res = oq->qt_blk_res - blk_res_used;
+			oq->qt_blk_res = blk_res_used;
 
 			nq->qt_rtblk_res = oq->qt_rtblk_res -
 				oq->qt_rtblk_res_used;
@@ -239,10 +245,6 @@ xfs_trans_mod_dquot(
 		 * disk blocks used.
 		 */
 	      case XFS_TRANS_DQ_BCOUNT:
-		if (qtrx->qt_blk_res && delta > 0) {
-			qtrx->qt_blk_res_used += (ulong)delta;
-			ASSERT(qtrx->qt_blk_res >= qtrx->qt_blk_res_used);
-		}
 		qtrx->qt_bcount_delta += delta;
 		break;
 
@@ -423,15 +425,19 @@ xfs_trans_apply_dquot_deltas(
 			 * reservation that a transaction structure knows of.
 			 */
 			if (qtrx->qt_blk_res != 0) {
-				if (qtrx->qt_blk_res != qtrx->qt_blk_res_used) {
-					if (qtrx->qt_blk_res >
-					    qtrx->qt_blk_res_used)
+				ulong blk_res_used = 0;
+
+				if (qtrx->qt_bcount_delta > 0)
+					blk_res_used = qtrx->qt_bcount_delta;
+
+				if (qtrx->qt_blk_res != blk_res_used) {
+					if (qtrx->qt_blk_res > blk_res_used)
 						dqp->q_res_bcount -= (xfs_qcnt_t)
 							(qtrx->qt_blk_res -
-							 qtrx->qt_blk_res_used);
+							 blk_res_used);
 					else
 						dqp->q_res_bcount -= (xfs_qcnt_t)
-							(qtrx->qt_blk_res_used -
+							(blk_res_used -
 							 qtrx->qt_blk_res);
 				}
 			} else {
-- 
cgit v1.2.3


From 39e56d9219dd46d696e6a5e1e24870392154dd0f Mon Sep 17 00:00:00 2001
From: Eric Sandeen <sandeen@redhat.com>
Date: Mon, 1 Jun 2015 07:15:38 +1000
Subject: xfs: don't cast string literals

The commit:

a9273ca5 xfs: convert attr to use unsigned names

added these (unsigned char *) casts, but then the _SIZE macros
return "7" - size of a pointer minus one - not the length of
the string.  This is harmless in the kernel, because the _SIZE
macros are not used, but as we sync up with userspace, this will
matter.

I don't think the cast is necessary; i.e. assigning the string
literal to an unsigned char *, or passing it to a function
expecting an unsigned char *, should be ok, right?

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_format.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'fs/xfs')

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4daaa662337b..ff22a4d9ad0c 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1453,8 +1453,8 @@ struct xfs_acl {
 		sizeof(struct xfs_acl_entry) * XFS_ACL_MAX_ENTRIES((mp)))
 
 /* On-disk XFS extended attribute names */
-#define SGI_ACL_FILE		(unsigned char *)"SGI_ACL_FILE"
-#define SGI_ACL_DEFAULT		(unsigned char *)"SGI_ACL_DEFAULT"
+#define SGI_ACL_FILE		"SGI_ACL_FILE"
+#define SGI_ACL_DEFAULT		"SGI_ACL_DEFAULT"
 #define SGI_ACL_FILE_SIZE	(sizeof(SGI_ACL_FILE)-1)
 #define SGI_ACL_DEFAULT_SIZE	(sizeof(SGI_ACL_DEFAULT)-1)
 
-- 
cgit v1.2.3


From 339e4f66d1fa00a0161a2c27f3c9aa256d856979 Mon Sep 17 00:00:00 2001
From: Nan Jia <jiananmail@gmail.com>
Date: Mon, 1 Jun 2015 10:50:00 +1000
Subject: xfs: Clean up xfs_trans_dup_dqinfo

Fixed two missing spaces.

Signed-off-by: Nan Jia <jiananmail@gmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_trans_dquot.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'fs/xfs')

diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 58c0c6b2f4ed..ce78534a047e 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -90,7 +90,7 @@ xfs_trans_dup_dqinfo(
 	xfs_trans_t	*ntp)
 {
 	xfs_dqtrx_t	*oq, *nq;
-	int		i,j;
+	int		i, j;
 	xfs_dqtrx_t	*oqa, *nqa;
 	ulong		blk_res_used;
 
@@ -103,7 +103,7 @@ xfs_trans_dup_dqinfo(
 	 * Because the quota blk reservation is carried forward,
 	 * it is also necessary to carry forward the DQ_DIRTY flag.
 	 */
-	if(otp->t_flags & XFS_TRANS_DQ_DIRTY)
+	if (otp->t_flags & XFS_TRANS_DQ_DIRTY)
 		ntp->t_flags |= XFS_TRANS_DQ_DIRTY;
 
 	for (j = 0; j < XFS_QM_TRANS_DQTYPES; j++) {
-- 
cgit v1.2.3