linux-toradex.git/drivers/md, branch v3.2.73

md/raid10: don't clear bitmap bit when bad-block-list write fails.

2015-11-17T15:54:45+00:00

commit c340702ca26a628832fade4f133d8160a55c29cc upstream.

When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data. If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit. Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced. This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 95af587e95aa ("md/raid10: ensure device failure recorded before write request returns.")
as well. I'll send that to 'stable' separately.

Note that of the two tests of R10BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed. However doing it this way makes the
patch more obviously correct. I will tidy the code up in a
future merge window.

Reported-by: Nate Dailey
Fixes: bd870a16c594 ("md/raid10: Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown
Signed-off-by: Ben Hutchings

md/raid10: ensure device failure recorded before write request returns.

2015-11-17T15:54:45+00:00

commit 95af587e95aacb9cfda4a9641069a5244a540dc8 upstream.

When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

md/raid1: don't clear bitmap bit when bad-block-list write fails.

2015-11-17T15:54:45+00:00

commit bd8688a199b864944bf62eebed0ca13b46249453 upstream.

When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data.  If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit.  Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced.  This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write request returns.")
as well.  I'll send that to 'stable' separately.

Note that of the two tests of R1BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed.  However doing it this way makes the
patch more obviously correct.  I will tidy the code up in a
future merge window.

Reported-and-tested-by: Nate Dailey 
Cc: Jes Sorensen 
Fixes: cd5ff9a16f08 ("md/raid1:  Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown 
Signed-off-by: Ben Hutchings

md/raid1: ensure device failure recorded before write request returns.

2015-11-17T15:54:45+00:00

commit 55ce74d4bfe1b9444436264c637f39a152d1e5ac upstream.

When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

dm btree: fix leak of bufio-backed block in btree_split_beneath error path

2015-11-17T15:54:45+00:00

commit 4dcb8b57df3593dcb20481d9d6cf79d1dc1534be upstream.

btree_split_beneath()'s error path had an outstanding FIXME that speaks
directly to the potential for _not_ cleaning up a previously allocated
bufio-backed block.

Fix this by releasing the previously allocated bufio block using
unlock_block().

Reported-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Acked-by: Joe Thornber 
Signed-off-by: Ben Hutchings

dm btree remove: fix a bug when rebalancing nodes after removal

2015-11-17T15:54:45+00:00

commit 2871c69e025e8bc507651d5a9cf81a8a7da9d24b upstream.

Commit 4c7e309340ff ("dm btree remove: fix bug in redistribute3") wasn't
a complete fix for redistribute3().

The redistribute3 function takes 3 btree nodes and shares out the entries
evenly between them.  If the three nodes in total contained
(MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting
rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in
the center.

Fix this issue by being more careful about calculating the target number
of entries for the left and right nodes.

Unit tested in userspace using this program:
https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c

Signed-off-by: Joe Thornber 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

md/raid0: apply base queue limits before disk_stack_limits

2015-11-17T15:54:41+00:00

commit 66eefe5de11db1e0d8f2edc3880d50e7c36a9d43 upstream.

Calling e.g. blk_queue_max_hw_sectors() after calls to
disk_stack_limits() discards the settings determined by
disk_stack_limits().
So we need to make those calls first.

Fixes: 199dc6ed5179 ("md/raid0: update queue parameter in a safer location.")
Reported-by: Jes Sorensen 
Signed-off-by: NeilBrown 
[bwh: Backported to 3.2: the code being moved looks a little different]
Signed-off-by: Ben Hutchings

md/raid0: update queue parameter in a safer location.

2015-11-17T15:54:41+00:00

commit 199dc6ed5179251fa6158a461499c24bdd99c836 upstream.

When a (e.g.) RAID5 array is reshaped to RAID0, the updating
of queue parameters (e.g. max number of sectors per bio) is
done in the wrong place.
It should be part of ->run, but it is actually part of ->takeover.
This means it happens before level_store() calls:

	blk_set_stacking_limits(&mddev->queue->limits);

and so it ineffective.  This can lead to errors from underlying
devices.

So move all the relevant settings out of create_stripe_zones()
and into raid0_run().

As this can lead to a bug-on it is suitable for any -stable
kernel which supports reshape to RAID0.  So 2.6.35 or later.
As the bug has been present for five years there is no urgency,
so no need to rush into -stable.

Fixes: 9af204cf720c ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover")
Reported-by: Yi Zhang 
Signed-off-by: NeilBrown 
[bwh: Backported to 3.2:
 - md has no discard or write-same support
 - md is not used by dm-raid so mddev->queue is never null
 - Open-code rdev_for_each()
 - Adjust context]
Signed-off-by: Ben Hutchings

md: use kzalloc() when bitmap is disabled

2015-10-13T02:46:11+00:00

commit b6878d9e03043695dbf3fa1caa6dfc09db225b16 upstream.

In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
mdu_bitmap_file_t called "file".

5769         file = kmalloc(sizeof(*file), GFP_NOIO);
5770         if (!file)
5771                 return -ENOMEM;

This structure is copied to user space at the end of the function.

5786         if (err == 0 &&
5787             copy_to_user(arg, file, sizeof(*file)))
5788                 err = -EFAULT

But if bitmap is disabled only the first byte of "file" is initialized
with zero, so it's possible to read some bytes (up to 4095) of kernel
space memory from user space. This is an information leak.

5775         /* bitmap disabled, zero the first byte and copy out */
5776         if (!mddev->bitmap_info.file)
5777                 file->pathname[0] = '\0';

Signed-off-by: Benjamin Randazzo 
Signed-off-by: NeilBrown 
[bwh: Backported to 3.2: patch both possible allocation calls]
Signed-off-by: Ben Hutchings

dm btree: add ref counting ops for the leaves of top level btrees

2015-10-13T02:46:03+00:00

commit b0dc3c8bc157c60b1d470163882be8c13e1950af upstream.

When using nested btrees, the top leaves of the top levels contain
block addresses for the root of the next tree down.  If we shadow a
shared leaf node the leaf values (sub tree roots) should be incremented
accordingly.

This is only an issue if there is metadata sharing in the top levels.
Which only occurs if metadata snapshots are being used (as is possible
with dm-thinp).  And could result in a block from the thinp metadata
snap being reused early, thus corrupting the thinp metadata snap.

Signed-off-by: Joe Thornber 
Signed-off-by: Mike Snitzer 
[bwh: Backported to 3.2:
 - Drop change to remove_one()
 - Remove const pointer qualifications]
Signed-off-by: Ben Hutchings

linux-toradex.git/drivers/md, branch v3.2.73

md/raid10: don't clear bitmap bit when bad-block-list write fails.

md/raid10: ensure device failure recorded before write request returns.

md/raid1: don't clear bitmap bit when bad-block-list write fails.

md/raid1: ensure device failure recorded before write request returns.

dm btree: fix leak of bufio-backed block in btree_split_beneath error path

dm btree remove: fix a bug when rebalancing nodes after removal

md/raid0: apply base queue limits *before* disk_stack_limits

md/raid0: update queue parameter in a safer location.

md: use kzalloc() when bitmap is disabled

dm btree: add ref counting ops for the leaves of top level btrees

md/raid0: apply base queue limits before disk_stack_limits