linux-toradex.git/drivers/md, branch v2.6.32.35

md: correctly handle probe of an 'mdp' device.

2011-03-02T14:47:05+00:00

commit 8f5f02c460b7ca74ce55ce126ce0c1e58a3f923d upstream.

'mdp' devices are md devices with preallocated device numbers
for partitions. As such it is possible to mknod and open a partition
before opening the whole device.

this causes  md_probe() to be called with a device number of a
partition, which in-turn calls mddev_find with such a number.

However mddev_find expects the number of a 'whole device' and
does the wrong thing with partition numbers.

So add code to mddev_find to remove the 'partition' part of
a device number and just work with the 'whole device'.

This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652

Reported-by: hkmaly@bigfoot.com
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

dm raid1: fix null pointer dereference in suspend

2011-03-02T14:46:44+00:00

commit 558569aa9d83e016295bac77d900342908d7fd85 upstream.

When suspending a failed mirror, bios are completed by mirror_end_io() and
__rh_lookup() in dm_rh_dec() returns NULL where a non-NULL return value is
required by design.  Fix this by not changing the state of the recovery failed
region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end().

Issue

On 2.6.33-rc1 kernel, I hit the bug when I suspended the failed
mirror by dmsetup command.

BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [] dm_rh_dec+0x35/0xa1 [dm_region_hash]
...
EIP: 0060:[] EFLAGS: 00010046 CPU: 0
EIP is at dm_rh_dec+0x35/0xa1 [dm_region_hash]
EAX: 00000286 EBX: 00000000 ECX: 00000286 EDX: 00000000
ESI: eff79eac EDI: eff79e80 EBP: f6915cd4 ESP: f6915cc4
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process dmsetup (pid: 2849, ti=f6914000 task=eff03e80 task.ti=f6914000)
 ...
Call Trace:
 [] ? mirror_end_io+0x53/0x1b1 [dm_mirror]
 [] ? clone_endio+0x4d/0xa2 [dm_mod]
 [] ? mirror_end_io+0x0/0x1b1 [dm_mirror]
 [] ? clone_endio+0x0/0xa2 [dm_mod]
 [] ? bio_endio+0x28/0x2b
 [] ? hold_bio+0x2d/0x62 [dm_mirror]
 [] ? mirror_presuspend+0xeb/0xf7 [dm_mirror]
 [] ? vmap_page_range+0xb/0xd
 [] ? suspend_targets+0x2d/0x3b [dm_mod]
 [] ? dm_table_presuspend_targets+0xe/0x10 [dm_mod]
 [] ? dm_suspend+0x4d/0x150 [dm_mod]
 [] ? dev_suspend+0x55/0x18a [dm_mod]
 [] ? _copy_from_user+0x42/0x56
 [] ? dm_ctl_ioctl+0x22c/0x281 [dm_mod]
 [] ? dev_suspend+0x0/0x18a [dm_mod]
 [] ? dm_ctl_ioctl+0x0/0x281 [dm_mod]
 [] ? vfs_ioctl+0x22/0x85
 [] ? do_vfs_ioctl+0x4cb/0x516
 [] ? sys_ioctl+0x40/0x5a
 [] ? sysenter_do_call+0x12/0x28

Analysis

When recovery process of a region failed, dm_rh_recovery_end() function
changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC.
When recovery_complete() is executed between dm_rh_update_states() and
dm_writes() in do_mirror(), bios are processed with the region state,
DM_RH_NOSYNC. However, the region data is freed without checking its
pending count when dm_rh_update_states() is called next time.

When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec()
returns NULL even though a valid return value are expected.

Solution

Remove the state change of the recovery failed region from DM_RH_RECOVERING
to DM_RH_NOSYNC in dm_rh_recovery_end(). We can remove the state change
because:

  - If the region data has been released by dm_rh_update_states(),
    a new region data is created with the state of DM_RH_NOSYNC, and
    bios are processed according to the DM_RH_NOSYNC state.

  - If the region data has not been released by dm_rh_update_states(),
    a state of the region is DM_RH_RECOVERING and bios are put in the
    delayed_bio list.

The flag change from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end()
was added in the following commit:
  dm raid1: handle resync failures
  author  Jonathan Brassow 
    Thu, 12 Jul 2007 16:29:04 +0000 (17:29 +0100)
  http://git.kernel.org/linus/f44db678edcc6f4c2779ac43f63f0b9dfa28b724

Signed-off-by: Takahiro Yasui 
Reviewed-by: Mikulas Patocka 
Signed-off-by: Alasdair G Kergon 
Cc: maximilian attems 
Signed-off-by: Greg Kroah-Hartman

dm raid1: fail writes if errors are not handled and log fails

2011-03-02T14:46:43+00:00

commit 5528d17de1cf1462f285c40ccaf8e0d0e4c64dc0 upstream.

If the mirror log fails when the handle_errors option was not selected
and there is no remaining valid mirror leg, writes return success even
though they weren't actually written to any device.  This patch
completes them with EIO instead.

This code path is taken:
do_writes:
	bio_list_merge(&ms->failures, &sync);
do_failures:
	if (!get_valid_mirror(ms)) (false)
	else if (errors_handled(ms)) (false)
	else bio_endio(bio, 0);

The logic in do_failures is based on presuming that the write was already
tried: if it succeeded at least on one leg (without handle_errors) it
is reported as success.

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197

Signed-off-by: Mikulas Patocka 
Signed-off-by: Alasdair G Kergon 
Cc: maximilian attems 
Signed-off-by: Greg Kroah-Hartman

dm mpath: disable blk_abort_queue

2011-02-17T23:37:14+00:00

commit 09c9d4c9b6a2b5909ae3c6265e4cd3820b636863 upstream.

Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2
  dm: Call blk_abort_queue on failed paths

Multipath began to use blk_abort_queue() to allow for
lower latency path deactivation.  This was found to
cause list corruption:

   the cmd gets blk_abort_queued/timedout run on it and the scsi eh
   somehow is able to complete and run scsi_queue_insert while
   scsi_request_fn is still trying to process the request.

   https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html

Signed-off-by: Mike Snitzer 
Signed-off-by: Alasdair G Kergon 
Cc: Mike Anderson 
Cc: Mike Christie 
Signed-off-by: Greg Kroah-Hartman

dm: dont take i_mutex to change device size

2011-02-17T23:37:13+00:00

commit c217649bf2d60ac119afd71d938278cffd55962b upstream.

No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the
size of a DM device.  This additional locking is unnecessary because
i_size_write() is already protected by the existing critical section in
dm_swap_table().  DM already has a reference on md->bdev so the
associated bd_inode may be changed without lifetime concerns.

A negative side-effect of having held md->bdev->bd_inode->i_mutex was
that a concurrent DM device resize and flush (via fsync) would deadlock.
Dropping md->bdev->bd_inode->i_mutex eliminates this potential for
deadlock.  The following reproducer no longer deadlocks:
  https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html

Signed-off-by: Mike Snitzer 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Alasdair G Kergon 
Signed-off-by: Greg Kroah-Hartman

md: fix regression with re-adding devices to arrays with no metadata

2011-02-17T23:37:09+00:00

commit bf572541ab44240163eaa2d486b06f306a31d45a upstream.

Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were
re-added when they shouldn't be but caused a regression in a less
common case that means sometimes devices cannot be re-added when they
should be.

In particular, when re-adding a device to an array without metadata
we should always access the device, but after the above commit we
didn't.

This patch sets the In_sync flag in that case so that the re-add
succeeds.

This patch is suitable for any -stable kernel to which 1a855a0606 was
applied.

Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead

2011-01-07T22:43:18+00:00

commit e692cb668fdd5a712c6ed2a2d6f2a36ee83997b4 upstream.

When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.

There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.

The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.

Reported-by: Ed Lin 
Signed-off-by: Martin K. Petersen 
Acked-by: Mike Snitzer 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

md: fix bug with re-adding of partially recovered device.

2011-01-07T22:43:09+00:00

commit 1a855a0606653d2d82506281e2c686bacb4b2f45 upstream.

With v0.90 metadata, a hot-spare does not become a full member of the
array until recovery is complete.  So if we re-add such a device to
the array, we know that all of it is as up-to-date as the event count
would suggest, and so it a bitmap-based recovery is possible.

However with v1.x metadata, the hot-spare immediately becomes a full
member of the array, but it record how much of the device has been
recovered.  If the array is stopped and re-assembled recovery starts
from this point.

When such a device is hot-added to an array we currently lose the 'how
much is recovered' information and incorrectly included it as a full
in-sync member (after bitmap-based fixup).
This is wrong and unsafe and could corrupt data.

So be more careful about setting saved_raid_disk - which is what
guides the re-adding of devices back into an array.
The new code matches the code in slot_store which does a similar
thing, which is encouraging.

This is suitable for any -stable kernel.

Reported-by: "Dailey, Nate" 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: fix return value of rdev_size_change()

2010-12-09T21:26:46+00:00

commit c26a44ed1e552aaa1d4ceb71842002d235fe98d7 upstream.

When trying to grow an array by enlarging component devices,
rdev_size_store() expects the return value of rdev_size_change() to be
in sectors, but the actual value is returned in KBs.

This functionality was broken by commit
     dd8ac336c13fd8afdb082ebacb1cddd5cf727889
so this patch is suitable for any kernel since 2.6.30.

Signed-off-by: Justin Maggard 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md/raid1: really fix recovery looping when single good device fails.

2010-12-09T21:26:46+00:00

commit 8f9e0ee38f75d4740daa9e42c8af628d33d19a02 upstream.

Commit 4044ba58dd15cb01797c4fd034f39ef4a75f7cc3 supposedly fixed a
problem where if a raid1 with just one good device gets a read-error
during recovery, the recovery would abort and immediately restart in
an infinite loop.

However it depended on raid1_remove_disk removing the spare device
from the array.  But that does not happen in this case.  So add a test
so that in the 'recovery_disabled' case, the device will be removed.

This suitable for any kernel since 2.6.29 which is when
recovery_disabled was introduced.

Reported-by: Sebastian Färber 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman