<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/drivers/md/raid1.c, branch v2.6.27.38</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>md: avoid races when stopping resync.</title>
<updated>2009-03-17T00:52:54+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2009-02-25T02:18:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a9785b107920670a8ec8ef1744e0e0c59dde9eb4'/>
<id>a9785b107920670a8ec8ef1744e0e0c59dde9eb4</id>
<content type='text'>
commit 73d5c38a9536142e062c35997b044e89166e063b upstream.

There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.

When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, -&gt;stop can be called, and
-&gt;conf can be freed.  So using conf after reschedule_retry is not
safe.

Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.

The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 73d5c38a9536142e062c35997b044e89166e063b upstream.

There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.

When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, -&gt;stop can be called, and
-&gt;conf can be freed.  So using conf after reschedule_retry is not
safe.

Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.

The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>md: Make mddev-&gt;array_size sector-based.</title>
<updated>2008-07-21T07:05:22+00:00</updated>
<author>
<name>Andre Noll</name>
<email>maan@systemlinux.org</email>
</author>
<published>2008-07-21T07:05:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f233ea5c9e0d8b95e4283bf6a3436b88f6fd3586'/>
<id>f233ea5c9e0d8b95e4283bf6a3436b88f6fd3586</id>
<content type='text'>
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.

Signed-off-by: Andre Noll &lt;maan@systemlinux.org&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.

Signed-off-by: Andre Noll &lt;maan@systemlinux.org&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: resolve external metadata handling deadlock in md_allow_write</title>
<updated>2008-07-01T00:18:19+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2008-06-28T04:44:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b5470dc5fc18a8ff6517c3bb538d1479e58ecb02'/>
<id>b5470dc5fc18a8ff6517c3bb538d1479e58ecb02</id>
<content type='text'>
md_allow_write() marks the metadata dirty while holding mddev-&gt;lock and then
waits for the write to complete.  For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.

Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN.  The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.

md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.

[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
md_allow_write() marks the metadata dirty while holding mddev-&gt;lock and then
waits for the write to complete.  For externally managed metadata this causes a
deadlock as userspace needs to take the lock to communicate that the metadata
update has completed.

Change md_allow_write() in the 'external' case to start the 'mark active'
operation and then return -EAGAIN.  The expected side effects while waiting for
userspace to write 'active' to 'array_state' are holding off reshape (code
currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
these failures can be mitigated by coordinating with mdmon.

md_write_start() still prevents writes from occurring until the metadata
handler has had a chance to take action as it unconditionally waits for
MD_CHANGE_CLEAN to be cleared.

[neilb@suse.de: return -EAGAIN, try GFP_NOIO]
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rationalise return value for -&gt;hot_add_disk method.</title>
<updated>2008-06-27T22:31:33+00:00</updated>
<author>
<name>Neil Brown</name>
<email>neilb@notabene.brown</email>
</author>
<published>2008-06-27T22:31:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=199050ea1ff2270174ee525b73bc4c3323098897'/>
<id>199050ea1ff2270174ee525b73bc4c3323098897</id>
<content type='text'>
For all array types but linear, -&gt;hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the -&gt;hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
For all array types but linear, -&gt;hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the -&gt;hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Support adding a spare to a live md array with external metadata.</title>
<updated>2008-06-27T22:31:31+00:00</updated>
<author>
<name>Neil Brown</name>
<email>neilb@notabene.brown</email>
</author>
<published>2008-06-27T22:31:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda'/>
<id>6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda</id>
<content type='text'>
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: restart recovery cleanly after device failure.</title>
<updated>2008-05-24T16:56:10+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2008-05-23T20:04:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=dfc7064500061677720fa26352963c772d3ebe6b'/>
<id>dfc7064500061677720fa26352963c772d3ebe6b</id>
<content type='text'>
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
  - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
  - We remove spares and then re-added them which loses important state
    information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed.  If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error.  So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded).  Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue:  If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available,  do we want to:
 1/ complete the current rebuild, then go back and rebuild the new spare or
 2/ restart the rebuild from the start and rebuild both devices in
    parallel.

Both options can be argued for.  The code currently takes option 2 as
  a/ this requires least code change
  b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" &lt;ivan@kasenna.com&gt;
Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
  - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
  - We remove spares and then re-added them which loses important state
    information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed.  If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error.  So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded).  Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue:  If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available,  do we want to:
 1/ complete the current rebuild, then go back and rebuild the new spare or
 2/ restart the rebuild from the start and rebuild both devices in
    parallel.

Both options can be argued for.  The code currently takes option 2 as
  a/ this requires least code change
  b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" &lt;ivan@kasenna.com&gt;
Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: raid1: Fix restoration of bio between failed read and write.</title>
<updated>2008-05-24T16:56:10+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2008-05-23T20:04:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=698b18c1e8bddf39cbf1ba50792b0fe302dbe6d6'/>
<id>698b18c1e8bddf39cbf1ba50792b0fe302dbe6d6</id>
<content type='text'>
When performing a "recovery" or "check" pass on a RAID1 array, we read
from each device and possible, if there is a difference or a read error,
write back to some devices.

We use the same 'bio' for both read and write, resetting various fields
between the two operations.

We forgot to reset bv_offset and bv_len however.  These are often left
unchanged, but in the case where there is an IO error one or two sectors
into a page, they are changed.

This results in correctable errors not being corrected properly.  It does
not result in any data corruption.

Cc: "Fairbanks, David" &lt;David.Fairbanks@stratus.com&gt;
Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When performing a "recovery" or "check" pass on a RAID1 array, we read
from each device and possible, if there is a difference or a read error,
write back to some devices.

We use the same 'bio' for both read and write, resetting various fields
between the two operations.

We forgot to reset bv_offset and bv_len however.  These are often left
unchanged, but in the case where there is an IO error one or two sectors
into a page, they are changed.

This results in correctable errors not being corrected properly.  It does
not result in any data corruption.

Cc: "Fairbanks, David" &lt;David.Fairbanks@stratus.com&gt;
Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: fix possible oops when removing a bitmap from an active array</title>
<updated>2008-05-24T16:56:09+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2008-05-23T20:04:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=84255d1018c50e72c71a49f359989597d53a3f53'/>
<id>84255d1018c50e72c71a49f359989597d53a3f53</id>
<content type='text'>
It is possible to add a write-intent bitmap to an active array, or remove
the bitmap that is there.

When we do with the 'quiesce' the array, which causes make_request to
block in "wait_barrier()".

However we are sampling the value of "mddev-&gt;bitmap" before the
wait_barrier call, and using it afterwards.  This can result in using a
bitmap structure that has been freed.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It is possible to add a write-intent bitmap to an active array, or remove
the bitmap that is there.

When we do with the 'quiesce' the array, which causes make_request to
block in "wait_barrier()".

However we are sampling the value of "mddev-&gt;bitmap" before the
wait_barrier call, and using it afterwards.  This can result in using a
bitmap structure that has been freed.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Remove blkdev warning triggered by using md</title>
<updated>2008-05-15T02:11:15+00:00</updated>
<author>
<name>Neil Brown</name>
<email>neilb@suse.de</email>
</author>
<published>2008-05-14T23:05:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e7e72bf641b1fc7b9df6f40bd2c36dfccd8d647c'/>
<id>e7e72bf641b1fc7b9df6f40bd2c36dfccd8d647c</id>
<content type='text'>
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.

For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q-&gt;queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q-&gt;__queue_lock.  So always initialise that lock when allocated.

With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.

Thanks to Dan Williams for review and fixing the silly bugs.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Jens Axboe &lt;jens.axboe@oracle.com&gt;
Cc: Alistair John Strachan &lt;alistair@devzero.co.uk&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Jacek Luczak &lt;difrost.kernel@gmail.com&gt;
Cc: Prakash Punnoor &lt;prakash@punnoor.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.

For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q-&gt;queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q-&gt;__queue_lock.  So always initialise that lock when allocated.

With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.

Thanks to Dan Williams for review and fixing the silly bugs.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Jens Axboe &lt;jens.axboe@oracle.com&gt;
Cc: Alistair John Strachan &lt;alistair@devzero.co.uk&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
Cc: Jacek Luczak &lt;difrost.kernel@gmail.com&gt;
Cc: Prakash Punnoor &lt;prakash@punnoor.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>md: support blocking writes to an array on device failure</title>
<updated>2008-04-30T15:29:33+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2008-04-30T07:52:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6bfe0b499082fd3950429017cd8ebf2a6c458aa5'/>
<id>6bfe0b499082fd3950429017cd8ebf2a6c458aa5</id>
<content type='text'>
Allows a userspace metadata handler to take action upon detecting a device
failure.

Based on an original patch by Neil Brown.

Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
 userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev-&gt;external

Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Allows a userspace metadata handler to take action upon detecting a device
failure.

Based on an original patch by Neil Brown.

Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
 userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev-&gt;external

Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
