| Age | Commit message (Collapse) | Author |
|
commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream.
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.
This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).
For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.
*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt
*** script ***
about trigger every time with below script
```
1 node1="mdcluster1"
2 node2="mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..10}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
```
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.
*** stack ***
```
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
```
*** rootcause ***
"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.
The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.
```
<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening
<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
```
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.
In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8b57251f9a91f5e5a599de7549915d2d226cc3af upstream.
Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".
Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae ]
We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.
kernel: [ 154.522645] ======================================================
kernel: [ 154.522647] WARNING: possible circular locking dependency detected
kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O
kernel: [ 154.522651] ------------------------------------------------------
kernel: [ 154.522653] mdadm/2482 is trying to acquire lock:
kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [ 154.522673]
kernel: [ 154.522673] but task is already holding lock:
kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522691]
kernel: [ 154.522691] which lock already depends on the new lock.
kernel: [ 154.522691]
kernel: [ 154.522694]
kernel: [ 154.522694] the existing dependency chain (in reverse order) is:
kernel: [ 154.522696]
kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [ 154.522704] __mutex_lock+0x87/0x950
kernel: [ 154.522706] __blkdev_get+0x79/0x590
kernel: [ 154.522708] blkdev_get+0x65/0x140
kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40
kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod]
kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod]
kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod]
kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522735] vfs_write+0xad/0x1a0
kernel: [ 154.522737] ksys_write+0xa4/0xe0
kernel: [ 154.522745] do_syscall_64+0x64/0x2b0
kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522749]
kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [ 154.522752] __mutex_lock+0x87/0x950
kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod]
kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522763] vfs_write+0xad/0x1a0
kernel: [ 154.522765] ksys_write+0xa4/0xe0
kernel: [ 154.522767] do_syscall_64+0x64/0x2b0
kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522770]
kernel: [ 154.522770] -> #2 (kn->count#253){++++}:
kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0
kernel: [ 154.522778] kernfs_remove+0x1f/0x30
kernel: [ 154.522780] kobject_del+0x28/0x60
kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [ 154.522786] process_one_work+0x2a7/0x5f0
kernel: [ 154.522788] worker_thread+0x2d/0x3d0
kernel: [ 154.522793] kthread+0x117/0x130
kernel: [ 154.522795] ret_from_fork+0x3a/0x50
kernel: [ 154.522796]
kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [ 154.522800] process_one_work+0x27e/0x5f0
kernel: [ 154.522802] worker_thread+0x2d/0x3d0
kernel: [ 154.522804] kthread+0x117/0x130
kernel: [ 154.522806] ret_from_fork+0x3a/0x50
kernel: [ 154.522807]
kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [ 154.522813] __lock_acquire+0x1392/0x1690
kernel: [ 154.522816] lock_acquire+0xb4/0x1a0
kernel: [ 154.522818] flush_workqueue+0xab/0x4b0
kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522823] __blkdev_get+0xea/0x590
kernel: [ 154.522825] blkdev_get+0x65/0x140
kernel: [ 154.522828] do_dentry_open+0x1d1/0x380
kernel: [ 154.522831] path_openat+0x567/0xcc0
kernel: [ 154.522834] do_filp_open+0x9b/0x110
kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522838] do_sys_open+0x57/0x80
kernel: [ 154.522840] do_syscall_64+0x64/0x2b0
kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522844]
kernel: [ 154.522844] other info that might help us debug this:
kernel: [ 154.522844]
kernel: [ 154.522846] Chain exists of:
kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [ 154.522846]
kernel: [ 154.522850] Possible unsafe locking scenario:
kernel: [ 154.522850]
kernel: [ 154.522852] CPU0 CPU1
kernel: [ 154.522853] ---- ----
kernel: [ 154.522854] lock(&bdev->bd_mutex);
kernel: [ 154.522856] lock(&mddev->reconfig_mutex);
kernel: [ 154.522858] lock(&bdev->bd_mutex);
kernel: [ 154.522860] lock((wq_completion)md_misc);
kernel: [ 154.522861]
kernel: [ 154.522861] *** DEADLOCK ***
kernel: [ 154.522861]
kernel: [ 154.522864] 1 lock held by mdadm/2482:
kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522868]
kernel: [ 154.522868] stack backtrace:
kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25
kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [ 154.522878] Call Trace:
kernel: [ 154.522881] dump_stack+0x8f/0xcb
kernel: [ 154.522884] check_noncircular+0x194/0x1b0
kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690
kernel: [ 154.522890] __lock_acquire+0x1392/0x1690
kernel: [ 154.522893] lock_acquire+0xb4/0x1a0
kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522898] flush_workqueue+0xab/0x4b0
kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522910] __blkdev_get+0xea/0x590
kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522914] blkdev_get+0x65/0x140
kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522918] do_dentry_open+0x1d1/0x380
kernel: [ 154.522921] path_openat+0x567/0xcc0
kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690
kernel: [ 154.522926] do_filp_open+0x9b/0x110
kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0
kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630
kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0
kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522944] do_sys_open+0x57/0x80
kernel: [ 154.522946] do_syscall_64+0x64/0x2b0
kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae
And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 062f5b2ae12a153644c765e7ba3b0f825427be1d ]
When a disk is added to array, the following path is called in mdadm.
Manage_subdevs -> sysfs_freeze_array
-> Manage_add
-> sysfs_set_str(&info, NULL, "sync_action","idle")
Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.
Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
can't work
[ Upstream commit 0d8ed0e9bf9643f27f4816dca61081784dedb38d ]
When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().
But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.
So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9642fa73d073527b0cbc337cc17a47d545d82cd2 ]
Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).
The real problem occurs here after commit 72deb455b5ec ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
Check if db value can be really counted and replace these macro by
div64_u64() inline.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit ee37e62191a59d253fc916b9fc763deb777211e2 upstream.
When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().
Fixes: a6da4ef85cef ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9e753ba9b9b405e3902d9f08aec5f2ea58a0c317 upstream.
Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear
hotadd. Let's only fix the role for disks in raid1/10.
Based on Guoqing's original patch.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d595567dc4f0c1d90685ec1e2e296e2cad2643ac ]
If we change the number of array's device after device is removed from array,
then add the device back to array, we can see that device is added as active
role instead of spare which we expected.
Please see the below link for details:
https://marc.info/?l=linux-raid&m=153736982015076&w=2
This is caused by that we prefer to use device's previous role which is
recorded by saved_raid_disk, but we should respect the new number of
conf->raid_disks since it could be changed after device is removed.
Reported-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Tested-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c42a0e2675721e1444f56e6132a07b7b1ec169ac ]
We met NULL pointer BUG as follow:
[ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
[ 151.762039] Oops: 0000 [#1] SMP PTI
[ 151.762406] Modules linked in:
[ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
[ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
[ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
[ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
[ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
[ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
[ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
[ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
[ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
[ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
[ 151.771272] Call Trace:
[ 151.771542] md_ioctl+0x1df2/0x1e10
[ 151.771906] ? __switch_to+0x129/0x440
[ 151.772295] ? __schedule+0x244/0x850
[ 151.772672] blkdev_ioctl+0x4bd/0x970
[ 151.773048] block_ioctl+0x39/0x40
[ 151.773402] do_vfs_ioctl+0xa4/0x610
[ 151.773770] ? dput.part.23+0x87/0x100
[ 151.774151] ksys_ioctl+0x70/0x80
[ 151.774493] __x64_sys_ioctl+0x16/0x20
[ 151.774877] do_syscall_64+0x5b/0x180
[ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9
For raid6, when two disk of the array are offline, two spare disks can
be added into the array. Before spare disks recovery completing,
system reboot and mdadm thinks it is ok to restart the degraded
array by md_ioctl(). Since disks in raid6 is not only_parity(),
raid5_run() will abort, when there is no PPL feature or not setting
'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.
But, mddev->raid_disks has been set and it will not be cleared when
raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
remove a disk by mdadm, which will cause NULL pointer dereference
in remove_and_add_spares() finally.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 011abdc9df559ec75779bb7c53a744c69b2a94c6 upstream.
If "re-add" is written to the "state" file for a device
which is faulty, this has an effect similar to removing
and re-adding the device. It should take up the
same slot in the array that it previously had, and
an accelerated (e.g. bitmap-based) rebuild should happen.
The slot that "it previously had" is determined by
rdev->saved_raid_disk.
However this is not set when a device fails (only when a device
is added), and it is cleared when resync completes.
This means that "re-add" will normally work once, but may not work a
second time.
This patch includes two fixes.
1/ when a device fails, record the ->raid_disk value in
->saved_raid_disk before clearing ->raid_disk
2/ when "re-add" is written to a device for which
->saved_raid_disk is not set, fail.
I think this is suitable for stable as it can
cause re-adding a device to be forced to do a full
resync which takes a lot longer and so puts data at
more risk.
Cc: <stable@vger.kernel.org> (v4.1)
Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3312c951efaba55080958974047414576b9e5d63 upstream.
When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which
means we cannot have devices with more than 2TB, and the code that
is trying to handle compatibility support for large devices in
md version 0.90 is meaningless but also causes a compile-time warning:
drivers/md/md.c: In function 'super_90_load':
drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow]
drivers/md/md.c: In function 'super_90_rdev_size_change':
drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow]
This adds a check for CONFIG_LBDAF to avoid even getting into this
code path, and also adds an explicit cast to let the compiler know
it doesn't have to warn about the truncation.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 upstream.
The sb->super_offset should be big-endian, but the rdev->sb_start is in
host byte order, so fix this by adding cpu_to_le64.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1345921393ba23b60d3fcf15933e699232ad25ae upstream.
The sb->layout is of type __le32, so we shoud use le32_to_cpu.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 82a301cb0ea2df8a5c88213094a01660067c7fb4 upstream.
Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device
removal.")
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 47a7b0d8888c04c9746812820b6e60553cc77bbc upstream.
The md-cluster is compiled as module by default,
if it is compiled by built-in way, then we can't
make md-cluster works.
[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)
Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions")
Reported-by: Marc Smith <marc.smith@mcc.edu>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9c573de3283af007ea11c17bde1e4568d9417328 upstream.
blk_queue_split marks bio unmergeable, which makes sense for normal bio.
But if dispatching the bio to underlayer disk, the blk_queue_split
checks are invalid, hence it's possible the bio becomes mergeable.
In the reported bug, this bug causes trim against raid0 performance slash
https://bugzilla.kernel.org/show_bug.cgi?id=117051
Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio)
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Reviewed-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1501efadc524a0c99494b576923091589a52d2a4 upstream.
It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue. Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.
The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a671 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.
This policy of disallowing changes the integrity profile once one has
been established is shared with DM.
Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277]
[ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[ 59.078302] md: data integrity enabled on md0
[..]
[ 90.489209] md0: incompatible integrity profile for pmem1m
[..]
[ 205.671277] md: super_written gets error=-5
[ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[ 205.677386] md/raid1:md0: Operation continuing on 1 devices.
[ 205.683037] RAID1 conf printout:
[ 205.684699] --- wd:1 rd:2
[ 205.685972] disk 0, wo:0, o:1, dev:pmem0s
[ 205.687562] disk 1, wo:1, o:1, dev:pmem1s
[ 205.691717] md: recovery of RAID array md0
Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
md currently doesn't allow a 'sync_action' such as 'reshape' to be set
while MD_RECOVERY_NEEDED is set.
This s a problem, particularly since commit 738a273806ee as that can
cause ->check_shape to call mddev_resume() which sets
MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is
very likely that MD_RECOVERY_NEEDED is still set.
Testing for this flag is not really needed and is in any case very
racy as it can be set at any moment - asynchronously. Any race
between setting a sync_action and setting MD_RECOVERY_NEEDED must
already be handled properly in some locked code, probably
md_check_recovery(), so remove the test here.
The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
so we should test it again after getting mddev_lock().
As this fixes a race and a regression which can cause 'reshape' to
fail, it is suitable for -stable kernels since 4.1
Reported-by: Xiao Ni <xni@redhat.com>
Fixes: 738a273806ee ("md/raid5: fix allocation of 'scribble' array.")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6
introduced a regression which would remove a recently added spare via
slot_store. Revert part of the patch which touches slot_store() and add
the disk directly using pers->hot_add_disk()
Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific
rdev")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc
causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
this BUG, the reason is that we attempt to suspend a device that is
already suspended. See also
https://bugzilla.redhat.com/show_bug.cgi?id=1283491
This patch fixes the bug by changing functions mddev_suspend and
mddev_resume to always nest.
The number of nested calls to mddev_nested_suspend is kept in the
variable mddev->suspended.
[neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]
kernel BUG at drivers/md/md.c:317!
CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001000000000000001111 Not tainted
r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001
sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748
CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000
ORIG_R28: 00000000000000b0
IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
Backtrace:
[<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
[<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
[<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
[<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
[<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
[<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
[<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
[<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558
Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Neil pointed out setting journal disk role to raid_disks will confuse
reshape if we support reshape eventually. Switching the role to 0 (we
should be fine as long as the value >=0) and skip sysfs file creation to
avoid error.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
|
|
Pull md updates from Neil Brown:
"Two major components to this update.
1) The clustered-raid1 support from SUSE is nearly complete. There
are a few outstanding issues being worked on. Maybe half a dozen
patches will bring this to a usable state.
2) The first stage of journalled-raid5 support from Facebook makes an
appearance. With a journal device configured (typically NVRAM or
SSD), the "RAID5 write hole" should be closed - a crash during
degraded operations cannot result in data corruption.
The next stage will be to use the journal as a write-behind cache
so that latency can be reduced and in some cases throughput
increased by performing more full-stripe writes.
* tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
MD: set journal disk ->raid_disk
MD: kick out journal disk if it's not fresh
raid5-cache: start raid5 readonly if journal is missing
MD: add new bit to indicate raid array with journal
raid5-cache: IO error handling
raid5: journal disk can't be removed
raid5-cache: add trim support for log
MD: fix info output for journal disk
raid5-cache: use bio chaining
raid5-cache: small log->seq cleanup
raid5-cache: new helper: r5_reserve_log_entry
raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
raid5-cache: take rdev->data_offset into account early on
raid5-cache: refactor bio allocation
raid5-cache: clean up r5l_get_meta
raid5-cache: simplify state machine when caches flushes are not needed
raid5-cache: factor out a helper to run all stripes for an I/O unit
raid5-cache: rename flushed_ios to finished_ios
raid5-cache: free I/O units earlier
...
|
|
Pull block integrity updates from Jens Axboe:
""This is the joint work of Dan and Martin, cleaning up and improving
the support for block data integrity"
* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
block: blk_flush_integrity() for bio-based drivers
block: move blk_integrity to request_queue
block: generic request_queue reference counting
nvme: suspend i/o during runtime blk_integrity_unregister
md: suspend i/o during runtime blk_integrity_unregister
md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
block: Inline blk_integrity in struct gendisk
block: Export integrity data interval size in sysfs
block: Reduce the size of struct blk_integrity
block: Consolidate static integrity profile properties
block: Move integrity kobject to struct gendisk
|
|
When RAID-4/5/6 array suffers from missing journal device, we put
the array in read only state. We should not allow trasition to
read-write states (clean and active) before replacing journal device.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
When journal disk is faulty and we are reassemabling the raid array, the
journal disk is old. We don't allow the journal disk added to the raid
array. Since journal disk is missing in the array, the raid5 will mark
the array readonly.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
journal disk can be faulty. The Journal and Faulty aren't exclusive with
each other.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Journal disk state sysfs entry should indicate it's journal
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.
In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57.
This commit is poorly justified, I can find not discusison in email,
and it clearly causes a problem.
If a device which is being recovered fails and is subsequently
re-added to an array, there could easily have been changes to the
array *before* the point where the recovery was up to. So the
recovery must start again from the beginning.
If a spare is being recovered and fails, then when it is re-added we
really should do a bitmap-based recovery up to the recovery-offset,
and then a full recovery from there. Before this reversion, we only
did the "full recovery from there" which is not corect. After this
reversion with will do a full recovery from the start, which is safer
but not ideal.
It will be left to a future patch to arrange the two different styles
of recovery.
Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (3.14+)
Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.")
|
|
Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Add the following two macros for special roles: spare and faulty
MD_DISK_ROLE_SPARE 0xffff
MD_DISK_ROLE_FAULTY 0xfffe
Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role,
and minimal value of special role.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.
This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.
In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.
mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Synchronize pending i/o against a change in the integrity profile to
avoid the possibility of spurious integrity errors. Given linear_add()
is suspending the mddev before manipulating the mddev, do the same for
the other personalities.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Now that the integrity profile is statically allocated there is no work
to do when shutting down an integrity enabled block device.
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <JBottomley@Odin.com>
Acked-by: NeilBrown <neilb@suse.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.
This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.
Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:
- Moving the blk_integrity definition to genhd.h.
- Inlining blk_integrity in struct gendisk.
- Removing the dynamic allocation code.
- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.
- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.
- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.
- Simplifying the integrity support routines in DM (Mike Snitzer).
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We shouldn't run related funs of md_cluster_ops in case
metadata_update_start returned failure.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
|
|
For cluster raid, we should not kick it from array if the disk can't be
remove from array successfully.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
|
Adding the disk worked incorrectly with the new reload code. Fix it:
- No operation should be performed on rdev marked as Candidate
- After a metadata update operation, kick disk if role is 0xfffe
else clear Candidate bit and continue with the regular change check.
- Saving the mode of the lock resource to check if token lock is already
locked, because it can be called twice while adding a disk. However,
unlock_comm() must be called only once.
- add_new_disk() is called by the node initiating the --add operation.
If it needs to be canceled, call add_new_disk_cancel(). The operation
is completed by md_update_sb() which will write and unlock the
communication.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
|
Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.
If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.
Remove the debug message in resync_info_update()
used during development.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
|
In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.
With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
|
md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.
Instead, read the superblock of one of the "good" rdevs and update
the necessary information:
- read the superblock into a newly allocated page, by temporarily
swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
1. iterates over list of active devices and checks the matching
dev_roles[] value.
If that is 'faulty', the device must be marked as faulty
- call md_error to mark the device as faulty. Make sure
not to set CHANGE_DEVS and wakeup mddev->thread or else
it would initiate a resync process, which is the responsibility
of the "primary" node.
- clear the Blocked bit
- Call remove_and_add_spares() to hot remove the device.
If the device is 'spare':
- call remove_and_add_spares() to get the number of spares
added in this operation.
- Reduce mddev->degraded to mark the array as not degraded.
2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
is equal to MaxSector, call spare_active() to set it In_sync
This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
|
remove_and_add_spares() checks for all devices to activate spare.
Change it to activate a specific device if a non-null rdev
argument is passed.
remove_and_add_spares() can be used to activate spares in
slot_store() as well.
For hot_remove_disk(), check if rdev->raid_disk == -1 before
calling remove_and_add_spares()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
|
|
Suspending the entire device for resync could take too long. Resync
in small chunks.
cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:
1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
list.
bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.
In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
Added MD_FEATURE_CLUSTERED in order to return error for older
kernels which would assemble MD even if the bitmap is corrupted.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|