summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)Author
2021-05-22md: md_open returns -EBUSY when entering racing areaZhao Heming
commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream. commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env. This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which will break the infinitely retry when md_open enter racing area. This patch is partly fix soft lockup issue, full fix needs mddev_find is split into two functions: mddev_find & mddev_find_or_alloc. And md_open should call new mddev_find (it only does searching job). For more detail, please refer with Christoph's "split mddev_find" patch in later commits. *** env *** kvm-qemu VM 2C1G with 2 iscsi luns kernel should be non-preempt *** script *** about trigger every time with below script ``` 1 node1="mdcluster1" 2 node2="mdcluster2" 3 4 mdadm -Ss 5 ssh ${node2} "mdadm -Ss" 6 wipefs -a /dev/sda /dev/sdb 7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \ /dev/sdb --assume-clean 8 9 for i in {1..10}; do 10 echo ==== $i ====; 11 12 echo "test ...." 13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" 14 sleep 1 15 16 echo "clean ....." 17 ssh ${node2} "mdadm -Ss" 18 done ``` I use mdcluster env to trigger soft lockup, but it isn't mdcluster speical bug. To stop md array in mdcluster env will do more jobs than non-cluster array, which will leave enough time/gap to allow kernel to run md_open. *** stack *** ``` [ 884.226509] mddev_put+0x1c/0xe0 [md_mod] [ 884.226515] md_open+0x3c/0xe0 [md_mod] [ 884.226518] __blkdev_get+0x30d/0x710 [ 884.226520] ? bd_acquire+0xd0/0xd0 [ 884.226522] blkdev_get+0x14/0x30 [ 884.226524] do_dentry_open+0x204/0x3a0 [ 884.226531] path_openat+0x2fc/0x1520 [ 884.226534] ? seq_printf+0x4e/0x70 [ 884.226536] do_filp_open+0x9b/0x110 [ 884.226542] ? md_release+0x20/0x20 [md_mod] [ 884.226543] ? seq_read+0x1d8/0x3e0 [ 884.226545] ? kmem_cache_alloc+0x18a/0x270 [ 884.226547] ? do_sys_open+0x1bd/0x260 [ 884.226548] do_sys_open+0x1bd/0x260 [ 884.226551] do_syscall_64+0x5b/0x1e0 [ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` *** rootcause *** "mdadm -A" (or other array assemble commands) will start a daemon "mdadm --monitor" by default. When "mdadm -Ss" is running, the stop action will wakeup "mdadm --monitor". The "--monitor" daemon will immediately get info from /proc/mdstat. This time mddev in kernel still exist, so /proc/mdstat still show md device, which makes "mdadm --monitor" to open /dev/md0. The previously "mdadm -Ss" is removing action, the "mdadm --monitor" open action will trigger md_open which is creating action. Racing is happening. ``` <thread 1>: "mdadm -Ss" md_release mddev_put deletes mddev from all_mddevs queue_work for mddev_delayed_delete at this time, "/dev/md0" is still available for opening <thread 2>: "mdadm --monitor ..." md_open + mddev_find can't find mddev of /dev/md0, and create a new mddev and | return. + trigger "if (mddev->gendisk != bdev->bd_disk)" and return -ERESTARTSYS. ``` In non-preempt kernel, <thread 2> is occupying on current CPU. and mddev_delayed_delete which was created in <thread 1> also can't be schedule. In preempt kernel, it can also trigger above racing. But kernel doesn't allow one thread running on a CPU all the time. after <thread 2> running some time, the later "mdadm -A" (refer above script line 13) will call md_alloc to alloc a new gendisk for mddev. it will break md_open statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller, the soft lockup is broken. Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22md: factor out a mddev_find_locked helper from mddev_findChristoph Hellwig
commit 8b57251f9a91f5e5a599de7549915d2d226cc3af upstream. Factor out a self-contained helper to just lookup a mddev by the dev_t "unit". Cc: stable@vger.kernel.org Reviewed-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-20md: don't flush workqueue unconditionally in md_openGuoqing Jiang
[ Upstream commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae ] We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue. kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> #2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05md: don't set In_sync if array is frozenGuoqing Jiang
[ Upstream commit 062f5b2ae12a153644c765e7ba3b0f825427be1d ] When a disk is added to array, the following path is called in mdadm. Manage_subdevs -> sysfs_freeze_array -> Manage_add -> sysfs_set_str(&info, NULL, "sync_action","idle") Then from kernel side, Manage_add invokes the path (add_new_disk -> validate_super = super_1_validate) to set In_sync flag. Since In_sync means "device is in_sync with rest of array", and the new added disk need to resync thread to help the synchronization of data. And md_reap_sync_thread would call spare_active to set In_sync for the new added disk finally. So don't set In_sync if array is in frozen. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05md: don't call spare_active in md_reap_sync_thread if all member devices ↵Guoqing Jiang
can't work [ Upstream commit 0d8ed0e9bf9643f27f4816dca61081784dedb38d ] When add one disk to array, the md_reap_sync_thread is responsible to activate the spare and set In_sync flag for the new member in spare_active(). But if raid1 has one member disk A, and disk B is added to the array. Then we offline A before all the datas are synchronized from A to B, obviously B doesn't have the latest data as A, but B is still marked with In_sync flag. So let's not call spare_active under the condition, otherwise B is still showed with 'U' state which is not correct. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-21md: fix for divide error in status_resyncMariusz Tkaczyk
[ Upstream commit 9642fa73d073527b0cbc337cc17a47d545d82cd2 ] Stopping external metadata arrays during resync/recovery causes retries, loop of interrupting and starting reconstruction, until it hit at good moment to stop completely. While these retries curr_mark_cnt can be small- especially on HDD drives, so subtraction result can be smaller than 0. However it is casted to uint without checking. As a result of it the status bar in /proc/mdstat while stopping is strange (it jumps between 0% and 99%). The real problem occurs here after commit 72deb455b5ec ("block: remove CONFIG_LBDAF"). Sector_div() macro has been changed, now the divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0. Check if db value can be really counted and replace these macro by div64_u64() inline. Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-11md: add mddev->pers to avoid potential NULL pointer dereferenceYufen Yu
commit ee37e62191a59d253fc916b9fc763deb777211e2 upstream. When doing re-add, we need to ensure rdev->mddev->pers is not NULL, which can avoid potential NULL pointer derefence in fallowing add_bound_rdev(). Fixes: a6da4ef85cef ("md: re-add a failed disk") Cc: Xiao Ni <xni@redhat.com> Cc: NeilBrown <neilb@suse.com> Cc: <stable@vger.kernel.org> # 4.4+ Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21MD: fix invalid stored role for a disk - try2Shaohua Li
commit 9e753ba9b9b405e3902d9f08aec5f2ea58a0c317 upstream. Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear hotadd. Let's only fix the role for disks in raid1/10. Based on Guoqing's original patch. Reported-by: kernel test robot <rong.a.chen@intel.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21MD: fix invalid stored role for a diskShaohua Li
[ Upstream commit d595567dc4f0c1d90685ec1e2e296e2cad2643ac ] If we change the number of array's device after device is removed from array, then add the device back to array, we can see that device is added as active role instead of spare which we expected. Please see the below link for details: https://marc.info/?l=linux-raid&m=153736982015076&w=2 This is caused by that we prefer to use device's previous role which is recorded by saved_raid_disk, but we should respect the new number of conf->raid_disks since it could be changed after device is removed. Reported-by: Gioh Kim <gi-oh.kim@profitbricks.com> Tested-by: Gioh Kim <gi-oh.kim@profitbricks.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06md: fix NULL dereference of mddev->pers in remove_and_add_spares()Yufen Yu
[ Upstream commit c42a0e2675721e1444f56e6132a07b7b1ec169ac ] We met NULL pointer BUG as follow: [ 151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 [ 151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0 [ 151.762039] Oops: 0000 [#1] SMP PTI [ 151.762406] Modules linked in: [ 151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238 [ 151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0 [ 151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246 [ 151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000 [ 151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000 [ 151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051 [ 151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600 [ 151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000 [ 151.769160] FS: 00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000 [ 151.769971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0 [ 151.771272] Call Trace: [ 151.771542] md_ioctl+0x1df2/0x1e10 [ 151.771906] ? __switch_to+0x129/0x440 [ 151.772295] ? __schedule+0x244/0x850 [ 151.772672] blkdev_ioctl+0x4bd/0x970 [ 151.773048] block_ioctl+0x39/0x40 [ 151.773402] do_vfs_ioctl+0xa4/0x610 [ 151.773770] ? dput.part.23+0x87/0x100 [ 151.774151] ksys_ioctl+0x70/0x80 [ 151.774493] __x64_sys_ioctl+0x16/0x20 [ 151.774877] do_syscall_64+0x5b/0x180 [ 151.775258] entry_SYSCALL_64_after_hwframe+0x44/0xa9 For raid6, when two disk of the array are offline, two spare disks can be added into the array. Before spare disks recovery completing, system reboot and mdadm thinks it is ok to restart the degraded array by md_ioctl(). Since disks in raid6 is not only_parity(), raid5_run() will abort, when there is no PPL feature or not setting 'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL. But, mddev->raid_disks has been set and it will not be cleared when raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to remove a disk by mdadm, which will cause NULL pointer dereference in remove_and_add_spares() finally. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03md: fix two problems with setting the "re-add" device state.NeilBrown
commit 011abdc9df559ec75779bb7c53a744c69b2a94c6 upstream. If "re-add" is written to the "state" file for a device which is faulty, this has an effect similar to removing and re-adding the device. It should take up the same slot in the array that it previously had, and an accelerated (e.g. bitmap-based) rebuild should happen. The slot that "it previously had" is determined by rdev->saved_raid_disk. However this is not set when a device fails (only when a device is added), and it is cleared when resync completes. This means that "re-add" will normally work once, but may not work a second time. This patch includes two fixes. 1/ when a device fails, record the ->raid_disk value in ->saved_raid_disk before clearing ->raid_disk 2/ when "re-add" is written to a device for which ->saved_raid_disk is not set, fail. I think this is suitable for stable as it can cause re-adding a device to be forced to do a full resync which takes a lot longer and so puts data at more risk. Cc: <stable@vger.kernel.org> (v4.1) Fixes: 97f6cd39da22 ("md-cluster: re-add capabilities") Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25md: avoid warning for 32-bit sector_tArnd Bergmann
commit 3312c951efaba55080958974047414576b9e5d63 upstream. When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which means we cannot have devices with more than 2TB, and the code that is trying to handle compatibility support for large devices in md version 0.90 is meaningless but also causes a compile-time warning: drivers/md/md.c: In function 'super_90_load': drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow] drivers/md/md.c: In function 'super_90_rdev_size_change': drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow] This adds a check for CONFIG_LBDAF to avoid even getting into this code path, and also adds an explicit cast to let the compiler know it doesn't have to warn about the truncation. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15md: fix super_offset endianness in super_1_rdev_size_changeJason Yan
commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 upstream. The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15md: fix incorrect use of lexx_to_cpu in does_sb_need_changingJason Yan
commit 1345921393ba23b60d3fcf15933e699232ad25ae upstream. The sb->layout is of type __le32, so we shoud use le32_to_cpu. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12md: MD_RECOVERY_NEEDED is set for mddev->recoveryShaohua Li
commit 82a301cb0ea2df8a5c88213094a01660067c7fb4 upstream. Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device removal.") Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-24md-cluster: make md-cluster also can work when compiled into kernelGuoqing Jiang
commit 47a7b0d8888c04c9746812820b6e60553cc77bbc upstream. The md-cluster is compiled as module by default, if it is compiled by built-in way, then we can't make md-cluster works. [64782.630008] md/raid1:md127: active with 2 out of 2 mirrors [64782.630528] md-cluster module not found. [64782.630530] md127: Could not setup cluster service (-2) Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions") Reported-by: Marc Smith <marc.smith@mcc.edu> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-11MD: make bio mergeableShaohua Li
commit 9c573de3283af007ea11c17bde1e4568d9417328 upstream. blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17md/raid: only permit hot-add of compatible integrity profilesDan Williams
commit 1501efadc524a0c99494b576923091589a52d2a4 upstream. It is not safe for an integrity profile to be changed while i/o is in-flight in the queue. Prevent adding new disks or otherwise online spares to an array if the device has an incompatible integrity profile. The original change to the blk_integrity_unregister implementation in md, commmit c7bfced9a671 "md: suspend i/o during runtime blk_integrity_unregister" introduced an immediate hang regression. This policy of disallowing changes the integrity profile once one has been established is shared with DM. Here is an abbreviated log from a test run that: 1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127] 2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209] 3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277] [ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors [ 59.078302] md: data integrity enabled on md0 [..] [ 90.489209] md0: incompatible integrity profile for pmem1m [..] [ 205.671277] md: super_written gets error=-5 [ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device. [ 205.677386] md/raid1:md0: Operation continuing on 1 devices. [ 205.683037] RAID1 conf printout: [ 205.684699] --- wd:1 rd:2 [ 205.685972] disk 0, wo:0, o:1, dev:pmem0s [ 205.687562] disk 1, wo:1, o:1, dev:pmem1s [ 205.691717] md: recovery of RAID array md0 Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister") Cc: Mike Snitzer <snitzer@redhat.com> Reported-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-21md: remove check for MD_RECOVERY_NEEDED in action_store.NeilBrown
md currently doesn't allow a 'sync_action' such as 'reshape' to be set while MD_RECOVERY_NEEDED is set. This s a problem, particularly since commit 738a273806ee as that can cause ->check_shape to call mddev_resume() which sets MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is very likely that MD_RECOVERY_NEEDED is still set. Testing for this flag is not really needed and is in any case very racy as it can be set at any moment - asynchronously. Any race between setting a sync_action and setting MD_RECOVERY_NEEDED must already be handled properly in some locked code, probably md_check_recovery(), so remove the test here. The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case so we should test it again after getting mddev_lock(). As this fixes a race and a regression which can cause 'reshape' to fail, it is suitable for -stable kernels since 4.1 Reported-by: Xiao Ni <xni@redhat.com> Fixes: 738a273806ee ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18Fix remove_and_add_spares removes drive added as spare in slot_storeGoldwyn Rodrigues
Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6 introduced a regression which would remove a recently added spare via slot_store. Revert part of the patch which touches slot_store() and add the disk directly using pers->hot_add_disk() Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific rdev") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18md: fix bug due to nested suspendMikulas Patocka
The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with this BUG, the reason is that we attempt to suspend a device that is already suspended. See also https://bugzilla.redhat.com/show_bug.cgi?id=1283491 This patch fixes the bug by changing functions mddev_suspend and mddev_resume to always nest. The number of nested calls to mddev_nested_suspend is kept in the variable mddev->suspended. [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend] kernel BUG at drivers/md/md.c:317! CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1 task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000000000001111 Not tainted r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810 r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810 r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4 r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001 r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1 r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000 r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001 sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390 IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748 CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000 ORIG_R28: 00000000000000b0 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod] IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod] RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1] Backtrace: [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1] [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid] [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod] [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod] [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod] [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod] [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod] [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558 Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18MD: change journal disk role to disk 0Shaohua Li
Neil pointed out setting journal disk role to raid_disks will confuse reshape if we support reshape eventually. Switching the role to 0 (we should be fine as long as the value >=0) and skip sysfs file creation to avoid error. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-07block: change ->make_request_fn() and users to return a queue cookieJens Axboe
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-04Merge tag 'md/4.4' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: "Two major components to this update. 1) The clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2) The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. * tag 'md/4.4' of git://neil.brown.name/md: (66 commits) MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW MD: set journal disk ->raid_disk MD: kick out journal disk if it's not fresh raid5-cache: start raid5 readonly if journal is missing MD: add new bit to indicate raid array with journal raid5-cache: IO error handling raid5: journal disk can't be removed raid5-cache: add trim support for log MD: fix info output for journal disk raid5-cache: use bio chaining raid5-cache: small log->seq cleanup raid5-cache: new helper: r5_reserve_log_entry raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta raid5-cache: take rdev->data_offset into account early on raid5-cache: refactor bio allocation raid5-cache: clean up r5l_get_meta raid5-cache: simplify state machine when caches flushes are not needed raid5-cache: factor out a helper to run all stripes for an I/O unit raid5-cache: rename flushed_ios to finished_ios raid5-cache: free I/O units earlier ...
2015-11-04Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-01MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RWSong Liu
When RAID-4/5/6 array suffers from missing journal device, we put the array in read only state. We should not allow trasition to read-write states (clean and active) before replacing journal device. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: set journal disk ->raid_diskShaohua Li
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of 0, because we already have a disk with ->raid_disk 0 and this causes sysfs entry creation conflict. A lot of places assumes disk with ->raid_disk >=0 is normal raid disk, so we add check for journal disk. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: kick out journal disk if it's not freshSong Liu
When journal disk is faulty and we are reassemabling the raid array, the journal disk is old. We don't allow the journal disk added to the raid array. Since journal disk is missing in the array, the raid5 will mark the array readonly. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: add new bit to indicate raid array with journalSong Liu
If a raid array has journal feature bit set, add a new bit to indicate this. If the array is started without journal disk existing, we know there is something wrong. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01MD: fix info output for journal diskShaohua Li
journal disk can be faulty. The Journal and Faulty aren't exclusive with each other. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01md: show journal for journal disk in disk state sysfsShaohua Li
Journal disk state sysfs entry should indicate it's journal Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01skip match_mddev_units check for special rolesSong Liu
match_mddev_units is used to check whether 2 RAID arrays share same disk(s). Arrays that share disk(s) will not do resync at the same time for better performance (fewer HDD seek). However, this check should not apply to Spare, Faulty, and Journal disks, as they do not paticipate in resync. In this patch, match_mddev_units skips check for disks with flag "Faulty" or "Journal" or raid_disk < 0. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01md: skip resync for raid array with journalShaohua Li
If a raid array has journal, the journal can guarantee the consistency, we can skip resync after a unclean shutdown. The exception is raid creation or user initiated resync, which we still do a raid resync. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-31Revert "md: allow a partially recovered device to be hot-added to an array."NeilBrown
This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57. This commit is poorly justified, I can find not discusison in email, and it clearly causes a problem. If a device which is being recovered fails and is subsequently re-added to an array, there could easily have been changes to the array *before* the point where the recovery was up to. So the recovery must start again from the beginning. If a spare is being recovered and fails, then when it is re-added we really should do a bitmap-based recovery up to the recovery-offset, and then a full recovery from there. Before this reversion, we only did the "full recovery from there" which is not corect. After this reversion with will do a full recovery from the start, which is safer but not ideal. It will be left to a future patch to arrange the two different styles of recovery. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (3.14+) Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.")
2015-10-24md: override md superblock recovery_offset for journal deviceShaohua Li
Journal device stores data in a log structure. We need record the log start. Here we override md superblock recovery_offset for this purpose. This field of a journal device is meaningless otherwise. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24MD: add a new disk role to present write journal deviceSong Liu
Next patches will use a disk as raid5/6 journaling. We need a new disk role to present the journal device and add MD_FEATURE_JOURNAL to feature_map for backward compability. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24MD: replace special disk roles with macrosSong Liu
Add the following two macros for special roles: spare and faulty MD_DISK_ROLE_SPARE 0xffff MD_DISK_ROLE_FAULTY 0xfffe Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role, and minimal value of special role. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24md-cluster: Call update_raid_disks() if another node --grow's raid_disksGoldwyn Rodrigues
To incorporate --grow feature executed on one node, other nodes need to acknowledge the change in number of disks. Call update_raid_disks() to update internal data structures. This leads to call check_reshape() -> md_allow_write() -> md_update_sb(), this results in a deadlock. This is done so it can safely allocate memory (which might trigger writeback which might write to raid1). This is not required for md with a bitmap. In the clustered case, we don't perform md_update_sb() in md_allow_write(), but in do_md_run(). Also we disable safemode for clustered mode. mddev->recovery_cp need not be set in check_sb_changes() because this is required only when a node reads another node's bitmap. mddev->recovery_cp (which is read from sb->resync_offset), is set only if mddev is in_sync. Since we disabled safemode, in_sync is set to zero. In a clustered environment, the MD may not be in sync because another node could be writing to it. So make sure that in_sync is not set in case of clustered node in __md_stop_writes(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-21md: suspend i/o during runtime blk_integrity_unregisterDan Williams
Synchronize pending i/o against a change in the integrity profile to avoid the possibility of spurious integrity errors. Given linear_add() is suspending the mddev before manipulating the mddev, do the same for the other personalities. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdownDan Williams
Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21block: Inline blk_integrity in struct gendiskMartin K. Petersen
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12md: check the return value for metadata_update_startGuoqing Jiang
We shouldn't run related funs of md_cluster_ops in case metadata_update_start returned failure. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12md-cluster: only call kick_rdev_from_array after remove disk successfullyGuoqing Jiang
For cluster raid, we should not kick it from array if the disk can't be remove from array successfully. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Fix adding of new disk with new reload codeGoldwyn Rodrigues
Adding the disk worked incorrectly with the new reload code. Fix it: - No operation should be performed on rdev marked as Candidate - After a metadata update operation, kick disk if role is 0xfffe else clear Candidate bit and continue with the regular change check. - Saving the mode of the lock resource to check if token lock is already locked, because it can be called twice while adding a disk. However, unlock_comm() must be called only once. - add_new_disk() is called by the node initiating the --add operation. If it needs to be canceled, call add_new_disk_cancel(). The operation is completed by md_update_sb() which will write and unlock the communication. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Perform resync/recovery under a DLM lockGoldwyn Rodrigues
Resync or recovery must be performed by only one node at a time. A DLM lock resource, resync_lockres provides the mutual exclusion so that only one node performs the recovery/resync at a time. If a node is unable to get the resync_lockres, because recovery is being performed by another node, it set MD_RECOVER_NEEDED so as to schedule recovery in the future. Remove the debug message in resync_info_update() used during development. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Perform a lazy updateGoldwyn Rodrigues
In a clustered environment, a change such as marking a device faulty, can be recorded by any of the nodes. This is communicated to all the nodes and re-recording such a change is unnecessary, and quite often pretty disruptive. With this patch, just before the update, we detect for the changes and if the changes are already in superblock, we abort the update after clearing all the flags Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Improve md_reload_sb to be less error proneGoldwyn Rodrigues
md_reload_sb is too simplistic and it explicitly needs to determine the changes made by the writing node. However, there are multiple areas where a simple reload could fail. Instead, read the superblock of one of the "good" rdevs and update the necessary information: - read the superblock into a newly allocated page, by temporarily swapping out rdev->sb_page and calling ->load_super. - if that fails return - if it succeeds, call check_sb_changes 1. iterates over list of active devices and checks the matching dev_roles[] value. If that is 'faulty', the device must be marked as faulty - call md_error to mark the device as faulty. Make sure not to set CHANGE_DEVS and wakeup mddev->thread or else it would initiate a resync process, which is the responsibility of the "primary" node. - clear the Blocked bit - Call remove_and_add_spares() to hot remove the device. If the device is 'spare': - call remove_and_add_spares() to get the number of spares added in this operation. - Reduce mddev->degraded to mark the array as not degraded. 2. reset recovery_cp - read the rest of the rdevs to update recovery_offset. If recovery_offset is equal to MaxSector, call spare_active() to set it In_sync This required that recovery_offset be initialized to MaxSector, as opposed to zero so as to communicate the end of sync for a rdev. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md: remove_and_add_spares() to activate specific rdevGoldwyn Rodrigues
remove_and_add_spares() checks for all devices to activate spare. Change it to activate a specific device if a non-null rdev argument is passed. remove_and_add_spares() can be used to activate spares in slot_store() as well. For hot_remove_disk(), check if rdev->raid_disk == -1 before calling remove_and_add_spares() Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Use a small window for resyncGoldwyn Rodrigues
Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-12md: Increment version for clustered bitmapsGoldwyn Rodrigues
Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels to assemble a clustered device. In order to maximize compatibility, the major version is set to BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered. Added MD_FEATURE_CLUSTERED in order to return error for older kernels which would assemble MD even if the bitmap is corrupted. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>