| Age | Commit message (Collapse) | Author |
|
commit 34cab6f42003cb06f48f86a86652984dec338ae9 upstream.
When we get a read error from the last working device, we don't
try to repair it, and don't fail the device. We simple report a
read error to the caller.
However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device. However a spare which is rebuilding
would be non-faulty but so not the only working device.
So change the test from "!Faulty" to "In_sync". If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.
This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 1c7518794a3647eb345d59ee52844e8a40405198 upstream.
Allocate memory using GFP_NOIO when deleting a btree. dm_btree_del()
can be called via an ioctl and we don't want to recurse into the FS or
block layer.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit a822c83e47d97cdef38c4352e1ef62d9f46cfe98 upstream.
Given the pool's cell_sort_array holds 8192 pointers it triggers an
order 5 allocation via kmalloc. This order 5 allocation is prone to
failure as system memory gets more fragmented over time.
Fix this by allocating the cell_sort_array using vmalloc.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2: make a similar change in prison_{create,destroy}()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 4c7e309340ff85072e96f529582d159002c36734 upstream.
redistribute3() shares entries out across 3 nodes. Some entries were
being moved the wrong way, breaking the ordering. This manifested as a
BUG() in dm-btree-remove.c:shift() when entries were removed from the
btree.
For additional context see:
https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html
Signed-off-by: Dennis Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 6e9eac2dcee5e19f125967dd2be3e36558c42fff upstream.
If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.
This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.
So only update pool_size if there is no error.
This bug was introduced in 2.6.17 and the patch is suitable for
-stable.
Fixes: ad01c9e3752f ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
REQ_WRITE_SAME
commit e5db29806b99ce2b2640d2e4d4fcb983cea115c5 upstream.
Since it's possible for the discard and write same queue limits to
change while the upper level command is being sliced and diced, fix up
both of them (a) to reject IO if the special command is unsupported at
the start of the function and (b) read the limits once and let the
commands error out on their own if the status happens to change.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2: adjust context; drop the write_same handling]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit ab7c7bb6f4ab95dbca96fcfc4463cd69843e3e24 upstream.
__dm_destroy() must take the suspend_lock so that its presuspend and
postsuspend calls do not race with an internal suspend.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 22aa66a3ee5b61e0f4a0bfeabcaa567861109ec3 upstream.
When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero. Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.
pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences s->origin. These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.
This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 2bec1f4a8832e74ebbe859f176d8a9cb20dd97f4 upstream.
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 37527b869207ad4c208b1e13967d69b8bba1fbf9 upstream.
I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:
# echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
# lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 4K 1G 0
`-moo (dm-0) 0 4K 1G 0
sdb 0 0B 0B 0
`-moo (dm-0) 0 4K 1G 0
Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.
If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.
To fix this: before entering the loop, check for the combination of
REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
the mirror device.
This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; upstream is also affected.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f2ed51ac64611d717d1917820a01930174c2f236 upstream.
It may be possible that a device claims discard support but it rejects
discards with -EOPNOTSUPP. It happens when using loopback on ext2/ext3
filesystem driven by the ext4 driver. It may also happen if the
underlying devices are moved from one disk on another.
If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
not degrade the array.
This patch fixes failed test shell/lvconvert-repair-transient.sh in the
lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
filesystem and it is being driven by the ext4 driver.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit c1c6156fe4d4577444b769d7edd5dd503e57bbc9 upstream.
This function isn't right and it causes a static checker warning:
drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
error: potentially using uninitialized 'sb_data_size'.
It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.
Fixes: 3241b1d3e0aa ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 40d43c4b4cac4c2647bf07110d7b07d35f399a84 upstream.
The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.
Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.
Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work. Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.
[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 upstream.
The shrinker uses gfp flags to indicate what kind of operation can the
driver wait for. If __GFP_IO flag is present, the driver can wait for
block I/O operations, if __GFP_FS flag is present, the driver can wait on
operations involving the filesystem.
dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
device that makes calls into the filesystem. If __GFP_IO is present and
__GFP_FS isn't, dm-bufio could still block on filesystem operations if it
runs on a loop block device.
The change from __GFP_IO to __GFP_FS supposedly fixes one observed (though
unreproducible) deadlock involving dm-bufio and loop device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2:
- There's only one shrinker callback
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 56ec16cb1e1ce46354de8511eef962a417c32c92 upstream.
If cn_add_callback() fails in dm_ulog_tfr_init(), it does not
deallocate prealloced memory but calls cn_del_callback().
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit eb76faf53b1ff7a77ce3f78cc98ad392ac70c2a0 upstream.
The 'last_accessed' member of the dm_buffer structure was only set when
the the buffer was created. This led to each buffer being discarded
after dm_bufio_max_age time even if it was used recently. In practice
this resulted in all thinp metadata being evicted soon after being read
-- this is particularly problematic for metadata intensive workloads
like multithreaded small random IO.
'last_accessed' is now updated each time the buffer is moved to the head
of the LRU list, so the buffer is now properly discarded if it was not
used in dm_bufio_max_age time.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit d49ec52ff6ddcda178fc2476a109cf1bd1fa19ed upstream.
The DM crypt target accesses memory beyond allocated space resulting in
a crash on 32 bit x86 systems.
This bug is very old (it dates back to 2.6.25 commit 3a7f6c990ad04 "dm
crypt: use async crypto"). However, this bug was masked by the fact
that kmalloc rounds the size up to the next power of two. This bug
wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio
data"). By switching to using per-bio data there was no longer any
padding beyond the end of a dm-crypt allocated memory block.
To minimize allocation overhead dm-crypt puts several structures into one
block allocated with kmalloc. The block holds struct ablkcipher_request,
cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
struct dm_crypt_request and an initialization vector.
The variable dmreq_start is set to offset of struct dm_crypt_request
within this memory block. dm-crypt allocates the block with this size:
cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.
When accessing the initialization vector, dm-crypt uses the function
iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
+ 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).
dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
structure. However, when dm-crypt accesses the initialization vector, it
takes a pointer to the end of dm_crypt_request, aligns it, and then uses
it as the initialization vector. If the end of dm_crypt_request is not
aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
alignment causes the initialization vector to point beyond the allocated
space.
Fix this bug by calculating the variable iv_size_padding and adding it
to the allocated size.
Also correct the alignment of dm_crypt_request. struct dm_crypt_request
is specific to dm-crypt (it isn't used by the crypto subsystem at all),
so it is aligned on __alignof__(struct dm_crypt_request).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 9c4bdf697c39805078392d5ddbbba5ae5680e0dd upstream.
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.
If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.
This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.
Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().
Fixes: 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 2446dba03f9dabe0b477a126cbeb377854785b47 upstream.
Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).
This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices). In this case
the bitmap bit will be cleared, but it really shouldn't.
The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.
If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.
As the bug can result in data corruption the patch is suitable for
-stable. For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.
Original-from: jiao hui <jiaohui@bwstor.com.cn>
Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn>
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 10f1d5d111e8aed46a0f1179faf9a3cf422f689e upstream.
There's a race condition between the atomic_dec_and_test(&io->count)
in dec_count() and the waking of the sync_io() thread. If the thread
is spuriously woken immediately after the decrement it may exit,
making the on stack io struct invalid, yet the dec_count could still
be using it.
Fix this race by using a completion in sync_io() and dec_count().
Reported-by: Minfei Huang <huangminfei@ucloud.cn>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
[bwh: Backported to 3.2: use wait_for_completion() as wait_for_completion_io()
is not available]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 133d4527eab8d199a62eee6bd433f0776842df2e upstream.
When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).
If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.
Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.
We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not. That
depends on details of the particular raid level and particular write
request.
This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel. It applied correctly to 3.0 at
least and will minor editing to earlier kernels.
Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 3991b31ea072b070081ca3bfa860a077eda67de5 upstream.
If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.
If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.
Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted. If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.
Consequently, this patch is suitable for any -stable kernel.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream.
If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:
1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
and md_check_recover enters an infinite loop.
Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit fe76cd88e654124d1431bb662a0fc6e99ca811a5 upstream.
If unable to ensure_next_mapping() we must add the current bio, which
was removed from the @bios list via bio_list_pop, back to the
deferred_bios list before all the remaining @bios.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 789b5e0315284463617e106baad360cb9e8db3ac upstream.
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
Cc: linux-raid@vger.kernel.org
Fixes: 36d1c6476be51101778882897b315bd928c8c7b5
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 9f97e4b128d2ea90a5f5063ea0ee3b0911f4c669 upstream.
Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices. However if the write wasn't fully successful we
should not clear the bit. If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.
This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing. But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.
So: set the flag when a write fails.
This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.
Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.2: adjust context, indentation]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 2995fa78e423d7193f3b57835f6c1c75006a0315 upstream.
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.
The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).
To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.
The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2:
- Adjust context
- Remove paranoid check of container_of() result in dm_get_from_kobject(),
which would now be incorrect
- Include <linux/export.h> in dm-builtin.c, included indirectly upstream]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 12c91a5c2d2a8e8cc40a9552313e1e7b0a2d9ee3 upstream.
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.
Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
-> sm_ll_extend
-> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
=> returns -ENOSPC because smm->begin == smm->ll.nr_blocks
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit be35f486108227e10fe5d96fd42fb2b344c59983 upstream.
There may be other parts of the kernel holding a reference on the dm
kobject. We must wait until all references are dropped before
deallocating the mapped_device structure.
The dm_kobject_release method signals that all references are dropped
via completion. But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).
This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
release method is called
* the release method signals the completion and should return without
delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
contains the kobject
Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit e8b849158508565e0cd6bc80061124afc5879160 upstream.
commit e875ecea266a543e643b19e44cf472f1412708f9
md/raid10 record bad blocks as needed during recovery.
added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.
So move the freeing of r10bio down where it is safe.
Fixes: e875ecea266a543e643b19e44cf472f1412708f9
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b50c259e25d9260b9108dc0c2964c26e5ecbe1c1 upstream.
If we discover a bad block when reading we split the request and
potentially read some of it from a different device.
The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
seen by comparison with raid1.c
This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.
Fixes: 856e08e23762dfb92ffc68fd0a8d228f9e152160
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 1cc03eb93245e63b0b7a7832165efdc52e25b4e6 upstream.
commit 5d8c71f9e5fbdd95650be00294d238e27a363b5c
md: raid5 crash during degradation
Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.
commit 14a75d3e07c784c004b4b44b34af996b8e4ac453
md/raid5: preferentially read from replacement device if possible.
Fixed the same bug if a more effective way, so we can now revert
the original commit.
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 5d8c71f9e5fbdd95650be00294d238e27a363b5c
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 4cb57ab4a2e61978f3a9b7d4f53988f30d61c27f upstream.
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.
However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345
The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.
This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 5b2d06576c5410c10d95adfd5c4d8b24de861d87 upstream.
The dm_round_up function may overflow to zero. In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().
This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 230c83afdd9cd384348475bea1e14b80b3b6b1b8 upstream.
There is a possible leak of snapshot space in case of crash.
The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.
For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250
Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260
Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.
This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.
Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 718822c1c112dc99e0c72c8968ee1db9d9d910f0 upstream.
The dm-delay target uses a shared workqueue for multiple instances. This
can cause deadlock if two or more dm-delay targets are stacked on the top
of each other.
This patch changes dm-delay to use a per-instance workqueue.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 954a73d5d3073df2231820c718fdd2f18b0fe4c9 upstream.
Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work. Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set. By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.
Without this patch a race condition exists that may result in a kernel
panic:
1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
to wait_for_pg_init_completion() assumes there are no more pending path
management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
mode_select errors, then process_queued_ios() will again queue the
path activation work.
3) If free_multipath() completes before activate_path() work is called a
NULL pointer dereference like the following can be seen when
accessing members of the recently destructed multipath:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[<ffffffffa003db1b>] [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath]
[<ffffffff81090ac0>] worker_thread+0x170/0x2a0
[<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com>
Tested-by: Speagle Andy <Andy.Speagle@netapp.com>
Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[bwh: Backported to 3.2:
- Adjust context
- Bump version to 1.3.2 not 1.6.0]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f36afb3957353d2529cb2b00f78fdccd14fc5e9c upstream.
dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.
On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.
The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit e9c6a182649f4259db704ae15a91ac820e63b0ca upstream.
This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.
When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.
When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.
This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.
The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.
CVE-2013-4299
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 60e356f381954d79088d0455e357db48cfdd6857 upstream.
LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin. Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64. Such small hash table causes performance degradation.
This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size. It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit d74c6d514fe314b8bdab58b487b25992291577ec upstream.
__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx. Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.
For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>
[bwh: Backported to 3.2: drop inapplicable change to drivers/block/rbd.c.
This is a prerequisite for commit 35dc248383bb 'sg: Fix user memory
corruption when SG_IO is interrupted by a signal']
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
non-rebuilding drive completed it.
commit 3056e3aec8d8ba61a0710fb78b2d562600aa2ea7 upstream.
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.2: for raid10, s/rdev/conf->mirrors[dev].rdev/]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 502624bdad3dba45dfaacaf36b7d83e39e74b2d2 upstream.
This patch uses memalloc_noio_save to avoid a possible deadlock in
dm-bufio. (it could happen only with large block size, at most
PAGE_SIZE << MAX_ORDER (typically 8MiB).
__vmalloc doesn't fully respect gfp flags. The specified gfp flags are
used for allocation of requested pages, structures vmap_area, vmap_block
and vm_struct and the radix tree nodes.
However, the kernel pagetables are allocated always with GFP_KERNEL.
Thus the allocation of pagetables can recurse back to the I/O layer and
cause a deadlock.
This patch uses the function memalloc_noio_save to set per-process
PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore
it. When this flag is set, all allocations in the process are done with
implied GFP_NOIO flag, thus the deadlock can't happen.
This should be backported to stable kernels, but they don't have the
PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore
functions. So, PF_MEMALLOC should be set and restored instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
[bwh: Backported to 3.2 as recommended]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 09e8b813897a0f85bb401435d009228644c81214 upstream.
Return -ENOMEM instead of success if unable to allocate pending
exception mempool in snapshot_ctr.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 486adf72ccc0c235754923d47a2270c5dcb0c98b upstream.
Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.
So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.
The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f046f89a99ccfd9408b94c653374ff3065c7edb3 upstream.
Fix a bug in dm_btree_remove that could leave leaf values with incorrect
reference counts. The effect of this was that removal of a shared block
could result in the space maps thinking the block was no longer used.
More concretely, if you have a thin device and a snapshot of it, sending
a discard to a shared region of the thin could corrupt the snapshot.
Thinp uses a 2-level nested btree to store it's mappings. This first
level is indexed by thin device, and the second level by logical
block.
Often when we're removing an entry in this mapping tree we need to
rebalance nodes, which can involve shadowing them, possibly creating a
copy if the block is shared. If we do create a copy then children of
that node need to have their reference counts incremented. In this
way reference counts percolate down the tree as shared trees diverge.
The rebalance functions were incrementing the children at the
appropriate time, but they were always assuming the children were
internal nodes. This meant the leaf values (in our case packed
block/flags entries) were not being incremented.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
[bwh: Backported to 3.2: bump target version numbers from 1.0.1 to 1.0.2]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 23cb21092eb9dcec9d3604b68d95192b79915890 upstream.
Add module aliases so that autoloading works correctly if the user
tries to activate "snapshot-origin" or "snapshot-merge" targets.
Reference: https://bugzilla.redhat.com/889973
Reported-by: Chao Yang <chyang@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit fd7c092e711ebab55b2688d3859d95dfd0301f73 upstream.
Avoid returning a truncated table or status string instead of setting
the DM_BUFFER_FULL_FLAG when the last target of a table fills the
buffer.
When processing a table or status request, the function retrieve_status
calls ti->type->status. If ti->type->status returns non-zero,
retrieve_status assumes that the buffer overflowed and sets
DM_BUFFER_FULL_FLAG.
However, targets don't return non-zero values from their status method
on overflow. Most targets returns always zero.
If a buffer overflow happens in a target that is not the last in the
table, it gets noticed during the next iteration of the loop in
retrieve_status; but if a buffer overflow happens in the last target, it
goes unnoticed and erroneously truncated data is returned.
In the current code, the targets behave in the following way:
* dm-crypt returns -ENOMEM if there is not enough space to store the
key, but it returns 0 on all other overflows.
* dm-thin returns errors from the status method if a disk error happened.
This is incorrect because retrieve_status doesn't check the error
code, it assumes that all non-zero values mean buffer overflow.
* all the other targets always return 0.
This patch changes the ti->type->status function to return void (because
most targets don't use the return code). Overflow is detected in
retrieve_status: if the status method fills up the remaining space
completely, it is assumed that buffer overflow happened.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
[bwh: Backported to 3.2:
- Adjust context
- dm_status_fn doesn't take a status_flags parameter
- Bump the last component of each current version (verified not to
match any version used in mainline)
- Drop changes to dm-verity]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 58ebb34c49fcfcaa029e4b1c1453d92583900f9a upstream.
Create_stripe_zones returns an error slightly differently to
raid0_run and to raid0_takeover_*.
The error returned used by the second was wrong and an error would
result in mddev->private being set to NULL and sooner or later a
crash.
So never return NULL, return ERR_PTR(err), not NULL from
create_stripe_zones.
This bug has been present since 2.6.35 so the fix is suitable
for any kernel since then.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit a64685399181780998281fe07309a94b25dd24c3 upstream.
You cannot resize a RAID0 array (in terms of making the devices
bigger), but the code doesn't entirely stop you.
So:
disable setting of the available size on each device for
RAID0 and Linear devices. This must not change as doing so
can change the effective layout of data.
Make sure that the size that raid0_size() reports is accurate,
but rounding devices sizes to chunk sizes. As the device sizes
cannot change now, this isn't so important, but it is best to be
safe.
Without this change:
mdadm --grow /dev/md0 -z max
mdadm --grow /dev/md0 -Z max
then read to the end of the array
can cause a BUG in a RAID0 array.
These bugs have been present ever since it became possible
to resize any device, which is a long time. So the fix is
suitable for any -stable kerenl.
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|