Age | Commit message (Collapse) | Author |
|
commit 09ee96b21456883e108c3b00597bb37ec512151b upstream.
The "dm snapshot: suspend origin when doing exception handover" commit
fixed a exception store handover bug associated with pending exceptions
to the "snapshot-origin" target.
However, a similar problem exists in snapshot merging. When snapshot
merging is in progress, we use the target "snapshot-merge" instead of
"snapshot-origin". Consequently, during exception store handover, we
must find the snapshot-merge target and suspend its associated
mapped_device.
To avoid lockdep warnings, the target must be suspended and resumed
without holding _origins_lock.
Introduce a dm_hold() function that grabs a reference on a
mapped_device, but unlike dm_get(), it doesn't crash if the device has
the DMF_FREEING flag set, it returns an error in this case.
In snapshot_resume() we grab the reference to the origin device using
dm_hold() while holding _origins_lock (_origins_lock guarantees that the
device won't disappear). Then we release _origins_lock, suspend the
device and grab _origins_lock again.
NOTE to stable@ people:
When backporting to kernels 3.18 and older, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b735fede8d957d9d255e9c5cf3964cfa59799637 upstream.
In the function snapshot_resume we perform exception store handover. If
there is another active snapshot target, the exception store is moved
from this target to the target that is being resumed.
The problem is that if there is some pending exception, it will point to
an incorrect exception store after that handover, causing a crash due to
dm-snap-persistent.c:get_exception()'s BUG_ON.
This bug can be triggered by repeatedly changing snapshot permissions
with "lvchange -p r" and "lvchange -p rw" while there are writes on the
associated origin device.
To fix this bug, we must suspend the origin device when doing the
exception store handover to make sure that there are no pending
exceptions:
- introduce _origin_hash that keeps track of dm_origin structures.
- introduce functions __lookup_dm_origin, __insert_dm_origin and
__remove_dm_origin that manipulate the origin hash.
- modify snapshot_resume so that it calls dm_internal_suspend_fast() and
dm_internal_resume_fast() on the origin device.
NOTE to stable@ people:
When backporting to kernels 3.12-3.18, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.
When backporting to kernels older than 3.12, you need to pick functions
dm_internal_suspend and dm_internal_resume from the commit
fd2ed4d252701d3bbed4cd3e3d267ad469bb832a.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5f027a3bf184d1d36e68745f7cd3718a8b879cc0 upstream.
It was always intended that a read to an unprovisioned block will return
zeroes regardless of whether the pool is in read-only or read-write
mode. thin_bio_map() was inconsistent with its handling of such reads
when the pool is in read-only mode, it now properly zero-fills the bios
it returns in response to unprovisioned block reads.
Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA
and just allow the IO to be deferred to the worker which will result in
pool->process_bio() handling the IO (which already properly zero-fills
reads to unprovisioned blocks).
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
REQ_WRITE_SAME
commit e5db29806b99ce2b2640d2e4d4fcb983cea115c5 upstream.
Since it's possible for the discard and write same queue limits to
change while the upper level command is being sliced and diced, fix up
both of them (a) to reject IO if the special command is unsupported at
the start of the function and (b) read the limits once and let the
commands error out on their own if the status happens to change.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ab7c7bb6f4ab95dbca96fcfc4463cd69843e3e24 upstream.
__dm_destroy() must take the suspend_lock so that its presuspend and
postsuspend calls do not race with an internal suspend.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 22aa66a3ee5b61e0f4a0bfeabcaa567861109ec3 upstream.
When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero. Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.
pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences s->origin. These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.
This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2bec1f4a8832e74ebbe859f176d8a9cb20dd97f4 upstream.
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.
dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.
There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.
To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 37527b869207ad4c208b1e13967d69b8bba1fbf9 upstream.
I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:
# echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
# lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 4K 1G 0
`-moo (dm-0) 0 4K 1G 0
sdb 0 0B 0B 0
`-moo (dm-0) 0 4K 1G 0
Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.
If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.
To fix this: before entering the loop, check for the combination of
REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
the mirror device.
This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; upstream is also affected.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f2ed51ac64611d717d1917820a01930174c2f236 upstream.
It may be possible that a device claims discard support but it rejects
discards with -EOPNOTSUPP. It happens when using loopback on ext2/ext3
filesystem driven by the ext4 driver. It may also happen if the
underlying devices are moved from one disk on another.
If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
not degrade the array.
This patch fixes failed test shell/lvconvert-repair-transient.sh in the
lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
filesystem and it is being driven by the ext4 driver.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d1901ef099c38afd11add4cfb3312c02ef21ec4a upstream.
When a drive is marked write-mostly it should only be the
target of reads if there is no other option.
This behaviour was broken by
commit 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc
md/raid1: read balance chooses idlest disk for SSD
which causes a write-mostly device to be *preferred* is some cases.
Restore correct behaviour by checking and setting
best_dist_disk and best_pending_disk rather than best_disk.
We only need to test one of these as they are both changed
from -1 or >=0 at the same time.
As we leave min_pending and best_dist unchanged, any non-write-mostly
device will appear better than the write-mostly device.
Reported-by: Tomáš Hodek <tomas.hodek@volny.cz>
Reported-by: Dark Penguin <darkpenguin@yandex.ru>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: http://marc.info/?l=linux-raid&m=135982797322422
Fixes: 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 26ac107378c4742978216be1005b7291b799c7b2 upstream.
Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f:
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.
Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.
So change the condition to only force RCW on non-degraded arrays.
Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull two fixes for md from Neil Brown:
- Another live lock, needs backporting
- work-around false positive with new warnings.
* tag 'md/3.19-fixes' of git://neil.brown.name/md:
md/bitmap: fix a might_sleep() warning.
md/raid5: fix another livelock caused by non-aligned writes.
|
|
commit 8eb23b9f35aae413140d3fda766a98092c21e9b0
sched: Debug nested sleeps
causes false-positive warnings in RAID5 code.
This annotation removes them and adds a comment
explaining why there is no real problem.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.
As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.
This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.
However since commit 67f455486d2ea2, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.
So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.
Fixes: 67f455486d2ea20b2d94d6adf5b9b783d079e321
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
FAIL mode
You can't modify the metadata in these modes. It's better to fail these
messages immediately than let the block-manager deny write locks on
metadata blocks. Otherwise these failed metadata changes will trigger
'needs_check' to get set in the metadata superblock -- requiring repair
using the thin_check utility.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
inactive and active DM tables") mistakenly ignored the use of ERR_PTR
returns. Restore missing IS_ERR checks and ERR_PTR returns where
appropriate.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Commit ffcc393641 ("dm: enhance internal suspend and resume interface")
attempted to handle multiple internal suspends on the same device, but
it did that incorrectly. When these functions are called in this order
on the same device the device is no longer suspended, but it should be:
dm_internal_suspend_noflush
dm_internal_suspend_noflush
dm_internal_resume
Fix this bug by maintaining an 'internal_suspend_count' and resuming
the device when this count drops to zero.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Introduce a new variable to count the number of allocated migration
structures. The existing variable cache->nr_migrations became
overloaded. It was used to:
i) track of the number of migrations in flight for the purposes of
quiescing during suspend.
ii) to estimate the amount of background IO occuring.
Recent discard changes meant that REQ_DISCARD bios are processed with
a migration. Discards are not background IO so nr_migrations was not
incremented. However this could cause quiescing to complete early.
(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.
Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge. This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.
Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
In bio-based DM's clone_endio(), when target_type doesn't implement
.end_io (e.g. linear) r will be always be initialized 0. So if a
WRITE SAME bio fails WRITE SAME will not be disabled as intended.
Fix this by initializing r to error, rather than 0, in clone_endio().
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails")
Cc: stable@vger.kernel.org
|
|
Commit 80e96c5484be ("dm thin: do not allow thin device activation
while pool is suspended") delayed the initialization of a new thin
device's refcount and completion until after this new thin was added
to the pool's active_thins list and the pool lock is released. This
opens a race with a worker thread that walks the list and calls
thin_get/put, noticing that the refcount goes to 0 and calling
complete, freezing up the system and giving the oops below:
kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90
kernel: Call Trace:
kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
Set the thin device's initial refcount and initialize the completion
before adding it to the pool's active_thins list in thin_ctr().
Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
are released
Discard bios and thin device deletion have the potential to release data
blocks. If the thin-pool is in out-of-data-space mode, and blocks were
released, transition the thin-pool back to full write mode.
The correct time to do this is just after the thin-pool metadata commit.
It cannot be done before the commit because the space maps will not
allow immediate reuse of the data blocks in case there's a rollback
following power failure.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
function pointer was incorrectly being set to
process_prepared_discard_passdown rather than process_prepared_discard.
This incorrect function pointer meant the discard was being passed down,
but not effecting the mapping. As such any discard that was issued, in
an attempt to reclaim blocks, would not successfully free data space.
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Pull md updates from Neil Brown:
"Three fixes for md.
I did have a largish set of locking changes queued, but late testing
showed they weren't quite as stable as I thought and while I fixed
what I found, I decided it safer to delay them a release ...
particularly as I'll be AFK for a few weeks. So expect a larger batch
next time :-)"
* tag 'md/3.19' of git://neil.brown.name/md:
md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
md: fix semicolon.cocci warnings
md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
|
|
Pull block layer driver updates from Jens Axboe:
- NVMe updates:
- The blk-mq conversion from Matias (and others)
- A stack of NVMe bug fixes from the nvme tree, mostly from Keith.
- Various bug fixes from me, fixing issues in both the blk-mq
conversion and generic bugs.
- Abort and CPU online fix from Sam.
- Hot add/remove fix from Indraneel.
- A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)
- With the generic IO stat accounting from 3.19/core, converting md,
bcache, and rsxx to use those. From Gu Zheng.
- Boundary check for queue/irq mode for null_blk from Matias. Fixes
cases where invalid values could be given, causing the device to hang.
- The xen blkfront pull request, with two bug fixes from Vitaly.
* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
NVMe: fix race condition in nvme_submit_sync_cmd()
NVMe: fix retry/error logic in nvme_queue_rq()
NVMe: Fix FS mount issue (hot-remove followed by hot-add)
NVMe: fix error return checking from blk_mq_alloc_request()
NVMe: fix freeing of wrong request in abort path
xen/blkfront: remove redundant flush_op
xen/blkfront: improve protection against issuing unsupported REQ_FUA
NVMe: Fix command setup on IO retry
null_blk: boundary check queue_mode and irqmode
block/rsxx: use generic io stats accounting functions to simplify io stat accounting
md: use generic io stats accounting functions to simplify io stat accounting
drbd: use generic io stats accounting functions to simplify io stat accounting
md/bcache: use generic io stats accounting functions to simplify io stat accounting
NVMe: Update module version major number
NVMe: fail pci initialization if the device doesn't have any BARs
NVMe: add ->exit_hctx() hook
NVMe: make setup work for devices that don't do INTx
NVMe: enable IO stats by default
NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
NVMe: replace blk_put_request() with blk_mq_free_request()
...
|
|
A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously. This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.
So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.
Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well. To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.
Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started. This is particularly best if we are about to stop the
sync thread.
Fixes: ac05f256691fe427a3e84c19261adb0b67dd73c0
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Significant DM thin-provisioning performance improvements to meet
performance requirements that were requested by the Gluster
distributed filesystem.
Specifically, dm-thinp now takes care to aggregate IO that will be
issued to the same thinp block before issuing IO to the underlying
devices. This really helps improve performance on HW RAID6 devices
that have a writeback cache because it avoids RMW in the HW RAID
controller.
- Some stable fixes: fix leak in DM bufio if integrity profiles were
enabled, use memzero_explicit in DM crypt to avoid any potential for
information leak, and a DM cache fix to properly mark a cache block
dirty if it was promoted to the cache via the overwrite optimization.
- A few simple DM persistent data library fixes
- DM cache multiqueue policy block promotion improvements.
- DM cache discard improvements that take advantage of range
(multiblock) discard support in the DM bio-prison. This allows for
much more efficient bulk discard processing (e.g. when mkfs.xfs
discards the entire device).
- Some small optimizations in DM core and RCU deference cleanups
- DM core changes to suspend/resume code to introduce the new internal
suspend/resume interface that the DM thin-pool target now uses to
suspend/resume active thin devices when the thin-pool must
suspend/resume.
This avoids forcing userspace to track all active thin volumes in a
thin-pool when the thin-pool is suspended for the purposes of
metadata or data space resize.
* tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (49 commits)
dm crypt: use memzero_explicit for on-stack buffer
dm space map metadata: fix sm_bootstrap_get_count()
dm space map metadata: fix sm_bootstrap_get_nr_blocks()
dm bufio: fix memleak when using a dm_buffer's inline bio
dm cache: fix spurious cell_defer when dealing with partial block at end of device
dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
dm cache: only use overwrite optimisation for promotion when in writeback mode
dm cache: discard block size must be a multiple of cache block size
dm cache: fix a harmless race when working out if a block is discarded
dm cache: when reloading a discard bitset allow for a different discard block size
dm cache: fix some issues with the new discard range support
dm array: if resizing the array is a noop set the new root to the old one
dm: use rcu_dereference_protected instead of rcu_dereference
dm thin: fix pool_io_hints to avoid looking at max_hw_sectors
dm thin: suspend/resume active thin devices when reloading thin-pool
dm: enhance internal suspend and resume interface
dm thin: do not allow thin device activation while pool is suspended
dm: add presuspend_undo hook to target_type
dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl
dm thin: remove stale 'trim' message in block comment above pool_message
...
|
|
drivers/md/md.c:7175:43-44: Unneeded semicolon
Removes unneeded semicolon.
Generated by: scripts/coccinelle/misc/semicolon.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.
Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.
So treat that case just like RAID6, just as we do in
handle_stripe_dirtying. RAID6 always does reconstruct-write.
This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.
Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
|
|
Use memzero_explicit to cleanup sensitive data allocated on stack
to prevent the compiler from optimizing and removing memset() calls.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Must set 'result' accordingly rather than return it.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
This function isn't right and it causes a static checker warning:
drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
error: potentially using uninitialized 'sb_data_size'.
It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.
Fixes: 3241b1d3e0aa ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload. Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.
Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
device
We never bother caching a partial block that is at the back end of the
origin device. No cell ever gets locked, but the calling code was
assuming it was and trying to release it.
Now the code only releases if the cell has been set to a non NULL
value.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation. Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
block size
The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.
To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits. The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.
Fixes: 7ae34e777896 ("dm cache: improve discard support")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
boundaries
- calculate the cache's discard_nr_blocks relative to the determined
discard_block_size rather than using oblock_to_dblock()
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target. And the DM cache target passes in
the same old root to array_resize() anyway.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
rcu_dereference() should be used in sections protected by rcu_read_lock.
For writers, holding some kind of mutex or lock,
rcu_dereference_protected() is the way to go, adding explicit lockdep
bits.
In __unbind(), we are the last user of this mapped device, so can use
the constant '1' instead of a lockdep_is_held(), not consistent with
other uses of rcu_dereference_protected() which use md->suspend_lock
mutex.
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33423974bfc1 ("dm: Use rcu_dereference() for accessing rcu pointer")
Cc: Pranith Kumar <bobby.prani@gmail.com>
[snitzer: allow lines longer than 80 columns, refine subject]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Simplify the pool_io_hints code that works to establish a max_sectors
value that is a power-of-2 factor of the thin-pool's blocksize. The
biggest associated improvement is that the DM thin-pool is no longer
concerning itself with the data device's max_hw_sectors when adjusting
max_sectors.
This fixes the relative fragility of the original "dm thin: adjust
max_sectors_kb based on thinp blocksize" commit that only became
apparent when testing was performed using a DM thin-pool ontop of a
virtio_blk device. One proposed upstream patch detailed the problems
inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611
So even though virtio_blk incorrectly set its max_hw_sectors it actually
helped make it clear that we need DM thinp to be tolerant of any future
Linux driver that incorrectly sets max_hw_sectors.
We only need to be concerned with modifying the thin-pool device's
max_sectors limit if it is smaller than the thin-pool's blocksize. In
this case the value of max_sectors does become a limiting factor when
upper layers (e.g. filesystems) construct their bios. But if the
hardware can support IOs larger than the thin-pool's blocksize the user
is encouraged to adjust the thin-pool's data device's max_sectors
accordingly -- doing so will enable the thin-pool to inherit the
established user-defined max_sectors.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Before this change it was expected that userspace would first suspend
all active thin devices, reload/resize the thin-pool target, then resume
all active thin devices. Now the thin-pool suspend/resume will trigger
the suspend/resume of all active thins via appropriate calls to
dm_internal_suspend and dm_internal_resume.
Store the mapped_device for each thin device in struct thin_c to make
these calls possible.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.
Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.
Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared. Add lockdep annotation to dm_suspend() and dm_resume().
The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().
Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend". A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.
But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.
Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
Otherwise IO could be issued to the pool while it is suspended.
Care was taken to properly interlock between the thin and thin-pool
targets when accessing the pool's 'suspended' flag. The thin_ctr will
not add a new thin device to the pool's active_thins list if the pool is
susepended.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
|
|
No point checking if the device is suspended if the current target
doesn't even implement .ioctl
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|