linux-toradex.git/drivers/md, branch v4.9.16

dm: flush queued bios when process blocks to avoid deadlock

2017-03-18T11:14:34+00:00

commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 upstream.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks
by redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However other deadlocks (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

** the related dm-snapshot deadlock is detailed here:
https://www.redhat.com/archives/dm-devel/2016-July/msg00065.html

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule() call.  Consequently,
when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.

The structure blk_plug contains an entry cb_list and this list can contain
arbitrary callback functions that are called when the process blocks.
To implement this fix DM (ab)uses the onstack plug's cb_list interface
to get its flush_current_bio_list() called at schedule() time.

This fixes the snapshot deadlock - if the map method blocks,
flush_current_bio_list() will be called and it redirects bios waiting
on current->bio_list to appropriate workqueues.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

md linear: fix a race between linear_add() and linear_congested()

2017-03-12T05:41:52+00:00

commit 03a9e24ef2aaa5f1f9837356aed79c860521407a upstream.

Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
disk to a md linear device causes kernel crash at linear_congested(). From
the crash image analysis, I find in linear_congested(), mddev->raid_disks
contains value N, but conf->disks[] only has N-1 pointers available. Then
a NULL pointer deference crashes the kernel.

There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend().  After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.

Here I explain how the race still exists in current code.  For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.

Now I use a possible code execution time sequence to demo how the possible
race happens,

seq    linear_add()                linear_congested()
 0                                 conf=mddev->private
 1   oldconf=mddev->private
 2   mddev->raid_disks++
 3                              for (i=0; iraid_disks;i++)
 4                                bdev_get_queue(conf->disks[i].rdev->bdev)
 5   mddev->private=newconf

In linear_add() mddev->raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf->disks[i] by
the increased mddev->raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf->disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.

To fix this race, there are 2 parts of modification in the patch,
 1) Add 'int raid_disks' in struct linear_conf, as a copy of
    mddev->raid_disks. It is initialized in linear_conf(), always being
    consistent with pointers number of 'struct dev_info disks[]'. When
    iterating conf->disks[] in linear_congested(), use conf->raid_disks to
    replace mddev->raid_disks in the for-loop, then NULL pointer deference
    will not happen again.
 2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
    free oldconf memory. Because oldconf may be referenced as mddev->private
    in linear_congested(), kfree_rcu() makes sure that its memory will not
    be released until no one uses it any more.
Also some code comments are added in this patch, to make this modification
to be easier understandable.

This patch can be applied for kernels since v4.0 after commit:
3be260cc18f8 ("md/linear: remove rcu protections in favour of
suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
people who maintain kernels before Linux v4.0, they need to do some back
back port to this patch.

Changelog:
 - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
       replace rcu_call() in linear_add().
 - v2: add RCU stuffs by suggestion from Shaohua and Neil.
 - v1: initial effort.

Signed-off-by: Coly Li 
Cc: Shaohua Li 
Cc: Neil Brown 
Signed-off-by: Shaohua Li 
Signed-off-by: Greg Kroah-Hartman

dm raid: fix data corruption on reshape request

2017-03-12T05:41:44+00:00

commit d36a19541fe8f392778ac137d60f9be8dfdd8f9d upstream.

The lvm2 sequence to manage dm-raid constructor flags that trigger a
rebuild or a reshape is defined as:

1) load table with flags (e.g. rebuild/delta_disks/data_offset)
2) clear out the flags in lvm2 metadata
3) store the lvm2 metadata, reload the table to reset the flags
   previously established during the initial load (1) -- in order to
   prevent repeatedly requesting a rebuild or a reshape on activation

Currently, loading an inactive table with rebuild/reshape flags
specified will cause dm-raid to rebuild/reshape on resume and thus start
updating the raid metadata (about the progress).  When the second table
reload, to reset the flags, occurs the constructor accesses the volatile
progress state kept in the raid superblocks.  Because the active mapping
is still processing the rebuild/reshape, that position will be stale by
the time the device is resumed.

In the reshape case, this causes data corruption by processing already
reshaped stripes again.  In the rebuild case, it does _not_ cause data
corruption but instead involves superfluous rebuilds.

Fix by keeping the raid set frozen during the first resume and then
allow the rebuild/reshape during the second resume.

Fixes: 9dbd1aa3a ("dm raid: add reshaping support to the target")
Signed-off-by: Heinz Mauelshagen 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

dm round robin: revert "use percpu 'repeat_count' and 'current_path'"

2017-03-12T05:41:44+00:00

commit 37a098e9d10db6e2efc05fe61e3a6ff2e9802c53 upstream.

The sloppy nature of lockless access to percpu pointers
(s->current_path) in rr_select_path(), from multiple threads, is
causing some paths to used more than others -- which results in less
IO performance being observed.

Revert these upstream commits to restore truly symmetric round-robin
IO submission in DM multipath:

b0b477c dm round robin: use percpu 'repeat_count' and 'current_path'
802934b dm round robin: do not use this_cpu_ptr() without having preemption disabled

There is no benefit to all this complexity if repeat_count = 1 (which is
the recommended default).

Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

dm stats: fix a leaked s->histogram_boundaries array

2017-03-12T05:41:44+00:00

commit 6085831883c25860264721df15f05bbded45e2a2 upstream.

Fixes: dfcfac3e4cd9 ("dm stats: collect and report histogram of IO latencies")
Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

dm cache: fix corruption seen when using cache > 2TB

2017-03-12T05:41:44+00:00

commit ca763d0a53b264a650342cee206512bc92ac7050 upstream.

A rounding bug due to compiler generated temporary being 32bit was found
in remap_to_cache().  A localized cast in remap_to_cache() fixes the
corruption but this preferred fix (changing from uint32_t to sector_t)
eliminates potential for future rounding errors elsewhere.

Signed-off-by: Joe Thornber 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

bcache: Make gc wakeup sane, remove set_task_state()

2017-02-23T16:44:36+00:00

commit be628be09563f8f6e81929efbd7cf3f45c344416 upstream.

Signed-off-by: Kent Overstreet 
Cc: Coly Li 
Signed-off-by: Greg Kroah-Hartman

dm rq: cope with DM device destruction while in dm_old_request_fn()

2017-02-14T23:25:32+00:00

commit 4087a1fffe38106e10646606a27f10d40451862d upstream.

Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
being passed from dm_old_request_fn() that races with DM device
destruction.

Reported-by: artem@flashgrid.io
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

md: fix refcount problem on mddev when stopping array.

2017-01-12T10:39:35+00:00

commit e2342ca832726a840ca6bd196dd2cc073815b08a upstream.

md_open() gets a counted reference on an mddev using mddev_find().
If it ends up returning an error, it must drop this reference.

There are two error paths where the reference is not dropped.
One only happens if the process is signalled and an awkward time,
which is quite unlikely.
The other was introduced recently in commit af8d8e6f0.

Change the code to ensure the drop the reference when returning an error,
and make it harded to re-introduce this sort of bug in the future.

Reported-by: Marc Smith 
Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: NeilBrown 
Acked-by: Guoqing Jiang 
Signed-off-by: Shaohua Li 
Signed-off-by: Greg Kroah-Hartman

md: MD_RECOVERY_NEEDED is set for mddev->recovery

2017-01-12T10:39:34+00:00

commit 82a301cb0ea2df8a5c88213094a01660067c7fb4 upstream.

Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device
removal.")

Reviewed-by: NeilBrown 
Signed-off-by: Shaohua Li 
Signed-off-by: Greg Kroah-Hartman