<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/block/elevator.c, branch v3.3.5</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions</title>
<updated>2012-02-08T08:19:38+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2012-02-08T08:19:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=050c8ea80e3e90019d9e981c6a117ef614e882ed'/>
<id>050c8ea80e3e90019d9e981c6a117ef614e882ed</id>
<content type='text'>
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test.  blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge().  elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge().  While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
LKML-Reference: &lt;4F16F3CA.90904@kernel.dk&gt;
Original-patch-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test.  blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge().  elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge().  While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
LKML-Reference: &lt;4F16F3CA.90904@kernel.dk&gt;
Original-patch-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Revert "block: recursive merge requests"</title>
<updated>2012-01-15T09:29:48+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2012-01-15T09:29:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5d381efb3d1f1ef10535a31ca0dd9b22fe1e1922'/>
<id>5d381efb3d1f1ef10535a31ca0dd9b22fe1e1922</id>
<content type='text'>
This reverts commit 274193224cdabd687d804a26e0150bb20f2dd52c.

We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit 274193224cdabd687d804a26e0150bb20f2dd52c.

We have some problems related to selection of empty queues
that need to be resolved, evidence so far points to the
recursive merge logic making either being the cause or at
least the accelerator for this. So revert it for now, until
we figure this out.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: recursive merge requests</title>
<updated>2011-12-16T13:00:31+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-12-16T13:00:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=274193224cdabd687d804a26e0150bb20f2dd52c'/>
<id>274193224cdabd687d804a26e0150bb20f2dd52c</id>
<content type='text'>
In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.

Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
a+3,.... When the requests are flushed to queue, a and a+1 are merged
to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
aren't merged.
With recursive merge below, the workload throughput gets improved 20%
and context switch drops 60%.

Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block, cfq: move io_cq exit/release to blk-ioc.c</title>
<updated>2011-12-13T23:33:42+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-12-13T23:33:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7e5a8794492e43e9eebb68a98a23be055888ccd0'/>
<id>7e5a8794492e43e9eebb68a98a23be055888ccd0</id>
<content type='text'>
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too.  The odd -&gt;io_cq-&gt;exit/release() callbacks are replaced
with elevator_ops-&gt;elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc.  The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released.  New field
io_cq-&gt;__rcu_icq_cache is added for this purpose.  As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too.  The odd -&gt;io_cq-&gt;exit/release() callbacks are replaced
with elevator_ops-&gt;elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc.  The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released.  New field
io_cq-&gt;__rcu_icq_cache is added for this purpose.  As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block, cfq: move icq cache management to block core</title>
<updated>2011-12-13T23:33:42+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-12-13T23:33:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=3d3c2379feb177a5fd55bb0ed76776dc9d4f3243'/>
<id>3d3c2379feb177a5fd55bb0ed76776dc9d4f3243</id>
<content type='text'>
Let elevators set -&gt;icq_size and -&gt;icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.

* elv_register() now can return failure.  All callers updated.

* icq caches are automatically named "ELVNAME_io_cq".

* cfq_slab_setup/kill() are collapsed into cfq_init/exit().

* While at it, minor indentation change for iosched_cfq.elevator_name
  for consistency.

This will help moving icq management to block core.  This doesn't
introduce any functional change.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Let elevators set -&gt;icq_size and -&gt;icq_align in elevator_type and
elv_register() and elv_unregister() respectively create and destroy
kmem_cache for icq.

* elv_register() now can return failure.  All callers updated.

* icq caches are automatically named "ELVNAME_io_cq".

* cfq_slab_setup/kill() are collapsed into cfq_init/exit().

* While at it, minor indentation change for iosched_cfq.elevator_name
  for consistency.

This will help moving icq management to block core.  This doesn't
introduce any functional change.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block, cfq: move cfqd-&gt;icq_list to request_queue and add request-&gt;elv.icq</title>
<updated>2011-12-13T23:33:41+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-12-13T23:33:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a612fddf0d8090f2877305c9168b6c1a34fb5d90'/>
<id>a612fddf0d8090f2877305c9168b6c1a34fb5d90</id>
<content type='text'>
Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.

* Move cfqd-&gt;icq_list to request_queue-&gt;icq_list

* Make request explicitly point to icq instead of through elevator
  private data.  -&gt;elevator_private[3] is replaced with sub struct elv
  which contains icq pointer and priv[2].  cfq is updated accordingly.

* Meaningless clearing of -&gt;elevator_private[0] removed from
  elv_set_request().  At that point in code, the field was guaranteed
  to be %NULL anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Most of icq management is about to be moved out of cfq into blk-ioc.
This patch prepares for it.

* Move cfqd-&gt;icq_list to request_queue-&gt;icq_list

* Make request explicitly point to icq instead of through elevator
  private data.  -&gt;elevator_private[3] is replaced with sub struct elv
  which contains icq pointer and priv[2].  cfq is updated accordingly.

* Meaningless clearing of -&gt;elevator_private[0] removed from
  elv_set_request().  At that point in code, the field was guaranteed
  to be %NULL anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: remove elevator_queue-&gt;ops</title>
<updated>2011-12-13T23:33:41+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-12-13T23:33:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=22f746e235a5cbee2a6ca9887b1be2aa7d31fe71'/>
<id>22f746e235a5cbee2a6ca9887b1be2aa7d31fe71</id>
<content type='text'>
elevator_queue-&gt;ops points to the same ops struct -&gt;elevator_type.ops
is pointing to.  The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.

Relocate elevator_type-&gt;list which used only during module init/exit
to the end of the structure, rename elevator_queue-&gt;elevator_type to
-&gt;type, and replace elevator_queue-&gt;ops with elevator_queue-&gt;type.ops.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
elevator_queue-&gt;ops points to the same ops struct -&gt;elevator_type.ops
is pointing to.  The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.

Relocate elevator_type-&gt;list which used only during module init/exit
to the end of the structure, rename elevator_queue-&gt;elevator_type to
-&gt;type, and replace elevator_queue-&gt;ops with elevator_queue-&gt;type.ops.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: reorder elevator switch sequence</title>
<updated>2011-12-13T23:33:40+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-12-13T23:33:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f8fc877d3c1f10457d0d73d8540a0c51a1fa718a'/>
<id>f8fc877d3c1f10457d0d73d8540a0c51a1fa718a</id>
<content type='text'>
Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator.  However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence.  It can just do "register,
detach, attach".

* elevator_init_queue() is updated to set -&gt;elevator_data directly and
  return 0 / -errno.  This allows elevator_exit() on an unattached
  elevator.

* __elv_unregister_queue() which was necessary to unregister
  unattached q is removed in favor of __elv_register_queue() which can
  register unattached q.

* elevator_attach() becomes a single assignment and obscures more then
  it helps.  Dropped.

This will help cleaning up io_context handling across elevator switch.

This patch doesn't introduce visible behavior change.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Elevator switch sequence first attached the new elevator, then tried
registering it (sysfs) and if that failed attached back the old
elevator.  However, sysfs registration doesn't require the elevator to
be attached, so there is no reason to do the "detach, attach new,
register, maybe re-attach old" sequence.  It can just do "register,
detach, attach".

* elevator_init_queue() is updated to set -&gt;elevator_data directly and
  return 0 / -errno.  This allows elevator_exit() on an unattached
  elevator.

* __elv_unregister_queue() which was necessary to unregister
  unattached q is removed in favor of __elv_register_queue() which can
  register unattached q.

* elevator_attach() becomes a single assignment and obscures more then
  it helps.  Dropped.

This will help cleaning up io_context handling across elevator switch.

This patch doesn't introduce visible behavior change.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block, cfq: remove delayed unlink</title>
<updated>2011-12-13T23:33:39+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-12-13T23:33:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b9a1920837bc53430d339380e393a6e4c372939f'/>
<id>b9a1920837bc53430d339380e393a6e4c372939f</id>
<content type='text'>
Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary.  Kill them and remove now unused elevator_ops-&gt;trim().

This also leaves call_for_each_cic() without any user.  Removed.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that all cic's are immediately unlinked from both ioc and queue,
lazy dropping from lookup path and trimming on elevator unregister are
unnecessary.  Kill them and remove now unused elevator_ops-&gt;trim().

This also leaves call_for_each_cic() without any user.  Removed.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown</title>
<updated>2011-10-19T12:42:16+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2011-10-19T12:42:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c9a929dde3913780b5c416f4bb9d9ed804f509ce'/>
<id>c9a929dde3913780b5c416f4bb9d9ed804f509ce</id>
<content type='text'>
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken.  The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

 sd 0:0:1:0: [sdb] Stopping disk
 ata1.01: disabled
 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
 RIP: 0010:[&lt;ffffffff8137d651&gt;]  [&lt;ffffffff8137d651&gt;] elv_rqhash_find+0x61/0x100
 ...
 Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
 ...
 Call Trace:
  [&lt;ffffffff8137d774&gt;] elv_merge+0x84/0xe0
  [&lt;ffffffff81385b54&gt;] blk_queue_bio+0xf4/0x400
  [&lt;ffffffff813838ea&gt;] generic_make_request+0xca/0x100
  [&lt;ffffffff81383994&gt;] submit_bio+0x74/0x100
  [&lt;ffffffff811c53ec&gt;] dio_bio_submit+0xbc/0xc0
  [&lt;ffffffff811c610e&gt;] __blockdev_direct_IO+0x92e/0xb40
  [&lt;ffffffff811c39f7&gt;] blkdev_direct_IO+0x57/0x60
  [&lt;ffffffff8113b1c5&gt;] generic_file_aio_read+0x6d5/0x760
  [&lt;ffffffff8118c1ca&gt;] do_sync_read+0xda/0x120
  [&lt;ffffffff8118ce55&gt;] vfs_read+0xc5/0x180
  [&lt;ffffffff8118cfaa&gt;] sys_pread64+0x9a/0xb0
  [&lt;ffffffff81afaf6b&gt;] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not.  Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release.  Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue.  Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed.  The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q-&gt;exit_mutex and
  queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
  This couldn't make any meaningful difference as the queue could die
  after the check.

* blk_drain_queue() updated such that it can drain all requests and is
  now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
  drains all throttled bios during cleanup and free td when queue is
  released.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken.  The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

 sd 0:0:1:0: [sdb] Stopping disk
 ata1.01: disabled
 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
 RIP: 0010:[&lt;ffffffff8137d651&gt;]  [&lt;ffffffff8137d651&gt;] elv_rqhash_find+0x61/0x100
 ...
 Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
 ...
 Call Trace:
  [&lt;ffffffff8137d774&gt;] elv_merge+0x84/0xe0
  [&lt;ffffffff81385b54&gt;] blk_queue_bio+0xf4/0x400
  [&lt;ffffffff813838ea&gt;] generic_make_request+0xca/0x100
  [&lt;ffffffff81383994&gt;] submit_bio+0x74/0x100
  [&lt;ffffffff811c53ec&gt;] dio_bio_submit+0xbc/0xc0
  [&lt;ffffffff811c610e&gt;] __blockdev_direct_IO+0x92e/0xb40
  [&lt;ffffffff811c39f7&gt;] blkdev_direct_IO+0x57/0x60
  [&lt;ffffffff8113b1c5&gt;] generic_file_aio_read+0x6d5/0x760
  [&lt;ffffffff8118c1ca&gt;] do_sync_read+0xda/0x120
  [&lt;ffffffff8118ce55&gt;] vfs_read+0xc5/0x180
  [&lt;ffffffff8118cfaa&gt;] sys_pread64+0x9a/0xb0
  [&lt;ffffffff81afaf6b&gt;] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not.  Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release.  Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue.  Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed.  The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q-&gt;exit_mutex and
  queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
  This couldn't make any meaningful difference as the queue could die
  after the check.

* blk_drain_queue() updated such that it can drain all requests and is
  now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
  drains all throttled bios during cleanup and free td when queue is
  released.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
</feed>
