Age | Commit message (Collapse) | Author |
|
BugLink: http://bugs.launchpad.net/bugs/946928
We create "bsg" link if q->kobj.sd is not NULL, so remove it only
when the same condition is true.
Fixes:
WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
sysfs: can not remove 'bsg', no directory
Call Trace:
[<c0429683>] warn_slowpath_common+0x6a/0x7f
[<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
[<c042970b>] warn_slowpath_fmt+0x2b/0x2f
[<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
[<c053969a>] sysfs_remove_link+0x20/0x23
[<c05d88f1>] bsg_unregister_queue+0x40/0x6d
[<c0692263>] __scsi_remove_device+0x31/0x9d
[<c069149f>] scsi_forget_host+0x41/0x52
[<c0689fa9>] scsi_remove_host+0x71/0xe0
[<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
[<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
[<c06c29de>] usb_unbind_interface+0x4e/0x109
[<c067a80f>] __device_release_driver+0x6b/0xa6
[<c067a861>] device_release_driver+0x17/0x22
[<c067a46a>] bus_remove_device+0xd6/0xe6
[<c06785e2>] device_del+0xf2/0x137
[<c06c101f>] usb_disable_device+0x94/0x1a0
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 37b40adf2d1b4a5e51323be73ccf8ddcf3f15dd3)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
|
commit 0bfc96cb77224736dfa35c3c555d37b3646ef35e upstream.
[ Changes with respect to 3.3: return -ENOTTY from scsi_verify_blk_ioctl
and -ENOIOCTLCMD from sd_compat_ioctl. ]
Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
will pass the command to the underlying block device. This is
well-known, but it is also a large security problem when (via Unix
permissions, ACLs, SELinux or a combination thereof) a program or user
needs to be granted access only to part of the disk.
This patch lets partitions forward a small set of harmless ioctls;
others are logged with printk so that we can see which ioctls are
actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred.
Of course it was being sent to a (partition on a) hard disk, so it would
have failed with ENOTTY and the patch isn't changing anything in
practice. Still, I'm treating it specially to avoid spamming the logs.
In principle, this restriction should include programs running with
CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and
/dev/sdb, it still should not be able to read/write outside the
boundaries of /dev/sda2 independent of the capabilities. However, for
now programs with CAP_SYS_RAWIO will still be allowed to send the
ioctls. Their actions will still be logged.
This patch does not affect the non-libata IDE driver. That driver
however already tests for bd != bd->bd_contains before issuing some
ioctl; it could be restricted further to forbid these ioctls even for
programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Make it also print the command name when warning - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backport to 2.6.32 - ENOIOCTLCMD does not get converted to
ENOTTY, so we must return ENOTTY directly]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 577ebb374c78314ac4617242f509e2f5e7156649 upstream.
Introduce a wrapper around scsi_cmd_ioctl that takes a block device.
The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.
Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
[bwh: Backport to 2.6.32 - adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 5eb46851de3904cd1be9192fdacb8d34deadc1fc upstream.
cfq_cic_link() has race condition. When some processes which shared ioc
issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
sometimes. The race condition might stop I/O by following steps:
step 1: Process A: Issue an I/O to /dev/sda
step 2: Process A: Get an ioc (iocA here) in get_io_context() which does not
linked with a cic for the device
step 3: Process A: Get a new cic for the device (cicA here) in
cfq_alloc_io_context()
step 4: Process B: Issue an I/O to /dev/sda
step 5: Process B: Get iocA in get_io_context() since process A and B share the
same ioc
step 6: Process B: Get a new cic for the device (cicB here) in
cfq_alloc_io_context() since iocA has not been linked with a
cic for the device yet
step 7: Process A: Link cicA to iocA in cfq_cic_link()
step 8: Process A: Dispatch I/O to driver and finish it
step 9: Process B: Try to link cicB to iocA in cfq_cic_link()
But it fails with showing "cfq: cic link failed!" kernel
message, since iocA has already linked with cicA at step 7.
step 10: Process B: Wait for finishig I/O in get_request_wait()
The function does not wake up, when there is no I/O to the
device.
When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c10b61f0910466b4b99c266a7d76ac4390743fb5 upstream.
Hi,
A user reported a kernel bug when running a particular program that did
the following:
created 32 threads
- each thread took a mutex, grabbed a global offset, added a buffer size
to that offset, released the lock
- read from the given offset in the file
- created a new thread to do the same
- exited
The result is that cfq's close cooperator logic would trigger, as the
threads were issuing I/O within the mean seek distance of one another.
This workload managed to routinely trigger a use after free bug when
walking the list of merge candidates for a particular cfqq
(cfqq->new_cfqq). The logic used for merging queues looks like this:
static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
{
int process_refs, new_process_refs;
struct cfq_queue *__cfqq;
/* Avoid a circular list and skip interim queue merges */
while ((__cfqq = new_cfqq->new_cfqq)) {
if (__cfqq == cfqq)
return;
new_cfqq = __cfqq;
}
process_refs = cfqq_process_refs(cfqq);
/*
* If the process for the cfqq has gone away, there is no
* sense in merging the queues.
*/
if (process_refs == 0)
return;
/*
* Merge in the direction of the lesser amount of work.
*/
new_process_refs = cfqq_process_refs(new_cfqq);
if (new_process_refs >= process_refs) {
cfqq->new_cfqq = new_cfqq;
atomic_add(process_refs, &new_cfqq->ref);
} else {
new_cfqq->new_cfqq = cfqq;
atomic_add(new_process_refs, &cfqq->ref);
}
}
When a merge candidate is found, we add the process references for the
queue with less references to the queue with more. The actual merging
of queues happens when a new request is issued for a given cfqq. In the
case of the test program, it only does a single pread call to read in
1MB, so the actual merge never happens.
Normally, this is fine, as when the queue exits, we simply drop the
references we took on the other cfqqs in the merge chain:
/*
* If this queue was scheduled to merge with another queue, be
* sure to drop the reference taken on that queue (and others in
* the merge chain). See cfq_setup_merge and cfq_merge_cfqqs.
*/
__cfqq = cfqq->new_cfqq;
while (__cfqq) {
if (__cfqq == cfqq) {
WARN(1, "cfqq->new_cfqq loop detected\n");
break;
}
next = __cfqq->new_cfqq;
cfq_put_queue(__cfqq);
__cfqq = next;
}
However, there is a hole in this logic. Consider the following (and
keep in mind that each I/O keeps a reference to the cfqq):
q1->new_cfqq = q2 // q2 now has 2 process references
q3->new_cfqq = q2 // q2 now has 3 process references
// the process associated with q2 exits
// q2 now has 2 process references
// queue 1 exits, drops its reference on q2
// q2 now has 1 process reference
// q3 exits, so has 0 process references, and hence drops its references
// to q2, which leaves q2 also with 0 process references
q4 comes along and wants to merge with q3
q3->new_cfqq still points at q2! We follow that link and end up at an
already freed cfqq.
So, the fix is to not follow a merge chain if the top-most queue does
not have a process reference, otherwise any queue in the chain could be
already freed. I also changed the logic to disallow merging with a
queue that does not have any process references. Previously, we did
this check for one of the merge candidates, but not the other. That
doesn't really make sense.
Without the attached patch, my system would BUG within a couple of
seconds of running the reproducer program. With the patch applied, my
system ran the program for over an hour without issues.
This addresses the following bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=16217
Thanks a ton to Phil Carns for providing the bug report and an excellent
reproducer.
[ Note for stable: this applies to 2.6.32/33/34 ].
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Phil Carns <carns@mcs.anl.gov>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e00ef7997195e4f8e10593727a6286e2e2802159 upstream
We need to rework this logic post the cooperating cfq_queue merging,
for now just get rid of it and Jeff Moyer will fix the fall out.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e6c5bc737ab71e4af6025ef7d150f5a26ae5f146 upstream.
cfq_queues are merged if they are issuing requests within the mean seek
distance of one another. This patch detects when the coopearting stops and
breaks the queues back up.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit b3b6d0408c953524f979468562e7e210d8634150 upstream
The flag used to indicate that a cfqq was allowed to jump ahead in the
scheduling order due to submitting a request close to the queue that
just executed. Since closely cooperating queues are now merged, the flag
holds little meaning. Change it to indicate that multiple queues were
merged. This will later be used to allow the breaking up of merged queues
when they are no longer cooperating.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit df5fe3e8e13883f58dc97489076bbcc150789a21 upstream.
When cooperating cfq_queues are detected currently, they are allowed to
skip ahead in the scheduling order. It is much more efficient to
automatically share the cfq_queue data structure between cooperating processes.
Performance of the read-test2 benchmark (which is written to emulate the
dump(8) utility) went from 12MB/s to 90MB/s on my SATA disk. NFS servers
with multiple nfsd threads also saw performance increases.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit b2c18e1e08a5a9663094d57bb4be2f02226ee61c upstream.
async cfq_queue's are already shared between processes within the same
priority, and forthcoming patches will change the mapping of cic to sync
cfq_queue from 1:1 to 1:N. So, calculate the seekiness of a process
based on the cfq_queue instead of the cfq_io_context.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 3181faa85bda3dc3f5e630a1846526c9caaa38e3 upstream.
I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
doesn't hold rcu_read_lock.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit ab4bd22d3cce6977dc039664cc2d052e3147d662 upstream.
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.
This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.
Tracked in RH bugzilla here:
https://bugzilla.redhat.com/show_bug.cgi?id=577968
Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit d86e0e83b32bc84600adb0b6ea1fce389b266682 upstream.
We need them in SCSI to fix a bug, but currently they are not
exported to modules. Export them.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 0a58e077eb600d1efd7e54ad9926a75a39d7f8ae upstream.
blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing. __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.
This was reported as causing an oops attributable to scsi.
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit ed5302d3c25006a9edc7a7fbea97a30483f89ef7 upstream.
We do not call blk_trace_remove_sysfs() in err return path
if kobject_add() fails. This path fixes it.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e692cb668fdd5a712c6ed2a2d6f2a36ee83997b4 upstream.
When stacking devices, a request_queue is not always available. This
forced us to have a no_cluster flag in the queue_limits that could be
used as a carrier until the request_queue had been set up for a
metadevice.
There were several problems with that approach. First of all it was up
to the stacking device to remember to set queue flag after stacking had
completed. Also, the queue flag and the queue limits had to be kept in
sync at all times. We got that wrong, which could lead to us issuing
commands that went beyond the max scatterlist limit set by the driver.
The proper fix is to avoid having two flags for tracking the same thing.
We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
block layer merging functions. The queue_limit 'no_cluster' is turned
into 'cluster' to avoid double negatives and to ease stacking.
Clustering defaults to being enabled as before. The queue flag logic is
removed from the stacking function, and explicitly setting the cluster
flag is no longer necessary in DM and MD.
Reported-by: Ed Lin <ed.lin@promise.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9284bcf4e335e5f18a8bc7b26461c33ab60d0689 upstream.
Ensure that we pass down properly validated iov segments before
calling into the mapping or copy functions.
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9f864c80913467312c7b8690e41fb5ebd1b50e92 upstream.
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 892b6f90db81cccb723d5d92f4fddc2d68b206e1 upstream.
Physical block size was declared unsigned int to accomodate the maximum
size reported by READ CAPACITY(16). Make sure we use the right type in
the related functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 478971600e47cb83ff2d3c63c5c24f2b04b0d6a1 upstream.
bsg incorrectly returns sg's masked_status value for device_status.
[jejb: fix up expression logic]
Reported-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit a534dbe96e9929c7245924d8252d89048c23d569 upstream.
blk_rq_timed_out_timer() relied on blk_add_timer() never returning a
timer value of zero, but commit 7838c15b8dd18e78a523513749e5b54bda07b0cb
removed the code that bumped this value when it was zero.
Therefore when jiffies is near wrap we could get unlucky & not set the
timeout value correctly.
This patch uses a flag to indicate that the timeout value was set and so
handles jiffies wrap correctly, and it keeps all the logic in one
function so should be easier to maintain in the future.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
block: Backport of various I/O topology fixes from 2.6.33 and 2.6.34
The stacking code incorrectly scaled up the data offset in some cases
causing misaligned devices to report alignment. Rewrite the stacking
algorithm to remedy this.
(Upstream commit 9504e0864b58b4a304820dcf3755f1da80d5e72f)
The top device misalignment flag would not be set if the added bottom
device was already misaligned as opposed to causing a stacking failure.
Also massage the reporting so that an error is only returned if adding
the bottom device caused the misalignment. I.e. don't return an error
if the top is already flagged as misaligned.
(Upstream commit fe0b393f2c0a0d23a9bc9ed7dc51a1ee511098bd)
lcm() was defined to take integer-sized arguments. The supplied
arguments are multiplied, however, causing us to overflow given
sufficiently large input. That in turn led to incorrect optimal I/O
size reporting in some cases. Switch lcm() over to unsigned long
similar to gcd() and move the function from blk-settings.c to lib.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 17be8c245054b9c7786545af3ba3ca4e54cd4ad9 upstream.
DM does not want to know about partition offsets. Add a partition-aware
wrapper that DM can use when stacking block devices.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
CFQ has an optimization for cooperated applications. if several
io-context have close requests, they will get boost. But the
optimization get abused. Considering thread a, b, which work on one
file. a reads sectors s, s+2, s+4, ...; b reads sectors s+1, s+3, s
+5, ... Both a and b are sequential read, so they can open idle window.
a reads a sector s and goes to idle window and wakeup b. b reads sector
s+1, since in current implementation, cfq_should_preempt() thinks a and
b are cooperators, b will preempt a. b then reads sector s+1 and goes to
idle window and wakeup a. for the same reason, a will preempt b and
reads s+2. a and b will continue the circle. The circle will be very
long, and a and b will occupy whole disk queue. Other applications will
nearly have no chance to run.
Fix this limiting coop preempt until a queue is scheduled normally
again.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Commit a6151c3a5c8e1ff5a28450bc8d6a99a2a0add0a7 inadvertently reversed
a preempt condition check, potentially causing a performance regression.
Make the meta check correct again.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
With 2.6.32-rc5 in a KVM guest using dm and virtio_blk, we see the
following errors:
end_request: I/O error, dev vda, sector 0
end_request: I/O error, dev vda, sector 0
The errors go away if dm stops submitting empty barriers, by reverting:
commit 52b1fd5a27c625c78373e024bf570af3c9d44a79
Author: Mikulas Patocka <mpatocka@redhat.com>
dm: send empty barriers to targets in dm_flush
We should silently error all barriers, even empty barriers, on devices
like virtio_blk which don't support them.
See also:
https://bugzilla.redhat.com/514901
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Fix kernel-doc notation in blk-settings.c::blk_queue_max_discard_sectors().
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
elv_iosched_store() ignore the return value of strstrip(). It makes small
inconsistent behavior.
This patch fixes it.
<before>
====================================
# cd /sys/block/{blockdev}/queue
case1:
# echo "anticipatory" > scheduler
# cat scheduler
noop [anticipatory] deadline cfq
case2:
# echo "anticipatory " > scheduler
# cat scheduler
noop [anticipatory] deadline cfq
case3:
# echo " anticipatory" > scheduler
bash: echo: write error: Invalid argument
<after>
====================================
# cd /sys/block/{blockdev}/queue
case1:
# echo "anticipatory" > scheduler
# cat scheduler
noop [anticipatory] deadline cfq
case2:
# echo "anticipatory " > scheduler
# cat scheduler
noop [anticipatory] deadline cfq
case3:
# echo " anticipatory" > scheduler
noop [anticipatory] deadline cfq
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
If the average think time is larger than the remaining time slice
for any given queue, don't allow it to idle. A succesful idle also
means that we need to dispatch and complete a request, so if we don't
even have time left for the idle process, we would overrun the slice
in any case.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Saves 16 bytes of text, woohoo. But the more important point is
that it makes the code more readable when returning bool for 0/1
cases.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
CFQ enables idle only for processes that think less than the allowed
idle time. Since idle time is lower for seeky queues, we should use the
correct value in the comparison.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
We should subtract the slice residual from the rb tree key, since
a negative residual count indicates that the cfqq overran its slice
the last time. Hence we want to add the overrun time, to position
it a bit further away in the service tree.
Reported-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Makes the whole thing easier to read, cfq_dispatch_requests() was
a bit messy before.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Makes it easier to read than the 0.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
It was briefly introduced to allow CFQ to to delayed scheduling,
but we ended up removing that feature again. So lets kill the
function and export, and just switch CFQ back to the normal work
schedule since it is now passing in a '0' delay from all call
sites.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
The RR service tree is indexed by a key that is relative to current jiffies.
This can cause problems on jiffies wraparound.
The patch fixes it using time_before comparison, and changing
the add_front path to use a relative number, too.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
cfq uses rq->start_time as the fifo indicator, but that field may
get modified prior to cfq doing it's fifo list adjustment when
a request gets merged with another request. This can cause the
fifo list to become unordered.
Reported-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
This reverts commit a9327cac440be4d8333bba975cbbf76045096275.
Corrado Zoccolo <czoccolo@gmail.com> reports:
"with 2.6.32-rc1 I started getting the following strange output from
"iostat -kx 2":
Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
10,70 0,00 3,16 15,75 0,00 70,38
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 18,22 0,00 0,67 0,01 14,77 0,02
43,94 0,01 10,53 39043915,03 2629219,87
sdb 60,89 9,68 50,79 3,04 1724,43 50,52
65,95 0,70 13,06 488437,47 2629219,87
avg-cpu: %user %nice %system %iowait %steal %idle
2,72 0,00 0,74 0,00 0,00 96,53
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
avg-cpu: %user %nice %system %iowait %steal %idle
6,68 0,00 0,99 0,00 0,00 92,33
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
avg-cpu: %user %nice %system %iowait %steal %idle
4,40 0,00 0,73 1,47 0,00 93,40
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 100,00
sdb 0,00 4,00 0,00 3,00 0,00 28,00
18,67 0,06 19,50 333,33 100,00
Global values for service time and utilization are garbage. For
interval values, utilization is always 100%, and service time is
higher than normal.
I bisected it down to:
[a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write
statistics of in_flight requests
and verified that reverting just that commit indeed solves the issue
on 2.6.32-rc1."
So until this is debugged, revert the bad commit.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
We cannot delay for the first dispatch of the async queue if it
hasn't dispatched at all, since that could present a local user
DoS attack vector using an app that just did slow timed sync reads
while filling memory.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Not all users of the topology information want to use libblkid. Provide
the topology information through bdev ioctls.
Also clarify sector size comments for existing BLK ioctls.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
We should use the sysfs modified slice sync value, in case it differs
from the default.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Don't think that's necessarily a perfect description of what this
option fiddles with, but it's probably better than 'desktop'.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
This slowly ramps up the async queue depth based on the time
passed since the sync IO, and doesn't allow async at all until
a sync slice period has passed.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
o Do not allow more than max_dispatch requests from an async queue, if some
sync request has finished recently. This is in the hope that sync activity
is still going on in the system and we might receive a sync request soon.
Most likely from a sync queue which finished a request and we did not enable
idling on it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
This is basically identical to what Vivek Goyal posted, but combined
into one and labelled 'desktop' instead of 'fairness'. The goal
is to continue to improve on the latency side of things as it relates
to interactiveness, keeping the questionable bits under this sysfs
tunable so it would be easy for throughput-only people to turn off.
Apart from adding the interactive sysfs knob, it also adds the
behavioural change of allowing slice idling even if the hardware
does tagged command queuing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl. That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_sectors tunable to limit the size for discard
requests.
We limit the max discard request size in bytes to 32bit as that is the
limit for bio->bi_size. This could be much larger if we had a way to pass
that information through the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
prepare_discard_fn() was being called in a place where memory allocation
was effectively impossible. This makes it inappropriate for all but
the most trivial translations of Linux's DISCARD operation to the block
command set. Additionally adding a payload there makes the ownership
of the bio backing unclear as it's now allocated by the device driver
and not the submitter as usual.
It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
the queue supports discard operations or not. blkdev_issue_discard now
allocates a one-page, sector-length payload which is the right thing
for the common ATA and SCSI implementations.
The mtd implementation of prepare_discard_fn() is replaced with simply
checking for the request being a discard.
Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
which did the prepare_discard_fn but not the different payload allocation
yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
|
Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
introduced in commit 1d54ad6da9192fed5dd3b60224d9f2dfea0dcd82.
Release kobject also in case the request_fn is NULL.
Problem was noticed via kmemleak backtrace when some sysfs entries were
note properly destroyed during device removal:
unreferenced object 0xffff88001aa76640 (size 80):
comm "lvcreate", pid 2120, jiffies 4294885144
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff .........e......
90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff .f........S.....
backtrace:
[<ffffffff813f9cc6>] kmemleak_alloc+0x26/0x60
[<ffffffff8111d693>] kmem_cache_alloc+0x133/0x1c0
[<ffffffff81195891>] sysfs_new_dirent+0x41/0x120
[<ffffffff81194b0c>] sysfs_add_file_mode+0x3c/0xb0
[<ffffffff81197c81>] internal_create_group+0xc1/0x1a0
[<ffffffff81197d93>] sysfs_create_group+0x13/0x20
[<ffffffff810d8004>] blk_trace_init_sysfs+0x14/0x20
[<ffffffff8123f45c>] blk_register_queue+0x3c/0xf0
[<ffffffff812447e4>] add_disk+0x94/0x160
[<ffffffffa00d8b08>] dm_create+0x598/0x6e0 [dm_mod]
[<ffffffffa00de951>] dev_create+0x51/0x350 [dm_mod]
[<ffffffffa00de823>] ctl_ioctl+0x1a3/0x240 [dm_mod]
[<ffffffffa00de8f2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
[<ffffffff81177bfd>] compat_sys_ioctl+0xcd/0x4f0
[<ffffffff81036ed8>] sysenter_dispatch+0x7/0x2c
[<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|