summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2013-04-29Merge branch 'for-3.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...
2013-04-29Merge tag 'driver-core-3.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core update from Greg Kroah-Hartman: "Here's the merge request for the driver core tree for 3.10-rc1 It's pretty small, just a number of driver core and sysfs updates and fixes, all of which have been in linux-next for a while now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed conflict in kernel/rtmutex-tester.c, the locking tree had a better fix for the same sysfs file mode problem. * tag 'driver-core-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: PM / Runtime: Idle devices asynchronously after probe|release driver core: handle user namespaces properly with the uid/gid devtmpfs change driver core: devtmpfs: fix compile failure with CONFIG_UIDGID_STRICT_TYPE_CHECKS devtmpfs: add base.h include driver core: add uid and gid to devtmpfs sysfs: check if one entry has been removed before freeing sysfs: fix crash_notes_size build warning sysfs: fix use after free in case of concurrent read/write and readdir rtmutex-tester: fix mode of sysfs files Documentation: Add ABI entry for crash_notes and crash_notes_size sysfs: Add crash_notes_size to export percpu note size driver core: platform_device.h: fix checkpatch errors and warnings driver core: platform.c: fix checkpatch errors and warnings driver core: warn that platform_driver_probe can not use deferred probing sysfs: use atomic_inc_unless_negative in sysfs_get_active base: core: WARN() about bogus permissions on device attributes device: separate all subsys mutexes
2013-04-18Revert "block: add missing block_bio_complete() tracepoint"Linus Torvalds
This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-14Merge 3.9-rc7 into driver-core-nextGreg Kroah-Hartman
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-11driver core: handle user namespaces properly with the uid/gid devtmpfs changeGreg Kroah-Hartman
Now that devtmpfs is caring about uid/gid, we need to use the correct internal types so users who have USER_NS enabled will have things work properly for them. Thanks to Eric for pointing this out, and the patch review. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-08driver core: add uid and gid to devtmpfsKay Sievers
Some drivers want to tell userspace what uid and gid should be used for their device nodes, so allow that information to percolate through the driver core to userspace in order to make this happen. This means that some systems (i.e. Android and friends) will not need to even run a udev-like daemon for their device node manager and can just rely in devtmpfs fully, reducing their footprint even more. Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-08Revert "loop: cleanup partitions when detaching loop device"Jens Axboe
This reverts commit 8761a3dc1f07b163414e2215a2cadbb4cfe2a107. There are situations where the destruction path is called with the bdev->bd_mutex already held, which then deadlocks in loop_clr_fd(). The normal partition cleanup does a trylock() on the mutex, but it'd be nice to have a more bullet proof method in loop. So punt this more involved fix to the next merge window, and just back out this buggy fix for now. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-04-03block: avoid using uninitialized value in from queue_var_storeArnd Bergmann
As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions that use a value generated by queue_var_store independent of whether that value was set or not. block/blk-sysfs.c: In function 'queue_store_nonrot': block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized] Unlike most other such warnings, this one is not a false positive, writing any non-number string into the sysfs files indeed has an undefined result, rather than returning an error. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22Block: blk-flush: Fixed indent code styleAlice Ferrazzi
Fixed code indent should use tabs where possible. Signed-off-by: Alice Ferrazzi <alice.ferrazzi@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22loop: cleanup partitions when detaching loop devicePhillip Susi
Any partitions added by user space to the loop device were being left in place after detaching the loop device. This was because the detach path issued a BLKRRPART to clean up partitions if LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto scanned on attach. Replace this BLKRRPART with code that unconditionally cleans up partitions on detach instead. Signed-off-by: Phillip Susi <psusi@ubuntu.com> Modified by Jens to export delete_partition(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-04cgroup: fix cgroup_path() vs rename() raceLi Zefan
rename() will change dentry->d_name. The result of this race can be worse than seeing partially rewritten name, but we might access a stale pointer because rename() will re-allocate memory to hold a longer name. As accessing dentry->name must be protected by dentry->d_lock or parent inode's i_mutex, while on the other hand cgroup-path() can be called with some irq-safe spinlocks held, we can't generate cgroup path using dentry->d_name. Alternatively we make a copy of dentry->d_name and save it in cgrp->name when a cgroup is created, and update cgrp->name at rename(). v5: use flexible array instead of zero-size array. v4: - allocate root_cgroup_name and all root_cgroup->name points to it. - add cgroup_name() wrapper. v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path. v2: make cgrp->name RCU safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-28Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partitions: optimize memory allocation in check_partition()Ming Lei
Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so it is easy to trigger page allocation failure by check_partition, especially in hotplug block device situation(such as, USB mass storage, MMC card, ...), and Felipe Balbi has observed the failure. This patch does below optimizations on the allocation of struct parsed_partitions to try to address the issue: - make parsed_partitions.parts as pointer so that the pointed memory can fit in 32KB buffer, then approximate 32KB memory can be saved - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is still a bit big for kmalloc - given that many devices have the partition count limit, so only allocate disk_max_parts() partitions instead of 256 partitions always Signed-off-by: Ming Lei <ming.lei@canonical.com> Reported-by: Felipe Balbi <balbi@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partitions/mac.c: obey the state->limit constraintMing Lei
It isn't necessary to read the information of partitions whose number is equal and more than state->limit since only maximum state->limit partitions will be added inside rescan_partitions(). That is also what other kind of partitions are doing. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partitions/efi.c: ensure that the GPT header is at least the size of ↵Peter Jones
the structure. UEFI 2.3.1D will include a change to the spec language mandating that a GPT header must be greater than *or equal to* the size of the defined structure. While verifying that this would work on Linux, I discovered that we're not actually checking the minimum bound at all. The result of this is that when we verify the checksum, it's possible that on a malformed header (with header_size of 0), we won't actually verify any data. [akpm@linux-foundation.org: fix printk warning] Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matt Fleming <matt.fleming@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block/partition/msdos: detect AIX formatted disks even without 55aaPhilippe De Muyter
AIX formatted disks do not always have the MSDOS 55aa signature. This happens e.g. for unbootable AIX disks. Up to now, such disks were not recognized as AIX disks, because of the missing 55aa. Fix that by inverting the two tests. Let's first check for the AIX magic strings, and only if that fails check for the MSDOS magic word. Signed-off-by: Philippe De Muyter <phdm@macqel.be> Cc: Andreas Mohr <andi@lisas.de> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block: convert to idr_alloc()Tejun Heo
Convert to the much saner new idr interface. Both bsg and genhd protect idr w/ mutex making preloading unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block: fix synchronization and limit check in blk_alloc_devt()Tejun Heo
idr allocation in blk_alloc_devt() wasn't synchronized against lookup and removal, and its limit check was off by one - 1 << MINORBITS is the number of minors allowed, not the maximum allowed minor. Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit checking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27block: fix ext_devt_idr handlingTomas Henzl
While adding and removing a lot of disks disks and partitions this sometimes shows up: WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted) Hardware name: sysfs: cannot create duplicate filename '/dev/block/259:751' Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan] Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1 Call Trace: warn_slowpath_common+0x87/0xc0 warn_slowpath_fmt+0x46/0x50 sysfs_add_one+0xc9/0x130 sysfs_do_create_link+0x12b/0x170 sysfs_create_link+0x13/0x20 device_add+0x317/0x650 idr_get_new+0x13/0x50 add_partition+0x21c/0x390 rescan_partitions+0x32b/0x470 sd_open+0x81/0x1f0 [sd_mod] __blkdev_get+0x1b6/0x3c0 blkdev_get+0x10/0x20 register_disk+0x155/0x170 add_disk+0xa6/0x160 sd_probe_async+0x13b/0x210 [sd_mod] add_wait_queue+0x46/0x60 async_thread+0x102/0x250 default_wake_function+0x0/0x20 async_thread+0x0/0x250 kthread+0x96/0xa0 child_rip+0xa/0x20 kthread+0x0/0xa0 child_rip+0x0/0x20 This most likely happens because dev_t is freed while the number is still used and idr_get_new() is not protected on every use. The fix adds a mutex where it wasn't before and moves the dev_t free function so it is called after device del. Signed-off-by: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23block/genhd.c: apply pm_runtime_set_memalloc_noio on block devicesMing Lei
Apply the introduced pm_runtime_set_memalloc_noio on block device so that PM core will teach mm to not allocate memory with GFP_IOFS when calling the runtime_resume and runtime_suspend callback for block devices and its ancestors. Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22cfq: fix lock imbalance with failed allocationsGlauber Costa
While stress-running very-small container scenarios with the Kernel Memory Controller, I've run into a lockdep-detected lock imbalance in cfq-iosched.c. I'll apologize beforehand for not posting a backlog: I didn't anticipate it would be so hard to reproduce, so I didn't save my serial output and went directly on debugging. Turns out that it did not happen again in more than 20 runs, making it a quite rare pattern. But here is my analysis: When we are in very low-memory situations, we will arrive at cfq_find_alloc_queue and may not find a queue, having to resort to the oom queue, in an rcu-locked condition: if (!cfqq || cfqq == &cfqd->oom_cfqq) [ ... ] Next, we will release the rcu lock, and try to allocate a queue, retrying if we succeed: rcu_read_unlock(); spin_unlock_irq(cfqd->queue->queue_lock); new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask | __GFP_ZERO, cfqd->queue->node); spin_lock_irq(cfqd->queue->queue_lock); if (new_cfqq) goto retry; We are unlocked at this point, but it should be fine, since we will reacquire the rcu_read_lock when we retry. Except of course, that we may not retry: the allocation may very well fail and we'll keep on going through the flow: The next branch is: if (cfqq) { [ ... ] } else cfqq = &cfqd->oom_cfqq; And right before exiting, we'll issue rcu_read_unlock(). Being already unlocked, this is the likely source of our imbalance. Since cfqq is either already NULL or made NULL in the first statement of the outter branch, the only viable alternative here seems to be to return the oom queue right away in case of allocation failure. Please review the following patch and apply if you agree with my analysis. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22block: don't select PERCPU_RWSEMMikulas Patocka
The block device doesn't use percpu rw-semaphore anymore, so don't select it for compilation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-21block: optionally snapshot page contents to provide stable pages during writeDarrick J. Wong
This provides a band-aid to provide stable page writes on jbd without needing to backport the fixed locking and page writeback bit handling schemes of jbd2. The band-aid works by using bounce buffers to snapshot page contents instead of waiting. For those wondering about the ext3 bandage -- fixing the jbd locking (which was done as part of ext4dev years ago) is a lot of surgery, and setting PG_writeback on data pages when we actually hold the page lock dropped ext3 performance by nearly an order of magnitude. If we're going to migrate iscsi and raid to use stable page writes, the complaints about high latency will likely return. We might as well centralize their page snapshotting thing to one place. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21bdi: allow block devices to say that they require stable page writesDarrick J. Wong
This patchset ("stable page writes, part 2") makes some key modifications to the original 'stable page writes' patchset. First, it provides creators (devices and filesystems) of a backing_dev_info a flag that declares whether or not it is necessary to ensure that page contents cannot change during writeout. It is no longer assumed that this is true of all devices (which was never true anyway). Second, the flag is used to relaxed the wait_on_page_writeback calls so that wait only occurs if the device needs it. Third, it fixes up the remaining disk-backed filesystems to use this improved conditional-wait logic to provide stable page writes on those filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. Sorry about not fixing it sooner. Complaints were registered by several people about the long write latencies introduced by the original stable page write patchset. Generally speaking, the kernel ought to allocate as little extra memory as possible to facilitate writeout, but for people who simply cannot wait, a second page stability strategy is (re)introduced: snapshotting page contents. The waiting behavior is still the default strategy; to enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This flag is used to bandaid^Henable stable page writeback on ext3[1], and is not used anywhere else. Given that there are already a few storage devices and network FSes that have rolled their own page stability wait/page snapshot code, it would be nice to move towards consolidating all of these. It seems possible that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Thank you to Jan Kara for helping fix a couple more filesystems. Per Andrew Morton's request, here are the result of using dbench to measure latencies on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, for ext2 the maximum write latency decreases from ~60ms on a laptop hard disk to ~4ms. I'm not sure why the flush latencies increase, though I suspect that being able to dirty pages faster gives the flusher more work to do. On ext4, the average write latency decreases as well as all the maximum latencies: 3.8.0-rc3: WriteX 85624 0.152 33.078 ReadX 272090 0.010 61.210 Flush 12129 36.219 168.260 Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms 3.8.0-rc3 + patches: WriteX 86082 0.141 30.928 ReadX 273358 0.010 36.124 Flush 12214 34.800 165.689 Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms XFS seems to exhibit similar latency improvements as ext2: 3.8.0-rc3: WriteX 125739 0.028 104.343 ReadX 399070 0.005 4.115 Flush 17851 25.004 131.390 Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms 3.8.0-rc3 + patches: WriteX 123529 0.028 6.299 ReadX 392434 0.005 4.287 Flush 17549 25.120 188.687 Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms ...and btrfs, just to round things out, also shows some latency decreases: 3.8.0-rc3: WriteX 67122 0.083 82.355 ReadX 212719 0.005 2.828 Flush 9547 47.561 147.418 Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms 3.8.0-rc3 + patches: WriteX 64898 0.101 71.631 ReadX 206673 0.005 7.123 Flush 9190 47.963 219.034 Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own wait code, or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3. [1] The alternative fixes to ext3 include fixing the locking order and page bit handling like we did for ext4 (but then why not just use ext4?), or setting PG_writeback so early that ext3 becomes extremely slow. I tried that, but the number of write()s I could initiate dropped by nearly an order of magnitude. That was a bit much even for the author of the stable page series! :) This patch: Creates a per-backing-device flag that tracks whether or not pages must be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nothing requires stable pages. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-19Merge branch 'for-3.9-async' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull async changes from Tejun Heo: "These are followups for the earlier deadlock issue involving async ending up waiting for itself through block requesting module[1]. The following changes are made by these commits. - Instead of requesting default elevator on each request_queue init, block now requests it once early during boot. - Kmod triggers warning if invoked from an async worker. - Async synchronization implementation has been reimplemented. It's a lot simpler now." * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: async: initialise list heads to fix crash async: replace list of active domains with global list of pending items async: keep pending tasks on async_domain and remove async_pending async: use ULLONG_MAX for infinity cookie value async: bring sanity to the use of words domain and running async, kmod: warn on synchronous request_module() from async workers block: don't request module during elevator init init, block: try to load default elevator module early during boot
2013-02-19Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-15block: account iowait time when waiting for completion of IO requestVladimir Davydov
Using wait_for_completion() for waiting for a IO request to be executed results in wrong iowait time accounting. For example, a system having the only task doing write() and fdatasync() on a block device can be reported being idle instead of iowaiting as it should because blkdev_issue_flush() calls wait_for_completion() which in turn calls schedule() that does not increment the iowait proc counter and thus does not turn on iowait time accounting. The patch makes block layer use wait_for_completion_io() instead of wait_for_completion() where appropriate to account iowait time correctly. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-07sched: Move sched.h sysctl bits into separate headerClark Williams
Move the sysctl-related bits from include/linux/sched.h into a new file: include/linux/sched/sysctl.h. Then update source files requiring access to those bits by including the new header file. Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-22block: don't request module during elevator initTejun Heo
Block layer allows selecting an elevator which is built as a module to be selected as system default via kernel param "elevator=". This is achieved by automatically invoking request_module() whenever a new block device is initialized and the elevator is not available. This led to an interesting deadlock problem involving async and module init. Block device probing running off an async job invokes request_module(). While the module is being loaded, it performs async_synchronize_full() which ends up waiting for the async job which is already waiting for request_module() to finish, leading to deadlock. Invoking request_module() from deep in block device init path is already nasty in itself. It seems best to avoid these situations from the beginning by moving on-demand module loading out of block init path. The previous patch made sure that the default elevator module is loaded early during boot if available. This patch removes on-demand loading of the default elevator from elevator init path. As the module would have been loaded during boot, userland-visible behavior difference should be minimal. For more details, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1420814 v2: The bool parameter was named @request_module which conflicted with request_module(). This built okay w/ CONFIG_MODULES because request_module() was defined as a macro. W/o CONFIG_MODULES, it causes build breakage. Rename the parameter to @try_loading. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alex Riesen <raa.lkml@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com>
2013-01-18init, block: try to load default elevator module early during bootTejun Heo
This patch adds default module loading and uses it to load the default block elevator. During boot, it's called right after initramfs or initrd is made available and right before control is passed to userland. This ensures that as long as the modules are available in the usual places in initramfs, initrd or the root filesystem, the default modules are loaded as soon as possible. This will replace the on-demand elevator module loading from elevator init path. v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alex Riesen <raa.lkml@gmail.com> Cc: Fengguang We <fengguang.wu@intel.com>
2013-01-14block: add @req to bio_{front|back}_merge tracepointsTejun Heo
bio_{front|back}_merge tracepoints report a bio merging into an existing request but didn't specify which request the bio is being merged into. Add @req to it. This makes it impossible to share the event template with block_bio_queue - split it out. @req isn't used or exported to userland at this point and there is no userland visible behavior change. Later changes will make use of the extra parameter. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14block: add missing block_bio_complete() tracepointTejun Heo
bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11Merge branch 'blkcg-cfq-hierarchy' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.9/core Tejun writes: Hello, Jens. Please consider pulling from the following branch to receive cfq blkcg hierarchy support. The branch is based on top of v3.8-rc2. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy The patchset was reviewd in the following thread. http://thread.gmane.org/gmane.linux.kernel.cgroups/5571
2013-01-11block: Remove should_sort judgement when flush blk_plugJianpeng Ma
In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing. Although this commit was used for the situation which blk_plug handled multi devices on the same time like md device. I think there must be some situations like this but only single device. So remove the should_sort judgement. Because the parameter should_sort is only for this purpose,it can delete should_sort from blk_plug. CC: Shaohua Li <shli@kernel.org> Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11block,elevator: use new hashtable implementationSasha Levin
Switch elevator to use the new hashtable implementation. This reduces the amount of generic unrelated code in the elevator. This also removes the dymanic allocation of the hash table. The size of the table is constant so there's no point in paying the price of an extra dereference when accessing it. This patch depends on d9b482c ("hashtable: introduce a small and naive hashtable") which was merged in v3.6. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-09cfq-iosched: add hierarchical cfq_group statisticsTejun Heo
Unfortunately, at this point, there's no way to make the existing statistics hierarchical without creating nasty surprises for the existing users. Just create recursive counterpart of the existing stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09cfq-iosched: collect stats from dead cfqgsTejun Heo
To support hierarchical stats, it's necessary to remember stats from dead children. Add cfqg->dead_stats and make a dying cfqg transfer its stats to the parent's dead-stats. The transfer happens form ->pd_offline_fn() and it is possible that there are some residual IOs completing afterwards. Currently, we lose these stats. Given that cgroup removal isn't a very high frequency operation and the amount of residual IOs on offline are likely to be nil or small, this shouldn't be a big deal and the complexity needed to handle residual IOs - another callback and rather elaborate synchronization to reach and lock the matching q - doesn't seem justified. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()Tejun Heo
Separate out cfqg_stats_reset() which takes struct cfqg_stats * from cfq_pd_reset_stats() and move the latter to where other pd methods are defined. cfqg_stats_reset() will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lockTejun Heo
Instead of holding blkcg->lock while walking ->blkg_list and executing prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while executing prfill(). This makes prfill() implementations easier as stats are mostly protected by queue lock. This will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09block: RCU free request_queueTejun Heo
RCU free request_queue so that blkcg_gq->q can be dereferenced under RCU lock. This will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()Tejun Heo
Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge(). The former two collect the [rw]stats designated by the target policy data and offset from the pd's subtree. The latter two add one [rw]stat to another. Note that the recursive sum functions require the queue lock to be held on entry to make blkg online test reliable. This is necessary to properly handle stats of a dying blkg. These will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09blkcg: export __blkg_prfill_rwstat()Tejun Heo
Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat(). Export it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2013-01-09blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/Tejun Heo
Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for summing up stats from multiple blkgs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->onlineTejun Heo
Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(), which are invoked as the policy_data gets activated and deactivated while holding both blkcg and q locks. Also, add blkcg_gq->online bool, which is set and cleared as the blkcg_gq gets activated and deactivated. This flag also is toggled while holding both blkcg and q locks. These will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09blkcg: add blkg_policy_data->plidTejun Heo
Add pd->plid so that the policy a pd belongs to can be identified easily. This will be used to implement hierarchical blkg_[rw]stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09cfq-iosched: enable full blkcg hierarchy supportTejun Heo
With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09cfq-iosched: convert cfq_group_slice() to use cfqg->vfractionTejun Heo
cfq_group_slice() calculates slice by taking a fraction of cfq_target_latency according to the ratio of cfqg->weight against service_tree->total_weight. This currently works only because all cfqgs are treated to be at the same level. To prepare for proper hierarchy support, convert cfq_group_slice() to base the calculation on cfqg->vfraction. As cfqg->vfraction is always a fraction of 1 and represents the fraction allocated to the cfqg with hierarchy considered, the slice can be simply calculated by multiplying cfqg->vfraction to cfq_target_latency (with fixed point shift factored in). As vfraction calculation currently treats all non-root cfqgs as children of the root cfqg, this patch doesn't introduce noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09cfq-iosched: implement hierarchy-ready cfq_group charge scalingTejun Heo
Currently, cfqg charges are scaled directly according to cfqg->weight. Regardless of the number of active cfqgs or the amount of active weights, a given weight value always scales charge the same way. This works fine as long as all cfqgs are treated equally regardless of their positions in the hierarchy, which is what cfq currently implements. It can't work in hierarchical settings because the interpretation of a given weight value depends on where the weight is located in the hierarchy. This patch reimplements cfqg charge scaling so that it can be used to support hierarchy properly. The scheme is fairly simple and light-weight. * When a cfqg is added to the service tree, v(disktime)weight is calculated. It walks up the tree to root calculating the fraction it has in the hierarchy. At each level, the fraction can be calculated as cfqg->weight / parent->level_weight By compounding these, the global fraction of vdisktime the cfqg has claim to - vfraction - can be determined. * When the cfqg needs to be charged, the charge is scaled inversely proportionally to the vfraction. The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point representation as before; however, the smallest scaling factor is now 1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1 was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller scaling factor. While this shifts the global scale of vdisktime a bit, it doesn't change the relative relationships among cfqgs and the scheduling result isn't different. cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY didn't have any relevance to vdisktime before and is unlikely to cause any visible behavior difference now especially as the scale shift isn't that large. As the new scheme now makes proper distinction between cfqg->weight and ->leaf_weight, reverse the weight aliasing for root cfqgs. For root, both weights are now mapped to ->leaf_weight instead of the other way around. Because we're still using cfqg_flat_parent(), this patch shouldn't change the scheduling behavior in any noticeable way. v2: Beefed up comments on vfraction as requested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2013-01-09cfq-iosched: implement cfq_group->nr_active and ->children_weightTejun Heo
To prepare for blkcg hierarchy support, add cfqg->nr_active and ->children_weight. cfqg->nr_active counts the number of active cfqgs at the cfqg's level and ->children_weight is sum of weights of those cfqgs. The level covers itself (cfqg->leaf_weight) and immediate children. The two values are updated when a cfqg enters and leaves the group service tree. Unless the hierarchy is very deep, the added overhead should be negligible. Currently, the parent is determined using cfqg_flat_parent() which makes the root cfqg the parent of all other cfqgs. This is to make the transition to hierarchy-aware scheduling gradual. Scheduling logic will be converted to use cfqg->children_weight without actually changing the behavior. When everything is ready, blkcg_weight_parent() will be replaced with proper parent function. This patch doesn't introduce any behavior chagne. v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>