linux-toradex.git/block, branch v3.10.51

blkcg: don't call into policy draining if root_blkg is already gone

2014-07-31T19:53:49+00:00

commit 0b462c89e31f7eb6789713437eb551833ee16ff3 upstream.

While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL.  If someone else starts to drain
while the queue is in this state, the following oops happens.

  NULL pointer dereference at 0000000000000028
  IP: [] blk_throtl_drain+0x84/0x230
  PGD e4a1067 PUD b773067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
  CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
  RIP: 0010:[]  [] blk_throtl_drain+0x84/0x230
  RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
  R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
  FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
  Stack:
   ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
   ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
   ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
  Call Trace:
   [] blkcg_drain_queue+0x1f/0x60
   [] __blk_drain_queue+0x71/0x180
   [] blk_queue_bypass_start+0x6e/0xb0
   [] blkcg_deactivate_policy+0x38/0x120
   [] blk_throtl_exit+0x34/0x50
   [] blkcg_exit_queue+0x35/0x40
   [] blk_release_queue+0x26/0xd0
   [] kobject_cleanup+0x38/0x70
   [] kobject_put+0x28/0x60
   [] blk_put_queue+0x15/0x20
   [] scsi_device_dev_release_usercontext+0x16b/0x1c0
   [] execute_in_process_context+0x89/0xa0
   [] scsi_device_dev_release+0x1c/0x20
   [] device_release+0x32/0xa0
   [] kobject_cleanup+0x38/0x70
   [] kobject_put+0x28/0x60
   [] put_device+0x17/0x20
   [] __scsi_remove_device+0xa9/0xe0
   [] scsi_remove_device+0x2b/0x40
   [] sdev_store_delete+0x27/0x30
   [] dev_attr_store+0x18/0x30
   [] sysfs_kf_write+0x3e/0x50
   [] kernfs_fop_write+0xe7/0x170
   [] vfs_write+0xaf/0x1d0
   [] SyS_write+0x4d/0xc0
   [] system_call_fastpath+0x16/0x1b

776687bce42b ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo 
Reported-by: Shirish Pargaonkar 
Reported-by: Sasha Levin 
Reported-by: Jet Chen 
Tested-by: Shirish Pargaonkar 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: don't assume last put of shared tags is for the host

2014-07-31T19:53:48+00:00

commit d45b3279a5a2252cafcd665bbf2db8c9b31ef783 upstream.

There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods.  Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: provide compat ioctl for BLKZEROOUT

2014-07-31T19:53:48+00:00

commit 3b3a1814d1703027f9867d0f5cbbfaf6c7482474 upstream.

This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.

Signed-off-by: Mikulas Patocka 
Acked-by: Martin K. Petersen 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blktrace: fix accounting of partially completed requests

2014-05-31T04:52:11+00:00

commit af5040da01ef980670b3741b3e10733ee3e33566 upstream.

trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:

  C   R 232 + 240 [0]
  C   R 240 + 232 [0]
  C   R 248 + 224 [0]
  C   R 256 + 216 [0]

but should be:

  C   R 232 + 8 [0]
  C   R 240 + 8 [0]
  C   R 248 + 8 [0]
  C   R 256 + 8 [0]

Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.

This patch takes into account real completion size of the request and
fixes wrong completion accounting.

Signed-off-by: Roman Pen 
CC: Steven Rostedt 
CC: Frederic Weisbecker 
CC: Ingo Molnar 
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: add cond_resched() to potentially long running ioctl discard loop

2014-02-22T20:41:28+00:00

commit c8123f8c9cb517403b51aa41c3c46ff5e10b2c17 upstream.

When mkfs issues a full device discard and the device only
supports discards of a smallish size, we can loop in
blkdev_issue_discard() for a long time. If preempt isn't enabled,
this can turn into a softlock situation and the kernel will
start complaining.

Add an explicit cond_resched() at the end of the loop to avoid
that.

Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: __elv_next_request() shouldn't call into the elevator if bypassing

2014-02-22T20:41:28+00:00

commit 556ee818c06f37b2e583af0363e6b16d0e0270de upstream.

request_queue bypassing is used to suppress higher-level function of a
request_queue so that they can be switched, reconfigured and shut
down.  A request_queue does the followings while bypassing.

* bypasses elevator and io_cq association and queues requests directly
  to the FIFO dispatch queue.

* bypasses block cgroup request_list lookup and always uses the root
  request_list.

Once confirmed to be bypassing, specific elevator and block cgroup
policy implementations can assume that nothing is in flight for them
and perform various operations which would be dangerous otherwise.

Such confirmation is acheived by short-circuiting all new requests
directly to the dispatch queue and waiting for all the requests which
were issued before to finish.  Unfortunately, while the request
allocating and draining sides were properly handled, we forgot to
actually plug the request dispatch path.  Even after bypassing mode is
confirmed, if the attached driver tries to fetch a request and the
dispatch queue is empty, __elv_next_request() would invoke the current
elevator's elevator_dispatch_fn() callback.  As all in-flight requests
were drained, the elevator wouldn't contain any request but once
bypass is confirmed we don't even know whether the elevator is even
there.  It might be in the process of being switched and half torn
down.

Frank Mayhar reports that this actually happened while switching
elevators, leading to an oops.

Let's fix it by making __elv_next_request() avoid invoking the
elevator_dispatch_fn() callback if the queue is bypassing.  It already
avoids invoking the callback if the queue is dying.  As a dying queue
is guaranteed to be bypassing, we can simply replace blk_queue_dying()
check with blk_queue_bypass().

Reported-by: Frank Mayhar 
References: http://lkml.kernel.org/g/1390319905.20232.38.camel@bobble.lax.corp.google.com
Tested-by: Frank Mayhar 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

Update of blkg_stat and blkg_rwstat may happen in bh context. While u64_stats_fetch_retry is only preempt_disable on 32bit UP system. This is not enough to avoid preemption by bh and may read strange 64 bit value.

2013-12-12T06:36:27+00:00

commit 2c575026fae6e63771bd2a4c1d407214a8096a89 upstream.

Signed-off-by: Hong Zhiguo 
Acked-by: Tejun Heo 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

elevator: acquire q->sysfs_lock in elevator_change()

2013-12-08T15:29:27+00:00

commit 7c8a3679e3d8e9d92d58f282161760a0e247df97 upstream.

Add locking of q->sysfs_lock into elevator_change() (an exported function)
to ensure it is held to protect q->elevator from elevator_init(), even if
elevator_change() is called from non-sysfs paths.
sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
version, as the lock is already taken by elv_iosched_store().

Signed-off-by: Tomoki Sekiyama 
Signed-off-by: Jens Axboe 
Cc: Josh Boyer 
Signed-off-by: Greg Kroah-Hartman

elevator: Fix a race in elevator switching and md device initialization

2013-12-08T15:29:27+00:00

commit eb1c160b22655fd4ec44be732d6594fd1b1e44f4 upstream.

The soft lockup below happens at the boot time of the system using dm
multipath and the udev rules to switch scheduler.

[  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[  356.127001] RIP: 0010:[]  [] lock_timer_base.isra.35+0x1d/0x50
...
[  356.127001] Call Trace:
[  356.127001]  [] try_to_del_timer_sync+0x20/0x70
[  356.127001]  [] ? kmem_cache_alloc_node_trace+0x20a/0x230
[  356.127001]  [] del_timer_sync+0x52/0x60
[  356.127001]  [] cfq_exit_queue+0x32/0xf0
[  356.127001]  [] elevator_exit+0x2f/0x50
[  356.127001]  [] elevator_change+0xf1/0x1c0
[  356.127001]  [] elv_iosched_store+0x20/0x50
[  356.127001]  [] queue_attr_store+0x59/0xb0
[  356.127001]  [] sysfs_write_file+0xc6/0x140
[  356.127001]  [] vfs_write+0xbd/0x1e0
[  356.127001]  [] SyS_write+0x49/0xa0
[  356.127001]  [] system_call_fastpath+0x16/0x1b

This is caused by a race between md device initialization by multipathd and
shell script to switch the scheduler using sysfs.

 - multipathd:
   SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
   -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
    q->elevator = elevator_alloc(q, e); // not yet initialized

 - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
   elevator_switch (in the call trace above)
    struct elevator_queue *old = q->elevator;
    q->elevator = elevator_alloc(q, new_e);
    elevator_exit(old);                 // lockup! (*)

 - multipathd: (cont.)
    err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified

(*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
while timer->base == NULL. In this case, as timer will never initialized,
it results in lockup.

This patch introduces acquisition of q->sysfs_lock around elevator_init()
into blk_init_allocated_queue(), to provide mutual exclusion between
initialization of the q->scheduler and switching of the scheduler.

This should fix this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=902012

Signed-off-by: Tomoki Sekiyama 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-core: Fix memory corruption if blkcg_init_queue fails

2013-12-04T18:56:46+00:00

commit fff4996b7db7955414ac74386efa5e07fd766b50 upstream.

If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
to clean up structures allocated by the backing dev.

------------[ cut here ]------------
WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
ODEBUG: free active (active state 0) object type: percpu_counter hint:           (null)
Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
CPU: 0 PID: 2739 Comm: lvchange Tainted: G        W
3.10.15-devel #14
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
 0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
 ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
 ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
Call Trace:
 [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x6b/0xa0
 [] warn_slowpath_fmt+0x47/0x50
 [] ? debug_check_no_obj_freed+0xcf/0x250
 [] debug_print_object+0x85/0xa0
 [] debug_check_no_obj_freed+0x203/0x250
 [] kmem_cache_free+0x20c/0x3a0
 [] blk_alloc_queue_node+0x2a9/0x2c0
 [] blk_alloc_queue+0xe/0x10
 [] dm_create+0x1a3/0x530 [dm_mod]
 [] ? list_version_get_info+0xe0/0xe0 [dm_mod]
 [] dev_create+0x57/0x2b0 [dm_mod]
 [] ? list_version_get_info+0xe0/0xe0 [dm_mod]
 [] ? list_version_get_info+0xe0/0xe0 [dm_mod]
 [] ctl_ioctl+0x268/0x500 [dm_mod]
 [] ? get_lock_stats+0x22/0x70
 [] dm_ctl_ioctl+0xe/0x20 [dm_mod]
 [] do_vfs_ioctl+0x2ed/0x520
 [] ? fget_light+0x377/0x4e0
 [] SyS_ioctl+0x4b/0x90
 [] system_call_fastpath+0x1a/0x1f
---[ end trace 4b5ff0d55673d986 ]---
------------[ cut here ]------------

This fix should be backported to stable kernels starting with 2.6.37. Note
that in the kernels prior to 3.5 the affected code is different, but the
bug is still there - bdi_init is called and bdi_destroy isn't.

Signed-off-by: Mikulas Patocka 
Acked-by: Tejun Heo 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman