linux-toradex.git/fs/block_dev.c, branch v4.4.93

fs/block_dev: always invalidate cleancache in invalidate_bdev()

2017-05-20T12:27:01+00:00

commit a5f6a6a9c72eac38a7fadd1a038532bc8516337c upstream.

invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin 
Reviewed-by: Jan Kara 
Acked-by: Konrad Rzeszutek Wilk 
Cc: Alexander Viro 
Cc: Ross Zwisler 
Cc: Jens Axboe 
Cc: Johannes Weiner 
Cc: Alexey Kuznetsov 
Cc: Christoph Hellwig 
Cc: Nikolay Borisov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

block: get rid of blk_integrity_revalidate()

2017-05-14T11:32:59+00:00

commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream.

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

    if (bi->profile)
            disk->queue->backing_dev_info->capabilities |=
                    BDI_CAP_STABLE_WRITES;
    else
            disk->queue->backing_dev_info->capabilities &=
                    ~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time.  This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there.  This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen" 
Cc: Christoph Hellwig 
Cc: Mike Snitzer 
Tested-by: Dan Williams 
Signed-off-by: Ilya Dryomov 
[idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: protect iterate_bdevs() against concurrent close

2017-01-09T07:07:47+00:00

commit af309226db916e2c6e08d3eba3fa5c34225200c4 upstream.

If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
 IP: [] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
 RIP: 0010:[]  [] blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
 RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
 RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
 R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
 FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
 Stack:
  ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
  0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
  ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
 Call Trace:
  [] __filemap_fdatawrite_range+0x85/0x90
  [] filemap_fdatawrite+0x1f/0x30
  [] fdatawrite_one_bdev+0x16/0x20
  [] iterate_bdevs+0xf2/0x130
  [] sys_sync+0x63/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [] blk_get_backing_dev_info+0x10/0x20
  RSP 
 CR2: 0000000000000508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Fixes: 5c0d6b60a0ba ("vfs: Create function for iterating over block devices")
Reported-and-tested-by: Wei Fang 
Signed-off-by: Rabin Vincent 
Signed-off-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block_dev: don't test bdev->bd_contains when it is not stable

2017-01-06T10:16:11+00:00

commit bcc7f5b4bee8e327689a4d994022765855c807ff upstream.

bdev->bd_contains is not stable before calling __blkdev_get().
When __blkdev_get() is called on a parition with ->bd_openers == 0
it sets
  bdev->bd_contains = bdev;
which is not correct for a partition.
After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
and then ->bd_contains is stable.

When FMODE_EXCL is used, blkdev_get() calls
   bd_start_claiming() ->  bd_prepare_to_claim() -> bd_may_claim()

This call happens before __blkdev_get() is called, so ->bd_contains
is not stable.  So bd_may_claim() cannot safely use ->bd_contains.
It currently tries to use it, and this can lead to a BUG_ON().

This happens when a whole device is already open with a bd_holder (in
use by dm in my particular example) and two threads race to open a
partition of that device for the first time, one opening with O_EXCL and
one without.

The thread that doesn't use O_EXCL gets through blkdev_get() to
__blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;

Immediately thereafter the other thread, using FMODE_EXCL, calls
bd_start_claiming() from blkdev_get().  This should fail because the
whole device has a holder, but because bdev->bd_contains == bdev
bd_may_claim() incorrectly reports success.
This thread continues and blocks on bd_mutex.

The first thread then sets bdev->bd_contains correctly and drops the mutex.
The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
again in:
			BUG_ON(!bd_may_claim(bdev, whole, holder));
The BUG_ON fires.

Fix this by removing the dependency on ->bd_contains in
bd_may_claim().  As bd_may_claim() has direct access to the whole
device, it can simply test if the target bdev is the whole device.

Fixes: 6b4517a7913a ("block: implement bd_claiming and claiming block")
Signed-off-by: NeilBrown 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: detach bdev inode from its wb in __blkdev_put()

2015-12-04T18:02:17+00:00

Since 52ebea749aae ("writeback: make backing_dev_info host
cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
gets attached to a wb (struct bdi_writeback).  Detaching happens on
evict, in inode_detach_wb() called from __destroy_inode(), and involves
updating wb.

However, detaching an internal bdev inode from its wb in
__destroy_inode() is too late.  Its bdi and by extension root wb are
embedded into struct request_queue, which has different lifetime rules
and can be freed long before the final bdput() is called (can be from
__fput() of a corresponding /dev inode, through dput() - evict() -
bd_forget().  bdevs hold onto the underlying disk/queue pair only while
opened; as soon as bdev is closed all bets are off.  In fact,
disk/queue can be gone before __blkdev_put() even returns:

1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
1500 {
...
1518         if (bdev->bd_contains == bdev) {
1519                 if (disk->fops->release)
1520                         disk->fops->release(disk, mode);

[ Driver puts its references to disk/queue ]

1521         }
1522         if (!bdev->bd_openers) {
1523                 struct module *owner = disk->fops->owner;
1524
1525                 disk_put_part(bdev->bd_part);
1526                 bdev->bd_part = NULL;
1527                 bdev->bd_disk = NULL;
1528                 if (bdev != bdev->bd_contains)
1529                         victim = bdev->bd_contains;
1530                 bdev->bd_contains = NULL;
1531
1532                 put_disk(disk);

[ We put ours, the queue is gone
  The last bdput() would result in a write to invalid memory ]

1533                 module_put(owner);
...
1539 }

Since bdev inodes are special anyway, detach them in __blkdev_put()
after clearing inode's dirty bits, turning the problematic
inode_detach_wb() in __destroy_inode() into a noop.

add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
gendisk hold a reference to its queue"), so the old ->release comment
is removed in favor of the new inode_detach_wb() comment.

Cc: stable@vger.kernel.org # 4.2+, needs backporting
Signed-off-by: Ilya Dryomov 
Acked-by: Tejun Heo 
Tested-by: Raghavendra K T 
Signed-off-by: Jens Axboe

block: protect rw_page against device teardown

2015-11-19T21:47:10+00:00

Fix use after free crashes like the following:

 general protection fault: 0000 [#1] SMP
 Call Trace:
  [] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
  [] pmem_rw_page+0x42/0x80 [nd_pmem]
  [] bdev_read_page+0x50/0x60
  [] do_mpage_readpage+0x510/0x770
  [] ? I_BDEV+0x20/0x20
  [] ? lru_cache_add+0x1c/0x50
  [] mpage_readpages+0x107/0x170
  [] ? I_BDEV+0x20/0x20
  [] ? I_BDEV+0x20/0x20
  [] blkdev_readpages+0x1d/0x20
  [] __do_page_cache_readahead+0x28f/0x310
  [] ? __do_page_cache_readahead+0x169/0x310
  [] ? pagecache_get_page+0x2d/0x1d0
  [] filemap_fault+0x396/0x530
  [] __do_fault+0x4e/0xf0
  [] handle_mm_fault+0x11bd/0x1b50

Cc: 
Cc: Jens Axboe 
Cc: Alexander Viro 
Reported-by: kbuild test robot 
Acked-by: Matthew Wilcox 
[willy: symmetry fixups]
Signed-off-by: Dan Williams

fs/block_dev.c: Remove WARN_ON() when inode writeback fails

2015-11-11T16:36:57+00:00

If a block device is hot removed and later last reference to device
is put, we try to writeback the dirty inode. But device is gone and
that writeback fails.

Currently we do a WARN_ON() which does not seem to be the right thing.
Convert it to a ratelimited kernel warning.

Reported-by: Andi Kleen 
Signed-off-by: Vivek Goyal 
Acked-by: Tejun Heo 
[jmoyer@redhat.com: get rid of unnecessary name initialization, 80 cols]
Signed-off-by: Jeff Moyer 
Signed-off-by: Jens Axboe

block: Inline blk_integrity in struct gendisk

2015-10-21T20:42:42+00:00

Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen 
Reported-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Mike Snitzer 
Cc: Dan Williams 
Signed-off-by: Dan Williams 
Signed-off-by: Jens Axboe

blockdev: don't set S_DAX for misaligned partitions

2015-09-16T00:08:05+00:00

The dax code doesn't currently support misaligned partitions,
so disable O_DIRECT via dax until such time as that support
materializes.

Cc: 
Suggested-by: Boaz Harrosh 
Signed-off-by: Jeff Moyer 
Signed-off-by: Dan Williams

Merge branch 'akpm' (patches from Andrew)

2015-09-09T00:52:23+00:00

Merge second patch-bomb from Andrew Morton:
 "Almost all of the rest of MM.  There was an unusually large amount of
  MM material this time"

* emailed patches from Andrew Morton : (141 commits)
  zpool: remove no-op module init/exit
  mm: zbud: constify the zbud_ops
  mm: zpool: constify the zpool_ops
  mm: swap: zswap: maybe_preload & refactoring
  zram: unify error reporting
  zsmalloc: remove null check from destroy_handle_cache()
  zsmalloc: do not take class lock in zs_shrinker_count()
  zsmalloc: use class->pages_per_zspage
  zsmalloc: consider ZS_ALMOST_FULL as migrate source
  zsmalloc: partial page ordering within a fullness_list
  zsmalloc: use shrinker to trigger auto-compaction
  zsmalloc: account the number of compacted pages
  zsmalloc/zram: introduce zs_pool_stats api
  zsmalloc: cosmetic compaction code adjustments
  zsmalloc: introduce zs_can_compact() function
  zsmalloc: always keep per-class stats
  zsmalloc: drop unused variable `nr_to_migrate'
  mm/memblock.c: fix comment in __next_mem_range()
  mm/page_alloc.c: fix type information of memoryless node
  memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
  ...