linux-toradex.git/fs/aio.c, branch v3.14.43

ioctx_alloc(): fix vma (and file) leak on failure

2015-04-19T08:11:09+00:00

commit deeb8525f9bcea60f5e86521880c1161de7a5829 upstream.

If we fail past the aio_setup_ring(), we need to destroy the
mapping.  We don't need to care about anybody having found ctx,
or added requests to it, since the last failure exit is exactly
the failure to make ctx visible to lookups.

Reproducer (based on one by Joe Mario ):

void count(char *p)
{
	char s[80];
	printf("%s: ", p);
	fflush(stdout);
	sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
	system(s);
}

int main()
{
	io_context_t *ctx;
	int created, limit, i, destroyed;
	FILE *f;

	count("before");
	if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
		perror("opening aio-max-nr");
	else if (fscanf(f, "%d", &limit) != 1)
		fprintf(stderr, "can't parse aio-max-nr\n");
	else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
		perror("allocating aio_context_t array");
	else {
		for (i = 0, created = 0; i < limit; i++) {
			if (io_setup(1000, ctx + created) == 0)
				created++;
		}
		for (i = 0, destroyed = 0; i < created; i++)
			if (io_destroy(ctx[i]) == 0)
				destroyed++;
		printf("created %d, failed %d, destroyed %d\n",
			created, limit - created, destroyed);
		count("after");
	}
}

Found-by: Joe Mario 
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

2014-12-06T23:55:37+00:00

commit 835f252c6debd204fcd607c79975089b1ecd3472 upstream.

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
	zone_stat_item item)
{
     atomic_long_dec(&zone->vm_stat[item]);
+    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
	atomic_long_read(&zone->vm_stat[item]) < 0);
     atomic_long_dec(&vm_stat[item]);
}

[   21.341632] ------------[ cut here ]------------
[   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[   21.355296] Modules linked in: wutbox_cp sata_mv
[   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[   21.366793] Workqueue: events free_ioctx
[   21.370760] [] (unwind_backtrace) from []
(show_stack+0x20/0x24)
[   21.378562] [] (show_stack) from []
(dump_stack+0x24/0x28)
[   21.385840] [] (dump_stack) from []
(warn_slowpath_common+0x84/0x9c)
[   21.393976] [] (warn_slowpath_common) from []
(warn_slowpath_null+0x2c/0x34)
[   21.402800] [] (warn_slowpath_null) from []
(cancel_dirty_page+0x164/0x224)
[   21.411524] [] (cancel_dirty_page) from []
(truncate_inode_page+0x8c/0x158)
[   21.420272] [] (truncate_inode_page) from []
(truncate_inode_pages_range+0x11c/0x53c)
[   21.429890] [] (truncate_inode_pages_range) from
[] (truncate_pagecache+0x88/0xac)
[   21.439252] [] (truncate_pagecache) from []
(truncate_setsize+0x5c/0x74)
[   21.447731] [] (truncate_setsize) from []
(put_aio_ring_file.isra.14+0x34/0x90)
[   21.456826] [] (put_aio_ring_file.isra.14) from
[] (aio_free_ring+0x20/0xcc)
[   21.465660] [] (aio_free_ring) from []
(free_ioctx+0x24/0x44)
[   21.473190] [] (free_ioctx) from []
(process_one_work+0x134/0x47c)
[   21.481132] [] (process_one_work) from []
(worker_thread+0x130/0x414)
[   21.489350] [] (worker_thread) from []
(kthread+0xd4/0xec)
[   21.496621] [] (kthread) from []
(ret_from_fork+0x14/0x20)
[   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus 
Signed-off-by: Gu Zheng 
Acked-by: Andrew Morton 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Greg Kroah-Hartman

aio: block exit_aio() until all context requests are completed

2014-10-05T21:52:24+00:00

commit 6098b45b32e6baeacc04790773ced9340601d511 upstream.

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

aio: add missing smp_rmb() in read_events_ring

2014-10-05T21:52:10+00:00

commit 2ff396be602f10b5eab8e73b24f20348fa2de159 upstream.

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events.  After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves.  This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Greg Kroah-Hartman

aio: fix reqs_available handling

2014-10-05T21:52:10+00:00

commit d856f32a86b2b015ab180ab7a55e455ed8d3ccc5 upstream.

As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer.  Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression.  Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring.  So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available().  The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni 
Signed-off-by: Benjamin LaHaise 
Acked-by: Dan Aloni 
Cc: Kent Overstreet 
Cc: Mateusz Guzik 
Cc: Petr Matousek 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

aio: protect reqs_available updates from changes in interrupt handlers

2014-07-28T15:06:03+00:00

commit 263782c1c95bbddbb022dc092fd89a36bb8d5577 upstream.

As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot 
Tested-by: Robert Elliot 
Signed-off-by: Benjamin LaHaise 
Cc: Jens Axboe , Christoph Hellwig 
Signed-off-by: Greg Kroah-Hartman

aio: block io_destroy() until all context requests are completed

2014-07-09T18:18:28+00:00

commit e02ba72aabfade4c9cd6e3263e9b57bf890ad25c upstream.

deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().

man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.

Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:

  initialize io_context
  io_submit(read to buffer)
  io_destroy()

  // context is destroyed so we can free the resources
  free(buffers);

  // if the buffer is allocated by some other user he'll be surprised
  // to learn that the buffer still filled by an outstanding operation
  // from the destroyed io_context

The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.

If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.

Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
  do not see the race condition anymore.

Signed-off-by: Anatol Pomozov 
Signed-off-by: Benjamin LaHaise 
Cc: Jan Kara 
Signed-off-by: Greg Kroah-Hartman

aio: fix kernel memory disclosure in io_getevents() introduced in v3.10

2014-07-01T03:12:01+00:00

commit edfbbf388f293d70bf4b7c0bc38774d05e6f711a upstream.

A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380bed817aa25f8830ad23e1a0480fef797.  The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.

Signed-off-by: Benjamin LaHaise 
Cc: Mateusz Guzik 
Cc: Petr Matousek 
Cc: Kent Overstreet 
Cc: Jeff Moyer 
Signed-off-by: Greg Kroah-Hartman

aio: fix aio request leak when events are reaped by userspace

2014-07-01T03:12:01+00:00

commit f8567a3845ac05bb28f3c1b478ef752762bd39ef upstream.

The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping.  Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

Signed-off-by: Benjamin LaHaise 
Cc: Kent Overstreet 
Cc: Mateusz Guzik 
Cc: Petr Matousek 
Signed-off-by: Greg Kroah-Hartman

aio: fix potential leak in aio_run_iocb().

2014-06-07T17:28:11+00:00

commit 754320d6e166d3a12cb4810a452bde00afbd4e9a upstream.

iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
but it doesn't hold when failure happens right after aio_setup_vectored_rw().

Fix that in a such way to avoid hairy goto.

Signed-off-by: Leon Yu 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Greg Kroah-Hartman