linux-toradex.git/fs/buffer.c, branch v3.14.57

mm: non-atomically mark page accessed during page cache allocation where possible

2015-01-30T01:40:52+00:00

commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Tested-by: Prabhakar Lad 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

fs: buffer: do not use unnecessary atomic operations when discarding buffers

2015-01-30T01:40:52+00:00

commit e7470ee89f003634a88e7b5e5a7b65b3025987de upstream.

Discarding buffers uses a bunch of atomic operations when discarding
buffers because ......  I can't think of a reason.  Use a cmpxchg loop to
clear all the necessary flags.  In most (all?) cases this will be a single
atomic operations.

[akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Mel Gorman 
Signed-off-by: Greg Kroah-Hartman

vfs: fix data corruption when blocksize < pagesize for mmaped data

2014-11-14T16:59:47+00:00

commit 90a8020278c1598fafd071736a0846b38510309c upstream.

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

fs: make cont_expand_zero interruptible

2014-11-14T16:59:46+00:00

commit c2ca0fcd202863b14bd041a7fece2e789926c225 upstream.

This patch makes it possible to kill a process looping in
cont_expand_zero. A process may spend a lot of time in this function, so
it is desirable to be able to kill it.

It happened to me that I wanted to copy a piece data from the disk to a
file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
to the "seek" parameter, dd attempted to extend the file and became stuck
doing so - the only possibility was to reset the machine or wait many
hours until the filesystem runs out of space and cont_expand_zero fails.
We need this patch to be able to terminate the process.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

Fix nasty 32-bit overflow bug in buffer i/o code.

2014-10-05T21:52:22+00:00

commit f2d5a94436cc7cc0221b9a81bba2276a25187dd3 upstream.

On 32-bit architectures, the legacy buffer_head functions are not always
handling the sector number with the proper 64-bit types, and will thus
fail on 4TB+ disks.

Any code that uses __getblk() (and thus bread(), breadahead(),
sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
in __getblk_slow() with an infinite stream of errors logged to dmesg
like this:

  __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
  b_state=0x00000020, b_size=512
  device sda1 blocksize: 512

Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
top 32-bits are missing (in this case the 0x1 at the top).

This is because grow_dev_page() is broken and has a 32-bit overflow due
to shifting the page index value (a pgoff_t - which is just 32 bits on
32-bit architectures) left-shifted as the block number.  But the top
bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
before the shift.

This patch fixes this issue by type casting "index" to sector_t before
doing the left shift.

Note this is not a theoretical bug but has been seen in the field on a
4TiB hard drive with logical sector size 512 bytes.

This patch has been verified to fix the infinite loop problem on 3.17-rc5
kernel using a 4TB disk image mounted using "-o loop".  Without this patch
doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
100% reproducibly whilst with the patch it works fine as expected.

Signed-off-by: Anton Altaparmakov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq

2014-02-06T21:48:51+00:00

To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
During aio buffer migration, we have a possibility to see the following
call stack.

aio_migratepage  [disable interrupt]
  migrate_page_copy
    clear_page_dirty_for_io
      set_page_dirty
        __set_page_dirty_buffers
          __set_page_dirty
            spin_lock_irq

This mean, current aio migration is a deadlockable.  spin_lock_irqsave
is a safer alternative and we should use it.

Signed-off-by: KOSAKI Motohiro 
Reported-by: David Rientjes rientjes@google.com>
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

block: Replace __this_cpu_ptr with raw_cpu_ptr

2013-12-04T02:19:41+00:00

__this_cpu_ptr is being phased out.

Cc: Jens Axboe 
Signed-off-by: Christoph Lameter 
Signed-off-by: Jens Axboe

block: Abstract out bvec iterator

2013-11-24T06:33:47+00:00

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet 
Cc: Jens Axboe 
Cc: Geert Uytterhoeven 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Ed L. Cashin" 
Cc: Nick Piggin 
Cc: Lars Ellenberg 
Cc: Jiri Kosina 
Cc: Matthew Wilcox 
Cc: Geoff Levand 
Cc: Yehuda Sadeh 
Cc: Sage Weil 
Cc: Alex Elder 
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris 
Cc: Philip Kelleher 
Cc: Rusty Russell 
Cc: "Michael S. Tsirkin" 
Cc: Konrad Rzeszutek Wilk 
Cc: Jeremy Fitzhardinge 
Cc: Neil Brown 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh 
Cc: Benny Halevy 
Cc: "James E.J. Bottomley" 
Cc: Greg Kroah-Hartman 
Cc: "Nicholas A. Bellinger" 
Cc: Alexander Viro 
Cc: Chris Mason 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: Jaegeuk Kim 
Cc: Steven Whitehouse 
Cc: Dave Kleikamp 
Cc: Joern Engel 
Cc: Prasad Joshi 
Cc: Trond Myklebust 
Cc: KONISHI Ryusuke 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Ben Myers 
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Len Brown 
Cc: Pavel Machek 
Cc: "Rafael J. Wysocki" 
Cc: Herton Ronaldo Krzesinski 
Cc: Ben Hutchings 
Cc: Andrew Morton 
Cc: Guo Chao 
Cc: Tejun Heo 
Cc: Asai Thambi S P 
Cc: Selvan Mani 
Cc: Sam Bradshaw 
Cc: Wei Yongjun 
Cc: "Roger Pau Monné" 
Cc: Jan Beulich 
Cc: Stefano Stabellini 
Cc: Ian Campbell 
Cc: Sebastian Ott 
Cc: Christian Borntraeger 
Cc: Minchan Kim 
Cc: Jiang Liu 
Cc: Nitin Gupta 
Cc: Jerome Marchand 
Cc: Joe Perches 
Cc: Peng Tao 
Cc: Andy Adamson 
Cc: fanchaoting 
Cc: Jie Liu 
Cc: Sunil Mushran 
Cc: "Martin K. Petersen" 
Cc: Namjae Jeon 
Cc: Pankaj Kumar 
Cc: Dan Magenheimer 
Cc: Mel Gorman 6

fs: buffer: move allocation failure loop into the allocator

2013-10-17T04:35:53+00:00

Buffer allocation has a very crude indefinite loop around waking the
flusher threads and performing global NOFS direct reclaim because it can
not handle allocation failures.

The most immediate problem with this is that the allocation may fail due
to a memory cgroup limit, where flushers + direct reclaim might not make
any progress towards resolving the situation at all.  Because unlike the
global case, a memory cgroup may not have any cache at all, only
anonymous pages but no swap.  This situation will lead to a reclaim
livelock with insane IO from waking the flushers and thrashing unrelated
filesystem cache in a tight loop.

Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
any looping happens in the page allocator, which knows how to
orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
allows memory cgroups to detect allocations that can't handle failure
and will allow them to ultimately bypass the limit if reclaim can not
make progress.

Reported-by: azurIt 
Signed-off-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmscan: take page buffers dirty and locked state into account

2013-07-03T23:07:29+00:00

Page reclaim keeps track of dirty and under writeback pages and uses it
to determine if wait_iff_congested() should stall or if kswapd should
begin writing back pages.  This fails to account for buffer pages that
can be under writeback but not PageWriteback which is the case for
filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
can have all the buffers clean and writepage does no IO so it should not
be accounted as congested.

This patch adds an address_space operation that filesystems may
optionally use to check if a page is really dirty or really under
writeback.  An implementation is provided for for buffer_heads is added
and used for block operations and ext3 in ordered mode.  By default the
page flags are obeyed.

Credit goes to Jan Kara for identifying that the page flags alone are
not sufficient for ext3 and sanity checking a number of ideas on how the
problem could be addressed.

Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: KAMEZAWA Hiroyuki 
Cc: Jiri Slaby 
Cc: Valdis Kletnieks 
Cc: Zlatko Calusic 
Cc: dormando 
Cc: Trond Myklebust 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds