<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/mm/filemap.c, branch v3.15-rc7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>mm/filemap.c: avoid always dirtying mapping-&gt;flags on O_DIRECT</title>
<updated>2014-05-23T16:37:29+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2014-05-22T18:54:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d'/>
<id>7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d</id>
<content type='text'>
In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors().  That
smells fishy.  Looking further, this is basically what happens:

blkdev_aio_read()
    generic_file_aio_read()
        filemap_write_and_wait_range()
            if (!mapping-&gt;nr_pages)
                filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation.  The
patch below tests each of these bits before clearing them, avoiding this
issue.  In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Acked-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors().  That
smells fishy.  Looking further, this is basically what happens:

blkdev_aio_read()
    generic_file_aio_read()
        filemap_write_and_wait_range()
            if (!mapping-&gt;nr_pages)
                filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation.  The
patch below tests each of these bits before clearing them, avoiding this
issue.  In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Acked-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: filemap: update find_get_pages_tag() to deal with shadow entries</title>
<updated>2014-05-06T20:04:59+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2014-05-06T19:50:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=139b6a6fb1539e04b01663d61baff3088c63dbb5'/>
<id>139b6a6fb1539e04b01663d61baff3088c63dbb5</id>
<content type='text'>
Dave Jones reports the following crash when find_get_pages_tag() runs
into an exceptional entry:

  kernel BUG at mm/filemap.c:1347!
  RIP: find_get_pages_tag+0x1cb/0x220
  Call Trace:
    find_get_pages_tag+0x36/0x220
    pagevec_lookup_tag+0x21/0x30
    filemap_fdatawait_range+0xbe/0x1e0
    filemap_fdatawait+0x27/0x30
    sync_inodes_sb+0x204/0x2a0
    sync_inodes_one_sb+0x19/0x20
    iterate_supers+0xb2/0x110
    sys_sync+0x44/0xb0
    ia32_do_call+0x13/0x13

  1343                         /*
  1344                          * This function is never used on a shmem/tmpfs
  1345                          * mapping, so a swap entry won't be found here.
  1346                          */
  1347                         BUG();

After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
page cache radix trees") this comment and BUG() are out of date because
exceptional entries can now appear in all mappings - as shadows of
recently evicted pages.

However, as Hugh Dickins notes,

  "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
   any other PAGECACHE_TAG_*) to appear on an exceptional entry.

   I expect it comes down to an occasional race in RCU lookup of the
   radix_tree: lacking absolute synchronization, we might sometimes
   catch an exceptional entry, with the tag which really belongs with
   the unexceptional entry which was there an instant before."

And indeed, not only is the tree walk lockless, the tags are also read
in chunks, one radix tree node at a time.  There is plenty of time for
page reclaim to swoop in and replace a page that was already looked up
as tagged with a shadow entry.

Remove the BUG() and update the comment.  While reviewing all other
lookup sites for whether they properly deal with shadow entries of
evicted pages, update all the comments and fix memcg file charge moving
to not miss shmem/tmpfs swapcache pages.

Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Dave Jones reports the following crash when find_get_pages_tag() runs
into an exceptional entry:

  kernel BUG at mm/filemap.c:1347!
  RIP: find_get_pages_tag+0x1cb/0x220
  Call Trace:
    find_get_pages_tag+0x36/0x220
    pagevec_lookup_tag+0x21/0x30
    filemap_fdatawait_range+0xbe/0x1e0
    filemap_fdatawait+0x27/0x30
    sync_inodes_sb+0x204/0x2a0
    sync_inodes_one_sb+0x19/0x20
    iterate_supers+0xb2/0x110
    sys_sync+0x44/0xb0
    ia32_do_call+0x13/0x13

  1343                         /*
  1344                          * This function is never used on a shmem/tmpfs
  1345                          * mapping, so a swap entry won't be found here.
  1346                          */
  1347                         BUG();

After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
page cache radix trees") this comment and BUG() are out of date because
exceptional entries can now appear in all mappings - as shadows of
recently evicted pages.

However, as Hugh Dickins notes,

  "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
   any other PAGECACHE_TAG_*) to appear on an exceptional entry.

   I expect it comes down to an occasional race in RCU lookup of the
   radix_tree: lacking absolute synchronization, we might sometimes
   catch an exceptional entry, with the tag which really belongs with
   the unexceptional entry which was there an instant before."

And indeed, not only is the tree walk lockless, the tags are also read
in chunks, one radix tree node at a time.  There is plenty of time for
page reclaim to swoop in and replace a page that was already looked up
as tagged with a shadow entry.

Remove the BUG() and update the comment.  While reviewing all other
lookup sites for whether they properly deal with shadow entries of
evicted pages, update all the comments and fix memcg file charge moving
to not miss shmem/tmpfs swapcache pages.

Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Dave Jones &lt;davej@redhat.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: fix new kernel-doc warning in filemap.c</title>
<updated>2014-04-18T23:40:09+00:00</updated>
<author>
<name>Randy Dunlap</name>
<email>rdunlap@infradead.org</email>
</author>
<published>2014-04-18T22:07:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b59b8cbca6291a23928f1a103defba03d1cdab2c'/>
<id>b59b8cbca6291a23928f1a103defba03d1cdab2c</id>
<content type='text'>
Fix new kernel-doc warning in mm/filemap.c:

  Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix new kernel-doc warning in mm/filemap.c:

  Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write'

Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2014-04-12T21:49:50+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-04-12T21:49:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5166701b368caea89d57b14bf41cf39e819dad51'/>
<id>5166701b368caea89d57b14bf41cf39e819dad51</id>
<content type='text'>
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe-&gt;buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe-&gt;buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>memcg: rename high level charging functions</title>
<updated>2014-04-07T23:35:57+00:00</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2014-04-07T22:37:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d715ae08f2ff87508a081c4df78061bf4f7211d6'/>
<id>d715ae08f2ff87508a081c4df78061bf4f7211d6</id>
<content type='text'>
mem_cgroup_newpage_charge is used only for charging anonymous memory so
it is better to rename it to mem_cgroup_charge_anon.

mem_cgroup_cache_charge is used for file backed memory so rename it to
mem_cgroup_charge_file.

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mem_cgroup_newpage_charge is used only for charging anonymous memory so
it is better to rename it to mem_cgroup_charge_anon.

mem_cgroup_cache_charge is used for file backed memory so rename it to
mem_cgroup_charge_file.

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: cleanup size checks in filemap_fault() and filemap_map_pages()</title>
<updated>2014-04-07T23:35:53+00:00</updated>
<author>
<name>Kirill A. Shutemov</name>
<email>kirill.shutemov@linux.intel.com</email>
</author>
<published>2014-04-07T22:37:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=99e3e53f4e0fd807607bf381e14f6de8feedd383'/>
<id>99e3e53f4e0fd807607bf381e14f6de8feedd383</id>
<content type='text'>
Minor cleanups:
 - 'size' variable is now in bytes, not pages;
 - use round_up(): it should be easier to read.

Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Matthew Wilcox &lt;matthew.r.wilcox@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ning Qu &lt;quning@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Minor cleanups:
 - 'size' variable is now in bytes, not pages;
 - use round_up(): it should be easier to read.

Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Matthew Wilcox &lt;matthew.r.wilcox@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ning Qu &lt;quning@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: implement -&gt;map_pages for page cache</title>
<updated>2014-04-07T23:35:53+00:00</updated>
<author>
<name>Kirill A. Shutemov</name>
<email>kirill.shutemov@linux.intel.com</email>
</author>
<published>2014-04-07T22:37:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f1820361f83d556a7f0a9f629100f3825e594328'/>
<id>f1820361f83d556a7f0a9f629100f3825e594328</id>
<content type='text'>
filemap_map_pages() is generic implementation of -&gt;map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for -&gt;map_pages() if
filesystem use filemap_fault() for -&gt;fault().

Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Matthew Wilcox &lt;matthew.r.wilcox@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ning Qu &lt;quning@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
filemap_map_pages() is generic implementation of -&gt;map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for -&gt;map_pages() if
filesystem use filemap_fault() for -&gt;fault().

Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Matthew Wilcox &lt;matthew.r.wilcox@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Ning Qu &lt;quning@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: remove read_cache_page_async()</title>
<updated>2014-04-03T23:21:04+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2014-04-03T21:48:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=67f9fd91f93c582b7de2ab9325b6e179db77e4d5'/>
<id>67f9fd91f93c582b7de2ab9325b6e179db77e4d5</id>
<content type='text'>
This patch removes read_cache_page_async() which wasn't really needed
anywhere and simplifies the code around it a bit.

read_cache_page_async() is useful when we want to read a page into the
cache without waiting for it to complete.  This happens when the
appropriate callback 'filler' doesn't complete its read operation and
releases the page lock immediately, and instead queues a different
completion routine to do that.  This never actually happened anywhere in
the code.

read_cache_page_async() had 3 different callers:

- read_cache_page() which is the sync version, it would just wait for
  the requested read to complete using wait_on_page_read().

- JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
  function it supplied doesn't do any async reads, and would complete
  before the filler function returns - making it actually a sync read.

- CRAMFS would call it using the read_mapping_page_async() wrapper, with
  a similar story to JFFS2 - the filler function doesn't do anything that
  reminds async reads and would always complete before the filler function
  returns.

To sum it up, the code in mm/filemap.c never took advantage of having
read_cache_page_async().  While there are filler callbacks that do async
reads (such as the block one), we always called it with the
read_cache_page().

This patch adds a mandatory wait for read to complete when adding a new
page to the cache, and removes read_cache_page_async() and its wrappers.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch removes read_cache_page_async() which wasn't really needed
anywhere and simplifies the code around it a bit.

read_cache_page_async() is useful when we want to read a page into the
cache without waiting for it to complete.  This happens when the
appropriate callback 'filler' doesn't complete its read operation and
releases the page lock immediately, and instead queues a different
completion routine to do that.  This never actually happened anywhere in
the code.

read_cache_page_async() had 3 different callers:

- read_cache_page() which is the sync version, it would just wait for
  the requested read to complete using wait_on_page_read().

- JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
  function it supplied doesn't do any async reads, and would complete
  before the filler function returns - making it actually a sync read.

- CRAMFS would call it using the read_mapping_page_async() wrapper, with
  a similar story to JFFS2 - the filler function doesn't do anything that
  reminds async reads and would always complete before the filler function
  returns.

To sum it up, the code in mm/filemap.c never took advantage of having
read_cache_page_async().  While there are filler callbacks that do async
reads (such as the block one), we always called it with the
read_cache_page().

This patch adds a mandatory wait for read to complete when adding a new
page to the cache, and removes read_cache_page_async() and its wrappers.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: keep page cache radix tree nodes in check</title>
<updated>2014-04-03T23:21:01+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2014-04-03T21:47:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=449dd6984d0e47643c04c807f609dd56d48d5bcc'/>
<id>449dd6984d0e47643c04c807f609dd56d48d5bcc</id>
<content type='text'>
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers.  But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed.  This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting.  The shadow entries will just
sit there and waste memory.  In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads.  A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Metin Doslu &lt;metin@citusdata.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ozgun Erdogan &lt;ozgun@citusdata.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Roman Gushchin &lt;klamm@yandex-team.ru&gt;
Cc: Ryan Mallon &lt;rmallon@gmail.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers.  But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed.  This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting.  The shadow entries will just
sit there and waste memory.  In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads.  A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Metin Doslu &lt;metin@citusdata.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ozgun Erdogan &lt;ozgun@citusdata.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Roman Gushchin &lt;klamm@yandex-team.ru&gt;
Cc: Ryan Mallon &lt;rmallon@gmail.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: thrash detection-based file cache sizing</title>
<updated>2014-04-03T23:21:01+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2014-04-03T21:47:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a528910e12ec7ee203095eb1711468a66b9b60b0'/>
<id>a528910e12ec7ee203095eb1711468a66b9b60b0</id>
<content type='text'>
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past.  We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Reviewed-by: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Metin Doslu &lt;metin@citusdata.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ozgun Erdogan &lt;ozgun@citusdata.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Roman Gushchin &lt;klamm@yandex-team.ru&gt;
Cc: Ryan Mallon &lt;rmallon@gmail.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past.  We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Reviewed-by: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Metin Doslu &lt;metin@citusdata.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ozgun Erdogan &lt;ozgun@citusdata.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Roman Gushchin &lt;klamm@yandex-team.ru&gt;
Cc: Ryan Mallon &lt;rmallon@gmail.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
