linux-toradex.git/mm, branch v2.6.37.5

mm: fix possible cause of a page_mapped BUG

2011-03-14T21:17:34+00:00

commit a3e8cc643d22d2c8ed36b9be7d9c9ca21efcf7f7 upstream.

Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
a hole with madvise(,, MADV_REMOVE).  That path is under mutex, and
cannot be explained by lack of serialization in unmap_mapping_range().

Reviewing the code, I found one place where vm_truncate_count handling
should have been updated, when I switched at the last minute from one
way of managing the restart_addr to another: mremap move changes the
virtual addresses, so it ought to adjust the restart_addr.

But rather than exporting the notion of restart_addr from memory.c, or
converting to restart_pgoff throughout, simply reset vm_truncate_count
to 0 to force a rescan if mremap move races with preempted truncation.

We have no confirmation that this fixes Robert's BUG,
but it is a fix that's worth making anyway.

Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds 
Cc: Kerin Millar 
Signed-off-by: Greg Kroah-Hartman

mm: vmstat: use a single setter function and callback for adjusting percpu thresholds

2011-03-07T23:05:21+00:00

commit b44129b30652c8771db2265939bb8b463724043d upstream.

reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
errors due to counter drift.  The functions duplicate some code so this
patch replaces them with a single set_pgdat_percpu_threshold() that takes
a callback function to calculate the desired threshold as a parameter.

[akpm@linux-foundation.org: readability tweak]
[kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
Signed-off-by: Mel Gorman 
Reviewed-by: Christoph Lameter 
Reviewed-by: KAMEZAWA Hiroyuki 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix dubious code in __count_immobile_pages()

2011-03-07T23:05:11+00:00

commit 29723fccc837d20039078f7a571e8d457eb0d6c6 upstream.

When pfn_valid_within() failed 'iter' was incremented twice.

Signed-off-by: Namhyung Kim 
Reviewed-by: KAMEZAWA Hiroyuki 
Reviewed-by: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: prevent concurrent unmap_mapping_range() on the same inode

2011-03-07T23:05:08+00:00

commit 2aa15890f3c191326678f1bd68af61ec6b8753ec upstream.

Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"

Gurudas Pai reported the same bug on NFS.

The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode.  For example:

  thread1: going through a big range, stops in the middle of a vma and
     stores the restart address in vm_truncate_count.

  thread2: comes in with a small (e.g. single page) unmap request on
     the same vma, somewhere before restart_address, finds that the
     vma was already unmapped up to the restart address and happily
     returns without doing anything.

Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value.  This could go on forever without any of them being able to
finish.

Truncate and hole punching already serialize with i_mutex.  Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers.  In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.

This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.

[ We'll hopefully get rid of all this with the upcoming mm
  preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
  lockbreak" patch in particular.  But that is for 2.6.39 ]

Signed-off-by: Miklos Szeredi 
Reported-by: Michael Leun 
Reported-by: Gurudas Pai 
Tested-by: Gurudas Pai 
Acked-by: Hugh Dickins 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

memory hotplug: one more lock on memory hotplug

2011-02-17T23:14:50+00:00

commit 925268a06dc2b1ff7bfcc37419a6827a0e739639 upstream.

Now, memory_hotplug_(un)lock() is used for add/remove/offline pages
for avoiding races with hibernation. But this should be held in
online_pages(), too. It seems asymmetric.

There are cases where one has to avoid a race with memory hotplug
notifier and his own local code, and hotplug v.s. hotplug.
This will add a generic solution for avoiding races. In other view,
having lock here has no big impacts. online pages is tend to be
done by udev script at el against each memory section one by one.

Then, it's better to have lock here, too.

Reviewed-by: Christoph Lameter 
Acked-by: David Rientjes 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: Pekka Enberg 
Signed-off-by: Greg Kroah-Hartman

memsw: handle swapaccount kernel parameter correctly

2011-02-17T23:14:46+00:00

commit fceda1bf498677501befc7da72fd2e4de7f18466 upstream.

__setup based kernel command line parameters handlers which are handled in
obsolete_checksetup are provided with the parameter value including =
(more precisely everything right after the parameter name).

This means that the current implementation of swapaccount[=1|0] doesn't
work at all because if there is a value for the parameter then we are
testing for "0" resp.  "1" but we are getting "=0" resp.  "=1" and if
there is no parameter value we are getting an empty string rather than
NULL.

The original noswapccount parameter, which doesn't care about the value,
works correctly.

Signed-off-by: Michal Hocko 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Daisuke Nishimura 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: page allocator: adjust the per-cpu counter threshold when memory is low

2011-02-17T23:14:38+00:00

commit 88f5acf88ae6a9778f6d25d0d5d7ec2d57764a97 upstream.

Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla                      11.6615%
disable-threshold            0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).

Signed-off-by: Mel Gorman 
Reported-by: Shaohua Li 
Reviewed-by: Christoph Lameter 
Tested-by: Nicolas Bareil 
Cc: David Rientjes 
Cc: Kyle McMartin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

slub: Avoid use of slub_lock in show_slab_objects()

2011-02-17T23:14:37+00:00

commit 04d94879c8a4973b5499dc26b9d38acee8928791 upstream.

The purpose of the locking is to prevent removal and additions
of nodes when statistics are gathered for a slab cache. So we
need to avoid racing with memory hotplug functionality.

It is enough to take the memory hotplug locks there instead
of the slub_lock.

online_pages() currently does not acquire the memory_hotplug
lock. Another patch will be submitted by the memory hotplug
authors to take the memory hotplug lock and describe the
uses of the memory hotplug lock to protect against
adding and removal of nodes from non hotplug data structures.

Reported-and-tested-by: Bart Van Assche 
Acked-by: David Rientjes 
Signed-off-by: Christoph Lameter 
Signed-off-by: Pekka Enberg 
Signed-off-by: Greg Kroah-Hartman

memcg: fix account leak at failure of memsw acconting

2011-02-17T23:14:36+00:00

commit 01c88e2d6b7330c0cc5867fe2297e7d826e1337d upstream.

Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a
cancel of charge at case: memory charge-> success.  mem+swap charge->
failure.

This leaks usage of memory.  Fix it.

Signed-off-by: KAMEZAWA Hiroyuki 
Reviewed-by: Johannes Weiner 
Acked-by: Daisuke Nishimura 
Cc: Balbir Singh 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix hugepage migration

2011-02-17T23:14:18+00:00

commit fd4a4663db293bfd5dc20fb4113977f62895e550 upstream.

2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().

I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.

Signed-off-by: Hugh Dickins 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Naoya Horiguchi 
Cc: "Jun'ichi Nomura" 
Cc: Andi Kleen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman