linux-toradex.git/include/linux/memcontrol.h, branch v2.6.33.1

Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6

2009-12-16T20:36:49+00:00

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
  HWPOISON: Remove stray phrase in a comment
  HWPOISON: Try to allocate migration page on the same node
  HWPOISON: Don't do early filtering if filter is disabled
  HWPOISON: Add a madvise() injector for soft page offlining
  HWPOISON: Add soft page offline support
  HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
  HWPOISON: Use new shake_page in memory_failure
  HWPOISON: Use correct name for MADV_HWPOISON in documentation
  HWPOISON: mention HWPoison in Kconfig entry
  HWPOISON: Use get_user_page_fast in hwpoison madvise
  HWPOISON: add an interface to switch off/on all the page filters
  HWPOISON: add memory cgroup filter
  memcg: add accessor to mem_cgroup.css
  memcg: rename and export try_get_mem_cgroup_from_page()
  HWPOISON: add page flags filter
  mm: export stable page flags
  HWPOISON: limit hwpoison injector to known page types
  HWPOISON: add fs/device filters
  HWPOISON: return 0 to indicate success reliably
  HWPOISON: make semantics of IGNORED/DELAYED clear
  ...

memcg: make memcg's file mapped consistent with global VM

2009-12-16T15:20:07+00:00

In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED

And in global VM, mapped shared memory is accounted into FILE_MAPPED.
But memcg doesn't. fix it.
Note:
  page_is_file_cache() just checks SwapBacked or not.
  So, we need to check PageAnon.

Cc: Balbir Singh 
Reviewed-by: Daisuke Nishimura 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: coalesce uncharge during unmap/truncate

2009-12-16T15:20:07+00:00

In massive parallel enviroment, res_counter can be a performance
bottleneck.  One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.

This patch is a for coalescing uncharge.  For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task.  A reason for per-task batching is for making use
of caller's context information.  We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.

[without memcg config]
  40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
  27.67 cache-miss/faults
[root cgroup]
  36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
  31.58 miss/faults
[in a child cgroup]
  18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
  69.96 miss/faults
[child with this patch]
  27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
  47.16 miss/faults

We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.

Changelog(since 2009/10/02):
 - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
 - some clean up and commentary/description updates.
 - added initialize code to copy_process(). (possible bug fix)

Changelog(old):
 - fixed !CONFIG_MEM_CGROUP case.
 - rebased onto the latest mmotm + softlimit fix patches.
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.

Signed-off-by: KAMEZAWA Hiroyuki 
Cc: Balbir Singh 
Cc: Daisuke Nishimura 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: add accessor to mem_cgroup.css

2009-12-16T11:19:59+00:00

So that an outside user can free the reference count grabbed by
try_get_mem_cgroup_from_page().

CC: KOSAKI Motohiro 
CC: Hugh Dickins 
CC: Daisuke Nishimura 
CC: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Signed-off-by: Wu Fengguang 
Signed-off-by: Andi Kleen

memcg: rename and export try_get_mem_cgroup_from_page()

2009-12-16T11:19:59+00:00

So that the hwpoison injector can get mem_cgroup for arbitrary page
and thus know whether it is owned by some mem_cgroup task(s).

[AK: Merged with latest git tree]

CC: KOSAKI Motohiro 
CC: Hugh Dickins 
CC: Daisuke Nishimura 
CC: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Signed-off-by: Wu Fengguang 
Signed-off-by: Andi Kleen

memory controller: soft limit reclaim on contention

2009-09-24T14:20:59+00:00

Implement reclaim from groups over their soft limit

Permit reclaim from memory cgroups on contention (via the direct reclaim
path).

memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.

Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off.  The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits.  For soft
limits, we try to do more targetted reclaim.  Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded.  The proportion has been empirically
determined.

[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh 
Cc: KAMEZAWA Hiroyuki 
Cc: Li Zefan 
Acked-by: KOSAKI Motohiro 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: Daisuke Nishimura 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: add file-based RSS accounting

2009-06-18T20:03:47+00:00

Add file RSS tracking per memory cgroup

We currently don't track file RSS, the RSS we report is actually anon RSS.
 All the file mapped pages, come in through the page cache and get
accounted there.  This patch adds support for accounting file RSS pages.
It should

1. Help improve the metrics reported by the memory resource controller
2. Will form the basis for a future shared memory accounting heuristic
   that has been proposed by Kamezawa.

Unfortunately, we cannot rename the existing "rss" keyword used in
memory.stat to "anon_rss".  We however, add "mapped_file" data and hope to
educate the end user through documentation.

[hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
Signed-off-by: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Li Zefan 
Cc: Paul Menage 
Cc: Dhaval Giani 
Cc: Daisuke Nishimura 
Cc: YAMAMOTO Takashi 
Cc: KOSAKI Motohiro 
Cc: David Rientjes 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmscan: evict use-once pages first

2009-06-17T02:47:38+00:00

When the file LRU lists are dominated by streaming IO pages, evict those
pages first, before considering evicting other pages.

This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:

1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promoted

The pages freed in this way can either be reused for streaming IO, or
allocated for something else.  If the pages are used for streaming IO,
this pageout pattern continues.  Otherwise, we will fall back to the
normal pageout pattern.

Signed-off-by: Rik van Riel 
Reported-by: Elladan 
Cc: KOSAKI Motohiro 
Cc: Peter Zijlstra 
Cc: Lee Schermerhorn 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: fix mem_cgroup_shrink_usage()

2009-05-02T22:36:09+00:00

Current mem_cgroup_shrink_usage() has two problems.

1. It doesn't call mem_cgroup_out_of_memory and doesn't update
   last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.

2. Considering hierarchy, shrinking has to be done from the
   mem_over_limit, not from the memcg which the page would be charged to.

mem_cgroup_try_charge_swapin() does all of these things properly, so we
use it and call cancel_charge_swapin when it succeeded.

The name of "shrink_usage" is not appropriate for this behavior, so we
change it too.

Signed-off-by: Daisuke Nishimura 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Li Zefan 
Cc: Paul Menage 
Cc: Dhaval Giani 
Cc: Daisuke Nishimura 
Cc: YAMAMOTO Takashi 
Cc: KOSAKI Motohiro 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: use rcu_dereference to access mm->owner

2009-04-21T20:41:48+00:00

mm->owner should be accessed with rcu_dereference().

Reported-by: KOSAKI Motohiro 
Signed-off-by: KAMEZAWA Hiroyuki 
Acked-by: Balbir Singh 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds