linux-toradex.git/include/linux/mmzone.h, branch v2.6.36-rc7

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

2010-09-10T01:57:25+00:00

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter 
Signed-off-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmscan: kill prev_priority completely

2010-08-10T03:45:00+00:00

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Mel Gorman 
Reviewed-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Cc: Dave Chinner 
Cc: Chris Mason 
Cc: Nick Piggin 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: Christoph Hellwig 
Cc: KAMEZAWA Hiroyuki 
Cc: KOSAKI Motohiro 
Cc: Andrea Arcangeli 
Cc: Michael Rubin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mmzone.h: remove dead prototype

2010-08-10T03:44:57+00:00

get_zone_counts() was dropped from kernel tree, see:
http://www.mail-archive.com/mm-commits@vger.kernel.org/msg07313.html but
its prototype remains.

Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

numa: introduce numa_mem_id()- effective local memory node id

2010-05-27T16:12:57+00:00

Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in  when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild.  Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn 
Signed-off-by: Christoph Lameter 
Cc: Tejun Heo 
Cc: Mel Gorman 
Cc: Christoph Lameter 
Cc: Nick Piggin 
Cc: David Rientjes 
Cc: Eric Whitney 
Cc: KAMEZAWA Hiroyuki 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: "Luck, Tony" 
Cc: Pekka Enberg 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mem-hotplug: fix potential race while building zonelist for new populated zone

2010-05-25T15:07:02+00:00

Add global mutex zonelists_mutex to fix the possible race:

     CPU0                                  CPU1                    CPU2
(1) zone->present_pages += online_pages;
(2)                                       build_all_zonelists();
(3)                                                               alloc_page();
(4)                                                               free_page();
(5) build_all_zonelists();
(6)   __build_all_zonelists();
(7)     zone->pageset = alloc_percpu();

In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li 
Signed-off-by: Wu Fengguang 
Reviewed-by: Andi Kleen 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset

2010-05-25T15:07:01+00:00

For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
       at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
       shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

  ------------[ cut here ]------------
  kernel BUG at mm/page_alloc.c:1239!
  invalid opcode: 0000 [#1] SMP
  ...
  Call Trace:
   [] __alloc_pages_nodemask+0x131/0x7b0
   [] alloc_pages_current+0x87/0xd0
   [] __page_cache_alloc+0x67/0x70
   [] __do_page_cache_readahead+0x120/0x260
   [] ra_submit+0x21/0x30
   [] ondemand_readahead+0x166/0x2c0
   [] page_cache_async_readahead+0x80/0xa0
   [] generic_file_aio_read+0x364/0x670
   [] nfs_file_read+0xca/0x130
   [] do_sync_read+0xfa/0x140
   [] vfs_read+0xb5/0x1a0
   [] sys_read+0x51/0x80
   [] system_call_fastpath+0x16/0x1b
  RIP  [] get_page_from_freelist+0x883/0x900
   RSP 
  ---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li 
Signed-off-by: Wu Fengguang 
Reviewed-by: Andi Kleen 
Reviewed-by: Christoph Lameter 
Cc: Mel Gorman 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix NR_SECTION_ROOTS == 0 when using using sparsemem extreme.

2010-05-25T15:07:01+00:00

Got this while compiling for ARM/SA1100:

mm/sparse.c: In function '__section_nr':
mm/sparse.c:135: warning: 'root' is used uninitialized in this function

This patch follows Russell King's suggestion for a new calculation for
NR_SECTION_ROOTS.  Thanks also to Sergei Shtylyov for pointing out the
existence of the macro DIV_ROUND_UP.

Atsushi Nemoto observed:
: This fix doesn't just silence the warning - it fixes a real problem.
:
: Without this fix, mem_section[] might have 0 size so mem_section[0]
: will share other variable area.  For example, I got:
:
: c030c700 b __warned.16478
: c030c700 B mem_section
: c030c701 b __warned.16483
:
: This might cause very strange behavior.  Your patch actually fixes it.

Signed-off-by: Marcelo Roberto Jimenez 
Cc: Atsushi Nemoto 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: Yinghai Lu 
Cc: Sergei Shtylyov 
Cc: Russell King 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: compaction: defer compaction using an exponential backoff when compaction fails

2010-05-25T15:07:00+00:00

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail.  There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6.  If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e.  the
preferred zone.  To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache.  This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'for-next' into for-linus

2010-03-08T15:55:37+00:00

Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c

mm: restore zone->all_unreclaimable to independence word

2010-03-06T19:26:25+00:00

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag.  But it had an undesireble side
effect.  free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro 
Cc: David Rientjes 
Cc: Wu Fengguang 
Cc: KAMEZAWA Hiroyuki 
Cc: Minchan Kim 
Cc: Huang Shijie 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds