<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/linux/mmzone.h, branch v2.6.36-rc7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake</title>
<updated>2010-09-10T01:57:25+00:00</updated>
<author>
<name>Christoph Lameter</name>
<email>cl@linux.com</email>
</author>
<published>2010-09-09T23:38:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=aa45484031ddee09b06350ab8528bfe5b2c76d1c'/>
<id>aa45484031ddee09b06350ab8528bfe5b2c76d1c</id>
<content type='text'>
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vmscan: kill prev_priority completely</title>
<updated>2010-08-10T03:45:00+00:00</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2010-08-10T00:19:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=25edde0332916ae706ccf83de688be57bcc844b7'/>
<id>25edde0332916ae706ccf83de688be57bcc844b7</id>
<content type='text'>
Since 2.6.28 zone-&gt;prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 &gt; 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Reviewed-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Chris Mason &lt;chris.mason@oracle.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Michael Rubin &lt;mrubin@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since 2.6.28 zone-&gt;prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 &gt; 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Reviewed-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Chris Mason &lt;chris.mason@oracle.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Michael Rubin &lt;mrubin@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mmzone.h: remove dead prototype</title>
<updated>2010-08-10T03:44:57+00:00</updated>
<author>
<name>Alexander Nevenchannyy</name>
<email>a.nevenchannyy@gmail.com</email>
</author>
<published>2010-08-10T00:19:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b645bd1286f2fbcd2eb4ab3bed5884f63c42e363'/>
<id>b645bd1286f2fbcd2eb4ab3bed5884f63c42e363</id>
<content type='text'>
get_zone_counts() was dropped from kernel tree, see:
http://www.mail-archive.com/mm-commits@vger.kernel.org/msg07313.html but
its prototype remains.

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
get_zone_counts() was dropped from kernel tree, see:
http://www.mail-archive.com/mm-commits@vger.kernel.org/msg07313.html but
its prototype remains.

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>numa: introduce numa_mem_id()- effective local memory node id</title>
<updated>2010-05-27T16:12:57+00:00</updated>
<author>
<name>Lee Schermerhorn</name>
<email>lee.schermerhorn@hp.com</email>
</author>
<published>2010-05-26T21:45:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7aac789885512388a66d47280d7e7777ffba1e59'/>
<id>7aac789885512388a66d47280d7e7777ffba1e59</id>
<content type='text'>
Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in &lt;linux/topology.h&gt; when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild.  Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Eric Whitney &lt;eric.whitney@hp.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: "Luck, Tony" &lt;tony.luck@intel.com&gt;
Cc: Pekka Enberg &lt;penberg@cs.helsinki.fi&gt;
Cc: &lt;linux-arch@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Introduce numa_mem_id(), based on generic percpu variable infrastructure
to track "nearest node with memory" for archs that support memoryless
nodes.

Define API in &lt;linux/topology.h&gt; when CONFIG_HAVE_MEMORYLESS_NODES
defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
if/when they support them.

Archs can override definitions of:

numa_mem_id() - returns node number of "local memory" node
set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue

Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
This will initialize the boot cpu at boot time, and all cpus on change of
numa_zonelist_order, or when node or memory hot-plug requires zonelist
rebuild.  Archs that support memoryless nodes will need to initialize
'numa_mem' for secondary cpus as they're brought on-line.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Eric Whitney &lt;eric.whitney@hp.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: "Luck, Tony" &lt;tony.luck@intel.com&gt;
Cc: Pekka Enberg &lt;penberg@cs.helsinki.fi&gt;
Cc: &lt;linux-arch@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mem-hotplug: fix potential race while building zonelist for new populated zone</title>
<updated>2010-05-25T15:07:02+00:00</updated>
<author>
<name>Haicheng Li</name>
<email>haicheng.li@linux.intel.com</email>
</author>
<published>2010-05-24T21:32:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4eaf3f64397c3db3c5785eee508270d62a9fabd9'/>
<id>4eaf3f64397c3db3c5785eee508270d62a9fabd9</id>
<content type='text'>
Add global mutex zonelists_mutex to fix the possible race:

     CPU0                                  CPU1                    CPU2
(1) zone-&gt;present_pages += online_pages;
(2)                                       build_all_zonelists();
(3)                                                               alloc_page();
(4)                                                               free_page();
(5) build_all_zonelists();
(6)   __build_all_zonelists();
(7)     zone-&gt;pageset = alloc_percpu();

In step (3,4), zone-&gt;pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li &lt;haicheng.li@linux.intel.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Reviewed-by: Andi Kleen &lt;andi.kleen@intel.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add global mutex zonelists_mutex to fix the possible race:

     CPU0                                  CPU1                    CPU2
(1) zone-&gt;present_pages += online_pages;
(2)                                       build_all_zonelists();
(3)                                                               alloc_page();
(4)                                                               free_page();
(5) build_all_zonelists();
(6)   __build_all_zonelists();
(7)     zone-&gt;pageset = alloc_percpu();

In step (3,4), zone-&gt;pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).

Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).

[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li &lt;haicheng.li@linux.intel.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Reviewed-by: Andi Kleen &lt;andi.kleen@intel.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset</title>
<updated>2010-05-25T15:07:01+00:00</updated>
<author>
<name>Haicheng Li</name>
<email>haicheng.li@linux.intel.com</email>
</author>
<published>2010-05-24T21:32:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1f522509c77a5dea8dc384b735314f03908a6415'/>
<id>1f522509c77a5dea8dc384b735314f03908a6415</id>
<content type='text'>
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone-&gt;pageset from the shared boot_pageset
       at end of __build_all_zonelists().

    2) Use mutex to protect zone-&gt;pageset when it's still
       shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

  ------------[ cut here ]------------
  kernel BUG at mm/page_alloc.c:1239!
  invalid opcode: 0000 [#1] SMP
  ...
  Call Trace:
   [&lt;ffffffff811300c1&gt;] __alloc_pages_nodemask+0x131/0x7b0
   [&lt;ffffffff81162e67&gt;] alloc_pages_current+0x87/0xd0
   [&lt;ffffffff81128407&gt;] __page_cache_alloc+0x67/0x70
   [&lt;ffffffff811325f0&gt;] __do_page_cache_readahead+0x120/0x260
   [&lt;ffffffff81132751&gt;] ra_submit+0x21/0x30
   [&lt;ffffffff811329c6&gt;] ondemand_readahead+0x166/0x2c0
   [&lt;ffffffff81132ba0&gt;] page_cache_async_readahead+0x80/0xa0
   [&lt;ffffffff8112a0e4&gt;] generic_file_aio_read+0x364/0x670
   [&lt;ffffffff81266cfa&gt;] nfs_file_read+0xca/0x130
   [&lt;ffffffff8117b20a&gt;] do_sync_read+0xfa/0x140
   [&lt;ffffffff8117bf75&gt;] vfs_read+0xb5/0x1a0
   [&lt;ffffffff8117c151&gt;] sys_read+0x51/0x80
   [&lt;ffffffff8103c032&gt;] system_call_fastpath+0x16/0x1b
  RIP  [&lt;ffffffff8112ff13&gt;] get_page_from_freelist+0x883/0x900
   RSP &lt;ffff88000d1e78a8&gt;
  ---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li &lt;haicheng.li@linux.intel.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Reviewed-by: Andi Kleen &lt;andi.kleen@intel.com&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone-&gt;pageset from the shared boot_pageset
       at end of __build_all_zonelists().

    2) Use mutex to protect zone-&gt;pageset when it's still
       shared in onlined_pages()

Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:

  ------------[ cut here ]------------
  kernel BUG at mm/page_alloc.c:1239!
  invalid opcode: 0000 [#1] SMP
  ...
  Call Trace:
   [&lt;ffffffff811300c1&gt;] __alloc_pages_nodemask+0x131/0x7b0
   [&lt;ffffffff81162e67&gt;] alloc_pages_current+0x87/0xd0
   [&lt;ffffffff81128407&gt;] __page_cache_alloc+0x67/0x70
   [&lt;ffffffff811325f0&gt;] __do_page_cache_readahead+0x120/0x260
   [&lt;ffffffff81132751&gt;] ra_submit+0x21/0x30
   [&lt;ffffffff811329c6&gt;] ondemand_readahead+0x166/0x2c0
   [&lt;ffffffff81132ba0&gt;] page_cache_async_readahead+0x80/0xa0
   [&lt;ffffffff8112a0e4&gt;] generic_file_aio_read+0x364/0x670
   [&lt;ffffffff81266cfa&gt;] nfs_file_read+0xca/0x130
   [&lt;ffffffff8117b20a&gt;] do_sync_read+0xfa/0x140
   [&lt;ffffffff8117bf75&gt;] vfs_read+0xb5/0x1a0
   [&lt;ffffffff8117c151&gt;] sys_read+0x51/0x80
   [&lt;ffffffff8103c032&gt;] system_call_fastpath+0x16/0x1b
  RIP  [&lt;ffffffff8112ff13&gt;] get_page_from_freelist+0x883/0x900
   RSP &lt;ffff88000d1e78a8&gt;
  ---[ end trace 4bda28328b9990db ]

[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li &lt;haicheng.li@linux.intel.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Reviewed-by: Andi Kleen &lt;andi.kleen@intel.com&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: fix NR_SECTION_ROOTS == 0 when using using sparsemem extreme.</title>
<updated>2010-05-25T15:07:01+00:00</updated>
<author>
<name>Marcelo Roberto Jimenez</name>
<email>mroberto@cpti.cetuc.puc-rio.br</email>
</author>
<published>2010-05-24T21:32:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0faa56389c793cda7f967117415717bbab24fe4e'/>
<id>0faa56389c793cda7f967117415717bbab24fe4e</id>
<content type='text'>
Got this while compiling for ARM/SA1100:

mm/sparse.c: In function '__section_nr':
mm/sparse.c:135: warning: 'root' is used uninitialized in this function

This patch follows Russell King's suggestion for a new calculation for
NR_SECTION_ROOTS.  Thanks also to Sergei Shtylyov for pointing out the
existence of the macro DIV_ROUND_UP.

Atsushi Nemoto observed:
: This fix doesn't just silence the warning - it fixes a real problem.
:
: Without this fix, mem_section[] might have 0 size so mem_section[0]
: will share other variable area.  For example, I got:
:
: c030c700 b __warned.16478
: c030c700 B mem_section
: c030c701 b __warned.16483
:
: This might cause very strange behavior.  Your patch actually fixes it.

Signed-off-by: Marcelo Roberto Jimenez &lt;mroberto@cpti.cetuc.puc-rio.br&gt;
Cc: Atsushi Nemoto &lt;anemo@mba.ocn.ne.jp&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Sergei Shtylyov &lt;sshtylyov@mvista.com&gt;
Cc: Russell King &lt;rmk@arm.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Got this while compiling for ARM/SA1100:

mm/sparse.c: In function '__section_nr':
mm/sparse.c:135: warning: 'root' is used uninitialized in this function

This patch follows Russell King's suggestion for a new calculation for
NR_SECTION_ROOTS.  Thanks also to Sergei Shtylyov for pointing out the
existence of the macro DIV_ROUND_UP.

Atsushi Nemoto observed:
: This fix doesn't just silence the warning - it fixes a real problem.
:
: Without this fix, mem_section[] might have 0 size so mem_section[0]
: will share other variable area.  For example, I got:
:
: c030c700 b __warned.16478
: c030c700 B mem_section
: c030c701 b __warned.16483
:
: This might cause very strange behavior.  Your patch actually fixes it.

Signed-off-by: Marcelo Roberto Jimenez &lt;mroberto@cpti.cetuc.puc-rio.br&gt;
Cc: Atsushi Nemoto &lt;anemo@mba.ocn.ne.jp&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Yinghai Lu &lt;yinghai@kernel.org&gt;
Cc: Sergei Shtylyov &lt;sshtylyov@mvista.com&gt;
Cc: Russell King &lt;rmk@arm.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: defer compaction using an exponential backoff when compaction fails</title>
<updated>2010-05-25T15:07:00+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mel@csn.ul.ie</email>
</author>
<published>2010-05-24T21:32:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4f92e2586b43a2402e116055d4edda704f911b5b'/>
<id>4f92e2586b43a2402e116055d4edda704f911b5b</id>
<content type='text'>
The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail.  There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 &lt;&lt; compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6.  If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e.  the
preferred zone.  To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache.  This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail.  There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 &lt;&lt; compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6.  If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e.  the
preferred zone.  To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache.  This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-next' into for-linus</title>
<updated>2010-03-08T15:55:37+00:00</updated>
<author>
<name>Jiri Kosina</name>
<email>jkosina@suse.cz</email>
</author>
<published>2010-03-08T15:55:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=318ae2edc3b29216abd8a2510f3f80b764f06858'/>
<id>318ae2edc3b29216abd8a2510f3f80b764f06858</id>
<content type='text'>
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: restore zone-&gt;all_unreclaimable to independence word</title>
<updated>2010-03-06T19:26:25+00:00</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2010-03-05T21:41:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=93e4a89a8c987189b168a530a331ef6d0fcf07a7'/>
<id>93e4a89a8c987189b168a530a331ef6d0fcf07a7</id>
<content type='text'>
commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag.  But it had an undesireble side
effect.  free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Huang Shijie &lt;shijie8@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag.  But it had an undesireble side
effect.  free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Huang Shijie &lt;shijie8@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
