<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/linux/mempolicy.h, branch v3.7-rc2</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>UAPI: (Scripted) Disintegrate include/linux</title>
<updated>2012-10-13T09:46:48+00:00</updated>
<author>
<name>David Howells</name>
<email>dhowells@redhat.com</email>
</author>
<published>2012-10-13T09:46:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=607ca46e97a1b6594b29647d98a32d545c24bdff'/>
<id>607ca46e97a1b6594b29647d98a32d545c24bdff</id>
<content type='text'>
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Acked-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Acked-by: Dave Jones &lt;davej@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Acked-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Michael Kerrisk &lt;mtk.manpages@gmail.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Acked-by: Dave Jones &lt;davej@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mempolicy: fix a race in shared_policy_replace()</title>
<updated>2012-10-09T07:22:22+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-08T23:29:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=b22d127a39ddd10d93deee3d96e643657ad53a49'/>
<id>b22d127a39ddd10d93deee3d96e643657ad53a49</id>
<content type='text'>
shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
be dereferenced if sp-&gt;lock is not held and 2) another thread can modify
sp_node between spin_unlock for allocating a new sp node and next
spin_lock.  The bug was introduced before 2.6.12-rc2.

Kosaki's original patch for this problem was to allocate an sp node and
policy within shared_policy_replace and initialise it when the lock is
reacquired.  I was not keen on this approach because it partially
duplicates sp_alloc().  As the paths were sp-&gt;lock is taken are not that
performance critical this patch converts sp-&gt;lock to sp-&gt;mutex so it can
sleep when calling sp_alloc().

[kosaki.motohiro@jp.fujitsu.com: Original patch]
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Josh Boyer &lt;jwboyer@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
be dereferenced if sp-&gt;lock is not held and 2) another thread can modify
sp_node between spin_unlock for allocating a new sp node and next
spin_lock.  The bug was introduced before 2.6.12-rc2.

Kosaki's original patch for this problem was to allocate an sp node and
policy within shared_policy_replace and initialise it when the lock is
reacquired.  I was not keen on this approach because it partially
duplicates sp_alloc().  As the paths were sp-&gt;lock is taken are not that
performance critical this patch converts sp-&gt;lock to sp-&gt;mutex so it can
sleep when calling sp_alloc().

[kosaki.motohiro@jp.fujitsu.com: Original patch]
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Josh Boyer &lt;jwboyer@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: kill vma flag VM_RESERVED and mm-&gt;reserved_vm counter</title>
<updated>2012-10-09T07:22:19+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@openvz.org</email>
</author>
<published>2012-10-08T23:29:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=314e51b9851b4f4e8ab302243ff5a6fc6147f379'/>
<id>314e51b9851b4f4e8ab302243ff5a6fc6147f379</id>
<content type='text'>
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Carsten Otte &lt;cotte@de.ibm.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@tilera.com&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Cc: Eric Paris &lt;eparis@redhat.com&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: James Morris &lt;james.l.morris@oracle.com&gt;
Cc: Jason Baron &lt;jbaron@redhat.com&gt;
Cc: Kentaro Takeda &lt;takedakn@nttdata.co.jp&gt;
Cc: Matt Helsley &lt;matthltc@us.ibm.com&gt;
Cc: Nick Piggin &lt;npiggin@kernel.dk&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Robert Richter &lt;robert.richter@amd.com&gt;
Cc: Suresh Siddha &lt;suresh.b.siddha@intel.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Venkatesh Pallipadi &lt;venki@google.com&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Carsten Otte &lt;cotte@de.ibm.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@tilera.com&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Cc: Eric Paris &lt;eparis@redhat.com&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: James Morris &lt;james.l.morris@oracle.com&gt;
Cc: Jason Baron &lt;jbaron@redhat.com&gt;
Cc: Kentaro Takeda &lt;takedakn@nttdata.co.jp&gt;
Cc: Matt Helsley &lt;matthltc@us.ibm.com&gt;
Cc: Nick Piggin &lt;npiggin@kernel.dk&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Robert Richter &lt;robert.richter@amd.com&gt;
Cc: Suresh Siddha &lt;suresh.b.siddha@intel.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Venkatesh Pallipadi &lt;venki@google.com&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>slab/mempolicy: always use local policy from interrupt context</title>
<updated>2012-06-20T07:01:04+00:00</updated>
<author>
<name>Andi Kleen</name>
<email>ak@linux.intel.com</email>
</author>
<published>2012-06-09T09:40:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e7b691b085fda913830e5280ae6f724b2a63c824'/>
<id>e7b691b085fda913830e5280ae6f724b2a63c824</id>
<content type='text'>
slab_node() could access current-&gt;mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.

Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.

Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.

I believe the original mempolicy code did that in fact,
so it's likely a regression.

v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma &lt;asharma@fb.com&gt;
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen &lt;ak@linux.intel.com&gt;
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: David Mackey &lt;tdmackey@twitter.com&gt;
Signed-off-by: Pekka Enberg &lt;penberg@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
slab_node() could access current-&gt;mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.

Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.

Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.

I believe the original mempolicy code did that in fact,
so it's likely a regression.

v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma &lt;asharma@fb.com&gt;
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen &lt;ak@linux.intel.com&gt;
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: David Mackey &lt;tdmackey@twitter.com&gt;
Signed-off-by: Pekka Enberg &lt;penberg@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: do_migrate_pages(): rename arguments</title>
<updated>2012-05-29T23:22:20+00:00</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2012-05-29T22:06:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0ce72d4f7333248efbef1f3309770c7edb1b2625'/>
<id>0ce72d4f7333248efbef1f3309770c7edb1b2625</id>
<content type='text'>
s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
duplicates the argument's type.

Done in a fit of irritation over 80-col issues :(

Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;mkosaki@redhat.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
duplicates the argument's type.

Done in a fit of irritation over 80-col issues :(

Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;mkosaki@redhat.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/mempolicy.c: mpol_equal(): use bool</title>
<updated>2012-01-11T00:30:45+00:00</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2012-01-10T23:08:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=fcfb4dcc9698f932836aa63ba0d82e7dbd300fb3'/>
<id>fcfb4dcc9698f932836aa63ba0d82e7dbd300fb3</id>
<content type='text'>
mpol_equal() logically returns a boolean.  Use a bool type to slightly
improve readability.

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Stephen Wilson &lt;wilsons@start.ca&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
mpol_equal() logically returns a boolean.  Use a bool type to slightly
improve readability.

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Stephen Wilson &lt;wilsons@start.ca&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: declare mpol_to_str() when CONFIG_TMPFS=n</title>
<updated>2011-05-25T15:39:34+00:00</updated>
<author>
<name>Stephen Wilson</name>
<email>wilsons@start.ca</email>
</author>
<published>2011-05-25T00:12:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=13057efb0a0063eb8042d99093ec88a52c4f1593'/>
<id>13057efb0a0063eb8042d99093ec88a52c4f1593</id>
<content type='text'>
When CONFIG_TMPFS=n mpol_to_str() is not declared in mempolicy.h.
However, in the NUMA case, the definition is always compiled.

Since it is not strictly true that tmpfs is the only client, and since the
symbol was always lurking around anyways, export mpol_to_str()
unconditionally.  Furthermore, this will allow us to move show_numa_map()
out of mempolicy.c and into the procfs subsystem.

Signed-off-by: Stephen Wilson &lt;wilsons@start.ca&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When CONFIG_TMPFS=n mpol_to_str() is not declared in mempolicy.h.
However, in the NUMA case, the definition is always compiled.

Since it is not strictly true that tmpfs is the only client, and since the
symbol was always lurking around anyways, export mpol_to_str()
unconditionally.  Furthermore, this will allow us to move show_numa_map()
out of mempolicy.c and into the procfs subsystem.

Signed-off-by: Stephen Wilson &lt;wilsons@start.ca&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Randy Dunlap &lt;rdunlap@xenotime.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: export get_vma_policy()</title>
<updated>2011-05-25T15:39:32+00:00</updated>
<author>
<name>Stephen Wilson</name>
<email>wilsons@start.ca</email>
</author>
<published>2011-05-25T00:12:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d98f6cb67fb5b9376d4957d7ba9f32eac35c2e08'/>
<id>d98f6cb67fb5b9376d4957d7ba9f32eac35c2e08</id>
<content type='text'>
In commit 48fce3429d ("mempolicies: unexport get_vma_policy()")
get_vma_policy() was marked static as all clients were local to
mempolicy.c.

However, the decision to generate /proc/pid/numa_maps in the numa memory
policy code and outside the procfs subsystem introduces an artificial
interdependency between the two systems.  Exporting get_vma_policy() once
again is the first step to clean up this interdependency.

Signed-off-by: Stephen Wilson &lt;wilsons@start.ca&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In commit 48fce3429d ("mempolicies: unexport get_vma_policy()")
get_vma_policy() was marked static as all clients were local to
mempolicy.c.

However, the decision to generate /proc/pid/numa_maps in the numa memory
policy code and outside the procfs subsystem introduces an artificial
interdependency between the two systems.  Exporting get_vma_policy() once
again is the first step to clean up this interdependency.

Signed-off-by: Stephen Wilson &lt;wilsons@start.ca&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>oom: select task from tasklist for mempolicy ooms</title>
<updated>2010-08-10T03:44:56+00:00</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2010-08-10T00:18:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6f48d0ebd907ae419387f27b602ee98870cfa7bb'/>
<id>6f48d0ebd907ae419387f27b602ee98870cfa7bb</id>
<content type='text'>
The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mempolicy: restructure rebinding-mempolicy functions</title>
<updated>2010-05-25T15:06:57+00:00</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2010-05-24T21:32:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=708c1bbc9d0c3e57f40501794d9b0eed29d10fce'/>
<id>708c1bbc9d0c3e57f40501794d9b0eed29d10fce</id>
<content type='text'>
Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1].  It happens only on the kernel that do not do
atomic nodemask_t stores.  (MAX_NUMNODES &gt; BITS_PER_LONG)

But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores.  The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory.  The reason is like this:

(mpol: mempolicy)
	task1			task1's mpol	task2
	alloc page		1
	  alloc on node0? NO	1
				1		change mems from 1 to 0
				1		rebind task1's mpol
				0-1		  set new bits
				0	  	  clear disallowed bits
	  alloc on node1? NO	0
	  ...
	can't alloc page
	  goto oom

I can use the attached program reproduce it by the following step:

# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` &gt; /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` &gt; /dev/cpuset/1/mems
# echo $$ &gt; /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &lt;nr_tasks&gt; &amp;
   &lt;nr_tasks&gt; = max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh

several hours later, oom will happen though there is a lot of free memory.

This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.

This patch:

In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.

So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
the mempolicy's nodes.  It is used when there is no real lock to protect
the mempolicy in the read-side.  Otherwise we can do rebind work at once.

In order to implement it, we define

	enum mpol_rebind_step {
		MPOL_REBIND_ONCE,
		MPOL_REBIND_STEP1,
		MPOL_REBIND_STEP2,
		MPOL_REBIND_NSTEP,
	};

If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.

Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed.  If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock.  So we defined the
following flag to identify it:

#define MPOL_F_REBINDING (1 &lt;&lt; 2)

The new functions will be used in the next patch.

Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Hugh Dickins &lt;hugh.dickins@tiscali.co.uk&gt;
Cc: Ravikiran Thirumalai &lt;kiran@scalex86.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1].  It happens only on the kernel that do not do
atomic nodemask_t stores.  (MAX_NUMNODES &gt; BITS_PER_LONG)

But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores.  The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory.  The reason is like this:

(mpol: mempolicy)
	task1			task1's mpol	task2
	alloc page		1
	  alloc on node0? NO	1
				1		change mems from 1 to 0
				1		rebind task1's mpol
				0-1		  set new bits
				0	  	  clear disallowed bits
	  alloc on node1? NO	0
	  ...
	can't alloc page
	  goto oom

I can use the attached program reproduce it by the following step:

# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` &gt; /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` &gt; /dev/cpuset/1/mems
# echo $$ &gt; /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &lt;nr_tasks&gt; &amp;
   &lt;nr_tasks&gt; = max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh

several hours later, oom will happen though there is a lot of free memory.

This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.

This patch:

In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.

So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
the mempolicy's nodes.  It is used when there is no real lock to protect
the mempolicy in the read-side.  Otherwise we can do rebind work at once.

In order to implement it, we define

	enum mpol_rebind_step {
		MPOL_REBIND_ONCE,
		MPOL_REBIND_STEP1,
		MPOL_REBIND_STEP2,
		MPOL_REBIND_NSTEP,
	};

If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.

Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed.  If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock.  So we defined the
following flag to identify it:

#define MPOL_F_REBINDING (1 &lt;&lt; 2)

The new functions will be used in the next patch.

Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Cc: Paul Menage &lt;menage@google.com&gt;
Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Hugh Dickins &lt;hugh.dickins@tiscali.co.uk&gt;
Cc: Ravikiran Thirumalai &lt;kiran@scalex86.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
