<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/mm/mlock.c, branch v2.6.28.7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>mm: fix error case in mlock downgrade reversion</title>
<updated>2009-02-12T17:50:36+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hugh@veritas.com</email>
</author>
<published>2009-02-08T20:56:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7bf97ba5bb635e84e2092d4de4092dcf76ad82a0'/>
<id>7bf97ba5bb635e84e2092d4de4092dcf76ad82a0</id>
<content type='text'>
commit d5b562330ec766292a3ac54ae5e0673610bd5b3d upstream.

Commit 27421e211a39784694b597dbf35848b88363c248, Manually revert
"mlock: downgrade mmap sem while populating mlocked regions", has
introduced its own regression: __mlock_vma_pages_range() may report
an error (for example, -EFAULT from trying to lock down pages from
beyond EOF), but mlock_vma_pages_range() must hide that from its
callers as before.

Reported-by: Sami Farin &lt;safari-kernel@safari.iki.fi&gt;
Signed-off-by: Hugh Dickins &lt;hugh@veritas.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit d5b562330ec766292a3ac54ae5e0673610bd5b3d upstream.

Commit 27421e211a39784694b597dbf35848b88363c248, Manually revert
"mlock: downgrade mmap sem while populating mlocked regions", has
introduced its own regression: __mlock_vma_pages_range() may report
an error (for example, -EFAULT from trying to lock down pages from
beyond EOF), but mlock_vma_pages_range() must hide that from its
callers as before.

Reported-by: Sami Farin &lt;safari-kernel@safari.iki.fi&gt;
Signed-off-by: Hugh Dickins &lt;hugh@veritas.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>Manually revert "mlock: downgrade mmap sem while populating mlocked regions"</title>
<updated>2009-02-06T21:47:16+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2009-02-01T19:00:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=a473fe79d2e74f0697969f003fe503c590376a2c'/>
<id>a473fe79d2e74f0697969f003fe503c590376a2c</id>
<content type='text'>
commit 27421e211a39784694b597dbf35848b88363c248 upstream.

This essentially reverts commit 8edb08caf68184fb170f4f69c7445929e199eaea.

It downgraded our mmap semaphore to a read-lock while mlocking pages, in
order to allow other threads (and external accesses like "ps" et al) to
walk the vma lists and take page faults etc.  Which is a nice idea, but
the implementation does not work.

Because we cannot upgrade the lock back to a write lock without
releasing the mmap semaphore, the code had to release the lock entirely
and then re-take it as a writelock.  However, that meant that the caller
possibly lost the vma chain that it was following, since now another
thread could come in and mmap/munmap the range.

The code tried to work around that by just looking up the vma again and
erroring out if that happened, but quite frankly, that was just a buggy
hack that doesn't actually protect against anything (the other thread
could just have replaced the vma with another one instead of totally
unmapping it).

The only way to downgrade to a read map _reliably_ is to do it at the
end, which is likely the right thing to do: do all the 'vma' operations
with the write-lock held, then downgrade to a read after completing them
all, and then do the "populate the newly mlocked regions" while holding
just the read lock.  And then just drop the read-lock and return to user
space.

The (perhaps somewhat simpler) alternative is to just make all the
callers of mlock_vma_pages_range() know that the mmap lock got dropped,
and just re-grab the mmap semaphore if it needs to mlock more than one
vma region.

So we can do this "downgrade mmap sem while populating mlocked regions"
thing right, but the way it was done here was absolutely not correct.
Thus the revert, in the expectation that we will do it all correctly
some day.

Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 27421e211a39784694b597dbf35848b88363c248 upstream.

This essentially reverts commit 8edb08caf68184fb170f4f69c7445929e199eaea.

It downgraded our mmap semaphore to a read-lock while mlocking pages, in
order to allow other threads (and external accesses like "ps" et al) to
walk the vma lists and take page faults etc.  Which is a nice idea, but
the implementation does not work.

Because we cannot upgrade the lock back to a write lock without
releasing the mmap semaphore, the code had to release the lock entirely
and then re-take it as a writelock.  However, that meant that the caller
possibly lost the vma chain that it was following, since now another
thread could come in and mmap/munmap the range.

The code tried to work around that by just looking up the vma again and
erroring out if that happened, but quite frankly, that was just a buggy
hack that doesn't actually protect against anything (the other thread
could just have replaced the vma with another one instead of totally
unmapping it).

The only way to downgrade to a read map _reliably_ is to do it at the
end, which is likely the right thing to do: do all the 'vma' operations
with the write-lock held, then downgrade to a read after completing them
all, and then do the "populate the newly mlocked regions" while holding
just the read lock.  And then just drop the read-lock and return to user
space.

The (perhaps somewhat simpler) alternative is to just make all the
callers of mlock_vma_pages_range() know that the mmap lock got dropped,
and just re-grab the mmap semaphore if it needs to mlock more than one
vma region.

So we can do this "downgrade mmap sem while populating mlocked regions"
thing right, but the way it was done here was absolutely not correct.
Thus the revert, in the expectation that we will do it all correctly
some day.

Cc: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>System call wrappers part 14</title>
<updated>2009-01-18T18:43:56+00:00</updated>
<author>
<name>Heiko Carstens</name>
<email>heiko.carstens@de.ibm.com</email>
</author>
<published>2009-01-14T13:14:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=eb728e5bd36c757f3086ef17bd2d62c5d46d3dc4'/>
<id>eb728e5bd36c757f3086ef17bd2d62c5d46d3dc4</id>
<content type='text'>
commit 3480b25743cb7404928d57efeaa3d085708b04c2 upstream.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 3480b25743cb7404928d57efeaa3d085708b04c2 upstream.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>System call wrappers part 13</title>
<updated>2009-01-18T18:43:56+00:00</updated>
<author>
<name>Heiko Carstens</name>
<email>heiko.carstens@de.ibm.com</email>
</author>
<published>2009-01-14T13:14:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0bba72e5cf1d17a28052ecfd3c49e815b573560a'/>
<id>0bba72e5cf1d17a28052ecfd3c49e815b573560a</id>
<content type='text'>
commit 6a6160a7b5c27b3c38651baef92a14fa7072b3c1 upstream.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 6a6160a7b5c27b3c38651baef92a14fa7072b3c1 upstream.

Signed-off-by: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>unitialized return value in mm/mlock.c: __mlock_vma_pages_range()</title>
<updated>2008-11-16T23:55:36+00:00</updated>
<author>
<name>Helge Deller</name>
<email>deller@gmx.de</email>
</author>
<published>2008-11-16T23:30:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=72eb8c6747b49e41fd2b042510f03ac7c13426fc'/>
<id>72eb8c6747b49e41fd2b042510f03ac7c13426fc</id>
<content type='text'>
Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y):
	mm/mlock.c: In function `__mlock_vma_pages_range':
	mm/mlock.c:165: warning: `ret' might be used uninitialized in this function

Signed-off-by: Helge Deller &lt;deller@gmx.de&gt;
[ It isn't ever really used uninitialized, since no caller should ever
  call this function with an empty range.  But the compiler is correct
  that from a local analysis standpoint that is impossible to see, and
  fixing the warning is appropriate.  ]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y):
	mm/mlock.c: In function `__mlock_vma_pages_range':
	mm/mlock.c:165: warning: `ret' might be used uninitialized in this function

Signed-off-by: Helge Deller &lt;deller@gmx.de&gt;
[ It isn't ever really used uninitialized, since no caller should ever
  call this function with an empty range.  But the compiler is correct
  that from a local analysis standpoint that is impossible to see, and
  fixing the warning is appropriate.  ]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: remove lru_add_drain_all() from the munlock path</title>
<updated>2008-11-13T01:17:16+00:00</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2008-11-12T21:26:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8891d6da17db0f9bb507d3a017f130b9970c3087'/>
<id>8891d6da17db0f9bb507d3a017f130b9970c3087</id>
<content type='text'>
lockdep warns about following message at boot time on one of my test
machine.  Then, schedule_on_each_cpu() sholdn't be called when the task
have mmap_sem.

Actually, lru_add_drain_all() exist to prevent the unevictalble pages
stay on reclaimable lru list.  but currenct unevictable code can rescue
unevictable pages although it stay on reclaimable list.

So removing is better.

In addition, this patch add lru_add_drain_all() to sys_mlock() and
sys_mlockall().  it isn't must.  but it reduce the failure of moving to
unevictable list.  its failure can rescue in vmscan later.  but reducing
is better.

Note, if above rescuing happend, the Mlocked and the Unevictable field
mismatching happend in /proc/meminfo.  but it doesn't cause any real
trouble.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.28-rc2-mm1 #2
-------------------------------------------------------
lvm/1103 is trying to acquire lock:
 (&amp;cpu_hotplug.lock){--..}, at: [&lt;c0130789&gt;] get_online_cpus+0x29/0x50

but task is already holding lock:
 (&amp;mm-&gt;mmap_sem){----}, at: [&lt;c01878ae&gt;] sys_mlockall+0x4e/0xb0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-&gt; #3 (&amp;mm-&gt;mmap_sem){----}:
       [&lt;c0153da2&gt;] check_noncircular+0x82/0x110
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0156161&gt;] validate_chain+0xb11/0x1070
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0			(*) grab mmap_sem
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0185e9b&gt;] might_fault+0x7b/0xa0
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0294dd0&gt;] copy_to_user+0x30/0x60
       [&lt;c01ae3ec&gt;] filldir+0x7c/0xd0
       [&lt;c01e3a6a&gt;] sysfs_readdir+0x11a/0x1f0			(*) grab sysfs_mutex
       [&lt;c01ae370&gt;] filldir+0x0/0xd0
       [&lt;c01ae370&gt;] filldir+0x0/0xd0
       [&lt;c01ae4c6&gt;] vfs_readdir+0x86/0xa0			(*) grab i_mutex
       [&lt;c01ae75b&gt;] sys_getdents+0x6b/0xc0
       [&lt;c010355a&gt;] syscall_call+0x7/0xb
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #2 (sysfs_mutex){--..}:
       [&lt;c0153da2&gt;] check_noncircular+0x82/0x110
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c0156161&gt;] validate_chain+0xb11/0x1070
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0			(*) grab sysfs_mutex
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c04f8b55&gt;] mutex_lock_nested+0xa5/0x2f0
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c01e422f&gt;] create_dir+0x3f/0x90
       [&lt;c01e42a9&gt;] sysfs_create_dir+0x29/0x50
       [&lt;c04faaf5&gt;] _spin_unlock+0x25/0x40
       [&lt;c028f21d&gt;] kobject_add_internal+0xcd/0x1a0
       [&lt;c028f37a&gt;] kobject_set_name_vargs+0x3a/0x50
       [&lt;c028f41d&gt;] kobject_init_and_add+0x2d/0x40
       [&lt;c019d4d2&gt;] sysfs_slab_add+0xd2/0x180
       [&lt;c019d580&gt;] sysfs_add_func+0x0/0x70
       [&lt;c019d5dc&gt;] sysfs_add_func+0x5c/0x70			(*) grab slub_lock
       [&lt;c01400f2&gt;] run_workqueue+0x172/0x200
       [&lt;c014008f&gt;] run_workqueue+0x10f/0x200
       [&lt;c0140bd0&gt;] worker_thread+0x0/0xf0
       [&lt;c0140c6c&gt;] worker_thread+0x9c/0xf0
       [&lt;c0143c80&gt;] autoremove_wake_function+0x0/0x50
       [&lt;c0140bd0&gt;] worker_thread+0x0/0xf0
       [&lt;c0143972&gt;] kthread+0x42/0x70
       [&lt;c0143930&gt;] kthread+0x0/0x70
       [&lt;c01042db&gt;] kernel_thread_helper+0x7/0x1c
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #1 (slub_lock){----}:
       [&lt;c0153d2d&gt;] check_noncircular+0xd/0x110
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c0156161&gt;] validate_chain+0xb11/0x1070
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c015433d&gt;] mark_lock+0x35d/0xd00
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c04f93a3&gt;] down_read+0x43/0x80
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0		(*) grab slub_lock
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c04fd9ac&gt;] notifier_call_chain+0x3c/0x70
       [&lt;c04f5454&gt;] _cpu_up+0x84/0x110
       [&lt;c04f552b&gt;] cpu_up+0x4b/0x70				(*) grab cpu_hotplug.lock
       [&lt;c06d1530&gt;] kernel_init+0x0/0x170
       [&lt;c06d15e5&gt;] kernel_init+0xb5/0x170
       [&lt;c06d1530&gt;] kernel_init+0x0/0x170
       [&lt;c01042db&gt;] kernel_thread_helper+0x7/0x1c
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #0 (&amp;cpu_hotplug.lock){--..}:
       [&lt;c0155bff&gt;] validate_chain+0x5af/0x1070
       [&lt;c040f7e0&gt;] dev_status+0x0/0x50
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
       [&lt;c04f8b55&gt;] mutex_lock_nested+0xa5/0x2f0
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
       [&lt;c017bc30&gt;] lru_add_drain_per_cpu+0x0/0x10
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50			(*) grab cpu_hotplug.lock
       [&lt;c0140cf2&gt;] schedule_on_each_cpu+0x32/0xe0
       [&lt;c0187095&gt;] __mlock_vma_pages_range+0x85/0x2c0
       [&lt;c0156945&gt;] __lock_acquire+0x285/0xa10
       [&lt;c0188f09&gt;] vma_merge+0xa9/0x1d0
       [&lt;c0187450&gt;] mlock_fixup+0x180/0x200
       [&lt;c0187548&gt;] do_mlockall+0x78/0x90			(*) grab mmap_sem
       [&lt;c01878e1&gt;] sys_mlockall+0x81/0xb0
       [&lt;c010355a&gt;] syscall_call+0x7/0xb
       [&lt;ffffffff&gt;] 0xffffffff

other info that might help us debug this:

1 lock held by lvm/1103:
 #0:  (&amp;mm-&gt;mmap_sem){----}, at: [&lt;c01878ae&gt;] sys_mlockall+0x4e/0xb0

stack backtrace:
Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
Call Trace:
 [&lt;c01555fc&gt;] print_circular_bug_tail+0x7c/0xd0
 [&lt;c0155bff&gt;] validate_chain+0x5af/0x1070
 [&lt;c040f7e0&gt;] dev_status+0x0/0x50
 [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
 [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c04f8b55&gt;] mutex_lock_nested+0xa5/0x2f0
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c017bc30&gt;] lru_add_drain_per_cpu+0x0/0x10
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c0140cf2&gt;] schedule_on_each_cpu+0x32/0xe0
 [&lt;c0187095&gt;] __mlock_vma_pages_range+0x85/0x2c0
 [&lt;c0156945&gt;] __lock_acquire+0x285/0xa10
 [&lt;c0188f09&gt;] vma_merge+0xa9/0x1d0
 [&lt;c0187450&gt;] mlock_fixup+0x180/0x200
 [&lt;c0187548&gt;] do_mlockall+0x78/0x90
 [&lt;c01878e1&gt;] sys_mlockall+0x81/0xb0
 [&lt;c010355a&gt;] syscall_call+0x7/0xb

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Tested-by: Kamalesh Babulal &lt;kamalesh@linux.vnet.ibm.com&gt;
Cc: Lee Schermerhorn &lt;Lee.Schermerhorn@hp.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;
Cc: Hugh Dickins &lt;hugh@veritas.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
lockdep warns about following message at boot time on one of my test
machine.  Then, schedule_on_each_cpu() sholdn't be called when the task
have mmap_sem.

Actually, lru_add_drain_all() exist to prevent the unevictalble pages
stay on reclaimable lru list.  but currenct unevictable code can rescue
unevictable pages although it stay on reclaimable list.

So removing is better.

In addition, this patch add lru_add_drain_all() to sys_mlock() and
sys_mlockall().  it isn't must.  but it reduce the failure of moving to
unevictable list.  its failure can rescue in vmscan later.  but reducing
is better.

Note, if above rescuing happend, the Mlocked and the Unevictable field
mismatching happend in /proc/meminfo.  but it doesn't cause any real
trouble.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.28-rc2-mm1 #2
-------------------------------------------------------
lvm/1103 is trying to acquire lock:
 (&amp;cpu_hotplug.lock){--..}, at: [&lt;c0130789&gt;] get_online_cpus+0x29/0x50

but task is already holding lock:
 (&amp;mm-&gt;mmap_sem){----}, at: [&lt;c01878ae&gt;] sys_mlockall+0x4e/0xb0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-&gt; #3 (&amp;mm-&gt;mmap_sem){----}:
       [&lt;c0153da2&gt;] check_noncircular+0x82/0x110
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0156161&gt;] validate_chain+0xb11/0x1070
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0			(*) grab mmap_sem
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0185e9b&gt;] might_fault+0x7b/0xa0
       [&lt;c0185e6a&gt;] might_fault+0x4a/0xa0
       [&lt;c0294dd0&gt;] copy_to_user+0x30/0x60
       [&lt;c01ae3ec&gt;] filldir+0x7c/0xd0
       [&lt;c01e3a6a&gt;] sysfs_readdir+0x11a/0x1f0			(*) grab sysfs_mutex
       [&lt;c01ae370&gt;] filldir+0x0/0xd0
       [&lt;c01ae370&gt;] filldir+0x0/0xd0
       [&lt;c01ae4c6&gt;] vfs_readdir+0x86/0xa0			(*) grab i_mutex
       [&lt;c01ae75b&gt;] sys_getdents+0x6b/0xc0
       [&lt;c010355a&gt;] syscall_call+0x7/0xb
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #2 (sysfs_mutex){--..}:
       [&lt;c0153da2&gt;] check_noncircular+0x82/0x110
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c0156161&gt;] validate_chain+0xb11/0x1070
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0			(*) grab sysfs_mutex
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c04f8b55&gt;] mutex_lock_nested+0xa5/0x2f0
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c01e3d2c&gt;] sysfs_addrm_start+0x2c/0xc0
       [&lt;c01e422f&gt;] create_dir+0x3f/0x90
       [&lt;c01e42a9&gt;] sysfs_create_dir+0x29/0x50
       [&lt;c04faaf5&gt;] _spin_unlock+0x25/0x40
       [&lt;c028f21d&gt;] kobject_add_internal+0xcd/0x1a0
       [&lt;c028f37a&gt;] kobject_set_name_vargs+0x3a/0x50
       [&lt;c028f41d&gt;] kobject_init_and_add+0x2d/0x40
       [&lt;c019d4d2&gt;] sysfs_slab_add+0xd2/0x180
       [&lt;c019d580&gt;] sysfs_add_func+0x0/0x70
       [&lt;c019d5dc&gt;] sysfs_add_func+0x5c/0x70			(*) grab slub_lock
       [&lt;c01400f2&gt;] run_workqueue+0x172/0x200
       [&lt;c014008f&gt;] run_workqueue+0x10f/0x200
       [&lt;c0140bd0&gt;] worker_thread+0x0/0xf0
       [&lt;c0140c6c&gt;] worker_thread+0x9c/0xf0
       [&lt;c0143c80&gt;] autoremove_wake_function+0x0/0x50
       [&lt;c0140bd0&gt;] worker_thread+0x0/0xf0
       [&lt;c0143972&gt;] kthread+0x42/0x70
       [&lt;c0143930&gt;] kthread+0x0/0x70
       [&lt;c01042db&gt;] kernel_thread_helper+0x7/0x1c
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #1 (slub_lock){----}:
       [&lt;c0153d2d&gt;] check_noncircular+0xd/0x110
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c0156161&gt;] validate_chain+0xb11/0x1070
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c015433d&gt;] mark_lock+0x35d/0xd00
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c04f93a3&gt;] down_read+0x43/0x80
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0		(*) grab slub_lock
       [&lt;c04f650f&gt;] slab_cpuup_callback+0x11f/0x1d0
       [&lt;c04fd9ac&gt;] notifier_call_chain+0x3c/0x70
       [&lt;c04f5454&gt;] _cpu_up+0x84/0x110
       [&lt;c04f552b&gt;] cpu_up+0x4b/0x70				(*) grab cpu_hotplug.lock
       [&lt;c06d1530&gt;] kernel_init+0x0/0x170
       [&lt;c06d15e5&gt;] kernel_init+0xb5/0x170
       [&lt;c06d1530&gt;] kernel_init+0x0/0x170
       [&lt;c01042db&gt;] kernel_thread_helper+0x7/0x1c
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #0 (&amp;cpu_hotplug.lock){--..}:
       [&lt;c0155bff&gt;] validate_chain+0x5af/0x1070
       [&lt;c040f7e0&gt;] dev_status+0x0/0x50
       [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
       [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
       [&lt;c04f8b55&gt;] mutex_lock_nested+0xa5/0x2f0
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
       [&lt;c017bc30&gt;] lru_add_drain_per_cpu+0x0/0x10
       [&lt;c0130789&gt;] get_online_cpus+0x29/0x50			(*) grab cpu_hotplug.lock
       [&lt;c0140cf2&gt;] schedule_on_each_cpu+0x32/0xe0
       [&lt;c0187095&gt;] __mlock_vma_pages_range+0x85/0x2c0
       [&lt;c0156945&gt;] __lock_acquire+0x285/0xa10
       [&lt;c0188f09&gt;] vma_merge+0xa9/0x1d0
       [&lt;c0187450&gt;] mlock_fixup+0x180/0x200
       [&lt;c0187548&gt;] do_mlockall+0x78/0x90			(*) grab mmap_sem
       [&lt;c01878e1&gt;] sys_mlockall+0x81/0xb0
       [&lt;c010355a&gt;] syscall_call+0x7/0xb
       [&lt;ffffffff&gt;] 0xffffffff

other info that might help us debug this:

1 lock held by lvm/1103:
 #0:  (&amp;mm-&gt;mmap_sem){----}, at: [&lt;c01878ae&gt;] sys_mlockall+0x4e/0xb0

stack backtrace:
Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
Call Trace:
 [&lt;c01555fc&gt;] print_circular_bug_tail+0x7c/0xd0
 [&lt;c0155bff&gt;] validate_chain+0x5af/0x1070
 [&lt;c040f7e0&gt;] dev_status+0x0/0x50
 [&lt;c0156923&gt;] __lock_acquire+0x263/0xa10
 [&lt;c015714c&gt;] lock_acquire+0x7c/0xb0
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c04f8b55&gt;] mutex_lock_nested+0xa5/0x2f0
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c017bc30&gt;] lru_add_drain_per_cpu+0x0/0x10
 [&lt;c0130789&gt;] get_online_cpus+0x29/0x50
 [&lt;c0140cf2&gt;] schedule_on_each_cpu+0x32/0xe0
 [&lt;c0187095&gt;] __mlock_vma_pages_range+0x85/0x2c0
 [&lt;c0156945&gt;] __lock_acquire+0x285/0xa10
 [&lt;c0188f09&gt;] vma_merge+0xa9/0x1d0
 [&lt;c0187450&gt;] mlock_fixup+0x180/0x200
 [&lt;c0187548&gt;] do_mlockall+0x78/0x90
 [&lt;c01878e1&gt;] sys_mlockall+0x81/0xb0
 [&lt;c010355a&gt;] syscall_call+0x7/0xb

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Tested-by: Kamalesh Babulal &lt;kamalesh@linux.vnet.ibm.com&gt;
Cc: Lee Schermerhorn &lt;Lee.Schermerhorn@hp.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Cc: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;
Cc: Hugh Dickins &lt;hugh@veritas.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mlock: make mlock error return Posixly Correct</title>
<updated>2008-10-20T15:52:31+00:00</updated>
<author>
<name>Lee Schermerhorn</name>
<email>lee.schermerhorn@hp.com</email>
</author>
<published>2008-10-19T03:26:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=9978ad583e100945b74e4f33e73317983ea32df9'/>
<id>9978ad583e100945b74e4f33e73317983ea32df9</id>
<content type='text'>
Rework Posix error return for mlock().

Posix requires error code for mlock*() system calls for some conditions
that differ from what kernel low level functions, such as
get_user_pages(), return for those conditions.  For more info, see:

http://marc.info/?l=linux-kernel&amp;m=121750892930775&amp;w=2

This patch provides the same translation of get_user_pages()
error codes to posix specified error codes in the context
of the mlock rework for unevictable lru.

[akpm@linux-foundation.org: fix build]
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Rework Posix error return for mlock().

Posix requires error code for mlock*() system calls for some conditions
that differ from what kernel low level functions, such as
get_user_pages(), return for those conditions.  For more info, see:

http://marc.info/?l=linux-kernel&amp;m=121750892930775&amp;w=2

This patch provides the same translation of get_user_pages()
error codes to posix specified error codes in the context
of the mlock rework for unevictable lru.

[akpm@linux-foundation.org: fix build]
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vmstat: mlocked pages statistics</title>
<updated>2008-10-20T15:52:31+00:00</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@suse.de</email>
</author>
<published>2008-10-19T03:26:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5344b7e648980cc2ca613ec03a56a8222ff48820'/>
<id>5344b7e648980cc2ca613ec03a56a8222ff48820</id>
<content type='text'>
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mmap: handle mlocked pages during map, remap, unmap</title>
<updated>2008-10-20T15:52:31+00:00</updated>
<author>
<name>Rik van Riel</name>
<email>riel@redhat.com</email>
</author>
<published>2008-10-19T03:26:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ba470de43188cdbff795b5da43a1474523c6c2fb'/>
<id>ba470de43188cdbff795b5da43a1474523c6c2fb</id>
<content type='text'>
Originally by Nick Piggin &lt;npiggin@suse.de&gt;

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed.  Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: KAMEZAWA Hiroyuki &lt;kamewzawa.hiroyu@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Originally by Nick Piggin &lt;npiggin@suse.de&gt;

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed.  Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: KAMEZAWA Hiroyuki &lt;kamewzawa.hiroyu@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mlock: downgrade mmap sem while populating mlocked regions</title>
<updated>2008-10-20T15:52:31+00:00</updated>
<author>
<name>Lee Schermerhorn</name>
<email>lee.schermerhorn@hp.com</email>
</author>
<published>2008-10-19T03:26:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8edb08caf68184fb170f4f69c7445929e199eaea'/>
<id>8edb08caf68184fb170f4f69c7445929e199eaea</id>
<content type='text'>
We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to very
long lock hold times attempting to fault in a large memory region to mlock
it into memory.  This can hold off other faults against the mm
[multithreaded tasks] and other scans of the mm, such as via /proc.  To
alleviate this, downgrade the mmap_sem to read mode during the population
of the region for locking.  This is especially the case if we need to
reclaim memory to lock down the region.  We [probably?] don't need to do
this for unlocking as all of the pages should be resident--they're already
mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and
mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
 Changing all callers appears to be way too much effort at this point.
So, restore write mode before returning.  Note that this opens a window
where the mmap list could change in a multithreaded process.  So, at least
for mlock_fixup(), where we could be called in a loop over multiple vmas,
we check that a vma still exists at the start address and that vma still
covers the page range [start,end).  If not, we return an error, -EAGAIN,
and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller deal
with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  However, I occassionally see delays while unlocking or
unmapping a large mlocked region.  Should we also downgrade the mmap_sem
for the unlock path?

Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to very
long lock hold times attempting to fault in a large memory region to mlock
it into memory.  This can hold off other faults against the mm
[multithreaded tasks] and other scans of the mm, such as via /proc.  To
alleviate this, downgrade the mmap_sem to read mode during the population
of the region for locking.  This is especially the case if we need to
reclaim memory to lock down the region.  We [probably?] don't need to do
this for unlocking as all of the pages should be resident--they're already
mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and
mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
 Changing all callers appears to be way too much effort at this point.
So, restore write mode before returning.  Note that this opens a window
where the mmap list could change in a multithreaded process.  So, at least
for mlock_fixup(), where we could be called in a loop over multiple vmas,
we check that a vma still exists at the start address and that vma still
covers the page range [start,end).  If not, we return an error, -EAGAIN,
and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller deal
with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  However, I occassionally see delays while unlocking or
unmapping a large mlocked region.  Should we also downgrade the mmap_sem
for the unlock path?

Signed-off-by: Lee Schermerhorn &lt;lee.schermerhorn@hp.com&gt;
Signed-off-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
