linux-toradex.git/mm/mlock.c, branch v2.6.28.3

System call wrappers part 14

2009-01-18T18:43:56+00:00

commit 3480b25743cb7404928d57efeaa3d085708b04c2 upstream.

Signed-off-by: Heiko Carstens 
Signed-off-by: Greg Kroah-Hartman

System call wrappers part 13

2009-01-18T18:43:56+00:00

commit 6a6160a7b5c27b3c38651baef92a14fa7072b3c1 upstream.

Signed-off-by: Heiko Carstens 
Signed-off-by: Greg Kroah-Hartman

unitialized return value in mm/mlock.c: __mlock_vma_pages_range()

2008-11-16T23:55:36+00:00

Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y):
	mm/mlock.c: In function `__mlock_vma_pages_range':
	mm/mlock.c:165: warning: `ret' might be used uninitialized in this function

Signed-off-by: Helge Deller 
[ It isn't ever really used uninitialized, since no caller should ever
  call this function with an empty range.  But the compiler is correct
  that from a local analysis standpoint that is impossible to see, and
  fixing the warning is appropriate.  ]
Signed-off-by: Linus Torvalds

mm: remove lru_add_drain_all() from the munlock path

2008-11-13T01:17:16+00:00

lockdep warns about following message at boot time on one of my test
machine.  Then, schedule_on_each_cpu() sholdn't be called when the task
have mmap_sem.

Actually, lru_add_drain_all() exist to prevent the unevictalble pages
stay on reclaimable lru list.  but currenct unevictable code can rescue
unevictable pages although it stay on reclaimable list.

So removing is better.

In addition, this patch add lru_add_drain_all() to sys_mlock() and
sys_mlockall().  it isn't must.  but it reduce the failure of moving to
unevictable list.  its failure can rescue in vmscan later.  but reducing
is better.

Note, if above rescuing happend, the Mlocked and the Unevictable field
mismatching happend in /proc/meminfo.  but it doesn't cause any real
trouble.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.28-rc2-mm1 #2
-------------------------------------------------------
lvm/1103 is trying to acquire lock:
 (&cpu_hotplug.lock){--..}, at: [] get_online_cpus+0x29/0x50

but task is already holding lock:
 (&mm->mmap_sem){----}, at: [] sys_mlockall+0x4e/0xb0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){----}:
       [] check_noncircular+0x82/0x110
       [] might_fault+0x4a/0xa0
       [] validate_chain+0xb11/0x1070
       [] might_fault+0x4a/0xa0
       [] __lock_acquire+0x263/0xa10
       [] lock_acquire+0x7c/0xb0			(*) grab mmap_sem
       [] might_fault+0x4a/0xa0
       [] might_fault+0x7b/0xa0
       [] might_fault+0x4a/0xa0
       [] copy_to_user+0x30/0x60
       [] filldir+0x7c/0xd0
       [] sysfs_readdir+0x11a/0x1f0			(*) grab sysfs_mutex
       [] filldir+0x0/0xd0
       [] filldir+0x0/0xd0
       [] vfs_readdir+0x86/0xa0			(*) grab i_mutex
       [] sys_getdents+0x6b/0xc0
       [] syscall_call+0x7/0xb
       [] 0xffffffff

-> #2 (sysfs_mutex){--..}:
       [] check_noncircular+0x82/0x110
       [] sysfs_addrm_start+0x2c/0xc0
       [] validate_chain+0xb11/0x1070
       [] sysfs_addrm_start+0x2c/0xc0
       [] __lock_acquire+0x263/0xa10
       [] lock_acquire+0x7c/0xb0			(*) grab sysfs_mutex
       [] sysfs_addrm_start+0x2c/0xc0
       [] mutex_lock_nested+0xa5/0x2f0
       [] sysfs_addrm_start+0x2c/0xc0
       [] sysfs_addrm_start+0x2c/0xc0
       [] sysfs_addrm_start+0x2c/0xc0
       [] create_dir+0x3f/0x90
       [] sysfs_create_dir+0x29/0x50
       [] _spin_unlock+0x25/0x40
       [] kobject_add_internal+0xcd/0x1a0
       [] kobject_set_name_vargs+0x3a/0x50
       [] kobject_init_and_add+0x2d/0x40
       [] sysfs_slab_add+0xd2/0x180
       [] sysfs_add_func+0x0/0x70
       [] sysfs_add_func+0x5c/0x70			(*) grab slub_lock
       [] run_workqueue+0x172/0x200
       [] run_workqueue+0x10f/0x200
       [] worker_thread+0x0/0xf0
       [] worker_thread+0x9c/0xf0
       [] autoremove_wake_function+0x0/0x50
       [] worker_thread+0x0/0xf0
       [] kthread+0x42/0x70
       [] kthread+0x0/0x70
       [] kernel_thread_helper+0x7/0x1c
       [] 0xffffffff

-> #1 (slub_lock){----}:
       [] check_noncircular+0xd/0x110
       [] slab_cpuup_callback+0x11f/0x1d0
       [] validate_chain+0xb11/0x1070
       [] slab_cpuup_callback+0x11f/0x1d0
       [] mark_lock+0x35d/0xd00
       [] __lock_acquire+0x263/0xa10
       [] lock_acquire+0x7c/0xb0
       [] slab_cpuup_callback+0x11f/0x1d0
       [] down_read+0x43/0x80
       [] slab_cpuup_callback+0x11f/0x1d0		(*) grab slub_lock
       [] slab_cpuup_callback+0x11f/0x1d0
       [] notifier_call_chain+0x3c/0x70
       [] _cpu_up+0x84/0x110
       [] cpu_up+0x4b/0x70				(*) grab cpu_hotplug.lock
       [] kernel_init+0x0/0x170
       [] kernel_init+0xb5/0x170
       [] kernel_init+0x0/0x170
       [] kernel_thread_helper+0x7/0x1c
       [] 0xffffffff

-> #0 (&cpu_hotplug.lock){--..}:
       [] validate_chain+0x5af/0x1070
       [] dev_status+0x0/0x50
       [] __lock_acquire+0x263/0xa10
       [] lock_acquire+0x7c/0xb0
       [] get_online_cpus+0x29/0x50
       [] mutex_lock_nested+0xa5/0x2f0
       [] get_online_cpus+0x29/0x50
       [] get_online_cpus+0x29/0x50
       [] lru_add_drain_per_cpu+0x0/0x10
       [] get_online_cpus+0x29/0x50			(*) grab cpu_hotplug.lock
       [] schedule_on_each_cpu+0x32/0xe0
       [] __mlock_vma_pages_range+0x85/0x2c0
       [] __lock_acquire+0x285/0xa10
       [] vma_merge+0xa9/0x1d0
       [] mlock_fixup+0x180/0x200
       [] do_mlockall+0x78/0x90			(*) grab mmap_sem
       [] sys_mlockall+0x81/0xb0
       [] syscall_call+0x7/0xb
       [] 0xffffffff

other info that might help us debug this:

1 lock held by lvm/1103:
 #0:  (&mm->mmap_sem){----}, at: [] sys_mlockall+0x4e/0xb0

stack backtrace:
Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
Call Trace:
 [] print_circular_bug_tail+0x7c/0xd0
 [] validate_chain+0x5af/0x1070
 [] dev_status+0x0/0x50
 [] __lock_acquire+0x263/0xa10
 [] lock_acquire+0x7c/0xb0
 [] get_online_cpus+0x29/0x50
 [] mutex_lock_nested+0xa5/0x2f0
 [] get_online_cpus+0x29/0x50
 [] get_online_cpus+0x29/0x50
 [] lru_add_drain_per_cpu+0x0/0x10
 [] get_online_cpus+0x29/0x50
 [] schedule_on_each_cpu+0x32/0xe0
 [] __mlock_vma_pages_range+0x85/0x2c0
 [] __lock_acquire+0x285/0xa10
 [] vma_merge+0xa9/0x1d0
 [] mlock_fixup+0x180/0x200
 [] do_mlockall+0x78/0x90
 [] sys_mlockall+0x81/0xb0
 [] syscall_call+0x7/0xb

Signed-off-by: KOSAKI Motohiro 
Tested-by: Kamalesh Babulal 
Cc: Lee Schermerhorn 
Cc: Christoph Lameter 
Cc: Heiko Carstens 
Cc: Nick Piggin 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mlock: make mlock error return Posixly Correct

2008-10-20T15:52:31+00:00

Rework Posix error return for mlock().

Posix requires error code for mlock*() system calls for some conditions
that differ from what kernel low level functions, such as
get_user_pages(), return for those conditions.  For more info, see:

http://marc.info/?l=linux-kernel&m=121750892930775&w=2

This patch provides the same translation of get_user_pages()
error codes to posix specified error codes in the context
of the mlock rework for unevictable lru.

[akpm@linux-foundation.org: fix build]
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Lee Schermerhorn 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmstat: mlocked pages statistics

2008-10-20T15:52:31+00:00

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin 
Signed-off-by: Lee Schermerhorn 
Signed-off-by: Rik van Riel 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mmap: handle mlocked pages during map, remap, unmap

2008-10-20T15:52:31+00:00

Originally by Nick Piggin 

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed.  Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin 
Signed-off-by: Lee Schermerhorn 
Signed-off-by: Rik van Riel 
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mlock: downgrade mmap sem while populating mlocked regions

2008-10-20T15:52:31+00:00

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to very
long lock hold times attempting to fault in a large memory region to mlock
it into memory.  This can hold off other faults against the mm
[multithreaded tasks] and other scans of the mm, such as via /proc.  To
alleviate this, downgrade the mmap_sem to read mode during the population
of the region for locking.  This is especially the case if we need to
reclaim memory to lock down the region.  We [probably?] don't need to do
this for unlocking as all of the pages should be resident--they're already
mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and
mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
 Changing all callers appears to be way too much effort at this point.
So, restore write mode before returning.  Note that this opens a window
where the mmap list could change in a multithreaded process.  So, at least
for mlock_fixup(), where we could be called in a loop over multiple vmas,
we check that a vma still exists at the start address and that vma still
covers the page range [start,end).  If not, we return an error, -EAGAIN,
and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller deal
with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  However, I occassionally see delays while unlocking or
unmapping a large mlocked region.  Should we also downgrade the mmap_sem
for the unlock path?

Signed-off-by: Lee Schermerhorn 
Signed-off-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mlock: mlocked pages are unevictable

2008-10-20T15:52:30+00:00

Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin 

splitlru: introduce __get_user_pages():

  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro 
Signed-off-by: Rik van Riel 
Signed-off-by: Lee Schermerhorn 
Cc: Nick Piggin 
Cc: Dave Hansen 
Cc: Matt Mackall 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mlock() fix return values

2008-08-04T23:58:45+00:00

Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
  int fd,ret, i = 0;
  char *addr, *addr1 = NULL;
  unsigned int page_size;
  struct rlimit rlim;

  if (0 != geteuid())
  {
   printf("Execute this pgm as root\n");
   exit(1);
  }

  /* create a file */
  if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
  {
   printf("cant create test file\n");
   exit(1);
  }

  page_size = sysconf(_SC_PAGE_SIZE);

  /* set the MEMLOCK limit */
  rlim.rlim_cur = 2000;
  rlim.rlim_max = 2000;

  if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
  {
   printf("Cant change limit values\n");
   exit(1);
  }

  addr = 0;
  while (1)
  {
  /* map a page into memory each time*/
  if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
  {
   printf("cant do mmap on file\n");
   exit(1);
  }

  if (0 == i)
    addr1 = addr;
  i++;
  errno = 0;
  /* lock the mapped memory pagewise*/
  if ((ret = mlock((char *)addr, 1500)) == -1)
  {
   printf("errno value is %d\n", errno);
   printf("cant lock maped region\n");
   exit(1);
  }
  addr = addr + page_size;
 }
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT.  When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
    Some or all of the address range specified by the addr and
    len arguments does not correspond to valid mapped pages
    in the address space of the process.

[EAGAIN]
    Some or all of the memory identified by the operation could not
    be locked when the call was made.

This rule isn't so nice and slighly strange.  but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv 
Tested-by: Halesh Sadashiv 
Signed-off-by: KOSAKI Motohiro 
Cc: 		[2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds