linux-toradex.git/mm, branch v2.6.34.15

Remove user-triggerable BUG from mpol_to_str

2014-02-10T21:11:34+00:00

commit 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a upstream.

Trivially triggerable, found by trinity:

  kernel BUG at mm/mempolicy.c:2546!
  Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
  Call Trace:
    show_numa_map+0xd5/0x450
    show_pid_numa_map+0x13/0x20
    traverse+0xf2/0x230
    seq_read+0x34b/0x3e0
    vfs_read+0xac/0x180
    sys_pread64+0xa2/0xc0
    system_call_fastpath+0x1a/0x1f
  RIP: mpol_to_str+0x156/0x360

Signed-off-by: Dave Jones 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

tmpfs: fix use-after-free of mempolicy object

2014-02-10T21:11:25+00:00

commit 5f00110f7273f9ff04ac69a5f85bb535a4fd0987 upstream.

The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
option is not specified in the remount request.  A new policy can be
specified if mpol=M is given.

Before this patch remounting an mpol bound tmpfs without specifying
mpol= mount option in the remount request would set the filesystem's
mempolicy object to a freed mempolicy object.

To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
    # mkdir /tmp/x

    # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0

    # mount -o remount,size=200M nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
        # note ? garbage in mpol=... output above

    # dd if=/dev/zero of=/tmp/x/f count=1
        # panic here

Panic:
    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<          (null)>]           (null)
    [...]
    Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
    Call Trace:
      mpol_shared_policy_init+0xa5/0x160
      shmem_get_inode+0x209/0x270
      shmem_mknod+0x3e/0xf0
      shmem_create+0x18/0x20
      vfs_create+0xb5/0x130
      do_last+0x9a1/0xea0
      path_openat+0xb3/0x4d0
      do_filp_open+0x42/0xa0
      do_sys_open+0xfe/0x1e0
      compat_sys_open+0x1b/0x20
      cstar_dispatch+0x7/0x1f

Non-debug kernels will not crash immediately because referencing the
dangling mpol will not cause a fault.  Instead the filesystem will
reference a freed mempolicy object, which will cause unpredictable
behavior.

The problem boils down to a dropped mpol reference below if
shmem_parse_options() does not allocate a new mpol:

    config = *sbinfo
    shmem_parse_options(data, &config, true)
    mpol_put(sbinfo->mpol)
    sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */

This patch avoids the crash by not releasing the mempolicy if
shmem_parse_options() doesn't create a new mpol.

How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
not look back further.

Signed-off-by: Greg Thelen 
Acked-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

mempolicy: fix a race in shared_policy_replace()

2014-02-10T21:11:10+00:00

commit b22d127a39ddd10d93deee3d96e643657ad53a49 upstream.

shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
be dereferenced if sp->lock is not held and 2) another thread can modify
sp_node between spin_unlock for allocating a new sp node and next
spin_lock.  The bug was introduced before 2.6.12-rc2.

Kosaki's original patch for this problem was to allocate an sp node and
policy within shared_policy_replace and initialise it when the lock is
reacquired.  I was not keen on this approach because it partially
duplicates sp_alloc().  As the paths were sp->lock is taken are not that
performance critical this patch converts sp->lock to sp->mutex so it can
sleep when calling sp_alloc().

[kosaki.motohiro@jp.fujitsu.com: Original patch]
Signed-off-by: Mel Gorman 
Acked-by: KOSAKI Motohiro 
Reviewed-by: Christoph Lameter 
Cc: Josh Boyer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

mm: Hold a file reference in madvise_remove

2014-02-10T21:11:10+00:00

commit 9ab4233dd08036fe34a89c7dc6f47a8bf2eb29eb upstream.

Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).

The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")

[bwh: Backported to 3.2:
 - Adjust context
 - madvise_remove() calls vmtruncate_range(), not do_fallocate()]
[luto: Backported to 3.0: Adjust context]

Cc: Hugh Dickins 
Cc: Miklos Szeredi 
Cc: Badari Pulavarty 
Cc: Nick Piggin 
Signed-off-by: Ben Hutchings 
Signed-off-by: Andy Lutomirski 
Signed-off-by: Greg Kroah-Hartman 
[PG: commit e12fcd38abe8a869cbabd77724008f1cf812a3e7 in v3.0.37]
Signed-off-by: Paul Gortmaker

mm: mmu_notifier: fix freed page still mapped in secondary MMU

2014-02-10T21:11:09+00:00

commit 3ad3d901bbcfb15a5e4690e55350db0899095a68 upstream.

mmu_notifier_release() is called when the process is exiting.  It will
delete all the mmu notifiers.  But at this time the page belonging to the
process is still present in page tables and is present on the LRU list, so
this race will happen:

      CPU 0                 CPU 1
mmu_notifier_release:    try_to_unmap:
   hlist_del_init_rcu(&mn->hlist);
                            ptep_clear_flush_notify:
                                  mmu nofifler not found
                            free page  !!!!!!
                            /*
                             * At the point, the page has been
                             * freed, but it is still mapped in
                             * the secondary MMU.
                             */

  mn->ops->release(mn, mm);

Then the box is not stable and sometimes we can get this bug:

[  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
[  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
[  738.075936] page flags: 0x20000000000014(referenced|dirty)

The same issue is present in mmu_notifier_unregister().

We can call ->release before deleting the notifier to ensure the page has
been unmapped from the secondary MMU before it is freed.

Signed-off-by: Xiao Guangrong 
Cc: Avi Kivity 
Cc: Marcelo Tosatti 
Cc: Paul Gortmaker 
Cc: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

mm: fix invalidate_complete_page2() lock ordering

2014-02-10T21:11:09+00:00

commit ec4d9f626d5908b6052c2973f37992f1db52e967 upstream.

In fuzzing with trinity, lockdep protested "possible irq lock inversion
dependency detected" when isolate_lru_page() reenabled interrupts while
still holding the supposedly irq-safe tree_lock:

invalidate_inode_pages2
  invalidate_complete_page2
    spin_lock_irq(&mapping->tree_lock)
    clear_page_mlock
      isolate_lru_page
        spin_unlock_irq(&zone->lru_lock)

isolate_lru_page() is correct to enable interrupts unconditionally:
invalidate_complete_page2() is incorrect to call clear_page_mlock() while
holding tree_lock, which is supposed to nest inside lru_lock.

Both truncate_complete_page() and invalidate_complete_page() call
clear_page_mlock() before taking tree_lock to remove page from radix_tree.
 I guess invalidate_complete_page2() preferred to test PageDirty (again)
under tree_lock before committing to the munlock; but since the page has
already been unmapped, its state is already somewhat inconsistent, and no
worse if clear_page_mlock() moved up.

Reported-by: Sasha Levin 
Deciphered-by: Andrew Morton 
Signed-off-by: Hugh Dickins 
Acked-by: Mel Gorman 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: Michel Lespinasse 
Cc: Ying Han 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

mm: bugfix: set current->reclaim_state to NULL while returning from kswapd()

2014-02-10T21:11:09+00:00

commit b0a8cc58e6b9aaae3045752059e5e6260c0b94bc upstream.

In kswapd(), set current->reclaim_state to NULL before returning, as
current->reclaim_state holds reference to variable on kswapd()'s stack.

In rare cases, while returning from kswapd() during memory offlining,
__free_slab() and freepages() can access the dangling pointer of
current->reclaim_state.

Signed-off-by: Takamori Yamaguchi 
Signed-off-by: Aaditya Kumar 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

mm: fix vma_resv_map() NULL pointer

2014-02-10T21:11:08+00:00

commit 4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream.

hugetlb_reserve_pages() can be used for either normal file-backed
hugetlbfs mappings, or MAP_HUGETLB.  In the MAP_HUGETLB, semi-anonymous
mode, there is not a VMA around.  The new call to resv_map_put() assumed
that there was, and resulted in a NULL pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
  IP: vma_resv_map+0x9/0x30
  PGD 141453067 PUD 1421e1067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP
  ...
  Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
  RIP: vma_resv_map+0x9/0x30
  ...
  Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
  Call Trace:
    resv_map_put+0xe/0x40
    hugetlb_reserve_pages+0xa6/0x1d0
    hugetlb_file_setup+0x102/0x2c0
    newseg+0x115/0x360
    ipcget+0x1ce/0x310
    sys_shmget+0x5a/0x60
    system_call_fastpath+0x16/0x1b

This was reported by Dave Jones, but was reproducible with the
libhugetlbfs test cases, so shame on me for not running them in the
first place.

With this, the oops is gone, and the output of libhugetlbfs's
run_tests.py is identical to plain 3.4 again.

[ Marked for stable, since this was introduced by commit c50ac050811d
  ("hugetlb: fix resv_map leak in error path") which was also marked for
  stable ]

Reported-by: Dave Jones 
Cc: Mel Gorman 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

hugetlb: fix resv_map leak in error path

2014-02-10T21:11:08+00:00

commit c50ac050811d6485616a193eb0f37bfbd191cc89 upstream.

When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
does a resv_map_alloc().  It depends on code in hugetlbfs's
vm_ops->close() to release that allocation.

However, in the mmap() failure path, we do a plain unmap_region() without
the remove_vma() which actually calls vm_ops->close().

This is a decent fix.  This leak could get reintroduced if new code (say,
after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
an error.  But, I think it would have to unroll the reservation anyway.

Christoph's test case:

	http://marc.info/?l=linux-mm&m=133728900729735

This patch applies to 3.4 and later.  A version for earlier kernels is at
https://lkml.org/lkml/2012/5/22/418.

Signed-off-by: Dave Hansen 
Acked-by: Mel Gorman 
Acked-by: KOSAKI Motohiro 
Reported-by: Christoph Lameter 
Tested-by: Christoph Lameter 
Cc: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Paul Gortmaker

Revert "percpu: fix chunk range calculation"

2014-02-10T21:10:41+00:00

This reverts commit 264266e6897dd81c894d1c5cbd90b133707b32f3.

The backport had dependencies on other mm/percpu.c restructurings,
like those in commit 020ec6537aa65c18e9084c568d7b94727f2026fd
("percpu: factor out pcpu_addr_in_first/reserved_chunk() and update
per_cpu_ptr_to_phys()").  Rather than drag in more changes, we simply
revert the incomplete backport.

Reported-by: George G. Davis 
Signed-off-by: Paul Gortmaker