linux-toradex.git/mm, branch v3.2.23

mm: Hold a file reference in madvise_remove

2012-07-12T03:32:19+00:00

commit 9ab4233dd08036fe34a89c7dc6f47a8bf2eb29eb upstream.

Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).

The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")

Cc: Hugh Dickins 
Cc: Miklos Szeredi 
Cc: Badari Pulavarty 
Cc: Nick Piggin 
Signed-off-by: Andy Lutomirski 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2:
 - Adjust context
 - madvise_remove() calls vmtruncate_range(), not do_fallocate()]
Signed-off-by: Ben Hutchings

splice: fix racy pipe->buffers uses

2012-07-12T03:31:59+00:00

commit 047fe3605235888f3ebcda0c728cb31937eadfe6 upstream.

Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()

commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.

Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.

Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.

splice_shrink_spd() loses its struct pipe_inode_info argument.

Reported-by: Dave Jones 
Signed-off-by: Eric Dumazet 
Cc: Jens Axboe 
Cc: Alexander Viro 
Cc: Tom Herbert 
Tested-by: Dave Jones 
Signed-off-by: Jens Axboe 
[bwh: Backported to 3.2:
 - Adjust context in vmsplice_to_pipe()
 - Update one more call to splice_shrink_spd(), from skb_splice_bits()]
Signed-off-by: Ben Hutchings

swap: fix shmem swapping when more than 8 areas

2012-06-19T22:18:28+00:00

commit 9b15b817f3d62409290fd56fe3cbb076a931bb0a upstream.

Minchan Kim reports that when a system has many swap areas, and tmpfs
swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
back the page cannot locate it, and the read fails with -ENOMEM.

Whoops.  Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
swp_entry_to_pte()) technique for determining maximum usable swap
offset, without stopping to realize that that actually depends upon the
pte swap encoding shifting swap offset to the higher bits and truncating
it there.  Whereas our radix_tree swap encoding leaves offset in the
lower bits: it's swap "type" (that is, index of swap area) that was
truncated.

Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().

This does not reduce the usable size of a swap area any further, it
leaves it as claimed when making the original commit: no change from 3.0
on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
per swapfile on i386 with PAE.  It's not a change I would have risked
five years ago, but with x86_64 supported for ten years, I believe it's
appropriate now.

Hmm, and what if some architecture implements its swap pte with offset
encoded below type? That would equally break the maximum usable swap
offset check.  Happily, they all follow the same tradition of encoding
offset above type, but I'll prepare a check on that for next.

Reported-and-Reviewed-and-Tested-by: Minchan Kim 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

slub: fix a memory leak in get_partial_node()

2012-06-10T13:41:48+00:00

commit 02d7633fa567be7bf55a993b79d2a31b95ce2227 upstream.

In the case which is below,

1. acquire slab for cpu partial list
2. free object to it by remote cpu
3. page->freelist = t

then memory leak is occurred.

Change acquire_slab() not to zap freelist when it works for cpu partial list.
I think it is a sufficient solution for fixing a memory leak.

Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 30 hackbench 50 process 4000 > /dev/null' is done.

***Vanilla***
Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     256  Total  :     468   Sanity Checks : Off  Total: 3833856
SlabObj:     256  Full   :     111   Redzoning     : Off  Used : 2004992
SlabSiz:    8192  Partial:     302   Poisoning     : Off  Loss : 1828864
Loss   :       0  CpuSlab:      55   Tracking      : Off  Lalig:       0
Align  :       8  Objects:      32   Tracing       : Off  Lpadd:       0

***Patched***
Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     256  Total  :     300   Sanity Checks : Off  Total: 2457600
SlabObj:     256  Full   :     204   Redzoning     : Off  Used : 2348800
SlabSiz:    8192  Partial:      33   Poisoning     : Off  Loss :  108800
Loss   :       0  CpuSlab:      63   Tracking      : Off  Lalig:       0
Align  :       8  Objects:      32   Tracing       : Off  Lpadd:       0

Total and loss number is the impact of this patch.

Acked-by: Christoph Lameter 
Signed-off-by: Joonsoo Kim 
Signed-off-by: Pekka Enberg 
Signed-off-by: Ben Hutchings

mm: fix vma_resv_map() NULL pointer

2012-06-10T13:41:47+00:00

commit 4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream.

hugetlb_reserve_pages() can be used for either normal file-backed
hugetlbfs mappings, or MAP_HUGETLB.  In the MAP_HUGETLB, semi-anonymous
mode, there is not a VMA around.  The new call to resv_map_put() assumed
that there was, and resulted in a NULL pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
  IP: vma_resv_map+0x9/0x30
  PGD 141453067 PUD 1421e1067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP
  ...
  Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
  RIP: vma_resv_map+0x9/0x30
  ...
  Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
  Call Trace:
    resv_map_put+0xe/0x40
    hugetlb_reserve_pages+0xa6/0x1d0
    hugetlb_file_setup+0x102/0x2c0
    newseg+0x115/0x360
    ipcget+0x1ce/0x310
    sys_shmget+0x5a/0x60
    system_call_fastpath+0x16/0x1b

This was reported by Dave Jones, but was reproducible with the
libhugetlbfs test cases, so shame on me for not running them in the
first place.

With this, the oops is gone, and the output of libhugetlbfs's
run_tests.py is identical to plain 3.4 again.

[ Marked for stable, since this was introduced by commit c50ac050811d
  ("hugetlb: fix resv_map leak in error path") which was also marked for
  stable ]

Reported-by: Dave Jones 
Cc: Mel Gorman 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

mm: fix faulty initialization in vmalloc_init()

2012-06-10T13:41:46+00:00

commit dbda591d920b4c7692725b13e3f68ecb251e9080 upstream.

The transfer of ->flags causes some of the static mapping virtual
addresses to be prematurely freed (before the mapping is removed) because
VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set.  This might
cause subsequent vmalloc/ioremap calls to fail because it might allocate
one of the freed virtual address ranges that aren't unmapped.

va->flags has different types of flags from tmp->flags.  If a region with
VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
by __purge_vmap_area_lazy().

Fix vmalloc_init() to correctly initialize vmap_area for the given
vm_struct.

Also initialise va->vm.  If it is not set, find_vm_area() for the early
vm regions will always fail.

Signed-off-by: KyongHo Cho 
Cc: "Olav Haugan" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

mm/vmalloc.c: change void* into explict vm_struct*

2012-06-10T13:41:46+00:00

commit db1aecafef58b5dda39c4228debe2c845e4a27ab upstream.

vmap_area->private is void* but we don't use the field for various purpose
but use only for vm_struct.  So change it to a vm_struct* with naming to
improve for readability and type checking.

Signed-off-by: Minchan Kim 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

hugetlb: fix resv_map leak in error path

2012-06-10T13:41:45+00:00

commit c50ac050811d6485616a193eb0f37bfbd191cc89 upstream.

When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
does a resv_map_alloc().  It depends on code in hugetlbfs's
vm_ops->close() to release that allocation.

However, in the mmap() failure path, we do a plain unmap_region() without
the remove_vma() which actually calls vm_ops->close().

This is a decent fix.  This leak could get reintroduced if new code (say,
after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
an error.  But, I think it would have to unroll the reservation anyway.

Christoph's test case:

	http://marc.info/?l=linux-mm&m=133728900729735

Signed-off-by: Dave Hansen 
[Christoph Lameter: I have rediffed the patch against 2.6.32 and 3.2.0.]
Signed-off-by: Ben Hutchings

mm: consider all swapped back pages in used-once logic

2012-06-10T13:41:45+00:00

commit e48982734ea0500d1eba4f9d96195acc5406cad6 upstream.

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
made mapped pages have another round in inactive list because they might
be just short lived and so we could consider them again next time.  This
heuristic helps to reduce pressure on the active list with a streaming
IO worklods.

This patch fixes a regression introduced by this commit for heavy shmem
based workloads because unlike Anon pages, which are excluded from this
heuristic because they are usually long lived, shmem pages are handled
as a regular page cache.

This doesn't work quite well, unfortunately, if the workload is mostly
backed by shmem (in memory database sitting on 80% of memory) with a
streaming IO in the background (backup - up to 20% of memory).  Anon
inactive list is full of (dirty) shmem pages when watermarks are hit.
Shmem pages are kept in the inactive list (they are referenced) in the
first round and it is hard to reclaim anything else so we reach lower
scanning priorities very quickly which leads to an excessive swap out.

Let's fix this by excluding all swap backed pages (they tend to be long
lived wrt.  the regular page cache anyway) from used-once heuristic and
rather activate them if they are referenced.

The customer's workload is shmem backed database (80% of RAM) and they
are measuring transactions/s with an IO in the background (20%).
Transactions touch more or less random rows in the table.  The
transaction rate fell by a factor of 3 (in the worst case) because of
commit 64574746.  This patch restores the previous numbers.

Signed-off-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: KAMEZAWA Hiroyuki 
Reviewed-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

memcg: free spare array to avoid memory leak

2012-05-30T23:43:51+00:00

commit 8c7577637ca31385e92769a77e2ab5b428e8b99c upstream.

When the last event is unregistered, there is no need to keep the spare
array anymore.  So free it to avoid memory leak.

Signed-off-by: Sha Zhengju 
Acked-by: KAMEZAWA Hiroyuki 
Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings