linux-toradex.git/mm, branch v3.10.60

mm: Remove false WARN_ON from pagecache_isize_extended()

2014-11-14T16:48:00+00:00

commit f55fefd1a5a339b1bd08c120b93312d6eb64a9fb upstream.

The WARN_ON checking whether i_mutex is held in
pagecache_isize_extended() was wrong because some filesystems (e.g.
XFS) use different locks for serialization of truncates / writes. So
just remove the check.

Signed-off-by: Jan Kara 
Reviewed-by: Dave Chinner 
Signed-off-by: Dave Chinner 
Signed-off-by: Greg Kroah-Hartman

cgroup/kmemleak: add kmemleak_free() for cgroup deallocations.

2014-11-14T16:47:59+00:00

commit 401507d67d5c2854f5a88b3f93f64fc6f267bca5 upstream.

Commit ff7ee93f4715 ("cgroup/kmemleak: Annotate alloc_page() for cgroup
allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but
corresponding kmemleak_free() is missing, which makes kmemleak be
wrongly disabled after memory offlining.  Log is pasted at the end of
this commit message.

This patch add kmemleak_free() into free_page_cgroup().  During page
offlining, this patch removes corresponding entries in kmemleak rbtree.
After that, the freed memory can be allocated again by other subsystems
without killing kmemleak.

  bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak

  Offlined Pages 32768
  kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing)
  CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x46/0x58
    create_object+0x266/0x2c0
    kmemleak_alloc+0x26/0x50
    kmem_cache_alloc+0xd3/0x160
    __sigqueue_alloc+0x49/0xd0
    __send_signal+0xcb/0x410
    send_signal+0x45/0x90
    __group_send_sig_info+0x13/0x20
    do_notify_parent+0x1bb/0x260
    do_exit+0x767/0xa40
    do_group_exit+0x44/0xa0
    SyS_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b

  kmemleak: Kernel memory leak detector disabled
  kmemleak: Object 0xffff880016900000 (size 524288):
  kmemleak:   comm "swapper/0", pid 0, jiffies 4294667296
  kmemleak:   min_count = 0
  kmemleak:   count = 0
  kmemleak:   flags = 0x1
  kmemleak:   checksum = 0
  kmemleak:   backtrace:
        log_early+0x63/0x77
        kmemleak_alloc+0x4b/0x50
        init_section_page_cgroup+0x7f/0xf5
        page_cgroup_init+0xc5/0xd0
        start_kernel+0x333/0x408
        x86_64_start_reservations+0x2a/0x2c
        x86_64_start_kernel+0xf5/0xfc

Fixes: ff7ee93f4715 (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations)
Signed-off-by: Wang Nan 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Steven Rostedt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

OOM, PM: OOM killed task shouldn't escape PM suspend

2014-11-14T16:47:58+00:00

commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Changes since v1
- push the re-check loop out of freeze_processes into
  check_frozen_processes and invert the condition to make the code more
  readable as per Rafael

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Signed-off-by: Michal Hocko 
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Greg Kroah-Hartman

vfs: fix data corruption when blocksize < pagesize for mmaped data

2014-11-14T16:47:54+00:00

commit 90a8020278c1598fafd071736a0846b38510309c upstream.

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

Revert "percpu: free percpu allocation info for uniprocessor system"

2014-11-14T16:47:53+00:00

commit bb2e226b3bef596dd56be97df655d857b4603923 upstream.

This reverts commit 3189eddbcafc ("percpu: free percpu allocation info for
uniprocessor system").

The commit causes a hang with a crisv32 image. This may be an architecture
problem, but at least for now the revert is necessary to be able to boot a
crisv32 image.

Cc: Tejun Heo 
Cc: Honggang Li 
Signed-off-by: Guenter Roeck 
Signed-off-by: Tejun Heo 
Fixes: 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system")
Signed-off-by: Greg Kroah-Hartman

mm: numa: Do not mark PTEs pte_numa when splitting huge pages

2014-10-09T19:18:42+00:00

commit abc40bd2eeb77eb7c2effcaf63154aad929a1d5f upstream.

This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.

 VM_RW|VM_PROTNONE
|-----------------|
      ^
      split here

In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.

Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.

Signed-off-by: Mel Gorman 
Acked-by: Rik van Riel 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm, thp: move invariant bug check out of loop in __split_huge_page_map

2014-10-09T19:18:42+00:00

commit f8303c2582b889351e261ff18c4d8eb197a77db2 upstream.

In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop.  Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.

This patch moves the invariant bug check out of the loop so that it will
be done only once.  On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively.  The performance gain is about 1%.

Signed-off-by: Waiman Long 
Acked-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Scott J Norton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

vm_is_stack: use for_each_thread() rather then buggy while_each_thread()

2014-10-05T21:54:16+00:00

commit 4449a51a7c281602d3a385044ab928322a122a02 upstream.

Aleksei hit the soft lockup during reading /proc/PID/smaps.  David
investigated the problem and suggested the right fix.

while_each_thread() is racy and should die, this patch updates
vm_is_stack().

Signed-off-by: Oleg Nesterov 
Reported-by: Aleksei Besogonov 
Tested-by: Aleksei Besogonov 
Suggested-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Li Zefan 
Signed-off-by: Greg Kroah-Hartman

oom_kill: add rcu_read_lock() into find_lock_task_mm()

2014-10-05T21:54:15+00:00

commit 4d4048be8a93769350efa31d2482a038b7de73d0 upstream.

find_lock_task_mm() expects it is called under rcu or tasklist lock, but
it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
mem_cgroup_out_of_memory()->oom_badness() can call it lockless.

Perhaps we could fix the callers, but this patch simply adds rcu lock
into find_lock_task_mm().  This also allows to simplify a bit one of its
callers, oom_kill_process().

Signed-off-by: Oleg Nesterov 
Cc: Sergey Dyasly 
Cc: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Li Zefan 
Signed-off-by: Greg Kroah-Hartman

oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()

2014-10-05T21:54:15+00:00

commit ad96244179fbd55b40c00f10f399bc04739b8e1f upstream.

At least out_of_memory() calls has_intersects_mems_allowed() without
even rcu_read_lock(), this is obviously buggy.

Add the necessary rcu_read_lock().  This means that we can not simply
return from the loop, we need "bool ret" and "break".

While at it, swap the names of task_struct's (the argument and the
local).  This cleans up the code a little bit and avoids the unnecessary
initialization.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Sergey Dyasly 
Tested-by: Sergey Dyasly 
Reviewed-by: Sameer Nanda 
Cc: "Eric W. Biederman" 
Cc: Frederic Weisbecker 
Cc: Mandeep Singh Baines 
Cc: "Ma, Xindong" 
Reviewed-by: Michal Hocko 
Cc: "Tu, Xiaobing" 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Li Zefan 
Signed-off-by: Greg Kroah-Hartman